Updated: May 30, 2018
and dancing on its grave might solve the oldest audio problem in broadcasting.
I have more than once heard 5.1 referred to as a “legacy format”, it is being replaced in high-end consumer systems by 7.1 or 9.1. DTS have Neo:X which supports up to 11.1, that’s 7.1 plus 4 height speakers. In Japan, NHK promote 22.2 which requires 24 loudspeakers at three different heights! In the cinema, Dolby Atmos provides audio with height and the ability to move individual sounds around the cinema with great precision, But next generation TV will deliver something far more significant than “sound with height”, enjoyable though that can be.
Too Many Channels
The plethora of channel-based formats and the various ways in which consumers have installed their speakers (3 for the living room and 2 for the kitchen) make it really hard for the broadcaster to create content that’s going to work for all of them. Thankfully, proposals for next generation TV don’t rely on the traditional audio channel based approach typified by stereo and 5.1. Instead, content will be created in a way which is “replay configuration agnostic” and the receiver will render the content according to the number and location of the speakers installed. There are huge benefits to this approach; the broadcaster only has to create one audio stream and the consumer always gets the best sound their kit can deliver. So – how does it work, and will it really solve the oldest audio problem in broadcasting?
Enter stage left (at ceiling level, about a metre across) Objects and Scenes!
Instead of creating sound by panning it in a stereo, 5.1 or even 22.2 setup, each sound (voice, car or helicopter) has metadata associated with it, which describes where in three dimensional space it should be. The metadata can be dynamic, to allow for moving objects. The metadata also include parameters for how large or diffuse the object is; a mosquito is smaller than an albatross. We will no longer have an excuse to confuse “small” with “far away”. The receiver uses the metadata, together with knowledge of the number and location of speakers available to it, to render the audio objects so they “appear” in the right place. The receiver could even render the audio to binaural presentation for 3D sound on earbuds.
Audio objects are not the best way to handle diffuse, enveloping sounds, such as wind in a forest, or reverberation. Audio scenes deal better with these and can be used alongside (around?) audio objects.
Yes, good old channels will still be there, to allow simpler production where appropriate, and a greater degree of compatibility. But even where channels are used, the receiver will remap them onto the speakers available if the delivery and replay channel layouts don’t match.
What does it give me?
If you work in TV, it gives you the opportunity to differentiate your content from that of your competitors, by providing experiences which the audience really value! Production tools are already available and it would be a good idea to start experimenting now. The new features of MPEG-H can be introduced gradually starting with a more efficient codec, then adding height where it is editorially justified; you don’t have to adopt a “big bang” approach to change.
Solving the oldest audio problem in broadcasting.
Obviously the viewers get the opportunity to enjoy sound with height. Rain need no longer “land” level with their ears, and helicopters really will go over their heads, not through them. But the fact that the audio is rendered in the receiver enables a far more fundamental change – customisation. The most common complaint about broadcast sound on TV is that the dialogue can’t be heard over the background audio. Sometimes this is because the broadcaster has messed up the sound balance, but often it’s because one sound balance can’t serve all audiences well. I believe this has been the case ever since broadcasting began and I call it the “oldest audio problem in broadcasting”. And it is getting worse. We have an ageing population, and modern flat screen TVs have tiny speakers, often pointing down or back at the wall behind them. The poor speakers combined with age related hearing loss can make it hard for the older members of society to hear what’s being said when there are background sounds. Of course the broadcaster could just make the speech significantly louder in the mix, but if they do this to an extent which would make it easy for the majority of people with a hearing impairment to easily hear what’s being said on a flat-screen TV, you end up with a VERY SHOUTY presentation for the younger audience listening on a decent sound system. So we all have to live with a compromise. But not for much longer.
In 2011 I ran an experiment from Wimbledon which allowed the user to adjust the relative sound level of commentary and court sounds. What I found was that some people pushed the sound balance a little one way, and about the same number pushed it a little the other way. In other words, the broadcast sound balance was a perfect compromise, and pleased almost nobody. Well, nobody from the self-selecting group of internet users who tried it; I don’t claim it’s statistically significant, but it does demonstrate the difficulty of pleasing all audiences with a single balance. Next generation broadcast TV will allow the user to decide how loud they want the speech to be, relative to everything else. The commentary and background sounds will be audio objects and the viewer will be able to “tell” the receiver how loud to render them.
Would you like it in Welsh?
At present, providing commentary in a different language uses up a lot of broadcast bandwidth. Imagine you are covering a rugby match in surround sound and would like to provide both English and Welsh language commentary. At present broadcasters need to provide two complete audio signals, doubling the audio bandwidth requirement. As a result, on the rare occasions a second language is made available, it is often only low bandwidth stereo. Why should those who want one language get a vastly inferior sound from those who want another? Audio objects allow the broadcaster to send the sound of the crowd alongside two audio additional audio objects, one for each commentary. The viewer will be able to decide which language they want, and perhaps even which end of the crowd they want to “sit with”, home or away.
Will I have to fix speakers to my ceiling?
Only if you really want the ultimate sound quality. A central ceiling speaker (known as the “Voice of God”) can be very effective but ceiling mounted speakers, or speakers mounted high on the walls are not essential. A number of different approaches are being tried. Dolby have small upward facing directional speakers built into the top of the main speakers, these bounce sound off the ceiling to give a sense of height. Meanwhile Fraunhofer have demonstrated a prototype “sound bar” which goes under the TV, this can create three dimensional sound for those who don’t want to fill their living room with wires and speakers.
When can I have it?
The ideas and benefits of next generation audio for TV have been tested on-line and the MPEG-H alliance have demonstrated an end-to end live production chain. Trials and experiments are underway in Europe, so hopefully we will get broadcast audio with height and customisation before I’m too old to enjoy it! At the moment, MPEG-H is on the air in South Korea, so you'll have to go there to experience it for yourself.