We are investigating expressive descriptions of object-based spatial sound scenes for transmission and reproduction on a wide variety of audio systems.
Why Do We Need to Describe Sound Scene?
Today's spatial audio contents - for instance for stereo, home surround, or cinema systems - is typically transmitted as a number of audio signals corresponding to the loudspeakers of the audio systems. In this channel-based setting, the spatial information of the sound scene, such as the position of different persons, musical instruments, sound effects, is contained in the set of loudspeaker signals, and there is no need for an additional description of the sound scene. But what happens if the loudspeakers in your home are placed in positions away from those for which the channel signals are made? Or if you even use a different loudspeaker system or headphones? In this case, the quality of channel-based spatial audio generally degrades. Object-based audio is a profoundly different approach to deliver spatial audio. Instead of providing channel signals, the sounds of different audio objects -- for instance individual voices, groups of instruments, environmental sounds, or the reverberation in a room. These individual sounds are augmented by data describing the object itself, such as its position and sound level.
Advantages of Object-Based Audio
The most obvious advantage of object-based audio is that the reproduction can be optimally adapted to the sound system, i.e., a loudspeaker setup or headphones. This is done by combining the objects's signals and corresponding object description to create signals for the actual loudspeaker setup and their positions. But the advantages of object-based audio do not stop here. On the one hand, it considerably eases the production, delivery and archiving of spatial audio contents because only one production is needed instead of multiple mixes for the various reproduction systems. On the other hand, the object-based representation enables more sophisticated control over the reproduction of sound scenes than just adapting to the loudspeaker positions. For instance, hearing-impaired persons could increase the intelligibility of the dialogue, users could select different commentaries or alternate storylines encoded in different objects.
State of the Art in Object-Based Audio
While object-based audio concepts are discussed in the research community for quite a while, it is just gaining an increasing momentum in the audio industry, both in terms of standardisation, for instance within the upcoming MPEG-H standard or the EBU Audio Definition Model (ADM), and technologies such as Dolby Atmos or DTS Multi-Dimensional Audio (MDA). However, it is yet unclear to what extent these approaches will reach a broader public, especially for domestic use.
In Search of Expressive Audio Objects
The sound scene representations in current object-based technologies are typically very simple --- typically a collection of objects of a single type, for instance point-like sources, possibly extended by properties as its physical extent or its diffuseness. The main research question for this project is whether these scene descriptions are sufficient and suitable for next-generation audio systems, or whether novel, more expressive object models and scene descriptions can increase the benefits of the object-based paradigm.To this end, we explore new novel object types and suitable data representations in combination with algorithms to reproduce these objects on multiple reproduction systems. This includes the exploration of scalable objects, e.g., objects that contain audio and describing metadata in different levels of detail. These object type can be advantageously used to adapt object-based reproduction to loudspeaker systems of different size, to audio systems with different levels of processing power, and to allow transmission with different network bandwidths. Further research questions are the use of descriptive metadata to adapt the reproduction to the environment, the modality of the listener, and potential special listener requirements such as hearing impairments.
Andreas Franck, Filippo Fazi