Setting out our long-term vision for the future of machine listening
Our vision is to create exceptional human experiences through a greater sense of hearing. The ability to teach machines to understand sound will play a foundational role in human-centered experiences which are pivotal to ambient intelligence/computing and the metaverse. A ‘greater sense of hearing’ will connect us to the past, help us be in the present and anticipate what we need in the very near future.
By Dr Chris Mitchell | CEO and Founder, Audio Analytic
Our vision is to create exceptional human experiences through a greater sense of hearing.
Sound recognition is already having an impact on consumers, such as giving them peace of mind when they leave their home by listening out for sounds that indicate all is not well like breaking windows, smoke alarms, dogs barking, etc. But by thinking about a sense of hearing in its broadest terms, there are many more experiences that an understanding of sound can enable within consumer devices. In some cases this will still be responding to a particular sound, like a car horn, but in many more cases the combinations of sounds will provide contextually-rich information that will help connect us to the past, be in the present or anticipate the near future.
Plus, the range of consumer devices that would benefit from a sense of hearing is wide. This can be static devices around the home like smart speakers and displays, doorbells, cameras, set-top-boxes, TVs, lightbulbs, thermostats, etc. They can be mobile too, like smartphones, earbuds, AR glasses, smart watches and vehicles.
A new era of human-centred experiences
These devices have different computing and power profiles, which is why compactness is so important. There has been a lot of talk about ambient computing and the metaverse in recent months. These visions foresee a world where AI and connectivity are dispersed, seamlessly fitting into our lives on all manner of form factors, fusing the digital and real world.
This fusion of physical and digital experiences changes the way that humans interact with technology, as it exists all around us, rather than just being on a central device like the PC or smartphone. It becomes about human-centricity or egocentricity (as Facebook called it recently).
This new reality sets up some exciting challenges for AI which will need to seamlessly integrate with the human’s main input mechanisms – our senses.
As our primary two senses, hearing and sight are integral to the way that we make sense of the world around us. For the metaverse concept to become reality it has to encapsulate both of these sensory inputs, using the unique capability of each at the right point in time. While AR might be our go-to reference device for this potential future, we have to avoid thinking about it as a purely visual experience and a purely visual ML challenge.
The human-centred sense of hearing
If we place the user at the centre of our understanding of sound then we can see how we supercharge that sense of hearing through the devices and services that people engage with. We can get them to help us to hear things we can’t, to remember things we’ve heard, to improve our listening experiences, anticipate our needs and even extend our ability to understand what is going on in places beyond our natural range.
Let me give you some examples:
Help me to hear because my hearing is impaired
The source of impairment could be some form of hearing loss, the result of wearing noise-cancelling earbuds or even because you’re distracted by the sound coming from your AR glasses or your smart speaker. The devices in your ear or maybe around you can use sound to help you avoid hazards or give you a nudge when it perceives something is happening – like a knock at the door – to draw your attention to important or hazardous things in the real world.
Improve what I hear because the environment around me is challenging
This could be earbuds, AR glasses, the smart speakers within your home, the portable speaker you take to the beach or even the loudspeaker on your smartphone. Our enjoyment of music and podcasts, or our ability to interact with people, voice assistants or navigation aides, is impacted by the world around us, whether that is background noise or even poor room acoustics. By understanding the context of use and acoustics, products can adapt to the immediate world around them and improve our experiences.
Extend what I can hear because I’m not present
Our sense of hearing has its limits. We haven’t evolved as a species to hear over long distances but there are many situations where being able to understand what is happening in other locations that are miles or even thousands of miles away can be incredibly useful. Home security is a popular application for our sound recognition technology because people want peace of mind when they are at work. But there are other useful applications, such as caring for elderly relatives or knowing when your teenager is home from school. Sounds are really useful indicators of activity that don’t infringe on privacy.
Remember what I heard because my memories are precious
We capture so many special moments on our phones and now glasses. Some are for instant sharing while others are saved for another day in the future, where we want to relive that memory. As we amass tens of thousands of photos and videos, navigating them becomes a challenge, as does editing them on the fly. By understanding the sounds featured within videos or the sounds around you as you take a photo, phones can help suggest useful edits or filters or even help you find and relive that amazing moment from long ago. Also, by recognising certain sounds and combining this information with time and location devices can also be useful in helping us to recall when things happened or where we were.
Anticipate my needs because I’m busy
Sounds provide useful cues. For example, I can hear that somebody is chopping food or frying a steak, even if I can’t see them. Combining these sounds makes me incredibly hungry but it also tells me that food is on its way. Giving the ability for a device like a plugged-in smart display or a pair of glasses the ability to draw conclusions or disambiguate situations like this means that these devices can offer hands-free help right at the specific time it is needed. This could be prompting the chef to now add the next ingredient at the right time like “did you remember to add the asparagus and garlic?”
In the above examples some of those experiences can be fully dependent on understanding sound but in many cases it will involve fusing different AI branches like vision and sound. In applications where the field of view is constantly changing (such as AR glasses) then a 360-degree input like hearing will probably solve a lot of the major technical headaches.
Where does this vision take us?
As with all good visions, ours doesn’t have an end point. It is a constant quest for ways of delivering incredible experiences through a greater sense of hearing.
For me, the area of contextual awareness is really exciting. Sound plays a key role in helping us to disambiguate situations – sometimes on its own or in conjunction with other sensory inputs. Giving devices as compact as earbuds or as pervasive as the smartphone the ability to understand contextual cues from the world around them means that you can improve user interfaces and reduce the need for users to manually intervene by anticipating the near future.
I also find it exciting because it poses significant technical challenges for our researchers and engineers. Contextual awareness involves teaching machines to infer things from sounds in a way that don’t become hard rules, but instead blend softness and precision. You can read more about our ‘soft-temporal modelling’ design philosophy here. This design philosophy will be essential to context as we move into areas where an understanding of sound needs to be modelled over short and long time ranges.
Our vision unlocks fascinating new experiences where devices can take an active role in helping us every day. And because it is on all devices, the technology just blends in to the background, which is a critical component of ambient intelligence.
Read the full blog and find out more about Audio Analytic