Why Audio Analytic's ‘soft temporal modelling’ design philosophy is key to understanding sounds


01-10-2021

To guide Audio Analytic's research and development, the team set a core technical design philosophy that allows for the flexibility that is inherently needed to take very diverse sound objects and sequences and translate these into very stable and unique detections. As you learn and read more about Audio Analytic's technology, it is useful to understand how their approach runs through their entire ML pipeline from the data they collect, label and augment, to the models they train and evaluate.

To guide Audio Analytic's research and development, the team set a core technical design philosophy that allows for the flexibility that is inherently needed to take very diverse sound objects and sequences and translate these into very stable and unique detections. As you learn and read more about Audio Analytic's technology, it is useful to understand how their approach runs through their entire ML pipeline from the data they collect, label and augment, to the models they train and evaluate.

First, let’s briefly explore why defining the structure of sounds is so challenging for machines. For this example, we're going to use a familiar sound to all of us; the dog bark. Detecting the sound of your dog barking supports a range of consumer use cases from smartphones and earbuds that alert you to sounds happening around you if you can’t hear them; to smart speakers and smart home devices that protect your property or enable you to track your pet’s anxiety levels while you are out.

Unless you suffer from some form of significant hearing loss, you will immediately be able to recall the sound of a dog barking whether that is a Chihuahua or a German Shepherd. But how do you impart this understanding on to a machine, especially when there is a vast array of acoustic features that compose the sound? For example, there are big dogs, small dogs, old dogs and young dogs.

Different breeds have a variety of snout shapes, which condition the variety of sounds they make. You have different types of barks, like when a dog is stressed or just looking for friendly attention. And then you have the environment in which the animal barks which can be inside houses, garages, streets, gardens, parks, etc… That is a vast number of different sound combinations for the dog part alone. On top of that, each environment features additional sources of sounds.

As experts in the understanding of sounds, Audio Analytic's number one goal is to develop algorithms and methods able to map all of these sounds to one label – in their example, mapping every possible dog bark in every possible location to these two words, “dog barking”. How do they do that?

One perspective on that problem is to find what is common to all dog barks at the acoustic level. Indeed, dog sounds exist within certain audio frequency boundaries, and are composed of “dog tones” and “dog noises” within certain ranges of variation. This gives part of the answer. But there is another part to it: finding what is different between dog bark audio sequences, and making the algorithm tolerant to these differences. That is the goal of Audio Analytic's core technical design philosophy that they call “soft temporal modelling”, which looks at both the acoustic and the temporal variability of sounds. In the example of dogs, you could think of this as the musical score of a dog barking episode: how many individual barks, how long they were, how each individual dog bark is structured over time and where in time they happen relative to each other. “Woof”, “yap-yap”, “woof woof — woof”, “woof – arf woof – woof”, “woof woof woof arf woof” – five sound sequences in this example, still covered by two words or just one concept: dog barking.

Three very different dog bark examples

Image - Three very different examples of dog barks

The necessity for softness along the temporal modelling dimension comes from the fact that even for humans, it can be difficult to agree very precisely on where the start and end of each sound or each tone are located, or how long a silence between sounds needs to be to qualify as an interruption. Although intuitively we seem to know what a dog bark is, when it comes to labelling the boundaries of it on a given audio recording, we are not so sure anymore – for example, data labellers often disagree on whether a short silence amidst a sequence of sounds is a boundary or is short enough to be considered part of the whole sound event. This is why Audio Analytic's data is labelled the way it is. They know that various levels of labelling are key to various levels of modelling, so they label simultaneously at the fine and episodic levels in addition to the basic level of weak labels. As a result, Audio Analytic have been able to develop temporal modelling methods which are tolerant to variations in sequence. Two words, one concept, against an infinity of relevant sound combinations and labelling opinions.

Audio Analytic's design philosophy flows down to all aspects of their technical innovation, both in terms of what we’ve already developed and what drives their continuous evolution. For example:

  • The tolerance to sequence interruptions is baked in the way they train the network via their patented loss function, which tells the network what the sound is and what it is not. For example, “the recognition of dog barks should be continuous and tolerant to short interruptions” – thus helping the network to find the boundaries between the sound objects in softer ways, instead of imposing hard and precise boundaries.

  • Audio Analytic's PSDS evaluation method, which has been adopted by the organisers of the DCASE Challenge Task 4 as the default metric, considers the quantity of overlap between sounds and their labels, rather than strict and instantaneous sound boundaries, to evaluate whether a detection was correct or not.

  • Similarly, their patented Temporal Decision Engine helps the system to make classification decisions based on soft temporal models of appropriate sound sequences for each type of sound event, in combination with decisions about the tones themselves.

Combining softness and precision, just the right amount of each, is how Audio Analytic get to truly define sounds. In the dog barking example, this is how they summarise thousands of sound instances into just one concept.

Audio Analytic's technology exists to create valuable and exciting experiences for consumers. This means that the models have to be accurate, robust and compact to be commercially successful. A feat that would be impossible without their design philosophy.

Like this? You can subscribe to their blog and receive an alert every time they publish an announcement, a comment on the industry or something more technical. 

 

 About Audio Analytic 

Audio Analytic is the pioneer of AI sound recognition technology. The company is on a mission to give machines a compact sense of hearing. This empowers them with the ability to react to the world around us, helping satisfy consumers entertainment, safety, security, wellbeing, convenience, and communication needs across a huge range of consumer products.

Audio Analytic’s ai3™ and ai3-nano™ sound recognition software enables device manufacturers to equip products at the edge with the ability to recognize and automatically respond to their growing list of sounds and acoustic scenes.

To read more information, click here.

Audio Analytic is the pioneer of artificial audio intelligence, which is enabling a new generation of smart products to hear and react to the sounds around us.

Audio Analytic