Into the unknown: why sites like YouTube are useless for sound recognition

Blog by Adrian Stępień | Senior ML & Audio Research Engineer, Audio Analytic

All sounds are not created equal. When it comes to training and evaluating sound recognition systems which deliver consistently good performance in consumer applications, you cannot use recordings downloaded from the internet (YouTube, Freesound, etc.) due to a range of technical and legal limitations.


“On two occasions I have been asked, “Pray, Mr Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” Charles Babbage, Passages from the Life of a Philosopher

Although it can be accessed at a click of the mouse, audio and video content shared online is made available for specific purposes only – typically for social networking, entertainment and media production. It wasn’t uploaded so that researchers and engineers could use it for building sound recognition systems.

In many cases, its usage for commercial purposes is prohibited, and the copyright ownership is held by the individual person or company who created, filmed or recorded it. There are also various technical limitations of using internet-downloaded recordings, and, in this blog, we look at why a detailed and considered approach is required when training and evaluating sound recognition technology that is fit for the real world.

In this blog, I look at six technical limitations, explaining the impact that each can have on sound recognition systems.


Depending on the evaluation circumstances, an internet-downloaded sound event may be correctly classified, misclassified or missed altogether by a sound recognition system. This is highly dependent on the distance between playback speaker and microphone, the type of sound, its frequency content and the effects applied to it. This is because internet-downloaded audio files are unknown quantities – where information about the recording environment and processes are unclear. These inconsistencies make the files unsuitable for training and undesirable to evaluate a sound recognition system fit for consumer applications.

The inconsistencies and problems introduced by internet-downloaded audio recordings span many factors, as highlighted in figure 1 below, which compares the issues facing real sounds in real environments compared to files found on the internet. The combination – or mixture – of these factors makes it difficult to pinpoint the root cause (or causes) of any resulting issues in the source file, leading to mistakes in the evaluation and training of sound recognition systems.

Read the blog