YouTube's captioning system now recognizes sound effect ✅

YouTube’s captioning system now recognizes sound effect

Updated on 19 May, 201719 May, 2017/

Thanks to advances in Google’s Machine Learning (ML) project that makes it possible to transcribe video to audio automatically on YouTube. Years ago, YouTube introduced an automatic captioning system that can transcribe videos to audios for better accessibility of content uploaded. Taking a closer look, Google realized that the audio transcription system was not enough to capture the impact of a video. Therefore, the company improved on the system by making it recognize and transcribe ambient sounds.

The system is now able to recognize and transcribe sound effects like [Laughter], [Applause] and [Music] into words. However, for now, the system is only restricted to caption these three sounds. According to the post from Google, the upgrade is only restricted to these three sounds because they are less complex and are the ones that are mostly captioned by video producers.

How does it work?

The YouTube captioning system is based on the application of Machine Language (ML) that is called a Deep Neural Network Model that was programmed to work on a labeled data. With the Neural Network model, it is possible to easily recognize and transcribe these three sounds. Therefore, whenever a new video is uploaded, the system tries to run the program so that it will recognize various ambient sound effects.

For now, the system can only recognize [APPLAUSE], [MUSIC], and [LAUGHTER]; however, these are the only sound effects that are researched and found to be the ones that are mostly used in video and are the ones that are frequently captioned.

Although the sound space contains more ambient sounds with widely relevant information than the three sounds but these three sounds are unambiguous and straightforward. For instance, there is a [RING] in a video; it brings about different questions like what rang – a bell, a Phone or an Alarm? This makes it difficult for the automatic captioning system to recognize the audio and therefore it will be hard to transcribe the video to audio.

Moreover, as part of Google’s work to make the Automatic Captioning System recognize and transcribe more sounds, Google has developed an analysis framework and infrastructure to enable scaling, which includes detection of ambient sounds and transcribing them into audio.

Finally, Google promise to expand the algorithm so that the automatic captioning system will be able to detect more ambient sounds likes [PIANO], [MUSIC] and more in future.