New AI method filters out background noise from bird song recordings

Rabbi Dr. Ari Berman, President and Rosh Yeshiva | Yeshiva University

By NYC Gazette

Jul 2, 2024

Researchers have developed a method using powerful technology to remove unwanted noise from audio recordings of bird sounds.

The method, called ViTVS, uses image processing technology to divide audio signals into distinct parts, or segments, for isolating clean bird sounds from a noisy background. The approach is explained in the paper “Vision Transformer Segmentation for Visual Bird Sound Denoising,” which has been accepted for presentation at InterSpeech 2024 by researchers from the Katz School’s Department of Computer Science and Engineering and Cornell University’s School of Public Policy.

Youshan Zhang, assistant professor of artificial intelligence and computer science, is co-author of the paper and Sahil Kumar's faculty mentor. “The vision transformer architecture is a powerful tool that can look at small parts of a whole, like pieces of a puzzle, and understand how they fit together, which helps in identifying and separating sounds from noise,” said Sahil Kumar, the first author of the paper and a student in the Katz School’s M.S. in Artificial Intelligence.

ViTVS helps their model understand and represent the audio comprehensively and in detail, capturing patterns and features that are both small and large, as well as those that occur over short and long periods. The model can also capture fine details in the audio, which helps distinguish subtle differences between sounds such as nuances in bird calls.

“This is important for understanding sounds that change slowly or have a broad context,” said Youshan Zhang. “This method enhances the model’s ability to process and understand audio by capturing detailed, extensive, and varied patterns, which is crucial for tasks like separating clean bird sounds from noisy backgrounds.”

The team used sophisticated algorithms specifically full convolutional neural networks to automatically learn how to distinguish between noise and actual bird sounds leading to more effective noise removal. Additionally techniques like Short-Time Fourier Transform (STFT) and Inverse Short-Time Fourier Transform (ISTFT) were employed to convert audio into a visual format.

STFT converted the audio signal into a visual representation similar to an image showing how the frequency content of the signal changes over time. After the noise was identified and removed in visual format ISTFT converted the cleaned visual representation back into original audio format.

“This makes it easier to see and identify patterns in noise and actual bird sounds,” said Kumar.

By using these techniques cleaning up audio or removing noise became more manageable because it transformed audio into a format where patterns and differences between noise and actual sounds were more apparent.

“Traditional deep-learning methods often struggle with certain types of noise especially artificial low-frequency noises,” said Zhang. “Extensive testing shows that ViTVS outperforms existing methods setting a new standard for cleaning up bird sounds making it a benchmark solution for real-world applications.”

ORGANIZATIONS IN THIS STORY

Yeshiva University