You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection
S. Venkatesh, D. Moffat, and E. Miranda. (2021)cite arxiv:2109.00962Comment: 20 pages, 4 figures, 6 tables. Added more experimental validation.
Abstract
Audio segmentation and sound event detection are crucial topics in machine
listening that aim to detect acoustic classes and their respective boundaries.
It is useful for audio-content analysis, speech recognition, audio-indexing,
and music information retrieval. In recent years, most research articles adopt
segmentation-by-classification. This technique divides audio into small frames
and individually performs classification on these frames. In this paper, we
present a novel approach called You Only Hear Once (YOHO), which is inspired by
the YOLO algorithm popularly adopted in Computer Vision. We convert the
detection of acoustic boundaries into a regression problem instead of
frame-based classification. This is done by having separate output neurons to
detect the presence of an audio class and predict its start and end points.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art
Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely
a convolutional neural network and has no recurrent layers, it is faster during
inference. In addition, as this approach is more end-to-end and predicts
acoustic boundaries directly, it is significantly quicker during
post-processing and smoothing.
Description
[2109.00962] You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection
%0 Generic
%1 venkatesh2021yololike
%A Venkatesh, Satvik
%A Moffat, David
%A Miranda, Eduardo Reck
%D 2021
%K audio segmentation singleshot
%T You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection
%U http://arxiv.org/abs/2109.00962
%X Audio segmentation and sound event detection are crucial topics in machine
listening that aim to detect acoustic classes and their respective boundaries.
It is useful for audio-content analysis, speech recognition, audio-indexing,
and music information retrieval. In recent years, most research articles adopt
segmentation-by-classification. This technique divides audio into small frames
and individually performs classification on these frames. In this paper, we
present a novel approach called You Only Hear Once (YOHO), which is inspired by
the YOLO algorithm popularly adopted in Computer Vision. We convert the
detection of acoustic boundaries into a regression problem instead of
frame-based classification. This is done by having separate output neurons to
detect the presence of an audio class and predict its start and end points.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art
Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely
a convolutional neural network and has no recurrent layers, it is faster during
inference. In addition, as this approach is more end-to-end and predicts
acoustic boundaries directly, it is significantly quicker during
post-processing and smoothing.
@misc{venkatesh2021yololike,
abstract = {Audio segmentation and sound event detection are crucial topics in machine
listening that aim to detect acoustic classes and their respective boundaries.
It is useful for audio-content analysis, speech recognition, audio-indexing,
and music information retrieval. In recent years, most research articles adopt
segmentation-by-classification. This technique divides audio into small frames
and individually performs classification on these frames. In this paper, we
present a novel approach called You Only Hear Once (YOHO), which is inspired by
the YOLO algorithm popularly adopted in Computer Vision. We convert the
detection of acoustic boundaries into a regression problem instead of
frame-based classification. This is done by having separate output neurons to
detect the presence of an audio class and predict its start and end points.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art
Convolutional Recurrent Neural Network on multiple datasets. As YOHO is purely
a convolutional neural network and has no recurrent layers, it is faster during
inference. In addition, as this approach is more end-to-end and predicts
acoustic boundaries directly, it is significantly quicker during
post-processing and smoothing.},
added-at = {2022-03-08T09:49:54.000+0100},
author = {Venkatesh, Satvik and Moffat, David and Miranda, Eduardo Reck},
biburl = {https://www.bibsonomy.org/bibtex/29695a761267d44a7f7be1c13bfcac84d/annakrause},
description = {[2109.00962] You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection},
interhash = {e0dd5369a04a8b31e886d111f68708e3},
intrahash = {9695a761267d44a7f7be1c13bfcac84d},
keywords = {audio segmentation singleshot},
note = {cite arxiv:2109.00962Comment: 20 pages, 4 figures, 6 tables. Added more experimental validation},
timestamp = {2022-03-08T09:49:54.000+0100},
title = {You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection},
url = {http://arxiv.org/abs/2109.00962},
year = 2021
}