HOULE: Hierarchical Object-based Unsupervised Learning for Computational Auditory Scene Analysis
Generic auditory scene analysis tools aimed at environmental soundscapes
The field of auditory signal analysis has long been mostly concerned about two domains: speech and music, two types of signal that are related to a wide range of profitable applications. In recent years, however, environmental soundscape analysis has become a topic of increasing interest for the research community, owing to a better understanding of their importance, omnipresence, and applicative potential: well-being, environmental quality, monitoring, detection of emergency situations, etc.
The main difficulty of this task is that environmental soundscapes are not, like most speech of music signal, richly structured following some formal or semantic rules: the types of events and their acoustic signatures can vary a lot. It is therefore more difficult to base their analysis upon prior knowledge about patterns or features to look for, which constitutes the most common and fruitful approach in music or speech processing.
Our aim with this project is to overcome those barriers by developing auditory scene analysis tools that remain as generic as possible, relying on minimal prior knowledge.
Making use of perceptually motivated analysis principles in order to build a multi-level structure of sound objects
In order to limit assumptions concerning the nature of the analyzed signal, we only rely on simple object constitution criteria coming from studies on human modes of perception: smooth evolution of features in simple objects (continuity), and identification of consistently repeated patterns that can interpreted as more complex objects.
We model the audio stream as a multi-level structure in which objects of increasing complexity are progressively built by concatenating simpler ones. That structural auditory scene description is constructed iteratively by the ALC (Alternate Levels Clustering) algorithm in a bottom-up manner, starting at the lowest levels with elementary sound fragments which are step by step gathered into larger ones. At each level of the analysis, a different weighting of the various considered criteria drives the aggregation; by controlling those weights, it is possible to adapt the operation of the algorithm to a great variety of scenes and objects.
Main outcomes of the project
The ALC algorithm described above constitutes the final contribution of this project, which has also given rise to other significant contributions.
SimScene is an automatic tool for the semi-automatic generation of audio scenes, dedicated to the thorough evaluation of analysis systems with a reliable ground truth. It is able to generate on demand a large number of distinct scenes following a generic high-level description. SimScene has notably been used during the IEEE AASP DCASE (Detection and Classification of Acoustic Scenes and Events) evaluation challenge.
The numerous requirements of the ALC algorithm for fast and flexible clustering, have led us to develop k-averages, an iterative hard clustering algorithm using inter-object similarities. k-averages yields better results than the literature standard while running 20 to 100 times faster.
Three production-grade tools have been developed and made available to the public:
k-averages can be found at (https://bitbucket.org/mlagrange/kaverages) in two optimized versions, C and Matlab, and is the topic of an article submitted to the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
ALC can be used online, with an article submitted to the European Signal Processing Conference 2015.
SimScene has been used has a central piece in the setting up of a validation protocol using simulated scenes, during the IEE AALC evaluation challenge; the full Matlab tool can be found at (https://bitbucket.org/mlagrange/simscene) and is the topic of a paper submitted to IEEE Transactions on Audio, Speech and Laguage Processing (TASSLP).
HOULE is a Young Researcher Project in fundamental research coordinated by Mathieu Lagrange, researcher at CNRS. Are also involved in the project Nicolas Misdariis, Arshia Cont and Axel Roëbel, IRCAM researchers. The project started in September 2009 and lasted 41 months. It benefited from a 232 000 euros ANR funding, for a total cost of about 444 000 Euros.