Information site for ANR project HOULE

ISMIR 2012 Workshop on “CASA for MIR”

Half-Day ISMIR 2012 Satellite Workshop: “CASA for MIR: Approaching Computational Auditory Scene Analysis from a Music Information Retrieval standpoint”

Saturday, 13th October 2012, 09:00-13:00


Católica Porto, Porto, Portugal — Campus da Foz


  • Luís Gustavo Martins (lmartins[at]porto.ucp.pt)
  • Mathieu Lagrange(mathieu.lagrange[at]ircam.fr)


Organized as a ISMIR2012 Satellite Event, this half-day workshop will focus on the use of Computational Auditory Scene Analysis (CASA) approaches for Music Information Retrieval (MIR), in an attempt to raise discussion around questions such as:

  • Is music a harder scenario for a Machine Hearing system than speech/generic sound?
  • What are the specific challenges music poses?
  • How good would a CASA approach be for MIR?
  • How can we evaluate such a system in a music scenario?

The aim of the workshop is then to bring together key researchers working across different multidisciplinary aspects related to Sound/Music perception, Sound Source Perception/Segregation, (Computational) Auditory Scene Analysis, Computational Systems for Music Perception, Machine Audition/Hearing, and some other related areas.

This workshop is a follow-up of the one organized during DAFx11 (more info here: http://houle.ircam.fr/wordpress/?page_id=29).

Invited Speakers:

Prof. Emmanuel Bigand (http://leadserv.u-bourgogne.fr/fr/membres/emmanuel-bigand)

(Institut Universitaire de France)

Title: “Music cognition and emotion: some challenges for MIR”

Prof. Shihab Shamma (http://www.ece.umd.edu/meet/faculty/shamma.php3#news)

(Dep. of Electrical and Computer Engineering, A. James Clark School of Engineering, University of Maryland, USA)

Title: “Role of coherence and rapid-plasticity in active perception of complex auditory scenes”

Joakim Andén (http://www.cmap.polytechnique.fr/~anden/)

(on behalf of Prof. Stephane Mallat who could unfortunately not attend this workshop)

(Ecole Polytechnique, France)

Title: “Stability to time-warping and frequency transposition invariance using scattering representations”

Dr. Tom Walters (http://research.google.com/pubs/author38237.html)

(Machine Hearing Group at Google, USA)

Title: “Machine hearing for large-scale content-based audio analysis”

Prof. George Tzanetakis (http://www.cs.uvic.ca/~gtzan/)

(University of Victoria, BC, Canada)

Title: “Active Computational Musicianship”

Detailed Information:


Prof. Shihab Shamma





Role of coherence and rapid-plasticity in active perception of complex auditory scenes


Humans and other animals can attend to one of multiple sounds, and follow it selectively over time. The neural underpinnings of this perceptual feat remain mysterious. Some studies have concluded that sounds are heard as separate streams when they activate well-separated populations of central auditory neurons, and that this process is largely pre-attentive. Here, we argue instead that stream formation depends primarily on temporal coherence between responses that encode various features of a sound source. Furthermore, we postulate that only when attention is directed towards a particular feature (e.g., pitch) do all other temporally coherent features of that source (e.g., timbre and location) become bound together as a stream that is segregated from the incoherent features of other sources.


Shihab Shamma is a Professor of Electrical and Computer Engineering and the Institute for Systems Research. His research deals with auditory perception, cortical physiology, role of attention and behavior in learning and plasticity, computational neuroscience, and neuromorphic engineering. One focus has been on studying the computational principles underlying the processing and recognition of complex sounds (speech and music) in the auditory system, and the relationship between auditory and visual processing. Another aspect of the research deals with how behavior induces rapid adaptive changes I neural selectivity and responses, and the mechanisms that facilitate these changes and control them. Finally, signal processing algorithms inspired by data from these neurophysiological and psychoacoustic experiments have been developed and applied in a variety of systems such as speech and voice recognition, diagnostics in industrial manufacturing, and underwater and battlefield acoustics. Other research interests include aVLSI implementations of auditory processing algorithms, and development of robotic systems for the detection and tracking of multiple simultaneous sound sources.


Joakim Andén

(on behalf of Prof. Stephane Mallat who could unfortunately not attend this workshop)






Stability to time-warping and frequency transposition invariance using scattering representations


Mel-frequency spectral coefficients (MFSCs), a spectrogram averaged along a mel-frequency scale, have proven very useful for various audio classification tasks. The mel-scale averaging ensures the stability of MFSCs to time-warping in a Euclidean norm, which partly explains their success. However, the averaging loses high-frequency information, so windows are kept small (around 20 ms) to reduce this loss. Consequently, MFSCs cannot capture large-scale structures. The scattering representation recovers the lost information using a cascade of wavelet decompositions and modulus operators which then allows larger window sizes. To create invariance to frequency transposition, scattering coefficients are averaged along acoustic frequency. In order to minimize the information loss due to this averaging, purely temporal wavelet filters are replaced with spectrotemporal filters starting at the second level of the cascade. By inverting the wavelet modulus operator, the original signal can be recovered from the scattering representation. Experiments in audio classification also show that scattering coefficients bring a significant improvement over standard MFSC methods.


Joakim Andén is a Ph.D. candidate in applied mathematics at Ecole Polytechnique in Paris, France under the supervision of Prof. Stéphane Mallat. Previously, he studied engineering physics and mathematics at the Royal Institute of Technology in Stockholm, Sweden and mathematics at Université Pierre et Marie Curie in Paris, France, from which he received an M.Sc. in 2010. His research focuses on invariant signal representations and their applications to classification and similarity estimation for music, speech and environmental sounds.


Dr. Tom Walters




Machine hearing for large-scale content-based audio analysis


The Machine Hearing group at Google is interested in application of computational auditory models to content-based audio analysis problems. Being part of Google, we need our approaches to work at scale. In recent years, we’ve developed a number of audio features based on the stabilized auditory image (SAI). The SAI is a correlogram-like representation generated from the output of a compressive auditory filterbank, which simulates the motion of the basilar membrane in the cochlea. I’ll discuss our group’s work at all stages of the model, including the development of the new CAR-FAC auditory filterbank, features derived from the auditory image, and some recent applications in melody recognition, sound-effects search and music recommendation.


Tom’s research focuses on computational models of the human auditory system. Tom received an MSci in Natural Sciences from the University of Cambridge in 2004 and a PhD from the Centre for the Neural Basis of Hearing at the University of Cambridge. He currently is a Research Scientist at Google (Machine Hearing Group).


Prof. Emmanuel Bigand




Music cognition and emotion: some challenges for MIR


Music is a sophisticated activity deeply rooted in human brain. Understanding part of this activity with formal model is challenging issue for MIR. HUman brain constantly integrated bottom up and top down influences. This latter ones are definitely the most resistant to computation. Moreover, human brains use coding strategies (such as embodiement and mirror neurons) which has no equivalent in artificial system. Some typical example of challenges belonging to either perception or performance will be presented during this talk. The case of emotion response to music will be developp in litte bit more because of large implication for MIR community.


After a professionnal carrear as doublebassist in Symphonic orchestra of Marseille, Emmanuel Bigand made a Phd on music cognition with R Francès and M Imberty in Paris and get a position of professor of cognitive psychology at the university of Dijon. He collaborated with S McAdams at Ircam on research project with contemporary composers. He is now member of the Institut Universitaire de France with a chair on Music, Brain and Cognition. In Dijon, he is the director of a CNRS laboratory specialized on learning and development,http://leadserv.u-bourgogne.fr/fr/membres/emmanuel-bigand and he coordinate a European project EBRAMUS on Music Brain and Health (http://leadserv.u-bourgogne.fr/ebramus/). He is specialized on the following issues : processing musical syntax, musical rethoric, implicit learning of new musical grammar, emotional response to music, new technology for music cognition.


Prof. George Tzanetakis




Active Computational Musicianship


Music is typically treated as an artifact both in traditional music information retrieval as well as more generally computational auditory scene analysis. By that I mean that the signal content is given and the goal is to extract useful information from it. However, in most cases, music is a process that unfolds over time that musicians actively create as they listen to it at the same time. Active vision is an area of computer vision in which the emphasis is on systems that have a purpose or goal and manipulate their perception to achieve this goal. For example a robotic camera might pan or zoom adaptively to better recognize a particular object. In some ways this emphasis on intention is related to but goes beyond the more familiar top-level (prediction driven) perception model as distinct from the more common bottom-up (sensory driven) model. In this talk I will discuss why I think this is an interesting direction for CASA research related to music especially in the context of live music performance. I will describe two systems that can be viewed as steps in this active musicianship direction: 1) teaching a virtual violinist to bow a physical model by automatic listening and 2) self-calibrating musical robots that adapt their behavior by listening to themselves. Even though one would be hard pressed to label such systems as examples of CASA I believe they help frame a vision of where advances in CASA could have fascinating applications in the process of making music in real time rather than after the fact processing of a recording.


George Tzanetakis is an Associate Professor in the Department of Computer Science with cross-listed appointments in ECE and Music at the University of Victoria, Canada. He is Canada Research Chair (Tier II) in the Computer Analysis and Audio and Music. In 2011 he was Visiting Faculty at Google Research. He received his PhD in Computer Science at Princeton University in 2002 and was a Post-Doctoral fellow at Carnegie Mellon University in 2002-2003. His research spans all stages of audio content analysis such as feature extraction, segmentation, classification with specific emphasis on music information retrieval. He is also the primary designer and developer of Marsyas an open source framework for audio processing with specific emphasis on music information retrieval applications. His pioneering work on musical genre classification received a IEEE signal processing society young author award and is frequently cited. More recently he has been exploring new interfaces for musical expression, music robotics, computational ethnomusicology, and computer-assisted music instrument tutoring. These interdisciplinary activities combine ideas from signal processing, perception, machine learning, sensors, actuators and human-computer interaction with the connecting theme of making computers better understand music to create more effective interactions with musicians and listeners.


This work has been partialy supported by

  • ANR, French funding agency in the scope of the HOULE project