From ASA to CASA, what does the “C” stand for anyway?
Satelite Workshop of the DAFx Conference organized by Mathieu Lagrange and Luis Gustavo Martins at Ircam, Paris.
September 23 2011 (Afternoon)
Auditory Scene Analysis (ASA) is the process by which the human auditory system organizes sound into perceptually meaningful elements. Inspired by the seminal work of Al Bregman (1990) and other researchers in perception and cognition, early computational systems were built by engineering or computer scientists such as David Mellinger (1991) or Dan Ellis (1996).
Strictly speaking, a CASA or a “machine listening” system is a computational system whose general architecture or key components design are motivated by facts taken from ASA. Though, ASA being a Gestaltist theory that focuses on the description and not on the explanation of the studied phenomenon, computational enthusiasts are left with a largely open field of investigation.
Perhaps this lack of definition did not fit into the way we do research nowadays, since papers strictly tackling this issue are relatively scarce. Though, informal discussions with experts in the sound and musical audio processing areas confirm that making sense of strongly polyphonic signals is a fundamental problem that is interesting both from the methodological and application point of views. Consequently, we (organisers of this workshop) believe that there are fundamental questions that need to be raised and discussed in order to better pave the way of research in this field.
Among others, those questions are:
- From ASA to CASA: only insights ?
- Is the knowledge transfer from ASA to CASA only qualitative?
- Are there other approaches in scientific fields such as biology, cognition, etc. that are also potentially meaningful for building powerful computational systems?
- What is CASA ?
- Is CASA a goal in itself ?
- Can it be decomposed into well defined tasks ?
- Is CASA worth pursuing ?
- What are the major locks in contemporary CASA ?
- How does it relates to other sound processing areas such as Blind Source Separation (BSS) or Music Information Retrieval (MIR) ?
This workshop aims at bringing to the audience some background and new topics on ASA and CASA. Questions such as the ones cited above will then be raised and discussed with the help of the invited speakers.
We are delighted to have 4 confirmed invited speakers (in order of appearance, please click on the talk title to jump to the corresponding section on this page):
- Trevor Agus (ENS): Perceptual learning of novel sounds
- Josh McDermott (NYU): Sound texture perception via statistics of the auditory periphery
- Jon Barker (Sheffield Univ.): Probabilistic frameworks for Scene understanding
- Boris Defreville (Orelia): Machine listening in everyday life
The workshop concludes with a Q&A session.
Trevor Agus (ENS): Perceptual learning of novel sounds
To recognize a sound, it is necessary to extract its auditory features, although we do not yet know what these auditory features are. Here, we show that there is not a simple set of features, but rather that new features can be learnt to recognize previously unheard sounds. The behavioral measure [Agus, Thorpe, & Pressnitzer, Neuron, 2010] was based on the detection of repetitions in 1s-long noises, some of which re-occurred throughout an experimental block. Repetitions in the re-occurring noises were detected more frequently, showing learning of otherwise meaningless sounds. The learning was unsupervised, resilient to interference, and rapid, and generalizable to similar noises. Multiple noises were remembered for several weeks and listeners learnt unrepeated noises. A second set of experiments showed faster selective responses to voices than for acoustically comparable sounds. These results collectively point towards an active mechanism that learns auditory features affecting everyday recognition of everyday sounds.
Trevor Agus is a researcher studying hearing in the Laboratoire Psychologie de la Perception and at the Departement d’études cognitives at the Ecole normale supérieure. He is particularly interested in how the auditory system processes complex sounds, typical of those encountered in everyday life. He believes that by better understanding the features used to recognize and segregate sound objects, it may be possible to better manage the effects of hearing-impairment.
Trevor obtained a BA in mathematics at the University of Cambridge, and an MSc in Music Technology at the University of York. His PhD in Psychology was obtained at the University of Strathclyde in conjunction with the Institute of Hearing Research Scottish Section, based in Glasgow, before starting his post-doctoral research with Daniel Pressnitzer in Paris.
Presentation slides: agusCasa11.pdf
Josh McDermott (NYU): Sound texture perception via statistics of the auditory periphery
Rainstorms, insect swarms, and galloping horses produce “sound textures” – the collective result of many similar acoustic events. Sound textures are distinguished by temporal homogeneity, suggesting they could be recognized with time-averaged statistics. To test this hypothesis, we processed real-world textures with an auditory model containing filters tuned for sound frequencies and their modulations, and measured statistics of the resulting decomposition. We then assessed the realism and recognizability of novel sounds synthesized to have matching statistics. Statistics of individual frequency channels, capturing spectral power and sparsity, generally failed to produce compelling synthetic textures. However, combining them with correlations between channels produced identifiable and natural-sounding textures. Synthesis quality declined if statistics were computed from biologically implausible auditory models. The results suggest that sound texture perception is mediated by relatively simple statistics of early auditory representations, presumably computed by downstream neural populations. The synthesis methodology offers a powerful tool for their further investigation.
Josh McDermott is a perceptual scientist studying sound, hearing, and music in the Center for Neural Science at New York University. His research in hearing addresses sound representation and auditory scene analysis using tools from experimental psychology, engineering, and neuroscience. He is particularly interested in using the gap between human and machine competence to both better understand biological hearing and design better algorithms for analyzing sound. His interests in music stem from the desire to understand why music is pleasurable, why some things sound good while others do not, and why we have music to begin with.
McDermott obtained a BA in Brain and Cognitive Science from Harvard, an MPhil in Computational Neuroscience from the Gatsby Unit at University College London, a PhD in Brain and Cognitive Science from MIT, and postdoctoral training in psychoacoustics at the University of Minnesota. He currently works in the Lab for Computational Vision at NYU, using computational tools from image processing and computer vision to explore auditory representation.
Presentation slides: mcdermottCasa11.pdf
Jon Barker (Sheffield Univ.): Probabilistic frameworks for Scene understanding
The talk will consider (hearing-inspired) probabilistic frameworks for scene understanding. In particular, the talk will consider how individual sound sources can be understood in the presence of multiple competing sound sources. Bregman’s Auditory Scene Analysis account presents this demixing problem as a two stage process in which innate primitive grouping `rules’ are balanced by the role of learnt schema-driven processes. The manner we perceive a scene is determined by the poorly understood balance between these processes. For example, our interpretation of speech in noise is a product of universally applied grouping forces such as cross-frequency pitch grouping and `softer’ expectations that are dependent on our learnt (and personal) knowledge of the patterns of speech and language. This talk will discuss the difficulty of integrating these contrasting organisational principals in a common probabilistic framework. As an example, the talk will feature an ASA inspired approach to robust speech recognition, `fragment decoding’. The talk will use the short-comings of this approach to demonstrate what future frameworks need to be able to do better.
The talk will also defend the adoption of a human-inspired approach to scene understanding, i.e. why we might want to build planes with flapping wings even if they can’t fly as fast.
Jon Barker is a Senior Lecturer in the Speech and Hearing Research group of the Computer Science department at the University of Sheffield. He has had a long standing interest in the perceptual organisation of complex acoustic scenes and in machine listening systems inspired by our understanding of Auditory Scene Analysis. He has a particular interest in the robust processing of speech in non-stationary noise environments. This interest faces in two directions: using insights gained from ASA to construct noise robust automatic speech recognition system, and using statistical modelling techniques — adopted from the speech recognition community — as a basis for understanding speech intelligibility.
Barker obtained a BA in Electrical and Information Science from the University of Cambridge and a Ph.D in Computer Science from the University of Sheffield under the supervision of Prof Martin Cooke. He spent some time working on audio-visual speech perception at ICP in Grenoble before returning as a post-doctoral researcher to Sheffield. He has spent some twenty years researching in the areas of speech and hearing.
Presentation slides: barkerCasa11.pdf
Boris Defreville (Orelia): Machine listening in everyday life
In this presentation, we talk about some real Machine Listening applications inspired by works we have done at ORELIA company. First, we present the wide and diverse range of existing applications in environment, industry and security areas. Then we focus on technical highlights and challenging tasks driven by real audio analytics applications. As a conclusion, we defend the future of Machine Listening systems and their utility in our everyday life.
Boris DEFREVILLE is a scientist and an entrepreneur. He co-founded ORELIA in 2007, one of the first audio analytics company, using Machine Listening technology for everyday life applications.
Prior to ORELIA, Boris worked 5 years as an acoustic consultant and as a scientist on R&D programs dedicated to environmental noise assessment. He received a PhD from Cergy Pontoise University in Psychoacoustics. You can follow Boris on twitter : @AudioSense
Presentation slides: defrevilleCasa11.pdf
Transcript of the main questions raised during the panel (thanks to Mathias Rossignol)
Q – (mainly as a follow-up to Jon’s talk, so concerning chiefly speech) in the real world humans operate based on expectations; they use a model and take context into account. In automatic processing, how can models be introduced? How much prior? How to find a good balance?
Jon – this is why I backed off from having “understanding” in the title of my talk. Dealing with semantics, cognition, is a distinct and tricky matter. It should probably be added on top of our system, but we’re avoiding it at the moment.
What we do is try to factorize on one side what’s dependent on the environment/context (including cognitive) and on the other what isn’t, and see what’s the best we can do with the 2nd part.
Boris – context information is something we really need in real-life applications. Temporal context — what happens right before and after a sound — is potentially very interesting to disambiguate.
Jon – one thing that I can see being introduced realistically is the notion of “domain” of BG noise; for example we now we’re in a house, so we expect household noises in the background.
Josh – in texture recognition, context is necessary: rain, for example, is very similar to applause. But we don’t really understand yet how that works.
Q – Mathieu, in your introductory talk you made an accusation that people in connected fields use ideas from CASA but never get into it in depth. For me, as an MIR person, what would be a good way to be more “serious” about CASA?
Mathieu – CASA is often seen as a “dangerous game”: the goal is somewhat ill-defined, and it’s very hard to evaluate. But even for me, who pretend to be centrally interested in it, it’s not perfectly clear, and I would like to take the occasion to forward the question to our experts here:
Sub-Q – if CASA can be defined as a goal in itself, how can we evaluate it?
Josh – It’s hard to answer this question directly; there is obviously a cultural problem here: the technological community is focused on performance improvement, but there is also need for fundamental research, to which CASA belongs.
Sub-Q – so are we not yet to the point where we can harness ASA?
Josh: no, that’s not it, I wouldn’t be that categorical. But if you’re going to work on it, it’s got to be long term.
Jon: there are some tasks where CASA is fundamental, especially when you need to mimick human behaviour, including human weaknesses. For example in robotics, a robot that communicates like you, has the same insufficiencies, is better because it more easily triggers a feeling of empathy.
Q – (Josh -> Jon) In your system you gather fragments into sources, so the assumption is that the grouping cues used to form fragments are always right; you don’t put your fragments into question once they’re formed. How could we take fallibility into account?
Jon – we use probabilistic grouping rules to form sources, but that’s possible because we have a model to guide us. Having probabilistic fragments raises a problem of efficiency. And if you don’t have a model, then you’re stuck.
Luis – for music could stream formation be the guiding rule? It’s hard to make a model for each instrument!
Mathieu – worse than that, you’d need a model for each instrument *and* playing technique! To go back to Jon’s work, that’s the good point I appreciate about it: the way it manages to combine in some way a bottom-up and a top-down approach.
Concerning speech and music, I’d like to mention the work of Barbara Tilman, in Lyon: it seems that every result on the perception of speech has a correspondence in music perception. So probably there are, if not dedicated brain areas, at least similar mechanisms in play.
Q (audience) – what about detecting genre on an excerpt and applying corresponding priors to analyze the whole?
Summarized answer – why not, but that raises the question of what tools you use to detect the genre; a somewhat “chicken and egg” problem.
Q – what about special attention phenomena, such as recognizing when your name is spoken, even in noise or in the background?
Jon – that’s a typical keyword spotting task; so we can imagine having a “name spotting process” continually running in the background.
The interesting thing is that this suggests there must be, contrary to our assumptions, at least some processing of the “audio background” going on.
Josh (spontaneous remark, not question-related) – at the moment, research in psycho-acoustics is unfortunately very detached from machine listening, in part because psycho-acoustics deals a lot with artificial signals now, whereas machine listening is interested in real world sounds.
It seems to me essential now to try and understand how people listen to real world sounds. There are real insights to be gained there.
Mathieu – maybe music could be a good example? It’s a sound organization system.
Josh – but it’s engineered (by the composer) to make the listener able to perform good scene analysis. So it’s a special case, dangerous for generalization.
Trevor – not always, though: an orchestra can sometimes “try” to sound like a single instrument.
Public – modern music increasingly uses sounds without any physical counterpart, that makes it an interesting challenge: it means you have to work on purely perceptual cues.
Public – CASA is hard for music notably because of synchronized onsets, etc., and that’s also why MIR people are “scared” of CASA.
Jon – Josh’s remark is also true of speech: it’s also made to be listened to.
Luis – CASA is not necessarily source separation, but more the separation/identification of perceptual sound objects.
Public (Mathias) – one difference between speech and music, though, is that in speech you’ll more commonly have active listening — moving your head, asking to repeat.
Public – The problem of physically active listening is indeed important, but the notion of intellectually/perceptually active listening must be kept in mind too.
This work has been partialy supported by
- ANR, French funding agency in the scope of the HOULE project
- “Fundação para a Ciência e Tecnologia” and by the Portuguese Government, in the scope of the Project “A Computational Framework for Sound Segregation in Music Signals”, with reference PTDC/EIA-CCO/111050/2009.