Loudspeaker monitoring is the reference when audio professionals evaluate content. Headphones are also important quality-checking tools; and many consumers enjoy music using “close-fitting listening devices”, as all different flavours of headphones are known in recent standards writing.
We discuss the two reproduction methods from perceptual, recording and mastering perspectives; especially differences in timbre, imaging and auditory envelopment when listening to stereo. Applications of headphones in recording, when setting up and trimming stereo or 3D microphone arrays, are also practically detailed.
In the last part of the workshop, attendees are invited to personally compare the two domains on the qualities and applications discussed; with guided listening to audio examples between a pair of precision nearfield monitors, Genelec 8351B, and a pair of excellent headphones, Audeze CRBN2.
Stefan Bock, born 20.08.1964 in southern Germany was starting his career in 1987 as an audio engineer. After freelancing in different facilities in Munich, he co-founded msm-studios in 1991 where he was the Chief Mastering Engineer and General Manager.
Recording Producer and Balance Engineer with 50 GRAMMY-nominations, 42 of these in craft categories Best Engineered Album, Best Surround Sound Album, Best Immersive Audio Album and Producer of the Year. Founder and CEO of the record label 2L. Grammy Award-winner 2020 and 2026. Immersive... Read More →
Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Music source separation (MSS) systems are commonly used in production, remixing,; audio analysis work, yet questions arise regarding the extent that objective evaluations of model performance align with human perceptual evaluations, particularly when tasked with non-traditional source material (in this case, heavily processed electronic music). This study seeks to set a framework for an evaluation of 3 machine learning approaches to MSS: a spectrogram-domain model (spleeter), a waveform-domain model (Demucs v2),; a hybrid-domain model (HTDemucs). Subjective evaluations of model performance were accumulated via a MUSHRA-style listening test, while objective evaluations were assessed using signal-to-distortion ratio (SDR); Frechet Audio Distance (FAD). Results showed consistent agreement across objective metrics, with the hybrid-domain model outperforming the other singular-domain models. Perceptual ratings also favored the hybrid model, with listeners occasionally rating the model output as equal or better quality than the original reference, interestingly. Preliminary analysis indicates some moderate but insignificant correlations between the two assessment paths, reinforcing concerns about relying solely on numerical evaluations when discussing MSS model performance. Implications for model design; future evaluation procedures are discussed.
Recent advances in large-scale multichannel loudspeaker systems have enabled immersive concert formats that extend spatial control beyond conventional stereo; small multichannel configurations. High-density loudspeaker arrays (HDLAs) allow sound to be distributed across complex architectural spaces, challenging established distinctions between composition, performance,; live sound practice. In live contexts, however, the realization of spatial attributes is often constrained by system complexity, limited rehearsal time,; the lack of artist-facing spatial control interfaces. As a result, spatial realization; sound diffusion are frequently delegated to sound engineers, who translate artistic material to the acoustic; architectural conditions of the venue in real time.
This paper examines three immersive concerts presented during Sonic Days 2025 in Denmark, realized on both large-scale; small-scale multichannel loudspeaker systems. The concerts represent contrasting production contexts, including a site-specific spatial composition conceived explicitly for a high-density loudspeaker array ; performances by artists whose practices are typically oriented toward stereo or small multichannel formats. Across these cases, spatialization functioned variously as compositional material, interpretive layer,; adaptive live-mixing practice.
The paper analyzes how control over spatial attributes is negotiated between artists; sound engineers in live immersive concert settings,; how this negotiation affects the interpretation of artistic intent; audience experience. Particular attention is given to the role of sound engineers as active mediators whose decisions shape spatial form, listening perspective,; the relationship between sound; architecture. The findings suggest that immersive concert formats redistribute creative agency across artists, technicians,; technological infrastructures,; point toward the need for revised conceptual frameworks for authorship, performance,; listening in large-scale spatial audio environments.
This presentation develops a conceptual framework for understanding how visitors cognize sound in museum exhibitions. While sound increasingly features in museum practice, research has focused primarily on measuring visitor enjoyment; engagement rather than examining the specific meanings sound generates. This gap reflects the absence of a framework conceptualizing sound's meaning-making capacities to guide empirical investigation. Drawing on scholarship from music studies, semiotics, phenomenology,; embodied cognition, I propose a seven-component spectrum identifying distinct yet interrelated meanings that sound can convey in museums: aesthetic, representational, emotional, sensorial, imaginative, social,; political. These meanings can be apprehended independently or in combination, typically through emergent, pre-conscious perception rather than deliberate awareness. The spectrum builds on the premise that museum sound meaning-making unfolds through dynamics internalized from early childhood as we attune to the world sonically. It draws on the notion of sound as a "sonic aggregate" (Grimshaw; Garner 2015)—encompassing social, contextual, temporal,; embodied experiences—rather than reducing sound to wave phenomena. Visitors actively co-produce meanings by drawing on their moods, memories, knowledge, ; imagination during exhibition encounters. Each meaning category is illustrated with exhibition case studies, demonstrating the spectrum's applicability across diverse sound-based multimodal museum practices—from popular music exhibitions to sound art installations. The spectrum aims to catalyze research through varied methodological approaches; establish analytical standards for studying sound in museums, with potential adoption by international standardization bodies.
Sound Studies Researcher, INET-md | NOVA University lisbon
A PhD in ethnomusicology and museum studies and a curator, I am committed to exploring the diverse meaning-making capabilities of sound when exhibited in museums, encompassing the representational, emotional, sensorial, and social, as well as its ability to foster imagination and... Read More →
Friday May 29, 2026 10:30am - 11:00am CEST Aud 43Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Prof. Li Dakang is a preeminent recording engineer and pioneer of 3D recording of Chinese traditional music, ancient instruments and spaces. Attendees are treated to a selection of unique 3D recordings, including a new and glorious version of China’s National Anthem. Prof. Li describes the LDK-Cube for capturing the envelopment of an acoustic space, and questions reliable reproduction of this important quality.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
This paper presents the perceptual evaluation of the Open Binaural Renderer (OBR), an open-source librarydeveloped for headphone-based rendering of Immersive Audio Model and Formats (IAMF) content. The evaluationfollowed an iterative framework in which findings from a pilot listening study informed the tuning of renderingprofiles, and the resulting renderer was benchmarked against established proprietary solutions. In the pilot study,19 expert listeners rated the Overall Listening Experience (OLE) of the initial prototype (OBRv1) and five externalrenderers across diverse audio content. Qualitative feedback was analysed using inductive coding to identify salientperceptual dimensions. The pilot revealed content-dependent performance and showed that a single default profilewas inadequate, yielding mixed responses in both the numerical scale and in the qualitative feedback and motivatingthe development of multiple rendering profiles in OBRv2. The main study evaluated two OBRv2 profiles targetingdifferent reverberation characteristics (Direct and Ambient) alongside three top-performing external renderers. Atotal of 39 participants, divided into expert and non-expert groups, rated five perceptual attributes: Voice Quality,Envelopment, Externalisation, Overall Listening Experience, and Timbral Balance. Mixed-design ANOVA revealedsignificant main effects of renderer condition on all attributes. Pairwise comparisons showed that OBRv2,Ambientachieved significantly higher OLE ratings than one proprietary renderer and reached statistical parity with theremaining two, representing a measurable improvement over the prototype. A trade-off between Voice Qualityand Externalisation was observed, driven by the level of reverberation in each renderer. The results demonstratethat iterative, perceptually informed tuning can yield competitive binaural rendering quality in an open-sourceframework.
Professor of Audio Engineering, University of York
Gavin Kearney graduated from Dublin Institute of Technology in 2002 with an Honors degree in Electronic Engineering and has since obtained MSc and PhD degrees in Audio Signal Processing from Trinity College Dublin. He joined the University of York as Lecturer in Sound Design in January... Read More →
With 25+ years of media industry product development, Jani Huoponen is a seasoned expert in developing cutting-edge audio and video technologies for consumer devices and streaming systems. Joining Google in 2010, he’s served as a product manager across key multimedia initiatives... Read More →
Despite the growing number of hearing-impaired workers wearing hearing-aids in occupational settings, understanding speech in multi-talker situations remains challenging. This difficulty is particularly pronounced in open-plan offices, where simultaneous talkers; room reverberation are prone to degrade speech intelligibility. While spatial cues are essential for segregating target speech from competing sources, hearing-aids signal processing may alter binaural information that supports spatial hearing. Accurate evaluation of hearing-aids performance is therefore crucial. Objective speech intelligibility metrics offer an efficient alternative to time-consuming listening tests; however, their validity in complex spatial scenarios involving hearing-impaired listeners remains unclear. Monaural metrics such as HASPI account for individual hearing loss but neglect spatial information, whereas binaural metrics such as MBSTOI incorporate spatial cues but are primarily designed for normal-hearing listeners. This study evaluates the ability of existing objective metrics to predict speech intelligibility for hearing-aid users in multi-talker spatial environments. Listening tests are conducted on 20 hearing-impaired participants fitted with binaural hearing-aids. Four types of multi-talker auditory scenes representative of open-plan offices are reproduced using a loudspeaker array. They involve a target speech, combined with diffuse noise; a localized competing speech source. Objective measurements are performed using an acoustic mannequin fitted with the participants’ hearing-aids. HASPI; MBSTOI values are computed from the binaural signals recorded at the eardrums ; incorporating individual hearing losses. Objective predictions are compared with subjective intelligibility scores,; an ablation analysis is conducted to distinguish the effects of hearing loss modeling from those of binaural processing.
Morten describes his excellent recording techniques, and attendees are treated to a unique selection of high resolution 3D music listening examples.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency.
Recording Producer and Balance Engineer with 50 GRAMMY-nominations, 42 of these in craft categories Best Engineered Album, Best Surround Sound Album, Best Immersive Audio Album and Producer of the Year. Founder and CEO of the record label 2L. Grammy Award-winner 2020 and 2026. Immersive... Read More →
Friday May 29, 2026 12:30pm - 1:30pm CEST Aud 31Technical University of Denmark Asmussens Alle, Building 306 DK-2800 Kgs. Lyngby Denmark
Speech intelligibility is a key factor in successful communication across various domains, including research, post-production for film and television, live sound reinforcement, and audio production. Traditional assessment methods often lack objectivity or fail to capture the listener’s experience in real-world scenarios. In this workshop, we introduce an innovative approach to measuring speech intelligibility based on the concept of “Listening Effort.” We will present the underlying technology, share practical examples from different application areas, and demonstrate how this method can be integrated into workflows to optimize intelligibility. Attendees will have the opportunity to participate in a hands-on demonstration and discuss potential use cases relevant to their own work. This session is designed for professionals and researchers seeking reliable and actionable tools for evaluating and improving speech intelligibility in diverse environments. In this workshop, we present a new technology for measuring speech intelligibility (“Listening Effort”). The method is used in research, post-production (film/TV), live sound, and audio production. The session is aimed at professionals from both academia and industry who are interested in objectively assessing and optimizing speech intelligibility.
Participants will be able to join a short demo/exercise and ask questions.
Introduction & Relevance: Overview of the importance of speech intelligibility across different fields Technology & Methodology: Presentation of the measurement method and underlying concepts Practical Examples: Case studies from research, post-production (film/TV), live sound, and production Live Demo / Interactive Exercise: Practical demonstration and opportunity for active participation Discussion & Outlook: Q&A, exchange of ideas, and future perspectives
Situational awareness is a multisensory ability that enables individuals to perceive; appropriately take into account their immediate environment. This perception of the world through our senses is carried out continuously; unconsciously throughout the day. When auditory perception is degraded, an individual may no longer correctly perceive a doorbell, a water leak, or an alarm signal, which negatively affects quality of life; may lead to dangerous situations. Auditory perception can in particular be degraded by hearing loss, a common; widespread condition. The most common treatment consists of wearing hearing aids, which are mainly designed to improve speech intelligibility, especially in noisy environments. Feedback from hearing-impaired people; hearing-aid users indicates that, although auditory situational awareness has been recognised as an essential component of well-being, it remains insufficiently studied; requires further investigation. There is currently no standard method for assessing to which extent one's situational awareness is affected by hearing impairment; the use of hearing aids. This is a complex process that requires assessing the perception of relevant sound events within a continuous stream of multisensorial information, by individuals who have different subjective preferences. Most existing methods are limited to evaluating only a subset of the problem, such as identification; localisation of non-speech sound events. The rise of new technologies, such as virtual reality, enables the development of assessment methods within more realistic yet controlled environments. This study aims to review existing methods in order to highlight their limitations in addressing the issue at hand.
The purpose of this study was to evaluate a sonification system that maps live heart rate data to real-time spectral filtering of a runner's preferred music. Assessed using a within-subjects design (n = 13), the system employs high-pass; low-pass filters to indicate deviations from target heart rate zones, providing instantaneous biofeedback without requiring visual attention. Quantitative analysis revealed no statistically significant differences in target zone accuracy or response time between auditory, visual,; combined conditions. However, qualitative thematic analysis identified a clear division in user preference. Participants favouring the auditory condition demonstrated faster mean response times to audio biofeedback. Findings suggest that while sonification promotes environmental focus; "gamifies" training, its efficacy is highly dependent on individual processing styles; music familiarity.
This paper proposes a psychoacoustic-based audio-visual mapping framework for intelligent vehicle cabins to enhance immersion; stabilize spatial auditory perception. By establishing mappings between auditory descriptors—such as Direction of Arrival (DOA), spectral centroid,; temporal envelope—and ambient lighting parameters, the framework leverages "ambient vision" to augment the perceptual experience without increasing the driver's cognitive load. Theoretical analysis based on Stevens’ Power Law indicates that the proposed mapping strategies effectively synchronize audio-visual intensities; mitigate perceptual fatigue, providing a conceptual reference for future multisensory HMI design.
Headphone listening has become an integral part of everyday life, spanning music consumption, communication, online media,; increasingly, computer gaming. These diverse listening contexts make individual sound exposure highly variable; difficult to quantify. While music listening ; occupational headphone use have been widely studied, sound exposure from gaming remains comparatively undocumented. This study investigated the relationship between self‑reported exposure through headphones; cochlear function assessed using transient evoked otoacoustic emissions (TEOAE). Forty‑one university students completed a detailed questionnaire on listening habits,; TEOAEs were recorded in both ears across five half‑octave frequency bands. Estimated weekly exposure levels were derived from participants’ reported durations ; contexts of use. TEOAE amplitude, signal‑to‑noise ratio (SNR),; reproducibility showed clear frequency‑dependent patterns; small ear asymmetries, consistent with typical OAE behaviour. Only limited associations were found between self‑reported exposure; TEOAE measures, with significant effects emerging primarily for SNR; reproducibility in the highest‑exposure group. No consistent differences were observed between long‑term gamers; non‑gamers. These findings suggest that self‑reported exposure alone may be insufficient to detect subtle cochlear changes in young adults,; underscore the need for more precise exposure‑monitoring methods when evaluating recreational sound exposure risks.
Binaural rendering is typically assessed via timbre; localization accuracy, while its intrinsic spatial resolution remains rarely quantified. This paper proposes a perceptual evaluation method based on Minimum Audible Angle (MAA) measurements to estimate the azimuthal just-noticeable difference (JND) introduced by binaural rendering algorithms. We systematically compared several rendering algorithms across eight reference azimuths using two participant-allocation paradigms. The results show that spatial resolution is significantly influenced by Ambisonic order; choice of the rendering alrorithm, with MAA thresholds systematically decreasing as the truncation order increases. Furthermore, the propsed method successfully captures physiological spatial characteristics ; identifies resolution limits imposed by reference angles. While both participant-allocation paradigms yield consistent qualitative trends, the repeated-measures design provides superior data stability. These findings demonstrate that the proposed MAA-based method is an effective tool for quantifying the spatial resolution of binaural rendering algorithms.
Richard is a multiple Grammy Award–winning recording engineer and a specialist in acoustic music recording. His work is focused primarily on classical, jazz, and film score music. A selection of his immersive recordings will be presented, accompanied by a discussion of the microphone configurations and mixing decisions employed in each example.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
This study evaluates three Next-Generation Audio (NGA) rendering systems through listening tests using real-life audio content. The testing paradigm prioritized subjective preference over adherence to a ground-truth reference. Participants assessed perceptual spatial audio attributes in both 5.1; 7.1.4 loudspeaker setups. The findings suggest that strict adherence to the rendering algorithm used during content creation is not mandatory in terms of listener preference. While not advocating disregarding artistic intent without consideration, this study proposes that such flexibility in reproduction can be an acceptable compromise.
Toni Hirvonen studied acoustics at the Helsinki University of Technology (now Aalto University), where he obtained a PhD in audio signal processing and spatial audio. After a position as a Marie Curie fellow, he has worked internationally in the audio industry since 2010. His projects... Read More →