There are many types of different distortions that can be measured from linear to non-linear distortion. Often the two are convoluted together and the linear distortion influences the non-linear distortion. Distortion is also very signal and level dependent and it is hard to compare one type of distortion measurement to another. There are many type of non-linear distortion metrics, e.g. THD, THD+N and IMD being the most classic ones using sine tones as the test signal. But how can we measure distortion with real signals such as speech and music or even noise and compare the results to audibility? This tutorial discusses a wide range of distortion measurements, discusses what is audible and what distortion sounds like.
Steve Temme is founder and President of Listen, Inc., manufacturer of the SoundCheck audio test system. Steve founded the company in 1995, and for the past 30 years the company has remained on the cutting edge of research into audio measurement, regularly introducing new measurement... Read More →
Thursday May 28, 2026 10:00am - 11:00am CEST Aud 49Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
This work presents the results of a perceptual study investigating the influence on musicians of a virtual acoustics system installed in the live room of a professional recording studio. The study focused on analyzing relationships between a selection of objective acoustic parameters (T30, STLate, LJ); subjective perceptions of 19 solo musicians performing under 11 different acoustic conditions. The experiment was conducted using the VAT (Virtual Acoustic Technology) system; the VAT Suite software developed at the Immersive Media Laboratory (IMLab) in the Sound Recording Department at McGill University. Correlations between quantitative; qualitative analyses show that musicians’ preferences converge on conditions with T30 ≈ 1 s,; that late; lateral energy increases the perception of spatiality, providing a positive balance between clarity; acoustic support. However, longer reverberation reduces comfort; executive control.
Kseniya Kawko, a Munich- and London-based Tonmeister and recording engineer specializing in classical music and jazz, shares selections from her recent live and studio recording and mixing projects, featuring leading orchestras and jazz ensembles, and provides an introduction to the artistic and production considerations behind immersive formats.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
Kseniya Kawko is a producer and recording engineer specialized in classical music and jazz. She holds Master of Music degrees from two world-renowned audio programs: Sound Recording, McGill University (Montréal, Canada) and Musikregie / Tonmeister, Hochschule für Musik Detmold (Germany... Read More →
Multichannel audio formats require an attention to channels' correlations and sometimes special approach. In this workshop, we would like to continue the discussion started at AES Show 2025 in LA and show how you can use different measurement tools to avoid certain problems in the final mix. For example, the mutual influence between the upper and main beds in immersive layout or problems in the LFE channel and how to check the mix for the correlation issues outside the sweet spot.
A “phantom image” is the illusion of an independent sound source created by two or more loudspeakers. Most often created by manipulating level differences between stereophonic channels (aka, “panning”), the effect is used to create a sense of auditory space between loudspeakers ; is largely taken for granted. In recent years, surround; immersive audio systems have attempted to utilize phantom image processing to render audio objects in desired positions across multiple loudspeaker arrays. This research examined the efficacy of phantom image perception horizontally; vertically from an active listener perspective. After listening to a target loudspeaker, listeners (n = 442) were asked to move a phantom sound to a position to match that of the target loudspeaker. The listener’s phantom placement was then compared to the target,; subjects were allowed “correct” their phantom position. The horizontal experiment was based on a standard stereophonic 60° loudspeaker array with the target loudspeaker at 15° off center. The vertical experiment utilized elevated loudspeakers in a 60° arc with the target loudspeaker elevated 10° above the horizon (lower loudspeaker). Results show nearly universal “undershoot” in horizontal placement error on first attempts with gradual improvement over trials that coalesced around the projected target location. However, after repeated tries, final perceptual image locations were spread over 2/3 of the sound-field around the target loudspeaker. In the vertical trials perceptual locations were spread across the entire sound field in all three trials; failed to show any patterns of coalescence around the target loudspeaker.
Associate Professor of Audio Engineering Technology, interested in the perception and cognition of music and sound, especially timbre and attention. An amateur historical keyboardist. And my first name sounds like "song-he" as in "The song he sang was beautiful."
Target curves for the sound signature of headphones are a helpful design target during the development process. While a lot of attention has been made to fi nd target curves that match the listening preference of consumers, equivalents for studio headphones date back to the 90’s. In the context of music production a mutual target or even standard is essential as to make mixing; mastering more gear-independent. This becomes even more important since the main tool for sound engineers shifts from loudspeakers in professional environments such as acoustically treated studios to headphones, often additionally equipped with virtualization algorithms. This enables them to be more fl exible; to rely less on potentially expensive loudspeaker setups. The diffuse fi eld target curve that is currently still the only standardized target curve for studio headphones is often reported to not match a real loudspeaker-equivalent of studio environments. In this paper, we approach to find a new standard target curve for studio headphones emulating the frequency response of a loudspeaker setup in modern studio environments. For this, we give an overview of current target curves; match them to their equivalent loudspeaker setups. Based on that we propose a new methodology for a measurement-based target curve incorporating typical panning paradigms of music signals based on measurements inside multiple control rooms. To verify the results, we conduct listening tests with professionals in multiple studio environments.
We present Binaspect, an open-source Python library for binaural audio analysis, visualization,; feature generation. Binaspect generates interpretable “azimuth maps” by calculating modified interaural time; level difference spectrograms,; clustering those time-frequency (TF) bins into stable time-azimuth histogram representations. This allows multiple active sources to appear as distinct azimuthal clusters, while degradations manifest as broadened, diffused, or shifted distributions. Crucially, Binaspect operates blindly on audio, requiring no prior knowledge of head models. These visualizations enable researchers; engineers to observe how binaural cues are degraded by codec; renderer design choices, among other downstream processes. We demonstrate the tool on bitrate ladders, ambisonic rendering,; VBAP source positioning, where degradations are clearly revealed. In addition to their diagnostic value, the proposed representations can be exported as structured features suitable for training machine learning models in quality prediction, spatial audio classification,; other binaural tasks. Binaspect is released under an open-source license with full reproducibility scripts at: (link removed for blind review)
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The recently finalized ISO international standard (IS) on MPEG-I immersive audio enables interactive six-degrees-of-freedom (6DoF) audio rendering for a multitude of virtual-reality; augmented-reality (VR/AR) acoustic scenarios; applications with comprehensive modeling of room acoustics; intricate acoustic phenomena, including e.g. occlusion, reflection, transmission; diffraction caused by sound obstacles, Doppler effect,; dynamic environment changes triggered by user interactivity. This paper describes concept, methodology; results of the final verification test of this standard. In the verification test, the perceptual quality of the renderer was assessed in an interactive listening test using different in-; outdoor acoustic scenes, testing the above-mentioned features of the standard. More than 50 listeners participated in the test distributed across six labs using the ITU‑R BS.2132 [1] multi‑stimulus method on a 100‑point scale for three conditions (IS, mid-; low anchor) in 10 VR scenes plus two repetitions. The results of several anchor processing configurations are presented. The selected mid; low anchors have demonstrated stable quality across diverse scenes with progressive timbre; spatial degradations. The listening test results show a clear separation of the conditions (IS > mid > low); the low anchor was stable (around 16 points median value) while the mid anchor varied by scene (around 47 points). The IS is rated with a median of 84 points among all labs, which is the “excellent” region of the scale. The individual scenes are rated differently. The quartile range for some scenes can exhibit 20 points. The median value for the IS of the different labs varied, some are a bit more critical than others.
Sascha Disch received his Dipl.-Ing. degree in electrical engineering from the Technical University Hamburg-Harburg (TUHH) in 1999 and joined the Fraunhofer Institute for Integrated Circuits (IIS) the same year. Ever since he has been working in research and development of perceptual... Read More →
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
This paper presents a comparative analysis of two immersive recording techniques for classical music: the PCMA-3D (Perspective Control Microphone Array); the Decca Cuboid. While the Decca Cuboid relies primarily on time-of-arrival differences to generate spatial impressions, the PCMA-3D utilises intensity differences; separates ambience from direct sound. A recording session was conducted in a concert hall using a classical guitar soloist; two distinct folk music ensembles to capture performances simultaneously with both arrays. Subjective evaluation was performed using a MUSHRA listening test with 18 participants, assessing parameters such as sensation of space, localisation precision,; sound quality. Statistical analysis reveals that while both systems provide high-quality immersive experiences, the PCMA-3D scored significantly higher in the sensation of space (p
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Deep learning has significantly improved speech enhancement performance in controlled laboratory conditions, yet these advances rarely translate into robust real-world benefit for hearing aid users. Current algorithms are trained; evaluated in simplified acoustic scenarios, neglecting multimodal cues, user interaction, environmental dynamics, ; the strict latency; power constraints of embedded devices. As a result, a persistent gap remains between algorithmic performance; everyday listening experience. This position paper reviews recent progress in speech enhancement, embedded Artificial Intelligence hardware,; hearing aid systems,; argues for a shift toward ecologically valid evaluation; hardware-aware design. We propose virtual reality as a reproducible, multisensory benchmarking platform enabling joint assessment of human perception; algorithmic processing. This perspective outlines a research roadmap toward adaptive, context-aware, ; practically deployable hearing technologies.
Few studies exist on the perception; measurement of nonlinear distortion in headphones. This paper reports the detection thresholds; perceived sound quality from real distortion in headphones. Five different distortion measurements were made on the headphones to determine how well they predict audibility; quality. Music samples were binaurally recorded on six headphones at playback levels ranging from 85 to +110 dBA at 3 dB increments. The recordings were reproduced at a normal playback level (83 dBA) through a reference headphone with low distortion. The headphone recordings were post-processed to remove both level; frequency response differences so only nonlinear distortions; residual noise remained. In a second test, listeners rated the similarity in quality of headphones relative to an undistorted reference; a hidden version of it. The results provide evidence audible distortion in headphones with music occurs at significantly higher playback levels (104 to 112 dBA SPL) than what is considered typical; safe. The percentage of measured THD in the headphone had the highest correlation with the detection thresholds while the non-coherent distortion with music best predicted the similarity ratings. We discuss the results; the practical implications they might have on future headphone design, testing; measurement.
This work presents a perceptual model based on a complex IIR filterbank. The filterbank with a frequency resolution of 4 bands per Bark consists of 104 filters whose slopes are designed to take spectral masking effects into account. The filter outputs are used to obtain masking thresholds with the following post processing. To obtain resonable masking thresholds from the spreading outputs, a post masking stage is required. Here, we propose a comodulation dependent adaptation of the postmasking decay to model Comodulation Masking Release (CMR) effects. This approach explicitely considers the dip-listening effect known from literature. The final masking thresholds are obtained by weighting the postmasking outputs by a tonality dependent gain, controlled using spectral flatness estimation. A listening test compares the proposed method to an already known approach using direct CMR based modification of the masking threshold gains.
Florian details the design of his brilliant and durable Double-Ufix 3D mic array, capable of high resolution outdoor recording. Attendees are treated to memorable listening examples from natural and rural environments in Austria and the Nordics.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
EMORSION is an exploratory study examining how film audio design shapes audience emotion; immersion. It was conducted using scenes from four films in the horror (2) ; drama (2) genres, with two mainstream; two independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design; frequency (pitch), dynamics (loudness),; directionality (spatial placement). Three audience groups were exposed to the scenes in a cinema setting, with each group experiencing either one manipulated audio mix; a control mix. Audience responses were assessed through a multimodal framework combining self-reported emotion; immersion via a questionnaire,; physiological measures, including heart rate monitoring; video-based motion tracking. Results show that subtle changes in audio design significantly affect emotional perception; immersion. Unconventional mixes produced greater variability in interpretation, while conventional immersive mixes led to stronger agreement across audiences. Notably, participants often reported perceived visual changes despite no alterations to the visual content.
Josh Reiss is Professor of Audio Engineering with the Centre for Digital Music at Queen Mary University of London. He has published more than 200 scientific papers (including over 50 in premier journals and 6 best paper awards) and co-authored two books. His research has been featured... Read More →
I'm Nelly Garcia. I'm an engineer in communications and electronics with the specialty in acoustics. Now, I'm a PhD Researcher at the Centre for Digital Music (C4DM) at Queen Mary University of London. My main interest is sound design, ways to create sounds from scratch, optimize the workflow of a sound designer and innovative ways to label, categorise or access samples... Read More →
Identifying robust headphone target curves is challenging when preference data from untrained listeners are interpreted without explicit perceptual structure. This work presents a methodological framework in which deep- learning-driven sensory-profile analysis serves as the primary interpretive layer for listening data. Candidate target curves are generated using an Interactive Differential Evolution (IDE) listening experiment that combines paired comparisons with a second- stage absolute-rating task, enabling continuous exploration of the perceptually relevant tuning space while reducing cognitive load. Converged gain sets are analyzed using a Virtual Listener Panel (VLP), a Deep Learning (DL) model trained on large-scale expert evaluations to predict perceptual attributes from rendered musical material. Predicted attributes are reported as relative scores along key sensory dimensions, including bass strength, timbral balance,; brilliance, enabling exploration of sensory clusters, perceptual trade-offs,; potential families of target tunings. Adaptive listening data from three culturally distinct listener panels (Denmark, Japan,; Colombia; 20 participants per site) support the DL-based interpretation. Convergence is quantified as a reduction in population variance, ; cross-site analyses assess the similarity of clustering structures; the consistency of relationships between preference; sensory attributes. Overall, the framework provides a scalable, perceptually grounded approach to interpreting listener-preference data when developing headphone target curves.
Perceptual Audio Evaluation Specialist, FORCE Technology
▪ Acoustics, psychoacoustics, product development, and digital communication as an Audio Engineer in the consumer electronics industry. ▪ Currently employed as a specialist at FORCE Technology's SenseLab department, contributing to enhancing sound quality in a wide range of consumer electronics products, collaborating with audio companies from across the globe... Read More →
Sa quintina is a distinctive emergent vocal phenomenon almost exclusively associated with the sacred polyphonic singing tradition of Castelsardo, perceived as an autonomous “fifth voice” arising during collective performance by four male singers. Although widely acknowledged in ethnomusicological literature, its formation mechanisms remain only partially explored within audio engineering; acoustical research. This paper presents an early-stage, descriptive sonological case study proposing new hypotheses on the formation; spatial reinforcement of sa quintina. The phenomenon is interpreted as a physically grounded, measurable outcome of harmonic fusion; spatial interference, observable through spectral energy distribution; coherence. It is hypothesized to emerge from a converging set of conditions—including non-tempered harmonic textures, differentiated vocal emission techniques, intentional formant tuning,; circular spatial configuration—none of which is assumed to be strictly sufficient in isolation. Building upon previous spectral coherence analyses, the study introduces a Quintina Directionality Index (QDI) to quantify the spatial dimension of the phenomenon. QDI is defined as the ratio between spectral energy in two frequency bands associated with sa quintina (600–750 Hz; 1200–1400 Hz); total spectral energy. The index is evaluated as a function of direction using ambisonic recordings in an anechoic chamber; as a function of microphone position in a controlled field setting. Preliminary observations suggest that sa quintina corresponds to localized regions of enhanced spectral coherence; energy reinforcement, supporting its interpretation as an emergent physical phenomenon that precedes; enables its perceptual salience, rather than a purely auditory illusion.
Jim and Ulrike have been recording in and for immersive audio for broadcast, film and audiophile staples for decades. They specialize in turning traditional acoustic New York Studio recordings into vast spatial experiences. The audiences will be experiencing the breathtaking virtuosity of the likes of Jane Ira Bloom, the Secret Trio, Donald Vega and large format ensembles under Franco Ambrosetti and Jim Pugh.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
Jim has been the President of the AES Educational Foundation since 2020 and is a professor of recorded music with the Clive Davis Institute of Recorded Music in the Tisch School of the Arts at New York University. Jim was the Institute’s Chair from 2004 – 2008. A graduate of the... Read More →
Live music environments can be simulated; evaluated through spatial audio; augmented reality (AR) technology. However, conducting perceptual studies on AR environments can be challenging, as multiple design considerations; uncontrolled variables come into play. Hence, we developed Naviqual, a tool to create a spatial audio quality map for a virtual live music environment. We generated objective quality contour; polar maps to predict the quality of experience (QoE) across listener locations; directions respectively. We found that these maps strongly aligned with perceptual evaluations by normal-hearing listeners through listening tests. We also found that binaural objective metrics; signal-to-noise ratio both strongly predict QoE across listener translations, with the former outperforming the latter in predicting QoE across listener directions. Overall, Naviqual provides a QoE map for virtual live music environments robust across various listener locations; directions, noise locations, music content,; room acoustics.
Audio engineering often implicitly assumes a uniformity in hearing across listeners; this is an assumption that does not reflect real-world diversity. How could technologies and practices in production, mixing, and reproduction be adapted to create music that is more inclusive? While the AES has a conference series on Audio and Music Induced Hearing Disorders, this has focused on the causes of hearing loss with little on audio engineering for listeners who have a hearing loss.
In western countries, about one in three adults are deaf, have hearing loss or suffer from tinnitus. Hearing loss can lead to many challenges with music such as: inaudibility of quieter passages, distortion, degraded pitch perception, and difficulty in identifying and picking out lyrics and instruments. The most common intervention for mild to moderately severe hearing loss is hearing aids. But while many of these devices have music programs, their efficacy is mixed, to the point that many opt not to use them. With the rise of machine learning within Audio Engineering, there are opportunities to better personalise music, and therefore address issues listeners face. Consumer devices are also increasingly having audio accessibility features added, but the usefulness of these lack independent testing. This workshop will consider opportunities for making music more accessible.
The workshop will start by exploring how hearing loss harms the experience of listening to music and how this varies between people. This will lead to discussion of why no technology can fully ‘correct’ music to achieve a ‘perfect’ listening experience for those with hearing loss. There is no technology to recreate a ‘golden-ears’ experience. This leads to a key research question: what is the best, rendition of a piece of music for someone who has hearing loss? What do listeners want from music, and how can we get closest to achieving that?
We will bring in findings from research projects and listening tests to explore what is known, and also to highlight that there are significant gaps in knowledge that require further research. We will then explore state-of-the-art in wearables such as hearing aids and sound reproduction systems. This will include the current Cadenza project, which has been running a series of machine learning challenges to improve music for those with hearing loss.
Throughout, we will encourage questions and engagement from delegates. We want to hear about lived experience of hearing difference and how that has changed professional practice and personal lives. We are also keen to hear suggestions from delegates on what approaches might be used to improve music for those with hearing loss.
We aim to raise awareness of the importance of considering diverse audiences in Audio Engineering practice. Where possible, the workshop will provide practical guidance for audio engineers, highlighting techniques and emerging technologies that can better support listeners with diverse hearing profiles.
The Workshop will be organised by the Cadenza Project Team https://cadenzachallenge.org/ A large UK-funded project about improving music for those with hearing loss.
Josh Reiss is Professor of Audio Engineering with the Centre for Digital Music at Queen Mary University of London. He has published more than 200 scientific papers (including over 50 in premier journals and 6 best paper awards) and co-authored two books. His research has been featured... Read More →
The phenomenon in which listeners’ impressions of music are unintentionally altered even when the same sound source is played back remains an important issue. Previous research has shown that the state; combination of audio equipment affect the characteristics of nonlinear distortion in music playback. Hence, we conducted a subjective evaluation of auditory; musical impressions using sound sources with various nonlinear distortions. However, the subjective evaluation was unstable; difficult to assess. The reason was that the sound change was perceived emotionally as a slight change in sound image; musicality,; the interpretation of evaluation terms varies widely among subjects due to the difficulty of verbalizing the impression. Therefore, we evaluated the change in listeners’ stress caused by nonlinear distortion in music playback using the photoplethysmography (PPG). In this study, we conducted a follow-up experiment with improved accuracy. In the experiment, 41 subjects listened to sound sources with even-order harmonic distortion at 2.69% THD, odd-order harmonic distortion at 2.69% THD,; no distortion. The musical piece of sound sources is an original to eliminate familiarity; bias toward existing music. We evaluated changes in subjects’ stress states using the mean pulse-pulse interval (PPI); the root mean square of successive differences (RMSSD), computed from the PPG signal, as indicators of stress. These results reconfirm that nonlinear distortion in music playback affects listeners’ vital responses, as evidenced by significant differences in both mean PPI; RMSSD, as assessed by Cochran's Q test at the 5% significance level.
Stefan reports from the front lines of recording, mixing, and live streaming immersive music, highlighting the technical and creative challenges of delivering three-dimensional sound in real time. He shares practical insights into spatial mixing, format compatibility, and the realities of reliable immersive streaming across diverse playback environments.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
Stefan Bock, born 20.08.1964 in southern Germany was starting his career in 1987 as an audio engineer. After freelancing in different facilities in Munich, he co-founded msm-studios in 1991 where he was the Chief Mastering Engineer and General Manager.
This paper presents Part 2 of our study on personalized timbre optimization for stereophonic sound reproduction via earphones, following our previous work presented at the AES International Conference on Headphone Technology in 2025. While Part 1 established a novel auditory-model-based framework for reproducing a listener’s natural timbre reference; demonstrated its perceptual validity under controlled conditions, the present study focuses on the practical implementation; validation of this approach for real-world use with consumer True Wireless Stereo (TWS) earphones.
Conventional headphone; earphone personalization techniques primarily target spatial audio reproduction or rely on preference-based equalization, often overlooking the accurate reproduction of natural timbre in stereophonic content. Our approach explicitly addresses this limitation by isolating; optimizing perceptually relevant timbral cues while excluding spatial encoding components, thereby improving timbral fidelity without degrading stereo imaging.
The proposed method originally consists of four stages: high-resolution anatomical scanning of the listener’s upper body, including the pinnae, individualized HRTF computation using the boundary element method, selective removal of spatial encoding components to derive a personalized reference target response curve (PR-TRC),; perceptual optimization using a listener-specific weighting coefficient grounded in auditory reference fidelity rather than preference. In this paper, each stage is simplified ; automated using smartphone-based scanning; AI-assisted processing, enabling end users to complete the entire personalization process via a smartphone connected to a cloud-based server. The resulting personalized target response curve is implemented within the computational; memory constraints of the DSP pipeline of commercial consumer TWS earphones.
A subjective evaluation using the Semantic Differential Method was conducted to assess the perceptual impact of the simplified implementation. Twenty-four listeners evaluated personalized target curves generated by both the original ; simplified methods, as well as two non-personalized target curves commonly used in commercial TWS earphones. The results show that both personalized methods consistently outperform non-personalized conditions in overall sound quality; listener preference. Importantly, no statistically significant degradation in perceived timbral naturalness was observed between the simplified; original methods.
These findings demonstrate that auditory-model-based personalized timbre optimization can be effectively translated into a practical, consumer-ready technology. The proposed approach represents a foundational contribution to future audio personalization; has broad applicability across headphone; earphone systems for stereophonic sound reproduction.
Kimio Hamasaki, an AES Fellow, is a producer and balance engineer for music recordings, a researcher in spatial audio, an educator in audio engineering and acoustics, and a consultant in audio engineering. He has recorded and produced numerous orchestral and operatic works with the Vienna Philharmonic... Read More →
Audio engineering standards often present as objective, yet they frequently rely on a systemic data bias which Perez characterises as the 'default male bias' [1]. This paper examines the hegemony of the male ear, a system of norms that privileges masculine modes of hearing by prioritizing technical structure; text over affective experience; timbre [2]. By transitioning from a visual centric auditory gaze toward an embodied sonic gnosis, researchers can recover haptic; physiological ways of knowing sound. Drawing on the feminist listening praxis of the Female Ear [3], this work explores the recording studio as an analytical space where sonic microaggressions [4] enforce rigid technical standards. The author argues for a new audio praxis that centers ear pleasures [5], validating subjective; affective sensory data as legitimate engineering input. This approach seeks to dismantle the regulatory fiction [6] of a universal hearing standard, promoting a pluralistic understanding of musicking [7] that is inclusive of non normative perspectives.
Loudspeaker monitoring is the reference when audio professionals evaluate content. Headphones are also important quality-checking tools; and many consumers enjoy music using “close-fitting listening devices”, as all different flavours of headphones are known in recent standards writing.
We discuss the two reproduction methods from perceptual, recording and mastering perspectives; especially differences in timbre, imaging and auditory envelopment when listening to stereo. Applications of headphones in recording, when setting up and trimming stereo or 3D microphone arrays, are also practically detailed.
In the last part of the workshop, attendees are invited to personally compare the two domains on the qualities and applications discussed; with guided listening to audio examples between a pair of precision nearfield monitors, Genelec 8351B, and a pair of excellent headphones, Audeze CRBN2.
Stefan Bock, born 20.08.1964 in southern Germany was starting his career in 1987 as an audio engineer. After freelancing in different facilities in Munich, he co-founded msm-studios in 1991 where he was the Chief Mastering Engineer and General Manager.
Recording Producer and Balance Engineer with 50 GRAMMY-nominations, 42 of these in craft categories Best Engineered Album, Best Surround Sound Album, Best Immersive Audio Album and Producer of the Year. Founder and CEO of the record label 2L. Grammy Award-winner 2020 and 2026. Immersive... Read More →
Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Music source separation (MSS) systems are commonly used in production, remixing,; audio analysis work, yet questions arise regarding the extent that objective evaluations of model performance align with human perceptual evaluations, particularly when tasked with non-traditional source material (in this case, heavily processed electronic music). This study seeks to set a framework for an evaluation of 3 machine learning approaches to MSS: a spectrogram-domain model (spleeter), a waveform-domain model (Demucs v2),; a hybrid-domain model (HTDemucs). Subjective evaluations of model performance were accumulated via a MUSHRA-style listening test, while objective evaluations were assessed using signal-to-distortion ratio (SDR); Frechet Audio Distance (FAD). Results showed consistent agreement across objective metrics, with the hybrid-domain model outperforming the other singular-domain models. Perceptual ratings also favored the hybrid model, with listeners occasionally rating the model output as equal or better quality than the original reference, interestingly. Preliminary analysis indicates some moderate but insignificant correlations between the two assessment paths, reinforcing concerns about relying solely on numerical evaluations when discussing MSS model performance. Implications for model design; future evaluation procedures are discussed.
Recent advances in large-scale multichannel loudspeaker systems have enabled immersive concert formats that extend spatial control beyond conventional stereo; small multichannel configurations. High-density loudspeaker arrays (HDLAs) allow sound to be distributed across complex architectural spaces, challenging established distinctions between composition, performance,; live sound practice. In live contexts, however, the realization of spatial attributes is often constrained by system complexity, limited rehearsal time,; the lack of artist-facing spatial control interfaces. As a result, spatial realization; sound diffusion are frequently delegated to sound engineers, who translate artistic material to the acoustic; architectural conditions of the venue in real time.
This paper examines three immersive concerts presented during Sonic Days 2025 in Denmark, realized on both large-scale; small-scale multichannel loudspeaker systems. The concerts represent contrasting production contexts, including a site-specific spatial composition conceived explicitly for a high-density loudspeaker array ; performances by artists whose practices are typically oriented toward stereo or small multichannel formats. Across these cases, spatialization functioned variously as compositional material, interpretive layer,; adaptive live-mixing practice.
The paper analyzes how control over spatial attributes is negotiated between artists; sound engineers in live immersive concert settings,; how this negotiation affects the interpretation of artistic intent; audience experience. Particular attention is given to the role of sound engineers as active mediators whose decisions shape spatial form, listening perspective,; the relationship between sound; architecture. The findings suggest that immersive concert formats redistribute creative agency across artists, technicians,; technological infrastructures,; point toward the need for revised conceptual frameworks for authorship, performance,; listening in large-scale spatial audio environments.
This presentation develops a conceptual framework for understanding how visitors cognize sound in museum exhibitions. While sound increasingly features in museum practice, research has focused primarily on measuring visitor enjoyment; engagement rather than examining the specific meanings sound generates. This gap reflects the absence of a framework conceptualizing sound's meaning-making capacities to guide empirical investigation. Drawing on scholarship from music studies, semiotics, phenomenology,; embodied cognition, I propose a seven-component spectrum identifying distinct yet interrelated meanings that sound can convey in museums: aesthetic, representational, emotional, sensorial, imaginative, social,; political. These meanings can be apprehended independently or in combination, typically through emergent, pre-conscious perception rather than deliberate awareness. The spectrum builds on the premise that museum sound meaning-making unfolds through dynamics internalized from early childhood as we attune to the world sonically. It draws on the notion of sound as a "sonic aggregate" (Grimshaw; Garner 2015)—encompassing social, contextual, temporal,; embodied experiences—rather than reducing sound to wave phenomena. Visitors actively co-produce meanings by drawing on their moods, memories, knowledge, ; imagination during exhibition encounters. Each meaning category is illustrated with exhibition case studies, demonstrating the spectrum's applicability across diverse sound-based multimodal museum practices—from popular music exhibitions to sound art installations. The spectrum aims to catalyze research through varied methodological approaches; establish analytical standards for studying sound in museums, with potential adoption by international standardization bodies.
Sound Studies Researcher, INET-md | NOVA University lisbon
A PhD in ethnomusicology and museum studies and a curator, I am committed to exploring the diverse meaning-making capabilities of sound when exhibited in museums, encompassing the representational, emotional, sensorial, and social, as well as its ability to foster imagination and... Read More →
Friday May 29, 2026 10:30am - 11:00am CEST Aud 43Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Prof. Li Dakang is a preeminent recording engineer and pioneer of 3D recording of Chinese traditional music, ancient instruments and spaces. Attendees are treated to a selection of unique 3D recordings, including a new and glorious version of China’s National Anthem. Prof. Li describes the LDK-Cube for capturing the envelopment of an acoustic space, and questions reliable reproduction of this important quality.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
This paper presents the perceptual evaluation of the Open Binaural Renderer (OBR), an open-source librarydeveloped for headphone-based rendering of Immersive Audio Model and Formats (IAMF) content. The evaluationfollowed an iterative framework in which findings from a pilot listening study informed the tuning of renderingprofiles, and the resulting renderer was benchmarked against established proprietary solutions. In the pilot study,19 expert listeners rated the Overall Listening Experience (OLE) of the initial prototype (OBRv1) and five externalrenderers across diverse audio content. Qualitative feedback was analysed using inductive coding to identify salientperceptual dimensions. The pilot revealed content-dependent performance and showed that a single default profilewas inadequate, yielding mixed responses in both the numerical scale and in the qualitative feedback and motivatingthe development of multiple rendering profiles in OBRv2. The main study evaluated two OBRv2 profiles targetingdifferent reverberation characteristics (Direct and Ambient) alongside three top-performing external renderers. Atotal of 39 participants, divided into expert and non-expert groups, rated five perceptual attributes: Voice Quality,Envelopment, Externalisation, Overall Listening Experience, and Timbral Balance. Mixed-design ANOVA revealedsignificant main effects of renderer condition on all attributes. Pairwise comparisons showed that OBRv2,Ambientachieved significantly higher OLE ratings than one proprietary renderer and reached statistical parity with theremaining two, representing a measurable improvement over the prototype. A trade-off between Voice Qualityand Externalisation was observed, driven by the level of reverberation in each renderer. The results demonstratethat iterative, perceptually informed tuning can yield competitive binaural rendering quality in an open-sourceframework.
Professor of Audio Engineering, University of York
Gavin Kearney graduated from Dublin Institute of Technology in 2002 with an Honors degree in Electronic Engineering and has since obtained MSc and PhD degrees in Audio Signal Processing from Trinity College Dublin. He joined the University of York as Lecturer in Sound Design in January... Read More →
With 25+ years of media industry product development, Jani Huoponen is a seasoned expert in developing cutting-edge audio and video technologies for consumer devices and streaming systems. Joining Google in 2010, he’s served as a product manager across key multimedia initiatives... Read More →
Despite the growing number of hearing-impaired workers wearing hearing-aids in occupational settings, understanding speech in multi-talker situations remains challenging. This difficulty is particularly pronounced in open-plan offices, where simultaneous talkers; room reverberation are prone to degrade speech intelligibility. While spatial cues are essential for segregating target speech from competing sources, hearing-aids signal processing may alter binaural information that supports spatial hearing. Accurate evaluation of hearing-aids performance is therefore crucial. Objective speech intelligibility metrics offer an efficient alternative to time-consuming listening tests; however, their validity in complex spatial scenarios involving hearing-impaired listeners remains unclear. Monaural metrics such as HASPI account for individual hearing loss but neglect spatial information, whereas binaural metrics such as MBSTOI incorporate spatial cues but are primarily designed for normal-hearing listeners. This study evaluates the ability of existing objective metrics to predict speech intelligibility for hearing-aid users in multi-talker spatial environments. Listening tests are conducted on 20 hearing-impaired participants fitted with binaural hearing-aids. Four types of multi-talker auditory scenes representative of open-plan offices are reproduced using a loudspeaker array. They involve a target speech, combined with diffuse noise; a localized competing speech source. Objective measurements are performed using an acoustic mannequin fitted with the participants’ hearing-aids. HASPI; MBSTOI values are computed from the binaural signals recorded at the eardrums ; incorporating individual hearing losses. Objective predictions are compared with subjective intelligibility scores,; an ablation analysis is conducted to distinguish the effects of hearing loss modeling from those of binaural processing.
Morten describes his excellent recording techniques, and attendees are treated to a unique selection of high resolution 3D music listening examples.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency.
Recording Producer and Balance Engineer with 50 GRAMMY-nominations, 42 of these in craft categories Best Engineered Album, Best Surround Sound Album, Best Immersive Audio Album and Producer of the Year. Founder and CEO of the record label 2L. Grammy Award-winner 2020 and 2026. Immersive... Read More →
Friday May 29, 2026 12:30pm - 1:30pm CEST Aud 31Technical University of Denmark Asmussens Alle, Building 306 DK-2800 Kgs. Lyngby Denmark
Speech intelligibility is a key factor in successful communication across various domains, including research, post-production for film and television, live sound reinforcement, and audio production. Traditional assessment methods often lack objectivity or fail to capture the listener’s experience in real-world scenarios. In this workshop, we introduce an innovative approach to measuring speech intelligibility based on the concept of “Listening Effort.” We will present the underlying technology, share practical examples from different application areas, and demonstrate how this method can be integrated into workflows to optimize intelligibility. Attendees will have the opportunity to participate in a hands-on demonstration and discuss potential use cases relevant to their own work. This session is designed for professionals and researchers seeking reliable and actionable tools for evaluating and improving speech intelligibility in diverse environments. In this workshop, we present a new technology for measuring speech intelligibility (“Listening Effort”). The method is used in research, post-production (film/TV), live sound, and audio production. The session is aimed at professionals from both academia and industry who are interested in objectively assessing and optimizing speech intelligibility.
Participants will be able to join a short demo/exercise and ask questions.
Introduction & Relevance: Overview of the importance of speech intelligibility across different fields Technology & Methodology: Presentation of the measurement method and underlying concepts Practical Examples: Case studies from research, post-production (film/TV), live sound, and production Live Demo / Interactive Exercise: Practical demonstration and opportunity for active participation Discussion & Outlook: Q&A, exchange of ideas, and future perspectives
Situational awareness is a multisensory ability that enables individuals to perceive; appropriately take into account their immediate environment. This perception of the world through our senses is carried out continuously; unconsciously throughout the day. When auditory perception is degraded, an individual may no longer correctly perceive a doorbell, a water leak, or an alarm signal, which negatively affects quality of life; may lead to dangerous situations. Auditory perception can in particular be degraded by hearing loss, a common; widespread condition. The most common treatment consists of wearing hearing aids, which are mainly designed to improve speech intelligibility, especially in noisy environments. Feedback from hearing-impaired people; hearing-aid users indicates that, although auditory situational awareness has been recognised as an essential component of well-being, it remains insufficiently studied; requires further investigation. There is currently no standard method for assessing to which extent one's situational awareness is affected by hearing impairment; the use of hearing aids. This is a complex process that requires assessing the perception of relevant sound events within a continuous stream of multisensorial information, by individuals who have different subjective preferences. Most existing methods are limited to evaluating only a subset of the problem, such as identification; localisation of non-speech sound events. The rise of new technologies, such as virtual reality, enables the development of assessment methods within more realistic yet controlled environments. This study aims to review existing methods in order to highlight their limitations in addressing the issue at hand.
The purpose of this study was to evaluate a sonification system that maps live heart rate data to real-time spectral filtering of a runner's preferred music. Assessed using a within-subjects design (n = 13), the system employs high-pass; low-pass filters to indicate deviations from target heart rate zones, providing instantaneous biofeedback without requiring visual attention. Quantitative analysis revealed no statistically significant differences in target zone accuracy or response time between auditory, visual,; combined conditions. However, qualitative thematic analysis identified a clear division in user preference. Participants favouring the auditory condition demonstrated faster mean response times to audio biofeedback. Findings suggest that while sonification promotes environmental focus; "gamifies" training, its efficacy is highly dependent on individual processing styles; music familiarity.
This paper proposes a psychoacoustic-based audio-visual mapping framework for intelligent vehicle cabins to enhance immersion; stabilize spatial auditory perception. By establishing mappings between auditory descriptors—such as Direction of Arrival (DOA), spectral centroid,; temporal envelope—and ambient lighting parameters, the framework leverages "ambient vision" to augment the perceptual experience without increasing the driver's cognitive load. Theoretical analysis based on Stevens’ Power Law indicates that the proposed mapping strategies effectively synchronize audio-visual intensities; mitigate perceptual fatigue, providing a conceptual reference for future multisensory HMI design.
Headphone listening has become an integral part of everyday life, spanning music consumption, communication, online media,; increasingly, computer gaming. These diverse listening contexts make individual sound exposure highly variable; difficult to quantify. While music listening ; occupational headphone use have been widely studied, sound exposure from gaming remains comparatively undocumented. This study investigated the relationship between self‑reported exposure through headphones; cochlear function assessed using transient evoked otoacoustic emissions (TEOAE). Forty‑one university students completed a detailed questionnaire on listening habits,; TEOAEs were recorded in both ears across five half‑octave frequency bands. Estimated weekly exposure levels were derived from participants’ reported durations ; contexts of use. TEOAE amplitude, signal‑to‑noise ratio (SNR),; reproducibility showed clear frequency‑dependent patterns; small ear asymmetries, consistent with typical OAE behaviour. Only limited associations were found between self‑reported exposure; TEOAE measures, with significant effects emerging primarily for SNR; reproducibility in the highest‑exposure group. No consistent differences were observed between long‑term gamers; non‑gamers. These findings suggest that self‑reported exposure alone may be insufficient to detect subtle cochlear changes in young adults,; underscore the need for more precise exposure‑monitoring methods when evaluating recreational sound exposure risks.
Binaural rendering is typically assessed via timbre; localization accuracy, while its intrinsic spatial resolution remains rarely quantified. This paper proposes a perceptual evaluation method based on Minimum Audible Angle (MAA) measurements to estimate the azimuthal just-noticeable difference (JND) introduced by binaural rendering algorithms. We systematically compared several rendering algorithms across eight reference azimuths using two participant-allocation paradigms. The results show that spatial resolution is significantly influenced by Ambisonic order; choice of the rendering alrorithm, with MAA thresholds systematically decreasing as the truncation order increases. Furthermore, the propsed method successfully captures physiological spatial characteristics ; identifies resolution limits imposed by reference angles. While both participant-allocation paradigms yield consistent qualitative trends, the repeated-measures design provides superior data stability. These findings demonstrate that the proposed MAA-based method is an effective tool for quantifying the spatial resolution of binaural rendering algorithms.
Richard is a multiple Grammy Award–winning recording engineer and a specialist in acoustic music recording. His work is focused primarily on classical, jazz, and film score music. A selection of his immersive recordings will be presented, accompanied by a discussion of the microphone configurations and mixing decisions employed in each example.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
This study evaluates three Next-Generation Audio (NGA) rendering systems through listening tests using real-life audio content. The testing paradigm prioritized subjective preference over adherence to a ground-truth reference. Participants assessed perceptual spatial audio attributes in both 5.1; 7.1.4 loudspeaker setups. The findings suggest that strict adherence to the rendering algorithm used during content creation is not mandatory in terms of listener preference. While not advocating disregarding artistic intent without consideration, this study proposes that such flexibility in reproduction can be an acceptable compromise.
Toni Hirvonen studied acoustics at the Helsinki University of Technology (now Aalto University), where he obtained a PhD in audio signal processing and spatial audio. After a position as a Marie Curie fellow, he has worked internationally in the audio industry since 2010. His projects... Read More →
With the omnipresence of immersive audio the loudness agenda has been pushed out of the spotlight. While there are important areas (like TV) where the introduction has been a resounding success with a complete paradigm shift, others have not yet fully embraced the "auditory cease fire" (Radio) or even searched for ways to counteract loudness normalisation or still gain a loudness advantage (pop music). In this workshop, two veterans of the EBU loudness group PLOUD will elaborate on potential meta-reasons for the resistance in the latter areas as well as survey recent developments and challenges.
String ageing is a familiar; perceptually important phenomenon for guitarists; players of other stringed instruments. From the moment a new set of strings is installed, the sound they produce when excited begins to change due to a combination of chemical degradation, corrosion,; mechanical wear arising from playing. Musicians commonly report that aged strings sound dull, lack sustain,; feel less responsive compared to new strings. String ageing is a function of both elapsed time ; accumulated playing time, with repeated playing accelerating degradation through contamination; repeated mechanical stress.
Previous studies have investigated individual aspects of string ageing by artificially accelerating wear; performing controlled acoustic measurements, identifying effects such as increased damping of higher partials; increased inharmonicity. While these approaches provide valuable physical insight, the tightly constrained experimental conditions differ significantly from real-world playing conditions.
This paper presents a dataset of audio recordings of guitar playing over a four-week period, starting from the point of new strings being installed. Audio performance data from different sets of electric guitar strings is recorded daily over a four-week period, using strictly fixed musical exercises that are repeated multiple times per session. By collecting many takes of identical material at each stage of string age, the dataset enables statistical analysis of ageing-related changes while accounting for natural performance variability.
The dataset is intended to support exploratory machine learning investigations into string ageing, including questions of how ageing manifests over time; playing duration, whether string age can be predicted from audio alone,; which audio features or learned representations capture perceptually relevant aspects of the ageing process.
Thomas McKenzie is a Lecturer in Acoustics and Architectural Acoustics at the Reid School of Music, Edinburgh College of Art, University of Edinburgh, UK. He completed a B.Sc. in Music, Multimedia, and Electronics at the University of Leeds, UK, in 2013, before completing his M.Sc... Read More →
Saturday May 30, 2026 9:00am - 11:00am CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Historically, music has developed primarily as a frontal phenomenon, thus limiting the expressive; perceptual potential related to sound space. The recent development of immersive audio systems opens new creative possibilities by expanding the artistic action space from a narrow frontal area to a complete sphere around the listener. The Ambisonic system (Scene-Based Audio), together with Object-Based formats; hybrid solutions, represents fertile ground for creative experimentation; the redefinition of workflows in the field of spatialized sound. In this new context, what is the role of the sound engineer, as an electroacoustic interpreter, in immersive musical artistic creation? The research is based on a multidisciplinary analysis that combines an in-depth study of current immersive audio technologies; their performance, with observations of existing compositional; production approaches. Additionally, a comparative study is conducted on the design choices of the sound engineer as an interpreter, investigating workflows, emerging musical semantics, available tools,; the recovery of the historical repertoire. Particular attention is paid to the experiment aimed at investigating a correlation between the position of a sound ; an emotional trigger in the listener. New directions emerge in the creative role of the sound engineer, who goes beyond the mere technical aspect to become an integral part of the compositional; interpretative process, harmonizing the relationship between technique; art.
Mashup is a distinctive form of music composition which integrates elements from existing songs to create a cohesive audio experience. The digital music landscape, with various audio processing tools; sharing platforms, has facilitated the creation; propagation of mashups by musicians, remixers, audio engineers,; automated systems. While most prior research; studies focus on mashups created by combining elements from individual audio tracks, typically using pop songs, there exists other types of mashups; for example, by incorporating phrases from base melodies into a new arrangement. In this study, we examined listener enjoyment ratings for this type of mashup, utilizing well-known Western classical melodies. A listening test was conducted to assess whether variations in pitch, tempo,; familiarity with the source material correlate with enhanced enjoyment. This paper presents our preliminary findings, with plans for future studies; additional survey responses to strengthen the results; uncover insights for crafting more engaging classical mashups.
Kimio-san designed NHK’s 22.2 audio system and the Hamasaki-cube, which is brilliant at capturing spatial qualities of a concert hall. Attendees are treated to a selection of high resolution 3D recordings from glorious Japanese concert halls.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
Kimio Hamasaki, an AES Fellow, is a producer and balance engineer for music recordings, a researcher in spatial audio, an educator in audio engineering and acoustics, and a consultant in audio engineering. He has recorded and produced numerous orchestral and operatic works with the Vienna Philharmonic... Read More →
Most contemporary immersive audio production workflows are centered on discrete channel-based loudspeaker formats such as 7.1.4. These formats are rarely experienced by most consumers and listeners, particularly in music playback. In practice, spatial audio is predominantly delivered via binaural reproduction. Beyond headphones, head-tracked loudspeaker array systems now enable convincing binaural reproduction in a practical, listener-centric manner, unlocking spatial audio over loudspeakers for ordinary listeners. This positions binaural reproduction not as a secondary translation, but as the core delivery format for immersive audio consumption.
Creating primarily for fixed speaker layouts can impose creative and technical constraints often resulting in restrained spatial design when content is later rendered binaurally. This workshop advocates a binaural-centric approach to spatial audio creation, treating binaural as the main deliverable, while preserving compatibility with discrete channel-based systems. Through discussion and practical examples, we will explore how designing with binaural in mind enables more expressive, perceptually robust, and immersive experiences across both headphone and loudspeaker-based binaural playback, without relying on traditional 7.1.4-centric production models.
Dialogue intelligibility is a fundamental aspect of audio post-production. Ensuring speech clarity in complex sound mixes remains challenging across different playback systems. Selective auditory attention plays a central role in how listeners track dialogue in busy mixes, so small changes in spectral or spatial structure can influence perceived clarity in unexpected ways. This study investigates the effectiveness of psychoacoustically informed techniques, equalisation; spatialisation, in reducing auditory masking; improving the clarity of dialogue. The listening test was completed on participants’ own playback systems, which reflects typical domestic viewing conditions; aligns the study with real-world listening environments. The techniques were tested individually; in combination to assess their impact. Results show that equalisation was more effective than spatialisation in reducing masking, while their combination produced a significant improvement in intelligibility, clarity,; reduced interference. The effectiveness of these methods varied between the two groups of clips, suggesting that their application should be adapted to the specific acoustic context of each scene.
Dialogue and sound editor with 3+ years' experience and 30+ credits in film across feature film, animation, documentary and TV series.Contributed to award-winning and festival recognised productions, including films screened at the Venice Film Festival and the David di Donatello Awards... Read More →
Sound plays a critical role in virtual reality (VR), shaping attention, narrative comprehension, emotional engagement,; experiential plausibility under conditions of embodiment; user agency. Although a growing body of research addresses VR audio techniques, perceptual effects, ; sound taxonomies, existing approaches remain fragmented ; largely descriptive. In particular, they do not provide a unifying, VR-specific account of how sound meaning; emotional intent are operationally linked to user agency ; non-linear narrative progression. This paper presents a narrative review of selected literature spanning game audio frameworks, immersive sound design, narrative theory,; plausibility-related research in games; VR. Through synthesis of these perspectives, the review identifies a conceptual gap in current research, namely the absence of a VR-specific, agency-coupled sound design framework for structuring sound meaning; emotional intent in support of experiential plausibility as users actively shape events in interactive VR environments.
Senior Lecturer, Music Technology & Popular Music, The University of Queensland, School of Music
Dr Eve Klein is a lecturer in music technology at the University of Queensland, Australia. She is also an operatic mezzo soprano, a composer, and an Ableton Live Certified Trainer. Eve's research is concentrated on music technology, recording cultures and contemporary music. Her current... Read More →
George plays high resolution stereo, 5.1 and 3D recordings from his fabulous back catalogue, commenting on production tools and techniques, including his own excellent dynamics processor.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.
Imagine that you just finished designing and are now managing your dream immersive audio mix room for a client with an array of 64 speakers and it functions beautifully - then CoVid19 wreaks global havoc. You find yourself suddenly isolated in a new country, forced into retirement with its budgetary restrictions, and your dream studio has become an early victim to the pandemic. What would be your next move?
In this real-life story, follow the adventures of an intrepid audio engineer and his quest to build a personal version of that immersive studio that was lost – all within a fixed-income retiree’s budget.
In this tutorial, an immersive studio design and construction will be described including:
Inspiration from prior work by the author and colleagues Room design goals Equipment choices Custom electronics design Speaker design considerations Speaker support and position alignment Construction steps VBAP, Ambisonics, and WFS approaches Test mixes
The ability to objectively measure listener emotion is a critical frontier for adaptive audio systems, healthcare, ; personalized music therapy. While music is a powerful driver of affect, traditional self-reporting is often intrusive or inaccessible for users in wellbeing settings who may struggle to articulate their mood. This paper introduces JoyCam, a multimodal system that estimates subtle moments of joyful engagement by blending lightweight brain-wave monitoring (wearable EEG) with facial-expression sensing. By capturing physiological reactions that occur below the threshold of conscious awareness, the system creates a more stable emotional profile than single-modality methods. In our system, Facial joy is estimated via MediaPipe landmark analysis, focusing on normalized mouth-width deviations. Simultaneously, neurological engagement is tracked through Frontal Alpha Asymmetry (FAA) using an OpenBCI Cyton system. To address the sensitivity of EEG to movement, a dynamic artefact index down-weights neural signals during high-frequency interference. The system was tested in a pilot study with five participants. Preliminary results indicate that baseline-corrected physiological scores align closely with self-reported music impact; valence ratings across joyful; sad conditions. These findings suggest that JoyCam offers a robust framework for responsive musical companions that can adjust playlists or production parameters based on a listener’s real-time physiological state
Senior Lecturer, Acoustics Research Centre, University of Salford
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Tinnitus has been described as `the conscious awareness of a tonal or composite noise for which there is no identifiable corresponding external sound source'; is experienced by ~15% of the European population. Tinnitus may be experienced in one ear, both ears, or perceived as originating from within the head. It can present as tonal sounds, noise-like sounds, or a combination of both. The perception can lead to emotional;/or cognitive dysfunction, autonomic arousal, behavioural changes,;/or functional disability (DeRidder 2021, Biswas 2022, Jarach 2022). There is no standard test for tinnitus in the medical literature; audiologists typically test pitch (to within half an octave); perceived loudness of the tone using standard clinical equipment for testing hearing loss. The underlying causes of tinnitus are not yet fully understood,; the most effective treatments not yet identified. We present the first release of an extended Tinnitus matching app that includes a highly individualizable tinnitus tone-matching tool; a comprehensive questionnaire for mobile health tracking. The app facilitates large data collection on tinnitus sounds across aetiologies, co-occurring symptoms,; demographics. Our intentions are threefold; 1) to provide those experiencing tinnitus with a way to communicate what they hear more precisely, 2) understand how tinnitus sounds vary across demographics, how these relate to co-occurring symptoms,; eventually – 3) to provide a means of individualising any sound-based approach to symptom amelioration. We present the approach; validation of the tinnitus matching tool against common clinical measures.
Department of Engineering Technology and Didactics, DTU
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The application of binaural cue perception mechanisms to multichannel audio compression technology can reduce spatial parameter redundancy; effectively lower the encoding bitrate. Binaural cues play a critical role in sound source localization,; their frequency-dependent characteristics yield varied perceptual localization effects. However, current understanding of the specific behavior of binaural cues at low frequencies, as well as the similarities; differences between interaural time difference (ITD); interaural level difference (ILD), remains incomplete. To explore the relationship between ITD-based; ILD-based azimuth perception, this study non-uniformly selected nine ITD values; twelve ILD values within the 300–1480 Hz frequency range to test ITD ; ILD perceptual azimuths, respectively. The experimental method involved using fixed binaural cue stimuli while varying the audio with known horizontal azimuth angles to approach the target binaural cue stimulus. Test results indicate that both ITD; ILD perceptual effects are significantly influenced by frequency, with the minimum perceptual azimuth values for both ITD; ILD observed at 700 Hz, suggesting that binaural cue perception azimuths are closer to the median plane at this frequency. Furthermore, surface fitting was applied to the perceptual azimuths of ITD; ILD, revealing relatively similar patterns. Based on experimental findings, this paper analyzes the explorable perceptual correlation between ITD-based; ILD-based azimuth perception. The application of data in spatial audio coding contributes to the efficient transmission; fidelity preservation of audio signals. This study provides valuable insights for optimizing binaural cue-based compression techniques, ultimately supporting high-fidelity spatial audio reproduction.
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The authors proposed a stereo-width shrinkage method for headphone reproduction, in which crosstalk from loudspeaker reproduction is added to the original stereo sources. In this study, we investigate the sound quality of stereo-width-shrunken sources with different parameter settings. A Semantic Differential method is employed to quantify the subjective characteristics with five adjective pairs,; the naturalness of the stereo width shrunk sources is evaluated in detail with Scheffé’s paired comparison. The results of the Semantic Differential method comprehensively rank the sound sources. Interestingly, the results of the paired comparison are not reversed in the natural; unnatural evaluations, whereas the negative evaluation yields reasonable results. These results provide valuable insights for practical sound-quality evaluation.
Mitsunori Mizumachi graduated from the Department of Acoustic Design, Kyushu Institute of Design, in 1995 and received his Ph.D. degree in Information Science from Japan Advanced Institute of Science and Technology in 2000. From 2000 to 2004, he worked as a researcher at Advanced... Read More →
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Hyunkook describes his fine, compact 3D mic array, used for outdoor and indoor capture of 3D atmosphere and music. Attendees are treated to examples from the excellent ECHO project, and to a selection of new high resolution recordings.
This masterclass series, featuring remarkable recording artists, is a chance to hear 3D audio at its best; as we discuss qualities that make it truly worth the effort.
In each masterclass, we explore the new spatial possibilities in recording and production, detailing also this specific listening room, regarding ITU-R BS.1116 compliance and auditory envelopment (AEV) transparency. Seats are limited to keep playback variation at bay.