We present Binaspect, an open-source Python library for binaural audio analysis, visualization,; feature generation. Binaspect generates interpretable “azimuth maps” by calculating modified interaural time; level difference spectrograms,; clustering those time-frequency (TF) bins into stable time-azimuth histogram representations. This allows multiple active sources to appear as distinct azimuthal clusters, while degradations manifest as broadened, diffused, or shifted distributions. Crucially, Binaspect operates blindly on audio, requiring no prior knowledge of head models. These visualizations enable researchers; engineers to observe how binaural cues are degraded by codec; renderer design choices, among other downstream processes. We demonstrate the tool on bitrate ladders, ambisonic rendering,; VBAP source positioning, where degradations are clearly revealed. In addition to their diagnostic value, the proposed representations can be exported as structured features suitable for training machine learning models in quality prediction, spatial audio classification,; other binaural tasks. Binaspect is released under an open-source license with full reproducibility scripts at: (link removed for blind review)
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The recently finalized ISO international standard (IS) on MPEG-I immersive audio enables interactive six-degrees-of-freedom (6DoF) audio rendering for a multitude of virtual-reality; augmented-reality (VR/AR) acoustic scenarios; applications with comprehensive modeling of room acoustics; intricate acoustic phenomena, including e.g. occlusion, reflection, transmission; diffraction caused by sound obstacles, Doppler effect,; dynamic environment changes triggered by user interactivity. This paper describes concept, methodology; results of the final verification test of this standard. In the verification test, the perceptual quality of the renderer was assessed in an interactive listening test using different in-; outdoor acoustic scenes, testing the above-mentioned features of the standard. More than 50 listeners participated in the test distributed across six labs using the ITU‑R BS.2132 [1] multi‑stimulus method on a 100‑point scale for three conditions (IS, mid-; low anchor) in 10 VR scenes plus two repetitions. The results of several anchor processing configurations are presented. The selected mid; low anchors have demonstrated stable quality across diverse scenes with progressive timbre; spatial degradations. The listening test results show a clear separation of the conditions (IS > mid > low); the low anchor was stable (around 16 points median value) while the mid anchor varied by scene (around 47 points). The IS is rated with a median of 84 points among all labs, which is the “excellent” region of the scale. The individual scenes are rated differently. The quartile range for some scenes can exhibit 20 points. The median value for the IS of the different labs varied, some are a bit more critical than others.
Sascha Disch received his Dipl.-Ing. degree in electrical engineering from the Technical University Hamburg-Harburg (TUHH) in 1999 and joined the Fraunhofer Institute for Integrated Circuits (IIS) the same year. Ever since he has been working in research and development of perceptual... Read More →
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
This paper presents a comparative analysis of two immersive recording techniques for classical music: the PCMA-3D (Perspective Control Microphone Array); the Decca Cuboid. While the Decca Cuboid relies primarily on time-of-arrival differences to generate spatial impressions, the PCMA-3D utilises intensity differences; separates ambience from direct sound. A recording session was conducted in a concert hall using a classical guitar soloist; two distinct folk music ensembles to capture performances simultaneously with both arrays. Subjective evaluation was performed using a MUSHRA listening test with 18 participants, assessing parameters such as sensation of space, localisation precision,; sound quality. Statistical analysis reveals that while both systems provide high-quality immersive experiences, the PCMA-3D scored significantly higher in the sensation of space (p
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Music source separation (MSS) systems are commonly used in production, remixing,; audio analysis work, yet questions arise regarding the extent that objective evaluations of model performance align with human perceptual evaluations, particularly when tasked with non-traditional source material (in this case, heavily processed electronic music). This study seeks to set a framework for an evaluation of 3 machine learning approaches to MSS: a spectrogram-domain model (spleeter), a waveform-domain model (Demucs v2),; a hybrid-domain model (HTDemucs). Subjective evaluations of model performance were accumulated via a MUSHRA-style listening test, while objective evaluations were assessed using signal-to-distortion ratio (SDR); Frechet Audio Distance (FAD). Results showed consistent agreement across objective metrics, with the hybrid-domain model outperforming the other singular-domain models. Perceptual ratings also favored the hybrid model, with listeners occasionally rating the model output as equal or better quality than the original reference, interestingly. Preliminary analysis indicates some moderate but insignificant correlations between the two assessment paths, reinforcing concerns about relying solely on numerical evaluations when discussing MSS model performance. Implications for model design; future evaluation procedures are discussed.
The purpose of this study was to evaluate a sonification system that maps live heart rate data to real-time spectral filtering of a runner's preferred music. Assessed using a within-subjects design (n = 13), the system employs high-pass; low-pass filters to indicate deviations from target heart rate zones, providing instantaneous biofeedback without requiring visual attention. Quantitative analysis revealed no statistically significant differences in target zone accuracy or response time between auditory, visual,; combined conditions. However, qualitative thematic analysis identified a clear division in user preference. Participants favouring the auditory condition demonstrated faster mean response times to audio biofeedback. Findings suggest that while sonification promotes environmental focus; "gamifies" training, its efficacy is highly dependent on individual processing styles; music familiarity.
This paper proposes a psychoacoustic-based audio-visual mapping framework for intelligent vehicle cabins to enhance immersion; stabilize spatial auditory perception. By establishing mappings between auditory descriptors—such as Direction of Arrival (DOA), spectral centroid,; temporal envelope—and ambient lighting parameters, the framework leverages "ambient vision" to augment the perceptual experience without increasing the driver's cognitive load. Theoretical analysis based on Stevens’ Power Law indicates that the proposed mapping strategies effectively synchronize audio-visual intensities; mitigate perceptual fatigue, providing a conceptual reference for future multisensory HMI design.
String ageing is a familiar; perceptually important phenomenon for guitarists; players of other stringed instruments. From the moment a new set of strings is installed, the sound they produce when excited begins to change due to a combination of chemical degradation, corrosion,; mechanical wear arising from playing. Musicians commonly report that aged strings sound dull, lack sustain,; feel less responsive compared to new strings. String ageing is a function of both elapsed time ; accumulated playing time, with repeated playing accelerating degradation through contamination; repeated mechanical stress.
Previous studies have investigated individual aspects of string ageing by artificially accelerating wear; performing controlled acoustic measurements, identifying effects such as increased damping of higher partials; increased inharmonicity. While these approaches provide valuable physical insight, the tightly constrained experimental conditions differ significantly from real-world playing conditions.
This paper presents a dataset of audio recordings of guitar playing over a four-week period, starting from the point of new strings being installed. Audio performance data from different sets of electric guitar strings is recorded daily over a four-week period, using strictly fixed musical exercises that are repeated multiple times per session. By collecting many takes of identical material at each stage of string age, the dataset enables statistical analysis of ageing-related changes while accounting for natural performance variability.
The dataset is intended to support exploratory machine learning investigations into string ageing, including questions of how ageing manifests over time; playing duration, whether string age can be predicted from audio alone,; which audio features or learned representations capture perceptually relevant aspects of the ageing process.
Thomas McKenzie is a Lecturer in Acoustics and Architectural Acoustics at the Reid School of Music, Edinburgh College of Art, University of Edinburgh, UK. He completed a B.Sc. in Music, Multimedia, and Electronics at the University of Leeds, UK, in 2013, before completing his M.Sc... Read More →
Saturday May 30, 2026 9:00am - 11:00am CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The ability to objectively measure listener emotion is a critical frontier for adaptive audio systems, healthcare, ; personalized music therapy. While music is a powerful driver of affect, traditional self-reporting is often intrusive or inaccessible for users in wellbeing settings who may struggle to articulate their mood. This paper introduces JoyCam, a multimodal system that estimates subtle moments of joyful engagement by blending lightweight brain-wave monitoring (wearable EEG) with facial-expression sensing. By capturing physiological reactions that occur below the threshold of conscious awareness, the system creates a more stable emotional profile than single-modality methods. In our system, Facial joy is estimated via MediaPipe landmark analysis, focusing on normalized mouth-width deviations. Simultaneously, neurological engagement is tracked through Frontal Alpha Asymmetry (FAA) using an OpenBCI Cyton system. To address the sensitivity of EEG to movement, a dynamic artefact index down-weights neural signals during high-frequency interference. The system was tested in a pilot study with five participants. Preliminary results indicate that baseline-corrected physiological scores align closely with self-reported music impact; valence ratings across joyful; sad conditions. These findings suggest that JoyCam offers a robust framework for responsive musical companions that can adjust playlists or production parameters based on a listener’s real-time physiological state
Senior Lecturer, Acoustics Research Centre, University of Salford
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Tinnitus has been described as `the conscious awareness of a tonal or composite noise for which there is no identifiable corresponding external sound source'; is experienced by ~15% of the European population. Tinnitus may be experienced in one ear, both ears, or perceived as originating from within the head. It can present as tonal sounds, noise-like sounds, or a combination of both. The perception can lead to emotional;/or cognitive dysfunction, autonomic arousal, behavioural changes,;/or functional disability (DeRidder 2021, Biswas 2022, Jarach 2022). There is no standard test for tinnitus in the medical literature; audiologists typically test pitch (to within half an octave); perceived loudness of the tone using standard clinical equipment for testing hearing loss. The underlying causes of tinnitus are not yet fully understood,; the most effective treatments not yet identified. We present the first release of an extended Tinnitus matching app that includes a highly individualizable tinnitus tone-matching tool; a comprehensive questionnaire for mobile health tracking. The app facilitates large data collection on tinnitus sounds across aetiologies, co-occurring symptoms,; demographics. Our intentions are threefold; 1) to provide those experiencing tinnitus with a way to communicate what they hear more precisely, 2) understand how tinnitus sounds vary across demographics, how these relate to co-occurring symptoms,; eventually – 3) to provide a means of individualising any sound-based approach to symptom amelioration. We present the approach; validation of the tinnitus matching tool against common clinical measures.
Department of Engineering Technology and Didactics, DTU
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The application of binaural cue perception mechanisms to multichannel audio compression technology can reduce spatial parameter redundancy; effectively lower the encoding bitrate. Binaural cues play a critical role in sound source localization,; their frequency-dependent characteristics yield varied perceptual localization effects. However, current understanding of the specific behavior of binaural cues at low frequencies, as well as the similarities; differences between interaural time difference (ITD); interaural level difference (ILD), remains incomplete. To explore the relationship between ITD-based; ILD-based azimuth perception, this study non-uniformly selected nine ITD values; twelve ILD values within the 300–1480 Hz frequency range to test ITD ; ILD perceptual azimuths, respectively. The experimental method involved using fixed binaural cue stimuli while varying the audio with known horizontal azimuth angles to approach the target binaural cue stimulus. Test results indicate that both ITD; ILD perceptual effects are significantly influenced by frequency, with the minimum perceptual azimuth values for both ITD; ILD observed at 700 Hz, suggesting that binaural cue perception azimuths are closer to the median plane at this frequency. Furthermore, surface fitting was applied to the perceptual azimuths of ITD; ILD, revealing relatively similar patterns. Based on experimental findings, this paper analyzes the explorable perceptual correlation between ITD-based; ILD-based azimuth perception. The application of data in spatial audio coding contributes to the efficient transmission; fidelity preservation of audio signals. This study provides valuable insights for optimizing binaural cue-based compression techniques, ultimately supporting high-fidelity spatial audio reproduction.
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The authors proposed a stereo-width shrinkage method for headphone reproduction, in which crosstalk from loudspeaker reproduction is added to the original stereo sources. In this study, we investigate the sound quality of stereo-width-shrunken sources with different parameter settings. A Semantic Differential method is employed to quantify the subjective characteristics with five adjective pairs,; the naturalness of the stereo width shrunk sources is evaluated in detail with Scheffé’s paired comparison. The results of the Semantic Differential method comprehensively rank the sound sources. Interestingly, the results of the paired comparison are not reversed in the natural; unnatural evaluations, whereas the negative evaluation yields reasonable results. These results provide valuable insights for practical sound-quality evaluation.
Mitsunori Mizumachi graduated from the Department of Acoustic Design, Kyushu Institute of Design, in 1995 and received his Ph.D. degree in Information Science from Japan Advanced Institute of Science and Technology in 2000. From 2000 to 2004, he worked as a researcher at Advanced... Read More →
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark