We present Binaspect, an open-source Python library for binaural audio analysis, visualization,; feature generation. Binaspect generates interpretable “azimuth maps” by calculating modified interaural time; level difference spectrograms,; clustering those time-frequency (TF) bins into stable time-azimuth histogram representations. This allows multiple active sources to appear as distinct azimuthal clusters, while degradations manifest as broadened, diffused, or shifted distributions. Crucially, Binaspect operates blindly on audio, requiring no prior knowledge of head models. These visualizations enable researchers; engineers to observe how binaural cues are degraded by codec; renderer design choices, among other downstream processes. We demonstrate the tool on bitrate ladders, ambisonic rendering,; VBAP source positioning, where degradations are clearly revealed. In addition to their diagnostic value, the proposed representations can be exported as structured features suitable for training machine learning models in quality prediction, spatial audio classification,; other binaural tasks. Binaspect is released under an open-source license with full reproducibility scripts at: (link removed for blind review)
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Realistic spatial audio consistent with visual information is essential for providing high immersion in Augmented Reality (AR) environments. However, conventional high-precision real-time acoustic simulations require significant computational power, limiting their implementation on standalone mobile VR devices such as the Meta Quest. This study proposes a practical method to enhance reverb realism using solely a standalone VR HMD, without the need for additional external equipment. By measuring impulse responses using a few hand claps in the physical space, we interpolate room acoustic parameters—specifically RT60; early/late energy ratios—to reflect the environment's unique characteristics. These extracted parameters are then applied to the VR engine's built-in reverb effects, enabling dynamic, location-aware real-time rendering with minimal computational load. The proposed method demonstrates that a brief calibration period of 3 to 5 minutes yields significantly improved realism compared to static reverb templates, offering an efficient; practical spatial audio solution for mobile AR environments.
Thursday May 28, 2026 1:30pm - 3:30pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
This study presents a voice-centered machine learning framework for detecting mental fatigue in military personnel, integrating acoustic analysis with physiological biosensors to enhance detection robustness. Mental fatigue poses critical safety; performance challenges in military operations, yet cultural stigma often prevents self-reporting. We collected multi-modal data from 23 participants across two fatigue states, extracting comprehensive acoustic features including sound pressure level (SPL), formants, mel-frequency cepstral coefficients (MFCCs), jitter, shimmer, harmonic-to-noise ratio (HNR), ; temporal speech characteristics. These voice features were combined with electroencephalography (EEG), photoplethysmography (PPG),; temperature data to train multiple machine learning classifiers. The voice-based models achieved accuracies between 82-85\%, with support vector machines (SVM); long short-term memory (LSTM) networks demonstrating superior performance. When acoustic features were combined with physiological markers, classification accuracy improved to 92\%, with Classification; Regression Trees (CART); Linear Discriminant Analysis (LDA) emerging as top performers. Statistical analysis identified SPL; formant variance as the most discriminative voice features, while Lempel-Ziv Complexity (LZC); theta/beta ratio proved most reliable for EEG. Evaluation on new participants yielded 67\% accuracy, revealing model generalization challenges that inform future research directions. This work demonstrates that voice-based machine learning systems, when augmented with physiological data, offer a promising non-invasive approach to real-time fatigue monitoring in operational military environments.
I’m a creative technologist and interaction designer exploring how sound, technology, and human experience meet. With an MScEng in Sound & Music Computing, I prototype audio interactions, build ML‑driven tools, and design experiments around perception. My background spans music... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Spatial audio recording using higher-order Ambisonics offers rich directional information for medical speech capture, yet challenging hospital acoustic environments motivate preprocessing with neural denoising algorithms. This study investigates whether U-Net-based denoising of third-order ambisonic recordings improves automatic speech recognition (ASR) quality for medical applications. We developed the Medical Immersive Audio Corpus (MIAC), comprising 1,759 utterances (6.43 hours) of Polish medical speech recorded with a Zylia ZM-1 microphone in uncontrolled hospital environments, capturing 16-channel third-order Ambisonics across multiple specializations including thyroid ultrasonography, surgical procedures,; general diagnostics. We applied a U-Net architecture with dual attention mechanisms trained using the Noise2Noise paradigm to denoise the corpus, then evaluated transcription quality using ten Whisper ASR models ranging from 39 million to 1.55 billion parameters, including domain-adapted medical variants. Surprisingly, we discovered a "noise reduction paradox" where denoising degraded transcription quality for seven of ten models, with statistically significant increases in Word Error Rate (WER); Character Error Rate (CER) for general-purpose base, small,; medium models. Only the domain-adapted whisper-medium-68000-abbr model showed statistically significant improvement (p=0.0008), while large-scale models (large-v2, large-v3) exhibited robustness with negligible changes. Effect sizes remained small (Cohen's d < 0.2) across all models. These counterintuitive findings suggest modern ASR systems implicitly utilize background noise characteristics as informative features,; that preprocessing pipelines should be reconsidered for domain-specific applications. Our results provide practical guidance for medical speech processing system design.
Virtual Microphone Array techniques are being investigated by the authors to support room acoustics optimisation in live sound environments. In our recent AES paper, “Room Acoustics Optimisation Using Virtual Microphone Arrays”, a notable outcome was that a compact four-microphone tetrahedral array performed strongly relative to its low sensor count. Recent virtual sensing; Remote Microphone Technique research treats microphone placement as an explicit design variable. It reports improved remote estimation performance when microphone layouts are deliberately chosen for the task, rather than adopted as fixed, standard configurations. This submission builds on our prior VMA work by focusing on the four-microphone case, where geometry choices are especially constrained. We compare a tetrahedral baseline with an ensemble of stochastically generated spherical layouts at the same array aperture using Monte Carlo simulation. We apply a consistent evaluation protocol across multiple listening-region offsets; standard beamforming estimators to isolate variability due to geometry alone. The central proposition is that, for low-count VMAs, geometry is a first-order design parameter. Tetrahedral remains a credible baseline, but lightweight stochastic exploration can reveal alternative layouts that are competitive;, in some cases, superior without increasing channel count.
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
This paper introduces clustered virtual microphone arrays as a step toward improving listener-level virtual microphone estimation for live sound. Multiple compact microphone sub-arrays are placed around a nominal overhead position. Each sub-array produces a virtual microphone estimate,; the estimates are fused. The aim is to attack the estimation problem from multiple viewpoints; reduce sensitivity to any one array placement or geometry. The work builds on our earlier paper, “Room Acoustics Optimisation Using Virtual Microphone Arrays”. That paper proposed virtual microphones estimated from an overhead array as a measurement layer for live sound optimisation. It also highlighted a key limitation: in its initial form, virtual microphone estimation quality was not yet strong enough for reliable use across positions. The present paper targets that limitation. We outline the clustered array idea; treat cluster count; inter-cluster spacing as design parameters. Virtual microphones are estimated using beamforming; combined using simple fusion. Performance is assessed with objective signal measures, including SNR ; frequency-; phase-related error measures, across multiple listener-level target positions. The results support further refinement under more realistic room conditions; further study of the link between improved estimation quality; FIR-based correction outcomes.
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
Loudspeaker array beamforming technology has been widely used; however, current frequency-domain; time-domain design methods for calculating FIR filters face challenges, including the need for modeling delay; high computational complexity. To address these issues, this paper proposes a time–frequency integrated framework. This framework supports both pressure matching; amplitude matching methods, enabling not only the realization of traditional superdirective beams but also the design of frequency-invariant beams. For the nonlinear optimization problem in amplitude matching, an efficient solving algorithm based on the Alternating Direction Method of Multipliers (ADMM) is introduced. Experimental results demonstrate that the proposed method combines the advantages of existing frequency-domain; time-domain approaches, directly computing FIR filter coefficients without delay modeling while maintaining high computational efficiency. This provides an effective solution for beam control in loudspeaker arrays.
The Exponential Sine Sweep (ESS) technique, popularized by Angelo Farina, has become a cornerstone of modern electroacoustic measurement due to its unique capability to simultaneously extract a system’s linear impulse response ; its individual harmonic distortion components. Standard implementation of this method almost exclusively utilizes a low-to-high (upward) exponential sine sweep. However, during a technical Q&A session at the AES Europe 2025 Convention in Warsaw, a question was raised: what are the practical consequences of reversing the sweep direction? This inquiry is particularly relevant given that several industry-standard measurement platforms often employ high-to-low (downward) sweeps to optimize the mechanical ; thermal stability of the device under test (DUT) while performing stepped or swept sinusoidal analysis. This paper provides an investigation into the temporal behavior of nonlinearities when the frequency gradient of an exponential sweep is inverted. Through formal mathematical derivation; numerical simulations the study proves that while the spacing between distortion orders remains identical in magnitude, the polarity; time distribution of these impulses is reversed. Specifically, we demonstrate that in a downward sweep, the distortion products shift from the "pre-causal" negative time region to the "post-causal" positive time region. This shift causes harmonic distortion pulses to emerge within the reverberant tail of the impulse response, leading to significant contamination of decay measurements; energy-time curves. By contrasting the "tracking filter" paradigm with "time-domain deconvolution," this work clarifies why sweep direction is a critical parameter that must be aligned with the specific goals of the measurement protocol.
Perceptual models are playing an important role in effectively balancing the data compression; fidelity in audio encoders by leveraging the masking effects in human auditory perception. For deriving well suitable masking thresholds, considering tonality is important. In this study, a novel filter bank is proposed, which uses narrow complex-valued all-pole gammatone filters followed by a non-linear spectral spreading processing. With an appropriate non-linear mapping before spreading,; inverse non-linear mapping afterwards, differences between masking strengths of tonal; noise-like maskers can be directly obtained without explicit tonality estimation. By employing a suitable non-linearity, level-dependency of spectral spreading in the human auditory system can also be modeled. The performance of the proposed approach is evaluated through subjective listening tests, which include comparisons with results obtained using partial spectral flatness measures.
PhD student, International Audio Laboratories Erlangen
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
The application of binaural cue perception mechanisms to multichannel audio compression technology can reduce spatial parameter redundancy; effectively lower the encoding bitrate. Binaural cues play a critical role in sound source localization,; their frequency-dependent characteristics yield varied perceptual localization effects. However, current understanding of the specific behavior of binaural cues at low frequencies, as well as the similarities; differences between interaural time difference (ITD); interaural level difference (ILD), remains incomplete. To explore the relationship between ITD-based; ILD-based azimuth perception, this study non-uniformly selected nine ITD values; twelve ILD values within the 300–1480 Hz frequency range to test ITD ; ILD perceptual azimuths, respectively. The experimental method involved using fixed binaural cue stimuli while varying the audio with known horizontal azimuth angles to approach the target binaural cue stimulus. Test results indicate that both ITD; ILD perceptual effects are significantly influenced by frequency, with the minimum perceptual azimuth values for both ITD; ILD observed at 700 Hz, suggesting that binaural cue perception azimuths are closer to the median plane at this frequency. Furthermore, surface fitting was applied to the perceptual azimuths of ITD; ILD, revealing relatively similar patterns. Based on experimental findings, this paper analyzes the explorable perceptual correlation between ITD-based; ILD-based azimuth perception. The application of data in spatial audio coding contributes to the efficient transmission; fidelity preservation of audio signals. This study provides valuable insights for optimizing binaural cue-based compression techniques, ultimately supporting high-fidelity spatial audio reproduction.
Saturday May 30, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark