Loading…
Schedule as of May 16, 2022 - subject to change

Default Time Zone is CEST - Central European Summer Time
You can change your view to your time zone (look for "Timezone" on the right)


LIVESTREAMS : A and B


ON DEMAND VIDEOS (previous days)
 
Type: Audio Processing clear filter
Thursday, May 28
 

1:30pm CEST

Binaspect: A Python Library for Binaural Audio Analysis, Visualization & Feature Generation
Thursday May 28, 2026 1:30pm - 3:30pm CEST
We present Binaspect, an open-source Python library for
binaural audio analysis, visualization,; feature
generation. Binaspect generates interpretable “azimuth
maps” by calculating modified interaural time; level
difference spectrograms,; clustering those
time-frequency (TF) bins into stable time-azimuth histogram
representations. This allows multiple active sources to
appear as distinct azimuthal clusters, while degradations
manifest as broadened, diffused, or shifted distributions.
Crucially, Binaspect operates blindly on audio, requiring
no prior knowledge of head models. These visualizations
enable researchers; engineers to observe how binaural
cues are degraded by codec; renderer design choices,
among other downstream processes. We demonstrate the tool
on bitrate ladders, ambisonic rendering,; VBAP source
positioning, where degradations are clearly revealed. In
addition to their diagnostic value, the proposed
representations can be exported as structured features
suitable for training machine learning models in quality
prediction, spatial audio classification,; other
binaural tasks. Binaspect is released under an open-source
license with full reproducibility scripts at: (link removed
for blind review)
Authors
AR

Alessandro Ragano

University College Dublin
DB

Dan Barry

University College Dublin
DS

Davoud Shariat Panah

University College Dublin
avatar for Jan Skoglund

Jan Skoglund

Google, Google

Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:30pm CEST

Lightweight Real-time Spatial Audio Interpolation for Standalone VR using Hand Claps
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Realistic spatial audio consistent with visual information
is essential for providing high immersion in Augmented
Reality (AR) environments. However, conventional
high-precision real-time acoustic simulations require
significant computational power, limiting their
implementation on standalone mobile VR devices such as the
Meta Quest. This study proposes a practical method to
enhance reverb realism using solely a standalone VR HMD,
without the need for additional external equipment. By
measuring impulse responses using a few hand claps in the
physical space, we interpolate room acoustic
parameters—specifically RT60; early/late energy
ratios—to reflect the environment's unique characteristics.
These extracted parameters are then applied to the VR
engine's built-in reverb effects, enabling dynamic,
location-aware real-time rendering with minimal
computational load. The proposed method demonstrates that a
brief calibration period of 3 to 5 minutes yields
significantly improved realism compared to static reverb
templates, offering an efficient; practical spatial
audio solution for mobile
AR environments.
Authors
MK

Minsu Kim

Seoul National University
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
 
Friday, May 29
 

9:00am CEST

Voice-Based Fatigue Detection for Military Personnel: A Multi-Modal Machine Learning Framework with Acoustic Feature Emphasis
Friday May 29, 2026 9:00am - 11:00am CEST
This study presents a voice-centered machine learning
framework for detecting mental fatigue in military
personnel, integrating acoustic analysis with physiological
biosensors to enhance detection robustness. Mental fatigue
poses critical safety; performance challenges in
military operations, yet cultural stigma often prevents
self-reporting. We collected multi-modal data from 23
participants across two fatigue states, extracting
comprehensive acoustic features including sound pressure
level (SPL), formants, mel-frequency cepstral coefficients
(MFCCs), jitter, shimmer, harmonic-to-noise ratio (HNR),
; temporal speech characteristics. These voice features
were combined with electroencephalography (EEG),
photoplethysmography (PPG),; temperature data to train
multiple machine learning classifiers. The voice-based
models achieved accuracies between 82-85\%, with support
vector machines (SVM); long short-term memory (LSTM)
networks demonstrating superior performance. When acoustic
features were combined with physiological markers,
classification accuracy improved to 92\%, with
Classification; Regression Trees (CART); Linear
Discriminant Analysis (LDA) emerging as top performers.
Statistical analysis identified SPL; formant variance as
the most discriminative voice features, while Lempel-Ziv
Complexity (LZC); theta/beta ratio proved most reliable
for EEG. Evaluation on new participants yielded 67\%
accuracy, revealing model generalization challenges that
inform future research directions. This work demonstrates
that voice-based machine learning systems, when augmented
with physiological data, offer a promising non-invasive
approach to real-time fatigue monitoring in operational
military environments.
Authors
CC

Claire Courchene

Applied Perception Associate Engineer, GN
I’m a creative technologist and interaction designer exploring how sound, technology, and human experience meet. With an MScEng in Sound & Music Computing, I prototype audio interactions, build ML‑driven tools, and design experiments around perception. My background spans music... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Exploring Perceptual; Physiological Auditory Models for Assessing Speech Intelligibility in Enhanced Signals
Friday May 29, 2026 9:00am - 11:00am CEST
Current deep learning approaches to speech enhancement rely
heavily on objective measures like mean squared error or
scale-invariant signal-to-distortion ratio as both training
objectives; evaluation metrics. While analytically
convenient, these benchmarks often fail to capture the
nuances of human perception or actual intelligibility.
Furthermore, the inconsistent integration of metrics like
Short-Term Objective Intelligibility or Perceptual
Evaluation of Speech Quality into training; evaluation
pipelines leaves a gap between algorithmic performance;
perceptual reality. This paper proposes a transition
towards evaluation methodologies grounded in
psychoacoustics; audiological modeling. Our study
explores two distinct methods to characterise enhanced
signals. On one hand, we employ a perceptual approach based
on the Cambridge loudness model to assess the preservation
of spectral excitation patterns; perceived intensity. On
the other hand, we adopt a biophysical approach by
utilising CoNNear, a convolutional model of the human
auditory periphery. This allows us to simulate
representations of responses at different stages of the
auditory periphery to observe how speech enhancement
processing affects the physiological representation of
speech. We analyse pre-trained speech enhancement models
using automatic speech recognition; Short-Term Objective
Intelligibility as an additional proxy for human
intelligibility. By mapping automatic speech recognition
performance against loudness; peripheral response
patterns, we investigate the extent to which current
enhancement strategies maintain the perceptual;
physiological integrity of the speech signal. This work
aims to identify features predictive of intelligibility,
providing a foundation for speech enhancement systems
optimised for the human listener rather than purely
signal-based objective functions.
Authors
FE

François Effa

Université de Lorraine, CNRS, Inria, Loria, Nancy, France
RS

Romain Serizel

LORIA - Laboratoire Lorrain de Recherche en Informatique etnses Applications
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Objective Quality Models for Decision-Making in Speech Coding
Friday May 29, 2026 9:00am - 11:00am CEST
Objective quality evaluation is widely used in speech
coding, yet objective estimates often show limited
agreement with subjective listening-test results. Rather
than focusing on absolute score accuracy, this paper
evaluates objective speech quality models from a
decision-making perspective, defined as their ability to
support comparative judgments between speech codecs or
codec configurations. A formal ITU-R P.800 Absolute
Category Rating (ACR) listening test was conducted with 30
listeners across 24 conditions, covering conventional;
neural monophonic speech codecs operating under
clear-channel conditions at sampling frequencies from 16 to
48 kHz; bit rates ranging from below 1 kbps to above 16
kbps. The speech material consisted of internally recorded,
clean French-language speech that was not used in the
development or training of any of the evaluated codecs or
objective quality models. Seven objective quality models,
namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA,
UTMOS,; DistillMOS, were evaluated on the same material.
Decision-making performance was assessed by comparing
subjective; objective rankings using Kendall’s rank
correlation coefficient; by analyzing pairwise codec
comparisons using t-tests at a 95% confidence level. The
results show that some objective quality models are
effective for comparing bit rate variations within a given
speech coding technology, provided that all other codec
parameters remain unchanged (e.g., sampling frequency).
However, all models exhibit limitations, including
tendencies toward over- or underestimation for certain
technologies, as well as reduced reliability when applied
across different sampling frequencies. Despite its
conventional origins, PESQ remains capable of supporting
decision-making even when applied to neural speech codecs.
Authors
CL

Clémence Lamballe

Universite de Sherbrooke
PG

Philippe Gournay

Universite de Sherbrooke
RL

Roch Lefebvre

Universite de Sherbrooke
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

The Ambisonic Denoising Paradox: U-Net Processing Degrades ASR Transcription Quality for Medical Speech
Friday May 29, 2026 9:00am - 11:00am CEST
Spatial audio recording using higher-order Ambisonics
offers rich directional information for medical speech
capture, yet challenging hospital acoustic environments
motivate preprocessing with neural denoising algorithms.
This study investigates whether U-Net-based denoising of
third-order ambisonic recordings improves automatic speech
recognition (ASR) quality for medical applications. We
developed the Medical Immersive Audio Corpus (MIAC),
comprising 1,759 utterances (6.43 hours) of Polish medical
speech recorded with a Zylia ZM-1 microphone in
uncontrolled hospital environments, capturing 16-channel
third-order Ambisonics across multiple specializations
including thyroid ultrasonography, surgical procedures,;
general diagnostics. We applied a U-Net architecture with
dual attention mechanisms trained using the Noise2Noise
paradigm to denoise the corpus, then evaluated
transcription quality using ten Whisper ASR models ranging
from 39 million to 1.55 billion parameters, including
domain-adapted medical variants. Surprisingly, we
discovered a "noise reduction paradox" where denoising
degraded transcription quality for seven of ten models,
with statistically significant increases in Word Error Rate
(WER); Character Error Rate (CER) for general-purpose
base, small,; medium models. Only the domain-adapted
whisper-medium-68000-abbr model showed statistically
significant improvement (p=0.0008), while large-scale
models (large-v2, large-v3) exhibited robustness with
negligible changes. Effect sizes remained small (Cohen's d
< 0.2) across all models. These counterintuitive findings
suggest modern ASR systems implicitly utilize background
noise characteristics as informative features,; that
preprocessing pipelines should be reconsidered for
domain-specific applications. Our results provide practical
guidance for medical speech processing system design.
Authors
avatar for Bartlomiej Mroz

Bartlomiej Mroz

Assistant Professor, Gdańsk University of Technology
PhD, Spatial Audio & Immersive Media Researcher, Recording Engineer, Statistics enthusiast
SZ

Szymon Zaporowski

Gdańsk University of Technology
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Geometry Sensitivity in Low-Count Virtual Microphone Arrays: From Tetrahedral Baselines to Stochastic Spherical Layouts
Friday May 29, 2026 1:00pm - 3:00pm CEST
Virtual Microphone Array techniques are being investigated
by the authors to support room acoustics optimisation in
live sound environments. In our recent AES paper, “Room
Acoustics Optimisation Using Virtual Microphone Arrays”, a
notable outcome was that a compact four-microphone
tetrahedral array performed strongly relative to its low
sensor count. Recent virtual sensing; Remote Microphone
Technique research treats microphone placement as an
explicit design variable. It reports improved remote
estimation performance when microphone layouts are
deliberately chosen for the task, rather than adopted as
fixed, standard configurations.
This submission builds on our prior VMA work by focusing on
the four-microphone case, where geometry choices are
especially constrained. We compare a tetrahedral baseline
with an ensemble of stochastically generated spherical
layouts at the same array aperture using Monte Carlo
simulation. We apply a consistent evaluation protocol
across multiple listening-region offsets; standard
beamforming estimators to isolate variability due to
geometry alone. The central proposition is that, for
low-count VMAs, geometry is a first-order design parameter.
Tetrahedral remains a credible baseline, but lightweight
stochastic exploration can reveal alternative layouts that
are competitive;, in some cases, superior without
increasing channel count.
Authors
avatar for Brian de Brit

Brian de Brit

Lecturer, Technological University Dublin
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
DD

David Dorran

Technological University Dublin
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Clustered Virtual Microphone Arrays for Listener-Level Monitoring; Room-Correction in Live Sound
Friday May 29, 2026 1:00pm - 3:00pm CEST
This paper introduces clustered virtual microphone arrays
as a step toward improving listener-level virtual
microphone estimation for live sound. Multiple compact
microphone sub-arrays are placed around a nominal overhead
position. Each sub-array produces a virtual microphone
estimate,; the estimates are fused. The aim is to attack
the estimation problem from multiple viewpoints; reduce
sensitivity to any one array placement or geometry.
The work builds on our earlier paper, “Room Acoustics
Optimisation Using Virtual Microphone Arrays”. That paper
proposed virtual microphones estimated from an overhead
array as a measurement layer for live sound optimisation.
It also highlighted a key limitation: in its initial form,
virtual microphone estimation quality was not yet strong
enough for reliable use across positions. The present paper
targets that limitation. We outline the clustered array
idea; treat cluster count; inter-cluster spacing as
design parameters. Virtual microphones are estimated using
beamforming; combined using simple fusion. Performance
is assessed with objective signal measures, including SNR
; frequency-; phase-related error measures, across
multiple listener-level target positions. The results
support further refinement under more realistic room
conditions; further study of the link between improved
estimation quality; FIR-based correction outcomes.
Authors
avatar for Brian de Brit

Brian de Brit

Lecturer, Technological University Dublin
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
DD

David Dorran

Technological University Dublin
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

A Time–Frequency Integrated Framework for Frequency-Invariant Beamforming in Loudspeaker Arrays
Friday May 29, 2026 1:00pm - 3:00pm CEST
Loudspeaker array beamforming technology has been widely
used; however, current frequency-domain; time-domain
design methods for calculating FIR filters face challenges,
including the need for modeling delay; high
computational complexity. To address these issues, this
paper proposes a time–frequency integrated framework. This
framework supports both pressure matching; amplitude
matching methods, enabling not only the realization of
traditional superdirective beams but also the design of
frequency-invariant beams. For the nonlinear optimization
problem in amplitude matching, an efficient solving
algorithm based on the Alternating Direction Method of
Multipliers (ADMM) is introduced. Experimental results
demonstrate that the proposed method combines the
advantages of existing frequency-domain; time-domain
approaches, directly computing FIR filter coefficients
without delay modeling while maintaining high computational
efficiency. This provides an effective solution for beam
control in loudspeaker arrays.
Authors
JY

Jianbin Yang

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
KP

Keyu Pan

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
NC

Ning Cong

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
XT

Xing Tian

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark, Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

The Impact of Frequency Gradient on Nonlinear Pulse Distribution in the Farina Technique
Friday May 29, 2026 1:00pm - 3:00pm CEST
The Exponential Sine Sweep (ESS) technique, popularized by
Angelo Farina, has become a cornerstone of modern
electroacoustic measurement due to its unique capability to
simultaneously extract a system’s linear impulse response
; its individual harmonic distortion components. Standard
implementation of this method almost exclusively utilizes a
low-to-high (upward) exponential sine sweep. However,
during a technical Q&A session at the AES Europe 2025
Convention in Warsaw, a question was raised: what are the
practical consequences of reversing the sweep direction?
This inquiry is particularly relevant given that several
industry-standard measurement platforms often employ
high-to-low (downward) sweeps to optimize the mechanical
; thermal stability of the device under test (DUT) while
performing stepped or swept sinusoidal analysis.
This paper provides an investigation into the temporal
behavior of nonlinearities when the frequency gradient of
an exponential sweep is inverted. Through formal
mathematical derivation; numerical simulations the study
proves that while the spacing between distortion orders
remains identical in magnitude, the polarity; time
distribution of these impulses is reversed. Specifically,
we demonstrate that in a downward sweep, the distortion
products shift from the "pre-causal" negative time region
to the "post-causal" positive time region. This shift
causes harmonic distortion pulses to emerge within the
reverberant tail of the impulse response, leading to
significant contamination of decay measurements;
energy-time curves. By contrasting the "tracking filter"
paradigm with "time-domain deconvolution," this work
clarifies why sweep direction is a critical parameter that
must be aligned with the specific goals of the measurement
protocol.
Authors
avatar for Daniele Ponteggia

Daniele Ponteggia

Materiacustica Srl
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
 
Saturday, May 30
 

1:00pm CEST

Investigations on Nonlinearity in a Gammatone Filter Bank Based Perceptual Model
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Perceptual models are playing an important role in
effectively balancing the data compression; fidelity in
audio encoders by leveraging the masking effects in human
auditory perception. For deriving well suitable masking
thresholds, considering tonality is important. In this
study, a novel filter bank is proposed, which uses narrow
complex-valued all-pole gammatone filters followed by a
non-linear spectral spreading processing. With an
appropriate non-linear mapping before spreading,;
inverse non-linear mapping afterwards, differences between
masking strengths of tonal; noise-like maskers can be
directly obtained without explicit tonality estimation.
By employing a suitable non-linearity, level-dependency of
spectral spreading in the human auditory system can also be
modeled. The performance of the proposed approach is
evaluated through subjective listening tests, which include
comparisons with results obtained using partial spectral
flatness measures.
Authors
BE

Bernd Edler

International Audio Laboratories Erlangen, Germany
FS

Fabian Schaller

Fraunhofer IIS, Erlangen, Germany
PE

Paul EmilMeier

International Audio Laboratories Erlangen

PS

Paula Schäfer

Fraunhofer-Institut für Integrierte Schaltungen IIS
YH

Yaqiong Hou

PhD student, International Audio Laboratories Erlangen
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Processing, Poster

1:00pm CEST

Measurement; Analysis of Perceptual Characteristics of Binaural Cues
Saturday May 30, 2026 1:00pm - 3:00pm CEST
The application of binaural cue perception mechanisms to
multichannel audio compression technology can reduce
spatial parameter redundancy; effectively lower the
encoding bitrate. Binaural cues play a critical role in
sound source localization,; their frequency-dependent
characteristics yield varied perceptual localization
effects. However, current understanding of the specific
behavior of binaural cues at low frequencies, as well as
the similarities; differences between interaural time
difference (ITD); interaural level difference (ILD),
remains incomplete. To explore the relationship between
ITD-based; ILD-based azimuth perception, this study
non-uniformly selected nine ITD values; twelve ILD
values within the 300–1480 Hz frequency range to test ITD
; ILD perceptual azimuths, respectively. The experimental
method involved using fixed binaural cue stimuli while
varying the audio with known horizontal azimuth angles to
approach the target binaural cue stimulus. Test results
indicate that both ITD; ILD perceptual effects are
significantly influenced by frequency, with the minimum
perceptual azimuth values for both ITD; ILD observed at
700 Hz, suggesting that binaural cue perception azimuths
are closer to the median plane at this frequency.
Furthermore, surface fitting was applied to the perceptual
azimuths of ITD; ILD, revealing relatively similar
patterns. Based on experimental findings, this paper
analyzes the explorable perceptual correlation between
ITD-based; ILD-based azimuth perception. The application
of data in spatial audio coding contributes to the
efficient transmission; fidelity preservation of audio
signals. This study provides valuable insights for
optimizing binaural cue-based compression techniques,
ultimately supporting high-fidelity spatial audio
reproduction.
Authors
HW

Heng Wang

Wuhan Polytechnic University
MG

Mingyan Gao

Wuhan Polytechnic University
YX

Yiming Xu

Wuhan Polytechnic University,Wuhan,China
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
 


Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.