Loading…
Schedule as of May 16, 2022 - subject to change

Default Time Zone is CEST - Central European Summer Time
You can change your view to your time zone (look for "Timezone" on the right)


LIVESTREAMS : A and B


ON DEMAND VIDEOS (previous days)
 
Subject: Poster clear filter
Thursday, May 28
 

1:30pm CEST

Binaspect: A Python Library for Binaural Audio Analysis, Visualization & Feature Generation
Thursday May 28, 2026 1:30pm - 3:30pm CEST
We present Binaspect, an open-source Python library for
binaural audio analysis, visualization,; feature
generation. Binaspect generates interpretable “azimuth
maps” by calculating modified interaural time; level
difference spectrograms,; clustering those
time-frequency (TF) bins into stable time-azimuth histogram
representations. This allows multiple active sources to
appear as distinct azimuthal clusters, while degradations
manifest as broadened, diffused, or shifted distributions.
Crucially, Binaspect operates blindly on audio, requiring
no prior knowledge of head models. These visualizations
enable researchers; engineers to observe how binaural
cues are degraded by codec; renderer design choices,
among other downstream processes. We demonstrate the tool
on bitrate ladders, ambisonic rendering,; VBAP source
positioning, where degradations are clearly revealed. In
addition to their diagnostic value, the proposed
representations can be exported as structured features
suitable for training machine learning models in quality
prediction, spatial audio classification,; other
binaural tasks. Binaspect is released under an open-source
license with full reproducibility scripts at: (link removed
for blind review)
Authors
AR

Alessandro Ragano

University College Dublin
DB

Dan Barry

University College Dublin
DS

Davoud Shariat Panah

University College Dublin
avatar for Jan Skoglund

Jan Skoglund

Google, Google

Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:30pm CEST

Lightweight Real-time Spatial Audio Interpolation for Standalone VR using Hand Claps
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Realistic spatial audio consistent with visual information
is essential for providing high immersion in Augmented
Reality (AR) environments. However, conventional
high-precision real-time acoustic simulations require
significant computational power, limiting their
implementation on standalone mobile VR devices such as the
Meta Quest. This study proposes a practical method to
enhance reverb realism using solely a standalone VR HMD,
without the need for additional external equipment. By
measuring impulse responses using a few hand claps in the
physical space, we interpolate room acoustic
parameters—specifically RT60; early/late energy
ratios—to reflect the environment's unique characteristics.
These extracted parameters are then applied to the VR
engine's built-in reverb effects, enabling dynamic,
location-aware real-time rendering with minimal
computational load. The proposed method demonstrates that a
brief calibration period of 3 to 5 minutes yields
significantly improved realism compared to static reverb
templates, offering an efficient; practical spatial
audio solution for mobile
AR environments.
Authors
MK

Minsu Kim

Seoul National University
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:30pm CEST

Perceptual Evaluation of the MPEG-I Immersive Audio Standard
Thursday May 28, 2026 1:30pm - 3:30pm CEST
The recently finalized ISO international standard (IS) on
MPEG-I immersive audio enables interactive
six-degrees-of-freedom (6DoF) audio rendering for a
multitude of virtual-reality; augmented-reality (VR/AR)
acoustic scenarios; applications with comprehensive
modeling of room acoustics; intricate acoustic
phenomena, including e.g. occlusion, reflection,
transmission; diffraction caused by sound obstacles,
Doppler effect,; dynamic environment changes triggered
by user interactivity. This paper describes concept,
methodology; results of the final verification test of
this standard. In the verification test, the perceptual
quality of the renderer was assessed in an interactive
listening test using different in-; outdoor acoustic
scenes, testing the above-mentioned features of the
standard. More than 50 listeners participated in the test
distributed across six labs using the ITU‑R BS.2132 [1]
multi‑stimulus method on a 100‑point scale for three
conditions (IS, mid-; low anchor) in 10 VR scenes plus
two repetitions. The results of several anchor processing
configurations are presented. The selected mid; low
anchors have demonstrated stable quality across diverse
scenes with progressive timbre; spatial degradations.
The listening test results show a clear separation of the
conditions (IS > mid > low); the low anchor was stable
(around 16 points median value) while the mid anchor varied
by scene (around 47 points). The IS is rated with a median
of 84 points among all labs, which is the “excellent”
region of the scale. The individual scenes are rated
differently. The quartile range for some scenes can exhibit
20 points. The median value for the IS of the different
labs varied, some are a bit more critical than others.
Authors
AS

Andreas Silzle

Fraunhofer IIS, Fraunhofer IIS
Germany
avatar for Leon Terentiv

Leon Terentiv

Dolby, Dolby
Germany
avatar for Pablo Delgado

Pablo Delgado

Fraunhofer IIS, Fraunhofer IIS
Erlangen, DE
SJ

Sam Jelfs

Philips
avatar for Sascha Disch

Sascha Disch

Fraunhofer IIS, Fraunhofer IIS
Sascha Disch received his Dipl.-Ing. degree in electrical engineering from the Technical University Hamburg-Harburg (TUHH) in 1999 and joined the Fraunhofer Institute for Integrated Circuits (IIS) the same year. Ever since he has been working in research and development of perceptual... Read More →
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:30pm CEST

Can the individual winner HRTFs be determined in a shooting task during onboarding for an Audio Only VR?
Thursday May 28, 2026 1:30pm - 3:30pm CEST
The significance of individual versus generic HRTFs in
Virtual Audio can be difficult to ascertain given the
variety of scenarios; tasks related to the spatial
listening experience. Are we working on the most
significant 80% of the success or fine-tuning the last 5%
of the sound quality? When the VR users are blind it is
fair to assume that the quality of the spatial audio
becomes a critical; more important factor. This is the
challenge as we see it. In the present project, we will
investigate options for powerful game components relying on
spatialized sound, using effects that are natural for the
blind gamer. As a first step, we have implemented a test
platform, where different options for HRTFs will exist,;
where the on-boarding process shall reveal the optimal
solution for the given user. The test scenario is inspired
by a “classical” shooting down sound sources scenario,
where we will vary e.g. the task definition, success
criteria (hit zone, number of attempts; elapsed time) as
well as eavesdropping game internal parameters of more
complex nature (e.g. navigation trajectories). The results
will display the variation in normal seeing listeners;
produce normative data for later comparisons with blind
participants. The platform also includes options for simple
mirror-image room models,; standardized reverberation,
which will be used in later tests to learn, whether the
room acoustics may play a stronger role for the blind
gamers’ navigation; source identification, than for
normal seeing listeners.
Authors
DH

Dorte Hammershøi

Professor, Acoustics and Hearing, AI and Sound, Department of Electronic Systems, Aalborg University
FC

Flemming Christensen

Acoustics and Hearing, AI and Sound, Department ofnElectronic Systems, Aalborg University
avatar for Max Væhrens

Max Væhrens

PhD Fellow, Acoustics and Hearing, AI and Sound, Department ofnElectronic Systems, Aalbor...


SS

Stefania Serafin

Department of Engineering Technology and Didactics,nTechnical University of Denmark
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Immersive Audio, Poster

1:30pm CEST

Exploiting Source Directivity for Robust Asymmetric Crosstalk Cancellation
Thursday May 28, 2026 1:30pm - 3:30pm CEST
This study investigates the relationship between the
robustness of crosstalk cancellation; the symmetry of
system configuration. Analytical results show that, when
the positions of the sound sources are fixed, increasing
asymmetry caused by deviations in the listener’s head
position or orientation leads to a reduction in system
robustness, whereas optimal performance is consistently
achieved in symmetric layouts. For asymmetric
configurations, we propose a method to optimize the axial
angles of the sound sources. This method leverages source
directivity patterns to adjust level differences along the
acoustic propagation paths, thereby improving system
robustness. Experiments confirm the effectiveness of the
proposed method in asymmetric crosstalk cancellation
systems, demonstrating enhanced robustness; yielding
higher binaural channel separation under slight listener
head movements.
Authors
JY

Jianbin Yang

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
KP

Keyu Pan

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
NC

Ning Cong

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
XT

Xing Tian

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark, Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Immersive Audio, Poster

1:30pm CEST

Capturing Immersive Sound in Concert Halls: A Comparative Analysis of PCMA-3D and Decca Cuboid Recording Techniques
Thursday May 28, 2026 1:30pm - 3:30pm CEST
This paper presents a comparative analysis of two immersive
recording techniques for classical music: the PCMA-3D
(Perspective Control Microphone Array); the Decca
Cuboid. While the Decca Cuboid relies primarily on
time-of-arrival differences to generate spatial
impressions, the PCMA-3D utilises intensity differences;
separates ambience from direct sound. A recording session
was conducted in a concert hall using a classical guitar
soloist; two distinct folk music ensembles to capture
performances simultaneously with both arrays. Subjective
evaluation was performed using a MUSHRA listening test with
18 participants, assessing parameters such as sensation of
space, localisation precision,; sound quality.
Statistical analysis reveals that while both systems
provide high-quality immersive experiences, the PCMA-3D
scored significantly higher in the sensation of space (p
Authors
ZW

Zechen Wang

University of York
Thursday May 28, 2026 1:30pm - 3:30pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
 
Friday, May 29
 

9:00am CEST

Music generation model based on global emotional feature perception
Friday May 29, 2026 9:00am - 11:00am CEST
The rapid development of artificial intelligence
composition technology has brought innovation to music
creation. However, current deep learning music generation
models often neglect the global correlation of emotional
features, resulting in fragmented emotional expression in
generated works; insufficient alignment with human
emotional perception, making it difficult to meet the core
demand for emotional conveyance in diverse music creation.
This study aims to propose a music generation method that
integrates a global perception mechanism for emotional
features. Taking the EMOPIA; VGMIDI preprocessed
datasets as the research objects, an improved model based
on EMelodyGen (EMelodyGen-PPO) is constructed: a GLU
network layer is introduced in the feature extraction stage
to enhance the model's ability to filter; represent
emotion-related features; an improved PPO-Clip algorithm is
integrated in the training process,; a multi-dimensional
emotional reward function is designed to achieve global
dynamic perception; optimization of emotional features.
Experimental results show that the music21 parsing rate of
the EMelodyGen-PPO model on the target dataset is 3%; 4%
higher than that of the baseline model, respectively. An
automated quality assessment system based on fluency,
rhythm stability, harmony richness, melodic smoothness,;
structural integrity verifies that the comprehensive score
of the model's generated works is significantly better than
that of the comparative model. This study provides an
efficient technical path for emotion-oriented music
generation, which can empower grassroots cultural workers
; independent musicians at low cost, facilitate diverse
music creation practices; emotional audio content
dissemination,; align with the diversity; innovative
development concept of the AES audio community.
Authors
CL

Chen Li

Wuhan Polytechnic University
HW

Heng Wang

Wuhan Polytechnic University
LC

Lingzhi Chen

Wuhan Polytechnic University
MG

Mingyan Gao

Wuhan Polytechnic University
XW

XUETING WANG

Wuhan Polytechnic University
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators
Friday May 29, 2026 9:00am - 11:00am CEST
In this paper, we analyze two main factors of Bonafide
Resource (BR) or AI-based Generator (AG) which affect the
performance; the generality of a Deepfake Speech
Detection (DSD) model. To this end, we first propose a
deep-learning based model, referred to as the baseline.
Then, we conducted experiments on the baseline by which
we indicate how Bonafide Resource (BR); AI-based
Generator (AG) factors affect the threshold score used to
detect fake or bonafide input audio in the inference
process. Given the experimental results, a dataset, which
re-uses public Deepfake Speech Detection (DSD) datasets;
shows a balance between Bonafide Resource (BR) or AI-based
Generator (AG), is proposed. We then train various
deep-learning based models on the proposed dataset;
conduct cross-dataset evaluation on different benchmark
datasets. The cross-dataset evaluation results prove that
the balance of Bonafide Resources (BR); AI-based
Generators (AG) is the key factor to train; achieve a
general Deepfake Speech Detection (DSD) model.
Authors
DT

Dat Tran

FPT University
DF

David Fischinger

Austrian Institute of Technology
DA

Davide Antonutti

Austrian Institute of Technology
IM

Ian McLoughlin

Singapore Institute of Technology
KV

Khoi Vu

FPT University
LP

Lam Pham

Austrian Institute of Technology
MH

Marcel Hasenbalg

Austrian Institute of Technology
MB

Martin Boyer

Austrian Institute of Technology
S

SimonFreitter

Austrian Institute of Technology

Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Semantic Audio Encoders from EQ Parameters Alone: Effects of Training Data Composition on Limited-Data Learning
Friday May 29, 2026 9:00am - 11:00am CEST
We investigate how training data composition influences
semantic audio encoders that learn perceptual descriptors
such as "warm," "bright,"; "muddy" from equalization
(EQ) parameter datasets without labeled audio examples.
Using the SAFE-DB dataset of 1,369 labeled EQ settings, we
train audio encoders via an inverse problem formulation in
which labeled EQ parameters are applied to source audio;
the encoder is trained to recognize the resulting semantic
characteristics. Three training configurations are
compared, varying both class sampling strategy (uniform
versus balanced); source audio type (pink noise versus
real music). Despite severe class imbalance in SAFE-DB,
where 76 percent of examples are labeled "bright" or
"warm," balanced class sampling combined with mixed-source
training (50 percent pink noise; 50 percent FMA music)
successfully learns physically meaningful semantic-spectral
relationships: "warm"; "muddy" show negative correlation
with spectral centroid (r = -0.56), while "bright";
"thin" show positive correlation (r = +0.49). However,
prediction confidence decreases substantially (from 0.96 to
0.76 to 0.86),; top-1 predictions remain dominated by
the "bright" class across all evaluated music genres,
reflecting inherent dataset bias rather than training
failure. These results demonstrate that training data
composition significantly affects model calibration but
cannot fully overcome fundamental bias in the underlying
label distribution, highlighting key challenges for
semantic audio understanding systems.
Authors
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Voice-Based Fatigue Detection for Military Personnel: A Multi-Modal Machine Learning Framework with Acoustic Feature Emphasis
Friday May 29, 2026 9:00am - 11:00am CEST
This study presents a voice-centered machine learning
framework for detecting mental fatigue in military
personnel, integrating acoustic analysis with physiological
biosensors to enhance detection robustness. Mental fatigue
poses critical safety; performance challenges in
military operations, yet cultural stigma often prevents
self-reporting. We collected multi-modal data from 23
participants across two fatigue states, extracting
comprehensive acoustic features including sound pressure
level (SPL), formants, mel-frequency cepstral coefficients
(MFCCs), jitter, shimmer, harmonic-to-noise ratio (HNR),
; temporal speech characteristics. These voice features
were combined with electroencephalography (EEG),
photoplethysmography (PPG),; temperature data to train
multiple machine learning classifiers. The voice-based
models achieved accuracies between 82-85\%, with support
vector machines (SVM); long short-term memory (LSTM)
networks demonstrating superior performance. When acoustic
features were combined with physiological markers,
classification accuracy improved to 92\%, with
Classification; Regression Trees (CART); Linear
Discriminant Analysis (LDA) emerging as top performers.
Statistical analysis identified SPL; formant variance as
the most discriminative voice features, while Lempel-Ziv
Complexity (LZC); theta/beta ratio proved most reliable
for EEG. Evaluation on new participants yielded 67\%
accuracy, revealing model generalization challenges that
inform future research directions. This work demonstrates
that voice-based machine learning systems, when augmented
with physiological data, offer a promising non-invasive
approach to real-time fatigue monitoring in operational
military environments.
Authors
CC

Claire Courchene

Applied Perception Associate Engineer, GN
I’m a creative technologist and interaction designer exploring how sound, technology, and human experience meet. With an MScEng in Sound & Music Computing, I prototype audio interactions, build ML‑driven tools, and design experiments around perception. My background spans music... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Exploring Perceptual; Physiological Auditory Models for Assessing Speech Intelligibility in Enhanced Signals
Friday May 29, 2026 9:00am - 11:00am CEST
Current deep learning approaches to speech enhancement rely
heavily on objective measures like mean squared error or
scale-invariant signal-to-distortion ratio as both training
objectives; evaluation metrics. While analytically
convenient, these benchmarks often fail to capture the
nuances of human perception or actual intelligibility.
Furthermore, the inconsistent integration of metrics like
Short-Term Objective Intelligibility or Perceptual
Evaluation of Speech Quality into training; evaluation
pipelines leaves a gap between algorithmic performance;
perceptual reality. This paper proposes a transition
towards evaluation methodologies grounded in
psychoacoustics; audiological modeling. Our study
explores two distinct methods to characterise enhanced
signals. On one hand, we employ a perceptual approach based
on the Cambridge loudness model to assess the preservation
of spectral excitation patterns; perceived intensity. On
the other hand, we adopt a biophysical approach by
utilising CoNNear, a convolutional model of the human
auditory periphery. This allows us to simulate
representations of responses at different stages of the
auditory periphery to observe how speech enhancement
processing affects the physiological representation of
speech. We analyse pre-trained speech enhancement models
using automatic speech recognition; Short-Term Objective
Intelligibility as an additional proxy for human
intelligibility. By mapping automatic speech recognition
performance against loudness; peripheral response
patterns, we investigate the extent to which current
enhancement strategies maintain the perceptual;
physiological integrity of the speech signal. This work
aims to identify features predictive of intelligibility,
providing a foundation for speech enhancement systems
optimised for the human listener rather than purely
signal-based objective functions.
Authors
FE

François Effa

Université de Lorraine, CNRS, Inria, Loria, Nancy, France
RS

Romain Serizel

LORIA - Laboratoire Lorrain de Recherche en Informatique etnses Applications
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Objective Quality Models for Decision-Making in Speech Coding
Friday May 29, 2026 9:00am - 11:00am CEST
Objective quality evaluation is widely used in speech
coding, yet objective estimates often show limited
agreement with subjective listening-test results. Rather
than focusing on absolute score accuracy, this paper
evaluates objective speech quality models from a
decision-making perspective, defined as their ability to
support comparative judgments between speech codecs or
codec configurations. A formal ITU-R P.800 Absolute
Category Rating (ACR) listening test was conducted with 30
listeners across 24 conditions, covering conventional;
neural monophonic speech codecs operating under
clear-channel conditions at sampling frequencies from 16 to
48 kHz; bit rates ranging from below 1 kbps to above 16
kbps. The speech material consisted of internally recorded,
clean French-language speech that was not used in the
development or training of any of the evaluated codecs or
objective quality models. Seven objective quality models,
namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA,
UTMOS,; DistillMOS, were evaluated on the same material.
Decision-making performance was assessed by comparing
subjective; objective rankings using Kendall’s rank
correlation coefficient; by analyzing pairwise codec
comparisons using t-tests at a 95% confidence level. The
results show that some objective quality models are
effective for comparing bit rate variations within a given
speech coding technology, provided that all other codec
parameters remain unchanged (e.g., sampling frequency).
However, all models exhibit limitations, including
tendencies toward over- or underestimation for certain
technologies, as well as reduced reliability when applied
across different sampling frequencies. Despite its
conventional origins, PESQ remains capable of supporting
decision-making even when applied to neural speech codecs.
Authors
CL

Clémence Lamballe

Universite de Sherbrooke
PG

Philippe Gournay

Universite de Sherbrooke
RL

Roch Lefebvre

Universite de Sherbrooke
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

The Ambisonic Denoising Paradox: U-Net Processing Degrades ASR Transcription Quality for Medical Speech
Friday May 29, 2026 9:00am - 11:00am CEST
Spatial audio recording using higher-order Ambisonics
offers rich directional information for medical speech
capture, yet challenging hospital acoustic environments
motivate preprocessing with neural denoising algorithms.
This study investigates whether U-Net-based denoising of
third-order ambisonic recordings improves automatic speech
recognition (ASR) quality for medical applications. We
developed the Medical Immersive Audio Corpus (MIAC),
comprising 1,759 utterances (6.43 hours) of Polish medical
speech recorded with a Zylia ZM-1 microphone in
uncontrolled hospital environments, capturing 16-channel
third-order Ambisonics across multiple specializations
including thyroid ultrasonography, surgical procedures,;
general diagnostics. We applied a U-Net architecture with
dual attention mechanisms trained using the Noise2Noise
paradigm to denoise the corpus, then evaluated
transcription quality using ten Whisper ASR models ranging
from 39 million to 1.55 billion parameters, including
domain-adapted medical variants. Surprisingly, we
discovered a "noise reduction paradox" where denoising
degraded transcription quality for seven of ten models,
with statistically significant increases in Word Error Rate
(WER); Character Error Rate (CER) for general-purpose
base, small,; medium models. Only the domain-adapted
whisper-medium-68000-abbr model showed statistically
significant improvement (p=0.0008), while large-scale
models (large-v2, large-v3) exhibited robustness with
negligible changes. Effect sizes remained small (Cohen's d
< 0.2) across all models. These counterintuitive findings
suggest modern ASR systems implicitly utilize background
noise characteristics as informative features,; that
preprocessing pipelines should be reconsidered for
domain-specific applications. Our results provide practical
guidance for medical speech processing system design.
Authors
avatar for Bartlomiej Mroz

Bartlomiej Mroz

Assistant Professor, Gdańsk University of Technology
PhD, Spatial Audio & Immersive Media Researcher, Recording Engineer, Statistics enthusiast
SZ

Szymon Zaporowski

Gdańsk University of Technology
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

A perceptual evaluation of various commercial models of music source separation, with a focus on model performance against non-traditional source material
Friday May 29, 2026 9:00am - 11:00am CEST
Music source separation (MSS) systems are commonly used in
production, remixing,; audio analysis work, yet
questions arise regarding the extent that objective
evaluations of model performance align with human
perceptual evaluations, particularly when tasked with
non-traditional source material (in this case, heavily
processed electronic music). This study seeks to set a
framework for an evaluation of 3 machine learning
approaches to MSS: a spectrogram-domain model (spleeter), a
waveform-domain model (Demucs v2),; a hybrid-domain
model (HTDemucs). Subjective evaluations of model
performance were accumulated via a MUSHRA-style listening
test, while objective evaluations were assessed using
signal-to-distortion ratio (SDR); Frechet Audio Distance
(FAD). Results showed consistent agreement across objective
metrics, with the hybrid-domain model outperforming the
other singular-domain models. Perceptual ratings also
favored the hybrid model, with listeners occasionally
rating the model output as equal or better quality than the
original reference, interestingly. Preliminary analysis
indicates some moderate but insignificant correlations
between the two assessment paths, reinforcing concerns
about relying solely on numerical evaluations when
discussing MSS model performance. Implications for model
design; future evaluation procedures are discussed.
Authors
avatar for Sahan Wijewardane

Sahan Wijewardane

University of Miami
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Automating sound design for adaptive video game narration
Friday May 29, 2026 9:00am - 11:00am CEST
HAMLET is a research project that investigates the
integration of Artificial Intelligence; co-creation
practices within the creative industries. The project
proposes AI-driven enablers to support artists through
collaborative workflows between creative practitioners;
technology providers. This work focuses on an automated
sound design framework for text-based role-playing games,
where the game narration is dynamically generated through
player textual interaction with an LLM. To address this
unpredictability, the proposed system generates adaptive
soundscapes automatically from textual scene descriptions.
An LLM identifies semantically relevant sound sources,
which are then matched to audio libraries through metadata
alignment. The files are assessed for quality,; are fed
to an automated mixing module. The framework addresses
challenges related to semantic alignment, audio quality,
aesthetic balance,; file size constraints.
Authors
CD

Charalampos Dimoulas

Aristotle University of Thessaloniki
GK

George Kalliris

Aristotle University of Thessaloniki
LV

Lazaros Vrysis

Aristotle University of Thessaloniki
ME

Marina Eirini Stamatiadou

Aristotle University of Thessaloniki
avatar for Nikolaos Vryzas

Nikolaos Vryzas

Aristotle University of Thessaloniki
Dr. Nikolaos Vryzas was born in Thessaloniki in 1990. He studied Electrical & Computer Engineering in the Aristotle University of Thessaloniki (AUTh). After graduating, he received his master degrees on Information and Communication Audio Video Technologies for Education & Production from the Interdepartme... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Geometry Sensitivity in Low-Count Virtual Microphone Arrays: From Tetrahedral Baselines to Stochastic Spherical Layouts
Friday May 29, 2026 1:00pm - 3:00pm CEST
Virtual Microphone Array techniques are being investigated
by the authors to support room acoustics optimisation in
live sound environments. In our recent AES paper, “Room
Acoustics Optimisation Using Virtual Microphone Arrays”, a
notable outcome was that a compact four-microphone
tetrahedral array performed strongly relative to its low
sensor count. Recent virtual sensing; Remote Microphone
Technique research treats microphone placement as an
explicit design variable. It reports improved remote
estimation performance when microphone layouts are
deliberately chosen for the task, rather than adopted as
fixed, standard configurations.
This submission builds on our prior VMA work by focusing on
the four-microphone case, where geometry choices are
especially constrained. We compare a tetrahedral baseline
with an ensemble of stochastically generated spherical
layouts at the same array aperture using Monte Carlo
simulation. We apply a consistent evaluation protocol
across multiple listening-region offsets; standard
beamforming estimators to isolate variability due to
geometry alone. The central proposition is that, for
low-count VMAs, geometry is a first-order design parameter.
Tetrahedral remains a credible baseline, but lightweight
stochastic exploration can reveal alternative layouts that
are competitive;, in some cases, superior without
increasing channel count.
Authors
avatar for Brian de Brit

Brian de Brit

Lecturer, Technological University Dublin
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
DD

David Dorran

Technological University Dublin
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Clustered Virtual Microphone Arrays for Listener-Level Monitoring; Room-Correction in Live Sound
Friday May 29, 2026 1:00pm - 3:00pm CEST
This paper introduces clustered virtual microphone arrays
as a step toward improving listener-level virtual
microphone estimation for live sound. Multiple compact
microphone sub-arrays are placed around a nominal overhead
position. Each sub-array produces a virtual microphone
estimate,; the estimates are fused. The aim is to attack
the estimation problem from multiple viewpoints; reduce
sensitivity to any one array placement or geometry.
The work builds on our earlier paper, “Room Acoustics
Optimisation Using Virtual Microphone Arrays”. That paper
proposed virtual microphones estimated from an overhead
array as a measurement layer for live sound optimisation.
It also highlighted a key limitation: in its initial form,
virtual microphone estimation quality was not yet strong
enough for reliable use across positions. The present paper
targets that limitation. We outline the clustered array
idea; treat cluster count; inter-cluster spacing as
design parameters. Virtual microphones are estimated using
beamforming; combined using simple fusion. Performance
is assessed with objective signal measures, including SNR
; frequency-; phase-related error measures, across
multiple listener-level target positions. The results
support further refinement under more realistic room
conditions; further study of the link between improved
estimation quality; FIR-based correction outcomes.
Authors
avatar for Brian de Brit

Brian de Brit

Lecturer, Technological University Dublin
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
DD

David Dorran

Technological University Dublin
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

A Time–Frequency Integrated Framework for Frequency-Invariant Beamforming in Loudspeaker Arrays
Friday May 29, 2026 1:00pm - 3:00pm CEST
Loudspeaker array beamforming technology has been widely
used; however, current frequency-domain; time-domain
design methods for calculating FIR filters face challenges,
including the need for modeling delay; high
computational complexity. To address these issues, this
paper proposes a time–frequency integrated framework. This
framework supports both pressure matching; amplitude
matching methods, enabling not only the realization of
traditional superdirective beams but also the design of
frequency-invariant beams. For the nonlinear optimization
problem in amplitude matching, an efficient solving
algorithm based on the Alternating Direction Method of
Multipliers (ADMM) is introduced. Experimental results
demonstrate that the proposed method combines the
advantages of existing frequency-domain; time-domain
approaches, directly computing FIR filter coefficients
without delay modeling while maintaining high computational
efficiency. This provides an effective solution for beam
control in loudspeaker arrays.
Authors
JY

Jianbin Yang

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
KP

Keyu Pan

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
NC

Ning Cong

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
XT

Xing Tian

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark, Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

The Impact of Frequency Gradient on Nonlinear Pulse Distribution in the Farina Technique
Friday May 29, 2026 1:00pm - 3:00pm CEST
The Exponential Sine Sweep (ESS) technique, popularized by
Angelo Farina, has become a cornerstone of modern
electroacoustic measurement due to its unique capability to
simultaneously extract a system’s linear impulse response
; its individual harmonic distortion components. Standard
implementation of this method almost exclusively utilizes a
low-to-high (upward) exponential sine sweep. However,
during a technical Q&A session at the AES Europe 2025
Convention in Warsaw, a question was raised: what are the
practical consequences of reversing the sweep direction?
This inquiry is particularly relevant given that several
industry-standard measurement platforms often employ
high-to-low (downward) sweeps to optimize the mechanical
; thermal stability of the device under test (DUT) while
performing stepped or swept sinusoidal analysis.
This paper provides an investigation into the temporal
behavior of nonlinearities when the frequency gradient of
an exponential sweep is inverted. Through formal
mathematical derivation; numerical simulations the study
proves that while the spacing between distortion orders
remains identical in magnitude, the polarity; time
distribution of these impulses is reversed. Specifically,
we demonstrate that in a downward sweep, the distortion
products shift from the "pre-causal" negative time region
to the "post-causal" positive time region. This shift
causes harmonic distortion pulses to emerge within the
reverberant tail of the impulse response, leading to
significant contamination of decay measurements;
energy-time curves. By contrasting the "tracking filter"
paradigm with "time-domain deconvolution," this work
clarifies why sweep direction is a critical parameter that
must be aligned with the specific goals of the measurement
protocol.
Authors
avatar for Daniele Ponteggia

Daniele Ponteggia

Materiacustica Srl
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Real-Time Heart Rate Sonification Using Spectral Filtering of Preferred Music for Running Training
Friday May 29, 2026 1:00pm - 3:00pm CEST
The purpose of this study was to evaluate a sonification
system that maps live heart rate data to real-time spectral
filtering of a runner's preferred music. Assessed using a
within-subjects design (n = 13), the system employs
high-pass; low-pass filters to indicate deviations from
target heart rate zones, providing instantaneous
biofeedback without requiring visual attention.
Quantitative analysis revealed no statistically significant
differences in target zone accuracy or response time
between auditory, visual,; combined conditions. However,
qualitative thematic analysis identified a clear division
in user preference. Participants favouring the auditory
condition demonstrated faster mean response times to audio
biofeedback. Findings suggest that while sonification
promotes environmental focus; "gamifies" training, its
efficacy is highly dependent on individual processing
styles; music familiarity.
Authors
avatar for Duncan Williams

Duncan Williams

Senior Lecturer, Acoustics Research Centre, University of Salford
JS

Jay Steel

Acoustics Research Centre, University of Salford
NR

Nicholas Ripley

School of Health and Society, University of Salford
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

A Psychoacoustic Framework for In-Vehicle Audio-Light Mapping
Friday May 29, 2026 1:00pm - 3:00pm CEST
This paper proposes a psychoacoustic-based audio-visual
mapping framework for intelligent vehicle cabins to enhance
immersion; stabilize spatial auditory perception. By
establishing mappings between auditory descriptors—such as
Direction of Arrival (DOA), spectral centroid,; temporal
envelope—and ambient lighting parameters, the framework
leverages "ambient vision" to augment the perceptual
experience without increasing the driver's cognitive load.
Theoretical analysis based on Stevens’ Power Law indicates
that the proposed mapping strategies effectively
synchronize audio-visual intensities; mitigate
perceptual fatigue, providing a conceptual reference for
future multisensory HMI design.
Authors
avatar for Kangwei Wang

Kangwei Wang

Acoustic System Engineer, GoerDynamics Lab2
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Sound field creation with a cube-like loudspeaker array designed using Lamé function based on virtual sound source distribution
Friday May 29, 2026 1:00pm - 3:00pm CEST
The diversification of audio content production has
increased the demand for realistic, immersive sound field
reproduction. Conventional methods struggle to separate
direct; reflected sounds, limiting accuracy. To address
this issue, this study proposes a method for sound field
reproduction that identifies the arrival directions of
reflected sounds based on the virtual sound source
distribution. In this study, the virtual sound source
distribution was calculated by using closely located four
point microphone method. Assuming that spherical waves
emitted from distant virtual sound sources arrive as plane
waves within the listening area, the target sound field is
generated through plane wave synthesis, enabling more
accurate; flexible sound field generation. Furthermore,
considering practical systems; typical room shapes, we
investigated the reproducibility of plane wave sound fields
using not only spherical array, but also cube-like
loudspeaker array configured by the Lamé function, which
allows continuous geometric transformation from a sphere to
a cube-like form. In this study, the ideal plane wave sound
field derived from the wave equation was regarded as the
reference,; the sound fields generated by the
loudspeaker arrays were evaluated; compared using mean
square error (MSE). Furthermore, the evaluation was
extended beyond a single time instant, enabling assessment
that also accounts for temporal variations. The results
indicated that changing the order of the Lamé function
maintained the desired level of reproducibility.
Consequently, it was confirmed that cube-like loudspeaker
arrays can achieve a level of reproducibility equivalent to
that of the spherical array.
Authors
TS

Tomohiro Sakaguchi

Doctoral student, Waseda University
YO

Yasuhiro Oikawa

Waseda University

YE

Yuzuki Eriguchi

Waseda University
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Spatial Sound Field Reproduction Systems for Cabin Noise in Rail Vehicles: Performance Evaluation Based on Sound Quality Indices
Friday May 29, 2026 1:00pm - 3:00pm CEST
Innovative railway vehicle systems such as high-speed rail,
maglev,; emerging transportation concepts are expected
to reduce conventional noise sources related to wheel–rail
; aerodynamic interactions. As these changes alter the
acoustic characteristics inside railway cabins, reliable
laboratory reproduction of interior noise becomes
increasingly important for evaluating passenger acoustic
comfort; guiding sound design during vehicle
development. Innovative railway vehicle systems such as
high-speed rail, maglev,; emerging transportation
concepts are expected to reduce conventional noise sources
related to wheel–rail; aerodynamic interactions. As
these changes alter the acoustic characteristics inside
railway cabins, reliable laboratory reproduction of
interior noise becomes increasingly important for
evaluating passenger acoustic comfort; guiding sound
design during vehicle development. The study focuses on
practical methods for assessing reproduction accuracy.
Conventional validation of reproduced sound fields
typically relies on sound pressure level; spectral
matching; however, these metrics alone may not fully
reflect perceptually relevant differences between in-situ
; reproduced environments. In this work, sound quality
indices are employed as complementary evaluation metrics to
examine whether reproduced sound fields maintain
perceptually meaningful characteristics of the original
cabin noise. Comparisons between in-situ recordings;
reproduced sound fields were conducted in terms of overall
sound pressure level, frequency characteristics,;
selected sound quality indices. In addition, the influence
of loudspeaker number; spatial configuration on
reproduction performance was examined. The results show
that sound quality–based evaluation provides useful
additional information for assessing perceptual fidelity
; for optimizing spatial sound reproduction systems for
railway cabin noise. The proposed reproduction platform
supports laboratory-based assessment of interior railway
noise; provides a practical framework for perceptually
informed acoustic evaluation; noise control during the
design of next-generation railway vehicles.
Authors
HK

Hyo-In Koh

Korea Railroad Research Institute
JH

Jiyoung Hong

Korea Railroad Research Institute
WS

Wooseok Song

University of Science and Technology
avatar for Yonghee Lee

Yonghee Lee

Research Associate, Changwon National University
Yonghee Lee
Ph D. Mechanical Engineeing.
Ultrasonic, Acoustic, SHM, NDE, fNIRS, and Bio-medical engineering.
Contact: [email protected]
Institute: Changwon National Uniersity, South Korea
Friday May 29, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
 
Saturday, May 30
 

9:00am CEST

Differentiated Wavefront Modulation for Directivity Control at High Frequencies
Saturday May 30, 2026 9:00am - 11:00am CEST
The inherent narrowing of directivity at high frequencies
in compact tweeters limits the spatial uniformity of sound
reproduction in modern audio systems. Conventional passive
solutions, such as waveguides; acoustic lenses,
partially mitigate this issue but typically rely on bulky
geometries; treat the diaphragm as a unitary radiator,
neglecting localized vibration behavior. This study
proposes a Matrix Wavefront Modulator (MWM), a compact
passive device that implements a differentiated
wavefront-shaping strategy based on vibration-aware
radiation control. Sound radiation from the piston-like
diaphragm dome; the breakup-prone surround is processed
independently by combining guided wavefront steering with
targeted scattering compensation. The geometry of the MWM
is optimized to adapt to the radiation characteristics of
the tweeter. Numerical simulations show that the optimized
MWM reshapes the high-frequency wavefront toward a more
spherical distribution; significantly reduces off-axis
attenuation above 10 kHz. Experimental measurements confirm
significant improvements in high-frequency directivity over
wide radiation angles.
Authors
JY

Jianbin Yang

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
JG

Jun Gu

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
ZL

Zhi Li

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Equipment, Poster

9:00am CEST

Mechanical Characterization; Geometry Optimization of Loudspeaker Spider Suspensions
Saturday May 30, 2026 9:00am - 11:00am CEST
Loudspeaker spider suspensions play a crucial role in
defining the compliance; stability of electrodynamic
transducers. Due to their woven structure impregnated with
thermosetting resins, spiders exhibit a nonlinear;
viscoelastic mechanical response, resulting in stiffness
dependence on displacement; excitation rate, as well as
energy dissipation during operation. However, viscoelastic
effects are often simplified during early loudspeaker
design stages.
This work presents a combined numerical–experimental study
aimed at characterizing the mechanical behaviour of
loudspeaker spiders; assessing its influence on
optimization choices during the pre-design phase. An
experimental campaign was conducted on spider samples with
fixed geometry; varying materials. Loading–unloading
cycle measurements were performed at different displacement
rates to capture nonlinear stiffness; hysteresis effects.
A finite element modelling framework was developed using a
2D axisymmetric formulation. Viscoelastic material
behaviour was first described through time-dependent
simulations, with model parameters identified by fitting
simulated loading–unloading curves to experimental data. A
parametric geometry optimization model based on linear
elastic assumptions was then implemented using quasi-static
simulations. Finally, the optimized spider geometries were
re-evaluated using time-dependent simulations incorporating
the identified viscoelastic material properties.
Results show that spider materials may influence its
mechanical behaviour, in particular the suspension
stiffness; hysteresis effects. Viscoelasticity mainly
affects the magnitude of the stiffness curve rather than
its overall shape, particularly at small displacements.
These findings support the use of quasi-static linear
elastic simulations for geometry optimization in early
loudspeaker design, while highlighting the importance of
material characterization for accurate performance
prediction.
Authors
avatar for Chiara Corsini

Chiara Corsini

R&D engineer, FAITAL S.P.A. ALPS ALPINE GROUP
Chiara has joined Faital S.p.A. in 2018, working as a FEM analyst in the R&D Department. Her research activities are focused on thermal phenomena associated with loudspeaker functioning, and mechanical behavior of the speaker moving parts. To this goal, she uses FEM and lumped parameter... Read More →
LV

Luca Villa

FAITAL S.P.A. ALPS ALPINE GROUP
NC

Nicolò Chillè

Politecnico di Milano
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Equipment, Poster

9:00am CEST

Quasi-Anechoic Loudspeaker Measurements: a “Step” Forward
Saturday May 30, 2026 9:00am - 11:00am CEST
Measuring the anechoic response of a loudspeaker system
requires space; facilities that are not commonly
available. The evolution of measurement instruments has
made it possible to visualize the time response of the
system under analysis, enabling the identification of
reflected signals; their elimination through time-gating
(windowing) of the impulse response. However, this comes at
the cost of a loss of resolution; characterization of
the system's response at lower frequencies. To correctly
characterize the system's response at the lowest
frequencies, the most widely used technique is the one
described by Keele in his AES paper "Low-Frequency
Loudspeaker Assessment by Nearfield Sound-Pressure
Measurement".
To obtain the overall system response, the appropriately
windowed far-field response; the near-field response are
combined, as described by Struck; Temme in their paper
"Simulated Free Field Measurements".
This operation is performed in the frequency domain, but
what happens when applied in the time domain?
The goal of this work is to use the near-field impulse
response to reconstruct the far-field portion of the
impulse response affected by environmental reflections. As
already stated, it’s quite easy to identify the first
reflection point on a far-field impulse response; this
can be used as a merging point to reconstruct the
reflections affected impulse tail using the corresponding
part of the near-field impulse measurement. Once the
far-field impulse tail is reconstructed, it is possible to
obtain the full-range frequency response of the system
under test while maintaining maximum measurement
resolution. The steps required to achieve a full-range
frequency response are fewer than those required for the
frequency-domain technique. For example, it is not
necessary to add the baffle diffraction step effect, as
demonstrated in the paper.
Authors
DS

Davide Saronni

Independent
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Equipment, Poster

9:00am CEST

Reduction of Mid-to-High-Frequency Distortion in Loudspeakers through Structural Magnetic Circuit Modification
Saturday May 30, 2026 9:00am - 11:00am CEST
This paper investigates mid-to-high-frequency distortion in
traditional electrodynamic loudspeakers arising from
current-dependent nonlinearity in the magnetic circuit.
Through theoretical analysis, finite-element simulations
; experimental validation, the dominant distortion
mechanisms are identified. To mitigate distortion while
maintaining a stable frequency response, an improved
magnetic circuit is proposed, which introduces longitudinal
slits to suppress surface-concentrated eddy currents.
Experimental results demonstrate that the modified circuit
achieves greater distortion reduction compared with
conventional designs. As the improvement relies solely on
structural modifications without changing the ferromagnetic
materials, the proposed design offers a practical;
cost-effective solution for engineering applications.
Authors
HX

He Xiao

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
JY

Jianbin Yang

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
ZL

Zhi Li

Dynaudio Lab, Gammel Lundtoftevej 3B, Copenhagen, Denmark
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Equipment, Poster

9:00am CEST

Sound Diffusion Properties of a Bending-Wave Loudspeaker Compared with a Conventional Speaker
Saturday May 30, 2026 9:00am - 11:00am CEST
The Panel-shaped Bending Wave Loudspeaker was proposed
recently by Kawahara. The authors conducted an objective
evaluation of the diffusion characteristics of Bending Wave
Loudspeakers (BWL) using the degree of interaural
cross-correlation (DICC) in this paper.
Conventional speakers exhibit strong directionality;
rely on room reflections to create a spatial impression. In
contrast, BWLs are considered less susceptible to room
reflections due to complex mode vibrations across the
entire diaphragm.
To quantify this characteristic, the authors recorded sound
in a real-world environment using a head-and-torso
simulator (HATS); compared the BWL's DICC with that of a
conventional speaker.
The results showed that the BWL exhibited significantly
lower DICC values than conventional loudspeaker at the
front position (Center) under both broadband noise;
music conditions, confirming its high diffusivity.
Furthermore, this difference exceeded the Just Noticeable
Difference (JND) for spatial perception, suggesting it is
also significant to the human ear. In addition, analysis
separating early reflections; late reflections suggested
differences in diffusion characteristics between
conventional speakers; BWL.
Authors
avatar for Kazuhiko Kawahara

Kazuhiko Kawahara

Associate Professor, Faculty of Design, Kyushu University
Dr. Kazuhiko Kawahara is an Associate Professor at the Department of Acoustic...
avatar for Rina Mizukami

Rina Mizukami

Graduate School of Design, Kyushu University
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Equipment, Poster

9:00am CEST

Zylia ZM-1 vs. Harpex Spcmic: A Case Study of Higher-Order Ambisonic Recording Performance
Saturday May 30, 2026 9:00am - 11:00am CEST
The Zylia ZM-1 (19 MEMS capsules, spherical array, 88 mm
diameter, 3rd-order); Harpex Spcmic (84 MEMS capsules,
planar array, 230 mm diameter, 5th-order capable) represent
two distinct geometrical approaches to higher-order
Ambisonics capture. Despite widespread adoption in research
; production, systematic comparison of their performance
in real-world recordings remains absent from published
literature. This case study presents a controlled
comparison through simultaneous recordings of piano
recitals in the same concert hall.

Two arrays—Zylia ZM-1; Harpex Spcmic—were mounted on a
single stereo bar (17 cm apart) ensuring acoustically
identical capture positions. Recording sessions occurred in
Aula Politechniki Gdańskiej (370-seat hall, RT60 = 1.97 s)
on two dates: August 15, 2024 (Franck: Prélude, Choral et
Fugue; Prokofiev: Piano Sonata No. 4, 35.6 minutes
total); April 30, 2024 (Ginastera: Sonata No. 1, Op. 22,
15.4 minutes). Both arrays recorded simultaneously; files
were processed through manufacturer A-to-B conversion
software; peak-normalized to −0.5 dBTP. The Spcmic was
encoded to both native 5th-order; truncated 3rd-order
formats for direct comparison with the ZM-1.

Four metrics were analyzed: (1) W-channel spectral
response, (2) integrated loudness (LUFS-I per ITU-R
BS.1770-5), (3) spatial energy distribution across
Ambisonics orders,; (4) first-order directional
component ratios.

Spectral analysis reveals the ZM-1 exhibits 5–8 dB
elevation at 200–600 Hz relative to the Spcmic. Loudness
measurements show the Spcmic 3rd-order yields 2.3–3.3 dB
higher LUFS-I than the ZM-1 despite identical peak
normalization.

The primary finding concerns spatial energy: the ZM-1
exhibits 27.4 dB attenuation from 0th to 3rd order, while
the Spcmic shows only 8.4 dB—a 19 dB difference despite
both producing "3rd-order Ambisonics" format. Analysis of
both recording sessions confirms consistency across
different repertoire (romantic, 20th-century,
contemporary). Directional analysis shows the Spcmic
exhibits stronger first-order components (X/Y/Z ratios
0.68–0.83) versus the ZM-1 (0.42–0.55).

Results demonstrate that nominal Ambisonics order
inadequately characterizes spatial resolution in real
recordings. The substantial higher-order energy deficit in
compact spherical arrays has implications for reproduction
quality, decoder design,; archival standards. Arrays
with steeper rolloff may require order-dependent gain
compensation to match spatial impression of larger systems.

This case study complements existing anechoic validation by
demonstrating performance differences in authentic
recording conditions. Recordings are part of a publicly
available HOA corpus (Gdańsk University of Technology
repository).
Authors
avatar for Bartlomiej Mroz

Bartlomiej Mroz

Assistant Professor, Gdańsk University of Technology
PhD, Spatial Audio & Immersive Media Researcher, Recording Engineer, Statistics enthusiast
SZ

Szymon Zaporowski

Gdańsk University of Technology
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

A Longitudinal Dataset for Guitar String Ageing
Saturday May 30, 2026 9:00am - 11:00am CEST
String ageing is a familiar; perceptually important
phenomenon for guitarists; players of other stringed
instruments. From the moment a new set of strings is
installed, the sound they produce when excited begins to
change due to a combination of chemical degradation,
corrosion,; mechanical wear arising from playing.
Musicians commonly report that aged strings sound dull,
lack sustain,; feel less responsive compared to new
strings. String ageing is a function of both elapsed time
; accumulated playing time, with repeated playing
accelerating degradation through contamination; repeated
mechanical stress.

Previous studies have investigated individual aspects of
string ageing by artificially accelerating wear;
performing controlled acoustic measurements, identifying
effects such as increased damping of higher partials;
increased inharmonicity. While these approaches provide
valuable physical insight, the tightly constrained
experimental conditions differ significantly from
real-world playing conditions.

This paper presents a dataset of audio recordings of guitar
playing over a four-week period, starting from the point of
new strings being installed.
Audio performance data from different sets of electric
guitar strings is recorded daily over a four-week period,
using strictly fixed musical exercises that are repeated
multiple times per session. By collecting many takes of
identical material at each stage of string age, the dataset
enables statistical analysis of ageing-related changes
while accounting for natural performance variability.

The dataset is intended to support exploratory machine
learning investigations into string ageing, including
questions of how ageing manifests over time; playing
duration, whether string age can be predicted from audio
alone,; which audio features or learned representations
capture perceptually relevant aspects of the ageing process.
Authors
AW

Alec Wright

University of Edinburgh
MH

Matthew Hamilton

University of Bologna
avatar for Thomas McKenzie

Thomas McKenzie

Lecturer in Acoustics, University of Edinburgh
Thomas McKenzie is a Lecturer in Acoustics and Architectural Acoustics at the Reid School of Music, Edinburgh College of Art, University of Edinburgh, UK. He completed a B.Sc. in Music, Multimedia, and Electronics at the University of Leeds, UK, in 2013, before completing his M.Sc... Read More →
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

9:00am CEST

Modulation Noise in Tape Recording
Saturday May 30, 2026 9:00am - 11:00am CEST
Tape recording of audio programme produces significant
noise signals underlying the audio signal. Measurements
show that total modulation noise is significant; often
around 25 dB down from a sinusoidal audio signal, although
historical measurement methods give numbers that may exceed
50 dB. The persistent popularity of tape in the audio
industry may indicate a preference for some of the more
salient tape characteristics; perhaps even modulation
noise. Measurements on a variety of tapes; machines are
presented in an attempt to understand the basic principles.
A model of modulation noise is developed which provides a
broad steepening spectral peak centred on the signal
frequency; captures much of the tape noise character.
This could be the basis of a plug-in to simulate such
noise. A new measurement method is presented culminating
in a single plot which gives a useful more complete picture
of modulation noise.
Authors
JV

John Vanderkooy

University of Waterloo
Saturday May 30, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

JoyCam: Blending Facial Recognition with Neural Activity measurement for Real-time Estimation of Listener Emotion
Saturday May 30, 2026 1:00pm - 3:00pm CEST
The ability to objectively measure listener emotion is a
critical frontier for adaptive audio systems, healthcare,
; personalized music therapy. While music is a powerful
driver of affect, traditional self-reporting is often
intrusive or inaccessible for users in wellbeing settings
who may struggle to articulate their mood. This paper
introduces JoyCam, a multimodal system that estimates
subtle moments of joyful engagement by blending lightweight
brain-wave monitoring (wearable EEG) with facial-expression
sensing. By capturing physiological reactions that occur
below the threshold of conscious awareness, the system
creates a more stable emotional profile than
single-modality methods. In our system, Facial joy is
estimated via MediaPipe landmark analysis, focusing on
normalized mouth-width deviations. Simultaneously,
neurological engagement is tracked through Frontal Alpha
Asymmetry (FAA) using an OpenBCI Cyton system. To address
the sensitivity of EEG to movement, a dynamic artefact
index down-weights neural signals during high-frequency
interference. The system was tested in a pilot study with
five participants. Preliminary results indicate that
baseline-corrected physiological scores align closely with
self-reported music impact; valence ratings across
joyful; sad conditions. These findings suggest that
JoyCam offers a robust framework for responsive musical
companions that can adjust playlists or production
parameters based on a listener’s real-time physiological
state
Authors
avatar for Duncan Williams

Duncan Williams

Senior Lecturer, Acoustics Research Centre, University of Salford
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Smartphone-based tinnitus matching: Implementation; Validation
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Tinnitus has been described as `the conscious awareness of
a tonal or composite noise for which there is no
identifiable corresponding external sound source'; is
experienced by ~15% of the European population. Tinnitus
may be experienced in one ear, both ears, or perceived as
originating from within the head. It can present as tonal
sounds, noise-like sounds, or a combination of both. The
perception can lead to emotional;/or cognitive
dysfunction, autonomic arousal, behavioural changes,;/or
functional disability (DeRidder 2021, Biswas 2022, Jarach
2022). There is no standard test for tinnitus in the
medical literature; audiologists typically test pitch (to
within half an octave); perceived loudness of the tone
using standard clinical equipment for testing hearing loss.
The underlying causes of tinnitus are not yet fully
understood,; the most effective treatments not yet
identified. We present the first release of an extended
Tinnitus matching app that includes a highly
individualizable tinnitus tone-matching tool; a
comprehensive questionnaire for mobile health tracking. The
app facilitates large data collection on tinnitus sounds
across aetiologies, co-occurring symptoms,;
demographics. Our intentions are threefold; 1) to provide
those experiencing tinnitus with a way to communicate what
they hear more precisely, 2) understand how tinnitus sounds
vary across demographics, how these relate to co-occurring
symptoms,; eventually – 3) to provide a means of
individualising any sound-based approach to symptom
amelioration. We present the approach; validation of the
tinnitus matching tool against common clinical measures.
Authors
CJ

Cheol-Ho Jeong

Acoustic Technology, Department of Electrical and PhotonicsnEngineering, DTU
IO

Izabela Ossowska

Hearing Systems, DTU HealthTech
MB

Mark Bo Jensen

Department of Engineering Technology and Didactics, DTU
ML

Mie LærkegårdJørgensen

Hearing Systems, DTU HealthTech

MB

Mikkel Brunstedt Nørgaard

Department of Engineering Technology and Didactics, DTU
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Investigations on Nonlinearity in a Gammatone Filter Bank Based Perceptual Model
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Perceptual models are playing an important role in
effectively balancing the data compression; fidelity in
audio encoders by leveraging the masking effects in human
auditory perception. For deriving well suitable masking
thresholds, considering tonality is important. In this
study, a novel filter bank is proposed, which uses narrow
complex-valued all-pole gammatone filters followed by a
non-linear spectral spreading processing. With an
appropriate non-linear mapping before spreading,;
inverse non-linear mapping afterwards, differences between
masking strengths of tonal; noise-like maskers can be
directly obtained without explicit tonality estimation.
By employing a suitable non-linearity, level-dependency of
spectral spreading in the human auditory system can also be
modeled. The performance of the proposed approach is
evaluated through subjective listening tests, which include
comparisons with results obtained using partial spectral
flatness measures.
Authors
BE

Bernd Edler

International Audio Laboratories Erlangen, Germany
FS

Fabian Schaller

Fraunhofer IIS, Erlangen, Germany
PE

Paul EmilMeier

International Audio Laboratories Erlangen

PS

Paula Schäfer

Fraunhofer-Institut für Integrierte Schaltungen IIS
YH

Yaqiong Hou

PhD student, International Audio Laboratories Erlangen
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Audio Processing, Poster

1:00pm CEST

Measurement; Analysis of Perceptual Characteristics of Binaural Cues
Saturday May 30, 2026 1:00pm - 3:00pm CEST
The application of binaural cue perception mechanisms to
multichannel audio compression technology can reduce
spatial parameter redundancy; effectively lower the
encoding bitrate. Binaural cues play a critical role in
sound source localization,; their frequency-dependent
characteristics yield varied perceptual localization
effects. However, current understanding of the specific
behavior of binaural cues at low frequencies, as well as
the similarities; differences between interaural time
difference (ITD); interaural level difference (ILD),
remains incomplete. To explore the relationship between
ITD-based; ILD-based azimuth perception, this study
non-uniformly selected nine ITD values; twelve ILD
values within the 300–1480 Hz frequency range to test ITD
; ILD perceptual azimuths, respectively. The experimental
method involved using fixed binaural cue stimuli while
varying the audio with known horizontal azimuth angles to
approach the target binaural cue stimulus. Test results
indicate that both ITD; ILD perceptual effects are
significantly influenced by frequency, with the minimum
perceptual azimuth values for both ITD; ILD observed at
700 Hz, suggesting that binaural cue perception azimuths
are closer to the median plane at this frequency.
Furthermore, surface fitting was applied to the perceptual
azimuths of ITD; ILD, revealing relatively similar
patterns. Based on experimental findings, this paper
analyzes the explorable perceptual correlation between
ITD-based; ILD-based azimuth perception. The application
of data in spatial audio coding contributes to the
efficient transmission; fidelity preservation of audio
signals. This study provides valuable insights for
optimizing binaural cue-based compression techniques,
ultimately supporting high-fidelity spatial audio
reproduction.
Authors
HW

Heng Wang

Wuhan Polytechnic University
MG

Mingyan Gao

Wuhan Polytechnic University
YX

Yiming Xu

Wuhan Polytechnic University,Wuhan,China
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

1:00pm CEST

Subjective Evaluation of Stereo Width Shrinkage Method Using Semantic Differential Method; Scheffé’s Paired Comparison
Saturday May 30, 2026 1:00pm - 3:00pm CEST
The authors proposed a stereo-width shrinkage method for
headphone reproduction, in
which crosstalk from loudspeaker reproduction is added to
the original stereo
sources. In this study, we investigate the sound quality of
stereo-width-shrunken
sources with different parameter settings. A Semantic
Differential method is
employed to quantify the subjective characteristics with
five adjective pairs,;
the naturalness of the stereo width shrunk sources is
evaluated in detail with
Scheffé’s paired comparison. The results of the Semantic
Differential method
comprehensively rank the sound sources. Interestingly, the
results of the paired
comparison are not reversed in the natural; unnatural
evaluations, whereas the
negative evaluation yields reasonable results. These
results provide valuable
insights for practical sound-quality evaluation.
Authors
MA

Matsumoto Arisa

Kyushu Institute of Technology
avatar for Mitsunori Mizumachi

Mitsunori Mizumachi

Professor, Kyushu Institute of Technology
Mitsunori Mizumachi graduated from the Department of Acoustic Design, Kyushu Institute of Design, in 1995 and received his Ph.D. degree in Information Science from Japan Advanced Institute of Science and Technology in 2000. From 2000 to 2004, he worked as a researcher at Advanced... Read More →
Saturday May 30, 2026 1:00pm - 3:00pm CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
  Perception, Poster
 


Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.