Name: The Ambisonic Denoising Paradox: U-Net Processing Degrades ASR Transcription Quality for Medical Speech
Start: 2026-05-29T09:00:00+0200
End: 2026-05-29T11:00:00+0200

Schedule as of May 16, 2022 - subject to change

Default Time Zone is CEST - Central European Summer Time
You can change your view to your time zone (look for "Timezone" on the right)

LIVESTREAMS : A and B

ON DEMAND VIDEOS (previous days)

The Ambisonic Denoising Paradox: U-Net Processing Degrades ASR Transcription Quality for Medical Speech

Friday May 29, 2026 9:00am - 11:00am CEST

Foyer Building 303A

Spatial audio recording using higher-order Ambisonics
offers rich directional information for medical speech
capture, yet challenging hospital acoustic environments
motivate preprocessing with neural denoising algorithms.
This study investigates whether U-Net-based denoising of
third-order ambisonic recordings improves automatic speech
recognition (ASR) quality for medical applications. We
developed the Medical Immersive Audio Corpus (MIAC),
comprising 1,759 utterances (6.43 hours) of Polish medical
speech recorded with a Zylia ZM-1 microphone in
uncontrolled hospital environments, capturing 16-channel
third-order Ambisonics across multiple specializations
including thyroid ultrasonography, surgical procedures,;
general diagnostics. We applied a U-Net architecture with
dual attention mechanisms trained using the Noise2Noise
paradigm to denoise the corpus, then evaluated
transcription quality using ten Whisper ASR models ranging
from 39 million to 1.55 billion parameters, including
domain-adapted medical variants. Surprisingly, we
discovered a "noise reduction paradox" where denoising
degraded transcription quality for seven of ten models,
with statistically significant increases in Word Error Rate
(WER); Character Error Rate (CER) for general-purpose
base, small,; medium models. Only the domain-adapted
whisper-medium-68000-abbr model showed statistically
significant improvement (p=0.0008), while large-scale
models (large-v2, large-v3) exhibited robustness with
negligible changes. Effect sizes remained small (Cohen's d
< 0.2) across all models. These counterintuitive findings
suggest modern ASR systems implicitly utilize background
noise characteristics as informative features,; that
preprocessing pipelines should be reconsidered for
domain-specific applications. Our results provide practical
guidance for medical speech processing system design.

Authors

Bartlomiej Mroz

Assistant Professor, Gdańsk University of Technology

PhD, Spatial Audio & Immersive Media Researcher, Recording Engineer, Statistics enthusiast

Szymon Zaporowski

Gdańsk University of Technology

Friday May 29, 2026 9:00am - 11:00am CEST
Foyer Building 303A Technical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark

AI and Machine Learning in Audio, Poster | Audio Processing, Poster | Recording Production and Reproduction, Poster

Presentation Type Poster

AES Europe 2026

Bartlomiej Mroz

Szymon Zaporowski

Attendees (2)

Get help with the event

AES Europe 2026

Bartlomiej Mroz

Szymon Zaporowski

Attendees (2)

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Get help with the event