The rapid development of artificial intelligence composition technology has brought innovation to music creation. However, current deep learning music generation models often neglect the global correlation of emotional features, resulting in fragmented emotional expression in generated works; insufficient alignment with human emotional perception, making it difficult to meet the core demand for emotional conveyance in diverse music creation. This study aims to propose a music generation method that integrates a global perception mechanism for emotional features. Taking the EMOPIA; VGMIDI preprocessed datasets as the research objects, an improved model based on EMelodyGen (EMelodyGen-PPO) is constructed: a GLU network layer is introduced in the feature extraction stage to enhance the model's ability to filter; represent emotion-related features; an improved PPO-Clip algorithm is integrated in the training process,; a multi-dimensional emotional reward function is designed to achieve global dynamic perception; optimization of emotional features. Experimental results show that the music21 parsing rate of the EMelodyGen-PPO model on the target dataset is 3%; 4% higher than that of the baseline model, respectively. An automated quality assessment system based on fluency, rhythm stability, harmony richness, melodic smoothness,; structural integrity verifies that the comprehensive score of the model's generated works is significantly better than that of the comparative model. This study provides an efficient technical path for emotion-oriented music generation, which can empower grassroots cultural workers ; independent musicians at low cost, facilitate diverse music creation practices; emotional audio content dissemination,; align with the diversity; innovative development concept of the AES audio community.
In this paper, we analyze two main factors of Bonafide Resource (BR) or AI-based Generator (AG) which affect the performance; the generality of a Deepfake Speech Detection (DSD) model. To this end, we first propose a deep-learning based model, referred to as the baseline. Then, we conducted experiments on the baseline by which we indicate how Bonafide Resource (BR); AI-based Generator (AG) factors affect the threshold score used to detect fake or bonafide input audio in the inference process. Given the experimental results, a dataset, which re-uses public Deepfake Speech Detection (DSD) datasets; shows a balance between Bonafide Resource (BR) or AI-based Generator (AG), is proposed. We then train various deep-learning based models on the proposed dataset; conduct cross-dataset evaluation on different benchmark datasets. The cross-dataset evaluation results prove that the balance of Bonafide Resources (BR); AI-based Generators (AG) is the key factor to train; achieve a general Deepfake Speech Detection (DSD) model.
We investigate how training data composition influences semantic audio encoders that learn perceptual descriptors such as "warm," "bright,"; "muddy" from equalization (EQ) parameter datasets without labeled audio examples. Using the SAFE-DB dataset of 1,369 labeled EQ settings, we train audio encoders via an inverse problem formulation in which labeled EQ parameters are applied to source audio; the encoder is trained to recognize the resulting semantic characteristics. Three training configurations are compared, varying both class sampling strategy (uniform versus balanced); source audio type (pink noise versus real music). Despite severe class imbalance in SAFE-DB, where 76 percent of examples are labeled "bright" or "warm," balanced class sampling combined with mixed-source training (50 percent pink noise; 50 percent FMA music) successfully learns physically meaningful semantic-spectral relationships: "warm"; "muddy" show negative correlation with spectral centroid (r = -0.56), while "bright"; "thin" show positive correlation (r = +0.49). However, prediction confidence decreases substantially (from 0.96 to 0.76 to 0.86),; top-1 predictions remain dominated by the "bright" class across all evaluated music genres, reflecting inherent dataset bias rather than training failure. These results demonstrate that training data composition significantly affects model calibration but cannot fully overcome fundamental bias in the underlying label distribution, highlighting key challenges for semantic audio understanding systems.
This study presents a voice-centered machine learning framework for detecting mental fatigue in military personnel, integrating acoustic analysis with physiological biosensors to enhance detection robustness. Mental fatigue poses critical safety; performance challenges in military operations, yet cultural stigma often prevents self-reporting. We collected multi-modal data from 23 participants across two fatigue states, extracting comprehensive acoustic features including sound pressure level (SPL), formants, mel-frequency cepstral coefficients (MFCCs), jitter, shimmer, harmonic-to-noise ratio (HNR), ; temporal speech characteristics. These voice features were combined with electroencephalography (EEG), photoplethysmography (PPG),; temperature data to train multiple machine learning classifiers. The voice-based models achieved accuracies between 82-85\%, with support vector machines (SVM); long short-term memory (LSTM) networks demonstrating superior performance. When acoustic features were combined with physiological markers, classification accuracy improved to 92\%, with Classification; Regression Trees (CART); Linear Discriminant Analysis (LDA) emerging as top performers. Statistical analysis identified SPL; formant variance as the most discriminative voice features, while Lempel-Ziv Complexity (LZC); theta/beta ratio proved most reliable for EEG. Evaluation on new participants yielded 67\% accuracy, revealing model generalization challenges that inform future research directions. This work demonstrates that voice-based machine learning systems, when augmented with physiological data, offer a promising non-invasive approach to real-time fatigue monitoring in operational military environments.
I’m a creative technologist and interaction designer exploring how sound, technology, and human experience meet. With an MScEng in Sound & Music Computing, I prototype audio interactions, build ML‑driven tools, and design experiments around perception. My background spans music... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Spatial audio recording using higher-order Ambisonics offers rich directional information for medical speech capture, yet challenging hospital acoustic environments motivate preprocessing with neural denoising algorithms. This study investigates whether U-Net-based denoising of third-order ambisonic recordings improves automatic speech recognition (ASR) quality for medical applications. We developed the Medical Immersive Audio Corpus (MIAC), comprising 1,759 utterances (6.43 hours) of Polish medical speech recorded with a Zylia ZM-1 microphone in uncontrolled hospital environments, capturing 16-channel third-order Ambisonics across multiple specializations including thyroid ultrasonography, surgical procedures,; general diagnostics. We applied a U-Net architecture with dual attention mechanisms trained using the Noise2Noise paradigm to denoise the corpus, then evaluated transcription quality using ten Whisper ASR models ranging from 39 million to 1.55 billion parameters, including domain-adapted medical variants. Surprisingly, we discovered a "noise reduction paradox" where denoising degraded transcription quality for seven of ten models, with statistically significant increases in Word Error Rate (WER); Character Error Rate (CER) for general-purpose base, small,; medium models. Only the domain-adapted whisper-medium-68000-abbr model showed statistically significant improvement (p=0.0008), while large-scale models (large-v2, large-v3) exhibited robustness with negligible changes. Effect sizes remained small (Cohen's d < 0.2) across all models. These counterintuitive findings suggest modern ASR systems implicitly utilize background noise characteristics as informative features,; that preprocessing pipelines should be reconsidered for domain-specific applications. Our results provide practical guidance for medical speech processing system design.
Music source separation (MSS) systems are commonly used in production, remixing,; audio analysis work, yet questions arise regarding the extent that objective evaluations of model performance align with human perceptual evaluations, particularly when tasked with non-traditional source material (in this case, heavily processed electronic music). This study seeks to set a framework for an evaluation of 3 machine learning approaches to MSS: a spectrogram-domain model (spleeter), a waveform-domain model (Demucs v2),; a hybrid-domain model (HTDemucs). Subjective evaluations of model performance were accumulated via a MUSHRA-style listening test, while objective evaluations were assessed using signal-to-distortion ratio (SDR); Frechet Audio Distance (FAD). Results showed consistent agreement across objective metrics, with the hybrid-domain model outperforming the other singular-domain models. Perceptual ratings also favored the hybrid model, with listeners occasionally rating the model output as equal or better quality than the original reference, interestingly. Preliminary analysis indicates some moderate but insignificant correlations between the two assessment paths, reinforcing concerns about relying solely on numerical evaluations when discussing MSS model performance. Implications for model design; future evaluation procedures are discussed.
HAMLET is a research project that investigates the integration of Artificial Intelligence; co-creation practices within the creative industries. The project proposes AI-driven enablers to support artists through collaborative workflows between creative practitioners; technology providers. This work focuses on an automated sound design framework for text-based role-playing games, where the game narration is dynamically generated through player textual interaction with an LLM. To address this unpredictability, the proposed system generates adaptive soundscapes automatically from textual scene descriptions. An LLM identifies semantically relevant sound sources, which are then matched to audio libraries through metadata alignment. The files are assessed for quality,; are fed to an automated mixing module. The framework addresses challenges related to semantic alignment, audio quality, aesthetic balance,; file size constraints.
Dr. Nikolaos Vryzas was born in Thessaloniki in 1990. He studied Electrical & Computer Engineering in the Aristotle University of Thessaloniki (AUTh). After graduating, he received his master degrees on Information and Communication Audio Video Technologies for Education & Production from the Interdepartme... Read More →
Friday May 29, 2026 9:00am - 11:00am CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark
Virtual Microphone Array techniques are being investigated by the authors to support room acoustics optimisation in live sound environments. In our recent AES paper, “Room Acoustics Optimisation Using Virtual Microphone Arrays”, a notable outcome was that a compact four-microphone tetrahedral array performed strongly relative to its low sensor count. Recent virtual sensing; Remote Microphone Technique research treats microphone placement as an explicit design variable. It reports improved remote estimation performance when microphone layouts are deliberately chosen for the task, rather than adopted as fixed, standard configurations. This submission builds on our prior VMA work by focusing on the four-microphone case, where geometry choices are especially constrained. We compare a tetrahedral baseline with an ensemble of stochastically generated spherical layouts at the same array aperture using Monte Carlo simulation. We apply a consistent evaluation protocol across multiple listening-region offsets; standard beamforming estimators to isolate variability due to geometry alone. The central proposition is that, for low-count VMAs, geometry is a first-order design parameter. Tetrahedral remains a credible baseline, but lightweight stochastic exploration can reveal alternative layouts that are competitive;, in some cases, superior without increasing channel count.
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
This paper introduces clustered virtual microphone arrays as a step toward improving listener-level virtual microphone estimation for live sound. Multiple compact microphone sub-arrays are placed around a nominal overhead position. Each sub-array produces a virtual microphone estimate,; the estimates are fused. The aim is to attack the estimation problem from multiple viewpoints; reduce sensitivity to any one array placement or geometry. The work builds on our earlier paper, “Room Acoustics Optimisation Using Virtual Microphone Arrays”. That paper proposed virtual microphones estimated from an overhead array as a measurement layer for live sound optimisation. It also highlighted a key limitation: in its initial form, virtual microphone estimation quality was not yet strong enough for reliable use across positions. The present paper targets that limitation. We outline the clustered array idea; treat cluster count; inter-cluster spacing as design parameters. Virtual microphones are estimated using beamforming; combined using simple fusion. Performance is assessed with objective signal measures, including SNR ; frequency-; phase-related error measures, across multiple listener-level target positions. The results support further refinement under more realistic room conditions; further study of the link between improved estimation quality; FIR-based correction outcomes.
Brian de Brit is a lecturer in the School of Electrical and Electronic Engineering at Technological University Dublin. He holds a B.Sc. in Mathematical Physics (University College Dublin), an M.Phil. in Music and Media Technologies (Trinity College Dublin), and a Master of Engineering... Read More →
Loudspeaker array beamforming technology has been widely used; however, current frequency-domain; time-domain design methods for calculating FIR filters face challenges, including the need for modeling delay; high computational complexity. To address these issues, this paper proposes a time–frequency integrated framework. This framework supports both pressure matching; amplitude matching methods, enabling not only the realization of traditional superdirective beams but also the design of frequency-invariant beams. For the nonlinear optimization problem in amplitude matching, an efficient solving algorithm based on the Alternating Direction Method of Multipliers (ADMM) is introduced. Experimental results demonstrate that the proposed method combines the advantages of existing frequency-domain; time-domain approaches, directly computing FIR filter coefficients without delay modeling while maintaining high computational efficiency. This provides an effective solution for beam control in loudspeaker arrays.
The Exponential Sine Sweep (ESS) technique, popularized by Angelo Farina, has become a cornerstone of modern electroacoustic measurement due to its unique capability to simultaneously extract a system’s linear impulse response ; its individual harmonic distortion components. Standard implementation of this method almost exclusively utilizes a low-to-high (upward) exponential sine sweep. However, during a technical Q&A session at the AES Europe 2025 Convention in Warsaw, a question was raised: what are the practical consequences of reversing the sweep direction? This inquiry is particularly relevant given that several industry-standard measurement platforms often employ high-to-low (downward) sweeps to optimize the mechanical ; thermal stability of the device under test (DUT) while performing stepped or swept sinusoidal analysis. This paper provides an investigation into the temporal behavior of nonlinearities when the frequency gradient of an exponential sweep is inverted. Through formal mathematical derivation; numerical simulations the study proves that while the spacing between distortion orders remains identical in magnitude, the polarity; time distribution of these impulses is reversed. Specifically, we demonstrate that in a downward sweep, the distortion products shift from the "pre-causal" negative time region to the "post-causal" positive time region. This shift causes harmonic distortion pulses to emerge within the reverberant tail of the impulse response, leading to significant contamination of decay measurements; energy-time curves. By contrasting the "tracking filter" paradigm with "time-domain deconvolution," this work clarifies why sweep direction is a critical parameter that must be aligned with the specific goals of the measurement protocol.
The purpose of this study was to evaluate a sonification system that maps live heart rate data to real-time spectral filtering of a runner's preferred music. Assessed using a within-subjects design (n = 13), the system employs high-pass; low-pass filters to indicate deviations from target heart rate zones, providing instantaneous biofeedback without requiring visual attention. Quantitative analysis revealed no statistically significant differences in target zone accuracy or response time between auditory, visual,; combined conditions. However, qualitative thematic analysis identified a clear division in user preference. Participants favouring the auditory condition demonstrated faster mean response times to audio biofeedback. Findings suggest that while sonification promotes environmental focus; "gamifies" training, its efficacy is highly dependent on individual processing styles; music familiarity.
This paper proposes a psychoacoustic-based audio-visual mapping framework for intelligent vehicle cabins to enhance immersion; stabilize spatial auditory perception. By establishing mappings between auditory descriptors—such as Direction of Arrival (DOA), spectral centroid,; temporal envelope—and ambient lighting parameters, the framework leverages "ambient vision" to augment the perceptual experience without increasing the driver's cognitive load. Theoretical analysis based on Stevens’ Power Law indicates that the proposed mapping strategies effectively synchronize audio-visual intensities; mitigate perceptual fatigue, providing a conceptual reference for future multisensory HMI design.
The diversification of audio content production has increased the demand for realistic, immersive sound field reproduction. Conventional methods struggle to separate direct; reflected sounds, limiting accuracy. To address this issue, this study proposes a method for sound field reproduction that identifies the arrival directions of reflected sounds based on the virtual sound source distribution. In this study, the virtual sound source distribution was calculated by using closely located four point microphone method. Assuming that spherical waves emitted from distant virtual sound sources arrive as plane waves within the listening area, the target sound field is generated through plane wave synthesis, enabling more accurate; flexible sound field generation. Furthermore, considering practical systems; typical room shapes, we investigated the reproducibility of plane wave sound fields using not only spherical array, but also cube-like loudspeaker array configured by the Lamé function, which allows continuous geometric transformation from a sphere to a cube-like form. In this study, the ideal plane wave sound field derived from the wave equation was regarded as the reference,; the sound fields generated by the loudspeaker arrays were evaluated; compared using mean square error (MSE). Furthermore, the evaluation was extended beyond a single time instant, enabling assessment that also accounts for temporal variations. The results indicated that changing the order of the Lamé function maintained the desired level of reproducibility. Consequently, it was confirmed that cube-like loudspeaker arrays can achieve a level of reproducibility equivalent to that of the spherical array.
Innovative railway vehicle systems such as high-speed rail, maglev,; emerging transportation concepts are expected to reduce conventional noise sources related to wheel–rail ; aerodynamic interactions. As these changes alter the acoustic characteristics inside railway cabins, reliable laboratory reproduction of interior noise becomes increasingly important for evaluating passenger acoustic comfort; guiding sound design during vehicle development. Innovative railway vehicle systems such as high-speed rail, maglev,; emerging transportation concepts are expected to reduce conventional noise sources related to wheel–rail; aerodynamic interactions. As these changes alter the acoustic characteristics inside railway cabins, reliable laboratory reproduction of interior noise becomes increasingly important for evaluating passenger acoustic comfort; guiding sound design during vehicle development. The study focuses on practical methods for assessing reproduction accuracy. Conventional validation of reproduced sound fields typically relies on sound pressure level; spectral matching; however, these metrics alone may not fully reflect perceptually relevant differences between in-situ ; reproduced environments. In this work, sound quality indices are employed as complementary evaluation metrics to examine whether reproduced sound fields maintain perceptually meaningful characteristics of the original cabin noise. Comparisons between in-situ recordings; reproduced sound fields were conducted in terms of overall sound pressure level, frequency characteristics,; selected sound quality indices. In addition, the influence of loudspeaker number; spatial configuration on reproduction performance was examined. The results show that sound quality–based evaluation provides useful additional information for assessing perceptual fidelity ; for optimizing spatial sound reproduction systems for railway cabin noise. The proposed reproduction platform supports laboratory-based assessment of interior railway noise; provides a practical framework for perceptually informed acoustic evaluation; noise control during the design of next-generation railway vehicles.
Yonghee Lee Ph D. Mechanical Engineeing. Ultrasonic, Acoustic, SHM, NDE, fNIRS, and Bio-medical engineering. Contact: [email protected] Institute: Changwon National Uniersity, South Korea
Friday May 29, 2026 1:00pm - 3:00pm CEST Foyer Building 303ATechnical University of Denmark Asmussens Alle, Building 303A DK-2800 Kgs. Lyngby Denmark