Current deep learning approaches to speech enhancement rely heavily on objective measures like mean squared error or scale-invariant signal-to-distortion ratio as both training objectives; evaluation metrics. While analytically convenient, these benchmarks often fail to capture the nuances of human perception or actual intelligibility. Furthermore, the inconsistent integration of metrics like Short-Term Objective Intelligibility or Perceptual Evaluation of Speech Quality into training; evaluation pipelines leaves a gap between algorithmic performance; perceptual reality. This paper proposes a transition towards evaluation methodologies grounded in psychoacoustics; audiological modeling. Our study explores two distinct methods to characterise enhanced signals. On one hand, we employ a perceptual approach based on the Cambridge loudness model to assess the preservation of spectral excitation patterns; perceived intensity. On the other hand, we adopt a biophysical approach by utilising CoNNear, a convolutional model of the human auditory periphery. This allows us to simulate representations of responses at different stages of the auditory periphery to observe how speech enhancement processing affects the physiological representation of speech. We analyse pre-trained speech enhancement models using automatic speech recognition; Short-Term Objective Intelligibility as an additional proxy for human intelligibility. By mapping automatic speech recognition performance against loudness; peripheral response patterns, we investigate the extent to which current enhancement strategies maintain the perceptual; physiological integrity of the speech signal. This work aims to identify features predictive of intelligibility, providing a foundation for speech enhancement systems optimised for the human listener rather than purely signal-based objective functions.
Objective quality evaluation is widely used in speech coding, yet objective estimates often show limited agreement with subjective listening-test results. Rather than focusing on absolute score accuracy, this paper evaluates objective speech quality models from a decision-making perspective, defined as their ability to support comparative judgments between speech codecs or codec configurations. A formal ITU-R P.800 Absolute Category Rating (ACR) listening test was conducted with 30 listeners across 24 conditions, covering conventional; neural monophonic speech codecs operating under clear-channel conditions at sampling frequencies from 16 to 48 kHz; bit rates ranging from below 1 kbps to above 16 kbps. The speech material consisted of internally recorded, clean French-language speech that was not used in the development or training of any of the evaluated codecs or objective quality models. Seven objective quality models, namely PESQ, VISQOL Speech, VISQOL Audio, WARP-Q, NISQA, UTMOS,; DistillMOS, were evaluated on the same material. Decision-making performance was assessed by comparing subjective; objective rankings using Kendall’s rank correlation coefficient; by analyzing pairwise codec comparisons using t-tests at a 95% confidence level. The results show that some objective quality models are effective for comparing bit rate variations within a given speech coding technology, provided that all other codec parameters remain unchanged (e.g., sampling frequency). However, all models exhibit limitations, including tendencies toward over- or underestimation for certain technologies, as well as reduced reliability when applied across different sampling frequencies. Despite its conventional origins, PESQ remains capable of supporting decision-making even when applied to neural speech codecs.
Music source separation (MSS) systems are commonly used in production, remixing,; audio analysis work, yet questions arise regarding the extent that objective evaluations of model performance align with human perceptual evaluations, particularly when tasked with non-traditional source material (in this case, heavily processed electronic music). This study seeks to set a framework for an evaluation of 3 machine learning approaches to MSS: a spectrogram-domain model (spleeter), a waveform-domain model (Demucs v2),; a hybrid-domain model (HTDemucs). Subjective evaluations of model performance were accumulated via a MUSHRA-style listening test, while objective evaluations were assessed using signal-to-distortion ratio (SDR); Frechet Audio Distance (FAD). Results showed consistent agreement across objective metrics, with the hybrid-domain model outperforming the other singular-domain models. Perceptual ratings also favored the hybrid model, with listeners occasionally rating the model output as equal or better quality than the original reference, interestingly. Preliminary analysis indicates some moderate but insignificant correlations between the two assessment paths, reinforcing concerns about relying solely on numerical evaluations when discussing MSS model performance. Implications for model design; future evaluation procedures are discussed.
The purpose of this study was to evaluate a sonification system that maps live heart rate data to real-time spectral filtering of a runner's preferred music. Assessed using a within-subjects design (n = 13), the system employs high-pass; low-pass filters to indicate deviations from target heart rate zones, providing instantaneous biofeedback without requiring visual attention. Quantitative analysis revealed no statistically significant differences in target zone accuracy or response time between auditory, visual,; combined conditions. However, qualitative thematic analysis identified a clear division in user preference. Participants favouring the auditory condition demonstrated faster mean response times to audio biofeedback. Findings suggest that while sonification promotes environmental focus; "gamifies" training, its efficacy is highly dependent on individual processing styles; music familiarity.
This paper proposes a psychoacoustic-based audio-visual mapping framework for intelligent vehicle cabins to enhance immersion; stabilize spatial auditory perception. By establishing mappings between auditory descriptors—such as Direction of Arrival (DOA), spectral centroid,; temporal envelope—and ambient lighting parameters, the framework leverages "ambient vision" to augment the perceptual experience without increasing the driver's cognitive load. Theoretical analysis based on Stevens’ Power Law indicates that the proposed mapping strategies effectively synchronize audio-visual intensities; mitigate perceptual fatigue, providing a conceptual reference for future multisensory HMI design.