We investigate how training data composition influences semantic audio encoders that learn perceptual descriptors such as "warm," "bright,"; "muddy" from equalization (EQ) parameter datasets without labeled audio examples. Using the SAFE-DB dataset of 1,369 labeled EQ settings, we train audio encoders via an inverse problem formulation in which labeled EQ parameters are applied to source audio; the encoder is trained to recognize the resulting semantic characteristics. Three training configurations are compared, varying both class sampling strategy (uniform versus balanced); source audio type (pink noise versus real music). Despite severe class imbalance in SAFE-DB, where 76 percent of examples are labeled "bright" or "warm," balanced class sampling combined with mixed-source training (50 percent pink noise; 50 percent FMA music) successfully learns physically meaningful semantic-spectral relationships: "warm"; "muddy" show negative correlation with spectral centroid (r = -0.56), while "bright"; "thin" show positive correlation (r = +0.49). However, prediction confidence decreases substantially (from 0.96 to 0.76 to 0.86),; top-1 predictions remain dominated by the "bright" class across all evaluated music genres, reflecting inherent dataset bias rather than training failure. These results demonstrate that training data composition significantly affects model calibration but cannot fully overcome fundamental bias in the underlying label distribution, highlighting key challenges for semantic audio understanding systems.