Speech Quality and Environmental Noise Classifier
This is a binary audio classification model that determines if a speech recording is clean or if it is degraded by environmental noise.
It is specifically trained to be robust and understand the difference between clean audio and audio that has actual background noise (like cars, music, or other people talking).
- LABEL_0:
clean
: The audio contains speech with no significant environmental noise. This includes high-quality recordings as well as recordings with source artifacts like hiss, clipping, or "bad microphone" quality. - LABEL_1:
noisy
: The audio contains speech that is obscured by external, environmental background noise.
Intended Uses & Limitations
This model is ideal for:
- Pre-processing a large audio dataset to filter for clean samples.
- Automatically tagging audio clips for quality control.
- As a gate for ASR (Automatic Speech Recognition) systems that perform better on clean audio.
Limitations:
- This model is a classifier, not a noise-reduction tool. It only tells you if environmental noise is present.
- Its definition of "noisy" is based on environmental sounds. It is trained to classify audio with only source artifacts (like microphone hum or pure static) as
clean
.
How to Use
The easiest way to use this model is with a pipeline
.
pip install transformers torch
from transformers import pipeline
classifier = pipeline("audio-classification", model="Etherll/NoisySpeechDetection-v0.2")
# Classify a local audio file (must be a WAV or other supported format)
# The pipeline automatically handles resampling to 16kHz.
results = classifier("path/to/your_audio_file.wav")
# The result is a list of dictionaries
# [{'score': 0.9979726672172546, 'label': 'clean'},
# {'score': 0.002027299487963319, 'label': 'noisy'}]
print(results)
Note: The model outputs a confidence score for each label. In my use case, I consider audio to be clean if the score for the
clean
label is greater than 0.7.
Training Data
This model was trained on a sophisticated, custom-built dataset of ~55,000 audio clips, specifically designed to teach the nuances of audio quality.
This whisper model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 215