Speech Quality and Environmental Noise Classifier

This is a binary audio classification model that determines if a speech recording is clean or if it is degraded by environmental noise.

It is specifically trained to be robust and understand the difference between clean audio and audio that has actual background noise (like cars, music, or other people talking).

  • LABEL_0: clean: The audio contains speech with no significant environmental noise. This includes high-quality recordings as well as recordings with source artifacts like hiss, clipping, or "bad microphone" quality.
  • LABEL_1: noisy: The audio contains speech that is obscured by external, environmental background noise.

Intended Uses & Limitations

This model is ideal for:

  • Pre-processing a large audio dataset to filter for clean samples.
  • Automatically tagging audio clips for quality control.
  • As a gate for ASR (Automatic Speech Recognition) systems that perform better on clean audio.

Limitations:

  • This model is a classifier, not a noise-reduction tool. It only tells you if environmental noise is present.
  • Its definition of "noisy" is based on environmental sounds. It is trained to classify audio with only source artifacts (like microphone hum or pure static) as clean.

How to Use

The easiest way to use this model is with a pipeline.

pip install transformers torch
from transformers import pipeline

classifier = pipeline("audio-classification", model="Etherll/NoisySpeechDetection-v0.2")

# Classify a local audio file (must be a WAV or other supported format)
# The pipeline automatically handles resampling to 16kHz.
results = classifier("path/to/your_audio_file.wav")

# The result is a list of dictionaries
# [{'score': 0.9979726672172546, 'label': 'clean'},
# {'score': 0.002027299487963319, 'label': 'noisy'}]
print(results)

Note: The model outputs a confidence score for each label. In my use case, I consider audio to be clean if the score for the clean label is greater than 0.7.

Training Data

This model was trained on a sophisticated, custom-built dataset of ~55,000 audio clips, specifically designed to teach the nuances of audio quality.

This whisper model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
215
Safetensors
Model size
88.4M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Etherll/NoisySpeechDetection-v0.2

Finetuned
(6)
this model