Voice Activity Detection¶
Frame-level voice activity detection with adaptive feature normalization and onset/hangover smoothing.
The detector extracts five features per frame:
Index |
Feature |
|---|---|
0 |
Energy |
1 |
Zero-crossing rate |
2 |
Spectral entropy |
3 |
Spectral flatness |
4 |
Band energy ratio |
Features are normalized to [0.0, 1.0] using adaptive EMA tracking, then combined into a weighted score:
An onset/hangover state machine smooths the final decision.
- class pyminidsp.VAD(*, threshold=None, onset_frames=None, hangover_frames=None, adaptation_rate=None, band_low_hz=None, band_high_hz=None, weights=None)[source]¶
Voice activity detector with adaptive feature normalization and onset/hangover smoothing.
Wraps the miniDSP C library’s stateful VAD API. The detector extracts five features per frame (energy, ZCR, spectral entropy, spectral flatness, band energy ratio), normalizes them adaptively, computes a weighted score, and applies an onset/hangover state machine.
Example
>>> detector = VAD(threshold=0.4) >>> detector.calibrate(silence_frame, sample_rate=16000.0) >>> decision, score, features = detector.process_frame(frame, 16000.0)
import pyminidsp as md detector = md.VAD(threshold=0.4) # Calibrate with silence silence = md.white_noise(320, amplitude=0.0, seed=0) for _ in range(10): detector.calibrate(silence, sample_rate=16000.0) # Process a single frame frame = md.sine_wave(320, amplitude=1.0, freq=1000.0, sample_rate=16000.0) decision, score, features = detector.process_frame(frame, 16000.0) # Batch-process an entire signal signal = md.sine_wave(16000, amplitude=1.0, freq=1000.0, sample_rate=16000.0) decisions, scores, features = detector.process(signal, 16000.0, frame_len=320)
- Parameters:
- calibrate(signal, sample_rate)[source]¶
Feed a known-silence frame to seed the adaptive normalization.
Call this on several silence frames before live processing to improve initial accuracy.
- Parameters:
signal (ArrayLike) – Frame samples (1-D array).
sample_rate (float) – Sample rate in Hz.
- Return type:
None
- process_frame(signal, sample_rate)[source]¶
Process one audio frame and return the VAD decision.
- Parameters:
signal (ArrayLike) – Frame samples (1-D array).
sample_rate (float) – Sample rate in Hz.
- Returns:
A tuple of
(decision, score, features)where decision is 1 (speech) or 0 (silence), score is the combined weighted score in [0.0, 1.0], and features is a float64 array of length 5 containing the normalized feature values.- Return type:
- process(signal, sample_rate, frame_len)[source]¶
Segment a signal into non-overlapping frames and process each one.
- Parameters:
- Returns:
A tuple of
(decisions, scores, features)where decisions is an int32 array of length num_frames, scores is a float64 array of length num_frames, and features is a float64 array of shape(num_frames, 5).- Return type:
tuple[ndarray[tuple[Any, …], dtype[int32]], ndarray[tuple[Any, …], dtype[float64]], ndarray[tuple[Any, …], dtype[float64]]]
- pyminidsp.VAD_NUM_FEATURES¶
Number of features extracted per frame (5).