Voice Activity Detection

Frame-level voice activity detection with adaptive feature normalization and onset/hangover smoothing.

The detector extracts five features per frame:

Index

Feature

0

Energy

1

Zero-crossing rate

2

Spectral entropy

3

Spectral flatness

4

Band energy ratio

Features are normalized to [0.0, 1.0] using adaptive EMA tracking, then combined into a weighted score:

\[S = \sum_{i=0}^{4} w_i \cdot \hat{f}_i\]

An onset/hangover state machine smooths the final decision.

class pyminidsp.VAD(*, threshold=None, onset_frames=None, hangover_frames=None, adaptation_rate=None, band_low_hz=None, band_high_hz=None, weights=None)[source]

Voice activity detector with adaptive feature normalization and onset/hangover smoothing.

Wraps the miniDSP C library’s stateful VAD API. The detector extracts five features per frame (energy, ZCR, spectral entropy, spectral flatness, band energy ratio), normalizes them adaptively, computes a weighted score, and applies an onset/hangover state machine.

Example

>>> detector = VAD(threshold=0.4)
>>> detector.calibrate(silence_frame, sample_rate=16000.0)
>>> decision, score, features = detector.process_frame(frame, 16000.0)
import pyminidsp as md

detector = md.VAD(threshold=0.4)

# Calibrate with silence
silence = md.white_noise(320, amplitude=0.0, seed=0)
for _ in range(10):
    detector.calibrate(silence, sample_rate=16000.0)

# Process a single frame
frame = md.sine_wave(320, amplitude=1.0, freq=1000.0, sample_rate=16000.0)
decision, score, features = detector.process_frame(frame, 16000.0)

# Batch-process an entire signal
signal = md.sine_wave(16000, amplitude=1.0, freq=1000.0, sample_rate=16000.0)
decisions, scores, features = detector.process(signal, 16000.0, frame_len=320)
Parameters:
  • threshold (float | None)

  • onset_frames (int | None)

  • hangover_frames (int | None)

  • adaptation_rate (float | None)

  • band_low_hz (float | None)

  • band_high_hz (float | None)

  • weights (Sequence[float] | None)

calibrate(signal, sample_rate)[source]

Feed a known-silence frame to seed the adaptive normalization.

Call this on several silence frames before live processing to improve initial accuracy.

Parameters:
  • signal (ArrayLike) – Frame samples (1-D array).

  • sample_rate (float) – Sample rate in Hz.

Return type:

None

process_frame(signal, sample_rate)[source]

Process one audio frame and return the VAD decision.

Parameters:
  • signal (ArrayLike) – Frame samples (1-D array).

  • sample_rate (float) – Sample rate in Hz.

Returns:

A tuple of (decision, score, features) where decision is 1 (speech) or 0 (silence), score is the combined weighted score in [0.0, 1.0], and features is a float64 array of length 5 containing the normalized feature values.

Return type:

tuple[int, float, ndarray[tuple[Any, …], dtype[float64]]]

process(signal, sample_rate, frame_len)[source]

Segment a signal into non-overlapping frames and process each one.

Parameters:
  • signal (ArrayLike) – Input signal (1-D array).

  • sample_rate (float) – Sample rate in Hz.

  • frame_len (int) – Number of samples per frame.

Returns:

A tuple of (decisions, scores, features) where decisions is an int32 array of length num_frames, scores is a float64 array of length num_frames, and features is a float64 array of shape (num_frames, 5).

Return type:

tuple[ndarray[tuple[Any, …], dtype[int32]], ndarray[tuple[Any, …], dtype[float64]], ndarray[tuple[Any, …], dtype[float64]]]

pyminidsp.VAD_NUM_FEATURES

Number of features extracted per frame (5).