miniDSP
A small C library for audio DSP
Loading...
Searching...
No Matches
Voice Activity Detection

This tutorial builds a voice activity detector (VAD) that combines five normalized audio features into a weighted score, applies a threshold, and smooths the binary decision with onset gating and hangover. VAD is a fundamental preprocessing step for streaming audio, ASR pipelines, and noise reduction.

All features are computed from existing miniDSP primitives. The detector adapts to different microphones, gain settings, and noise floors without manual tuning.

Build and run the example from the repository root:

make -C examples vad
cd examples && ./vad
open vad_plot.html

Energy

Frame energy measures overall loudness — the simplest voice activity cue. Speech frames carry more energy than silence or low-level background noise.

\[E = \sum_{n=0}^{N-1} x[n]^2 \]

Reading the formula in C:

// x[n] -> signal[n], N -> frame length
double energy = 0.0;
for (unsigned n = 0; n < N; n++) {
energy += signal[n] * signal[n]; // x[n]^2
}

Intuition: silence has near-zero energy; speech has high energy. Energy alone fails in moderate noise because the noise floor raises the baseline.


Zero-crossing rate

The zero-crossing rate counts how often the signal changes sign, normalized by the number of adjacent pairs:

\[\mathrm{ZCR} = \frac{1}{N-1}\sum_{n=1}^{N-1} \mathbf{1}\!\bigl[\mathrm{sgn}(x[n]) \ne \mathrm{sgn}(x[n-1])\bigr] \]

Reading the formula in C:

// x[n] -> signal[n], N -> frame length
unsigned crossings = 0;
for (unsigned n = 1; n < N; n++) {
// 1[sgn(x[n]) != sgn(x[n-1])]
if ((signal[n] >= 0.0) != (signal[n - 1] >= 0.0))
crossings++;
}
double zcr = (double)crossings / (double)(N - 1); // normalize to [0, 1]

Intuition: voiced speech has a low ZCR (periodic, low-frequency fundamental); unvoiced fricatives have high ZCR; silence has low ZCR. ZCR helps distinguish voiced speech from broadband noise.


Spectral entropy

Spectral entropy measures how uniformly energy is spread across frequency bins. Low entropy means energy is concentrated (tonal); high entropy means energy is diffuse (noise-like).

\[H = -\frac{1}{\ln(K)} \sum_{k=0}^{K-1} p_k \ln(p_k), \qquad p_k = \frac{\mathrm{PSD}[k]}{\sum_{j=0}^{K-1} \mathrm{PSD}[j]} \]

where \(K = N/2 + 1\) is the number of one-sided PSD bins.

Reading the formula in C:

// PSD[k] -> psd[k], K -> num_bins
double total = 0.0;
for (unsigned k = 0; k < num_bins; k++)
total += psd[k]; // denominator of p_k
double entropy = 0.0;
for (unsigned k = 0; k < num_bins; k++) {
double p_k = psd[k] / total; // p_k = PSD[k] / sum(PSD)
if (p_k > 0.0)
entropy -= p_k * log(p_k); // -sum(p_k * ln(p_k))
}
entropy /= log((double)num_bins); // normalize by ln(K) -> [0, 1]

Intuition: speech has lower spectral entropy (energy concentrated at harmonics); noise has higher spectral entropy (energy spread uniformly).


Spectral flatness

Spectral flatness (also called the Wiener entropy) is the ratio of the geometric mean to the arithmetic mean of PSD bins:

\[\mathrm{SF} = \frac{\left(\prod_{k=0}^{K-1} \mathrm{PSD}[k]\right)^{1/K}} {\frac{1}{K}\sum_{k=0}^{K-1} \mathrm{PSD}[k]} \]

Reading the formula in C:

// PSD[k] -> psd[k], K -> num_bins
// Compute in log domain to avoid overflow in the product
double log_sum = 0.0;
double arith_sum = 0.0;
for (unsigned k = 0; k < num_bins; k++) {
double val = psd[k] > 0.0 ? psd[k] : 1e-30;
log_sum += log(val); // sum of ln(PSD[k])
arith_sum += psd[k]; // sum of PSD[k]
}
double log_geo_mean = log_sum / (double)num_bins; // (1/K) * sum(ln(PSD[k]))
double geo_mean = exp(log_geo_mean); // geometric mean
double arith_mean = arith_sum / (double)num_bins; // arithmetic mean
double flatness = geo_mean / arith_mean; // SF in [0, 1]

Intuition: SF = 1.0 for white noise (perfectly flat spectrum); SF approaches 0 for a pure tone (all energy in one bin). Speech falls between these extremes, with lower flatness than background noise.


Band energy ratio

The band energy ratio measures the concentration of energy in the speech band (typically 300–3400 Hz, the telephone bandwidth):

\[\mathrm{BER} = \frac{\sum_{k : f_k \in [f_{\mathrm{lo}},\, f_{\mathrm{hi}}]} \mathrm{PSD}[k]} {\sum_{k=0}^{K-1} \mathrm{PSD}[k]} \]

where \(f_k = k \cdot f_s / N\) is the frequency of bin \(k\).

Reading the formula in C:

// PSD[k] -> psd[k], K -> num_bins
// f_lo -> band_low_hz, f_hi -> band_high_hz
double freq_per_bin = sample_rate / (double)N; // f_s / N
double total = 0.0;
double band = 0.0;
for (unsigned k = 0; k < num_bins; k++) {
double freq = k * freq_per_bin; // f_k
total += psd[k];
if (freq >= band_low_hz && freq <= band_high_hz)
band += psd[k]; // sum PSD in speech band
}
double ber = band / total; // BER in [0, 1]

Intuition: speech concentrates energy in the 300–3400 Hz band. Background noise typically has a flatter distribution, yielding a lower BER.


Adaptive normalization

Raw feature values vary by orders of magnitude across microphones and gain settings. The VAD normalizes each feature to [0, 1] using per-feature min/max estimates tracked via exponential moving average (EMA):

\[m_i \leftarrow m_i + \alpha \cdot (f_i - m_i), \quad M_i \leftarrow M_i + \alpha \cdot (f_i - M_i) \]

where \(m_i\) and \(M_i\) are the running minimum and maximum for feature \(i\), \(f_i\) is the current raw value, and \(\alpha\) is the adaptation rate. The normalized feature is:

\[\hat{f}_i = \frac{f_i - m_i}{M_i - m_i} \]

clamped to [0, 1].

Reading the formula in C:

// alpha -> adaptation_rate, f_i -> raw[i]
// m_i -> feat_min[i], M_i -> feat_max[i]
// Update min/max via EMA (only when raw value exceeds current bound)
if (raw[i] < feat_min[i])
feat_min[i] = feat_min[i] + alpha * (raw[i] - feat_min[i]);
if (raw[i] > feat_max[i])
feat_max[i] = feat_max[i] + alpha * (raw[i] - feat_max[i]);
// Normalize to [0, 1]
double range = feat_max[i] - feat_min[i];
if (range < 1e-12) range = 1e-12; // prevent division by zero
double norm = (raw[i] - feat_min[i]) / range;
if (norm < 0.0) norm = 0.0;
if (norm > 1.0) norm = 1.0;

Weighted scoring

The five normalized features are combined into a single score via weighted sum:

\[S = \sum_{i=0}^{4} w_i \cdot \hat{f}_i \]

where \(w_i\) are caller-tunable weights (default: 0.2 each for equal weighting).

Reading the formula in C:

// w_i -> weights[i], f_hat_i -> norm[i]
double score = 0.0;
for (int i = 0; i < 5; i++) {
score += weights[i] * norm[i]; // S = sum(w_i * f_hat_i)
}

Intuition: equal weights give each feature equal vote. Setting one weight to 1.0 and the rest to 0.0 reduces the detector to a single-feature VAD.


State machine

The detector uses an onset + hangover mechanism to smooth the binary decision:

State Condition Action
SILENCE score >= threshold Increment onset counter
SILENCE onset counter >= onset_frames Transition to SPEECH
SILENCE score < threshold Reset onset counter
SPEECH score >= threshold Reset hangover to hangover_frames
SPEECH score < threshold Decrement hangover counter
SPEECH hangover counter == 0 Transition to SILENCE

Onset gating prevents transient clicks from triggering false positives — the score must exceed the threshold for onset_frames consecutive frames before the detector declares speech.

Hangover bridges brief dips mid-utterance — after the score drops below threshold, the speech decision holds for hangover_frames additional frames.


Visualization

The interactive plot below shows the VAD processing a synthesized signal with two speech segments (sine bursts at 1000 Hz) separated by silence:

The four panels show:

  1. Waveform — peak envelope per frame, showing speech and silence regions.
  2. Normalized features — all five features mapped to [0, 1], showing how they respond differently to speech vs. silence.
  3. Combined score — the weighted sum of features compared against the threshold (dashed red line).
  4. Decision — the final binary output after onset gating and hangover smoothing.

Default parameter optimization

The default parameters returned by MD_vad_default_params() are not hand-picked — they were found by systematic hyperparameter optimization.

Motivation

The initial defaults used equal feature weights (0.2 each), a threshold of 0.5, and conservative onset/hangover settings. These worked in clean conditions but degraded significantly in noise, yielding an F2 score of only 0.837 on a standardized benchmark.

Dataset

The optimization used the LibriVAD corpus, specifically the train-clean-100 split:

  • 7 560 files across 9 noise types (babble, city, domestic, nature, office, public, SSN, street, transport) and 6 SNR levels (-5, 0, 5, 10, 15, 20 dB).
  • 242 197 seconds of audio with frame-level speech/non-speech labels.
  • 66.4% of frames are speech.

Methodology

The search used Optuna with the TPE (Tree-structured Parzen Estimator) sampler:

  • 300 trials, each evaluating all 7 560 files.
  • 10 parallel workers for evaluation.
  • Objective: maximize F2 (beta=2), which weights recall twice as heavily as precision — appropriate for VAD where missing speech is costlier than false alarms.
  • Search space: all 11 parameters in MD_vad_params (5 weights, threshold, onset frames, hangover frames, adaptation rate, band low/high Hz).

Results

Metric Baseline Optimized Change
F2 0.837 0.933 +0.096
Precision 0.835 0.782 -0.053
Recall 0.838 0.981 +0.143

The optimizer traded some precision for a large recall gain — the system now catches 98% of speech frames.

Key observations

  • Energy dominates — weight 0.723 (72% of the score). This is expected: frame energy is the most reliable single feature for speech detection.
  • Band energy ratio is second — weight 0.158. Speech concentrates energy in the 126–2899 Hz band more than most noise types.
  • Spectral entropy contributes little — weight 0.006. Its discriminative power is largely redundant with energy and flatness.
  • Low threshold + long hangover — the combination of threshold 0.245 and 22-frame hangover biases toward recall: speech is detected aggressively and held through brief pauses.
  • Onset of 1 frame — minimal onset gating, consistent with the recall bias.

Per-condition performance

The optimized parameters are robust across noise types and SNR levels:

Condition F2 range (across SNRs) Worst SNR
Babble noise 0.907 – 0.955 -5 dB
City noise 0.907 – 0.957 -5 dB
Domestic noise 0.906 – 0.957 -5 dB
Nature noise 0.906 – 0.958 -5 dB
Office noise 0.891 – 0.957 -5 dB
Public noise 0.904 – 0.957 -5 dB
SSN noise 0.906 – 0.956 -5 dB
Street noise 0.908 – 0.957 -5 dB
Transport noise 0.907 – 0.958 -5 dB

Even at -5 dB SNR (noise louder than speech), F2 stays above 0.89. Office noise at -5 dB is the hardest condition (F2=0.891), likely because office noise has speech-like spectral characteristics.

Re-tuning for your data

If your domain differs significantly from LibriVAD (e.g., telephony, far-field, non-speech audio), you can re-run the optimization on your own data. The optimization script is in optimize/VAD/:

cd optimize/VAD
uv run python optimize_vad.py --n-trials 300 --breakdown \
--output best_params.json \
librivad /path/to/your/LibriVAD --dataset YourDataset --split your-split

See optimize/VAD/README.md for full usage, including support for Audacity label files.


Comparison with ViT-MFCC baseline

How does the miniDSP VAD compare to a published neural baseline? We evaluated both systems on the LibriVAD test-clean split (702 files, 9 noise types × 6 SNR levels) using the methodology from the LibriVAD paper.

Systems under test

Property miniDSP VAD ViT-MFCC (small)
Type Hand-crafted features + weighted threshold + state machine Vision Transformer (KWT) on MFCCs
Language C (via pyminidsp CFFI bindings) Python / PyTorch
Trainable parameters 0 (10 tuned hyperparameters) ~3.5 M
Frame rate 20 ms 10 ms
Training data Optuna optimization on train-clean-100 (small) Supervised training on LibriSpeechConcat (small)

Both systems used the same-sized small training split, so the comparison isolates the modeling approach (hand-crafted features vs learned ViT) rather than data scale. Ground-truth labels are downsampled independently to each system's native frame rate via majority voting. F2 (beta=2) weights recall twice as heavily as precision. The ViT uses a fixed 0.5 decision threshold; the miniDSP VAD uses the Optuna-optimized defaults from the C library.

Overall results

System F2 Precision Recall AUC (macro) AUC (pooled) Wall time
miniDSP VAD 0.8440 0.8274 0.8482 0.6519 0.6517 3.6 s
ViT-MFCC (small) 0.9614 0.9390 0.9672 0.9712 0.9819 77.8 s

The ViT leads by +0.117 F2 and +0.319 AUC (macro) but takes 22× longer to process the same data. Our ViT AUC of 0.9712 closely reproduces the 0.9710 reported in the LibriVAD paper (Table 7).

Per-SNR breakdown

SNR (dB) miniDSP F2 ViT F2 F2 Gap miniDSP AUC ViT AUC AUC Gap
-5 0.7993 0.9321 0.133 0.5785 0.9145 0.336
0 0.8358 0.9523 0.117 0.6205 0.9564 0.336
5 0.8512 0.9606 0.109 0.6539 0.9787 0.325
10 0.8515 0.9685 0.117 0.6730 0.9888 0.316
15 0.8588 0.9752 0.116 0.6870 0.9934 0.306
20 0.8688 0.9801 0.111 0.6985 0.9955 0.297

The F2 gap is consistent across SNR levels (~0.11–0.13), widening slightly at -5 dB. The AUC gap is larger (~0.30–0.34) and narrows with increasing SNR, showing that miniDSP's continuous scores improve more with cleaner signals than its binary decisions suggest.

Noise-type observations

miniDSP performs best on noise types with distinctive spectral characteristics: babble (F2 gap 0.050) and SSN (F2 gap 0.080), where its energy and band-energy-ratio features are most discriminative. The largest F2 gaps appear with transport (0.178) and office (0.169) noise, which overlap more with the speech band.

AUC tells a different story: miniDSP's best AUC is on domestic noise (0.717), while babble and SSN — where F2 is strongest — have relatively low AUC (0.60–0.62). This indicates the good F2 on those noise types relies on threshold tuning rather than score quality.

Parameter generalization

The results above use the library's default parameters (optimized on train-clean-100). To check for overfitting, we compared three Optuna-optimized parameter sets, each tuned on a different data split, all evaluated on test-clean:

Parameter set F2 Precision Recall
train-clean-100 (small) 0.9340 0.7687 0.9870
dev-clean 0.9347 0.7704 0.9873
test-clean (cheat) 0.9348 0.7779 0.9844
ViT-MFCC (small) 0.9614 0.9390 0.9672

All three miniDSP parameter sets produce nearly identical F2 (~0.934–0.935). Even optimizing directly on the evaluation data ("cheating") yields only +0.001 F2 over the train-clean-100 parameters. The F2 ceiling from threshold/weight tuning alone is ~0.935 — the remaining gap to the ViT (0.935 → 0.961) requires richer features or a learned model.

Interpreting AUC vs F2

AUC measures how well a system's continuous scores separate speech from non-speech across all thresholds. F2 measures binary decision quality at a specific operating point.

miniDSP's low AUC (~0.65) vs the ViT's high AUC (~0.97) means the miniDSP's raw scores have limited discriminative power — its good F2 comes from Optuna finding the right operating point, not from intrinsically well-separated scores. If the threshold needs re-tuning for different conditions, there is limited headroom in the score distribution.

Practical implications: For embedded devices, real-time pipelines, or latency-sensitive applications where conditions are reasonably known, the miniDSP VAD offers a strong cost/performance tradeoff (97% of the ViT's F2 at 22× the speed with zero learned parameters). For maximum accuracy with compute budget available, especially across diverse or unknown noise conditions, the ViT-MFCC is clearly superior.

Reproducing the comparison

The comparison script and full per-condition breakdowns are in compare/VAD/:

cd compare/VAD
uv run python compare_vad.py /path/to/LibriVAD --breakdown

See compare/VAD/README.md for setup and compare/VAD/README_RESULTS.md for the complete results including per-noise-type breakdowns for all parameter sets.


API summary

API:

Quick example — initialize and process:

MD_vad_params params;
params.threshold = 0.3;
MD_vad_state state;
MD_vad_init(&state, &params);
FILE *csv = fopen("vad.csv", "w");
fprintf(csv, "frame,time,decision,score,energy,zcr,"
"spectral_entropy,spectral_flatness,band_energy_ratio\n");
for (unsigned i = 0; i < num_frames; i++) {
double *frame = signal + i * FRAME_SIZE;
double score;
double features[MD_VAD_NUM_FEATURES];
int decision = MD_vad_process_frame(&state, frame, FRAME_SIZE,
SAMPLE_RATE, &score, features);
double t = (double)(i * FRAME_SIZE) / SAMPLE_RATE;
times[i] = t;
decisions[i] = decision;
scores[i] = score;
/* Peak absolute amplitude in this frame */
double peak = 0.0;
for (unsigned s = 0; s < FRAME_SIZE; s++) {
double a = fabs(frame[s]);
if (a > peak) peak = a;
}
peak_env[i] = peak;
for (int f = 0; f < MD_VAD_NUM_FEATURES; f++)
feat_matrix[i * MD_VAD_NUM_FEATURES + f] = features[f];
fprintf(csv, "%u,%.4f,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
i, t, decision, score,
features[0], features[1], features[2],
features[3], features[4]);
}
fclose(csv);
printf(" CSV: vad.csv (%u frames)\n", num_frames);

Calibration — optional, improves accuracy if silence frames are available:

unsigned cal_frames = 10;
for (unsigned i = 0; i < cal_frames; i++) {
double *frame = signal + i * FRAME_SIZE;
MD_vad_calibrate(&state, frame, FRAME_SIZE, SAMPLE_RATE);
}

Custom weights — use a single feature:

MD_vad_params energy_only_params;
MD_vad_default_params(&energy_only_params);
for (int i = 0; i < MD_VAD_NUM_FEATURES; i++)
energy_only_params.weights[i] = 0.0;
energy_only_params.weights[MD_VAD_FEAT_ENERGY] = 1.0;
MD_vad_state energy_only_state;
MD_vad_init(&energy_only_state, &energy_only_params);