|
miniDSP
A small C library for audio DSP
|
This tutorial builds a voice activity detector (VAD) that combines five normalized audio features into a weighted score, applies a threshold, and smooths the binary decision with onset gating and hangover. VAD is a fundamental preprocessing step for streaming audio, ASR pipelines, and noise reduction.
All features are computed from existing miniDSP primitives. The detector adapts to different microphones, gain settings, and noise floors without manual tuning.
Build and run the example from the repository root:
Frame energy measures overall loudness — the simplest voice activity cue. Speech frames carry more energy than silence or low-level background noise.
\[E = \sum_{n=0}^{N-1} x[n]^2 \]
Reading the formula in C:
Intuition: silence has near-zero energy; speech has high energy. Energy alone fails in moderate noise because the noise floor raises the baseline.
The zero-crossing rate counts how often the signal changes sign, normalized by the number of adjacent pairs:
\[\mathrm{ZCR} = \frac{1}{N-1}\sum_{n=1}^{N-1} \mathbf{1}\!\bigl[\mathrm{sgn}(x[n]) \ne \mathrm{sgn}(x[n-1])\bigr] \]
Reading the formula in C:
Intuition: voiced speech has a low ZCR (periodic, low-frequency fundamental); unvoiced fricatives have high ZCR; silence has low ZCR. ZCR helps distinguish voiced speech from broadband noise.
Spectral entropy measures how uniformly energy is spread across frequency bins. Low entropy means energy is concentrated (tonal); high entropy means energy is diffuse (noise-like).
\[H = -\frac{1}{\ln(K)} \sum_{k=0}^{K-1} p_k \ln(p_k), \qquad p_k = \frac{\mathrm{PSD}[k]}{\sum_{j=0}^{K-1} \mathrm{PSD}[j]} \]
where \(K = N/2 + 1\) is the number of one-sided PSD bins.
Reading the formula in C:
Intuition: speech has lower spectral entropy (energy concentrated at harmonics); noise has higher spectral entropy (energy spread uniformly).
Spectral flatness (also called the Wiener entropy) is the ratio of the geometric mean to the arithmetic mean of PSD bins:
\[\mathrm{SF} = \frac{\left(\prod_{k=0}^{K-1} \mathrm{PSD}[k]\right)^{1/K}} {\frac{1}{K}\sum_{k=0}^{K-1} \mathrm{PSD}[k]} \]
Reading the formula in C:
Intuition: SF = 1.0 for white noise (perfectly flat spectrum); SF approaches 0 for a pure tone (all energy in one bin). Speech falls between these extremes, with lower flatness than background noise.
The band energy ratio measures the concentration of energy in the speech band (typically 300–3400 Hz, the telephone bandwidth):
\[\mathrm{BER} = \frac{\sum_{k : f_k \in [f_{\mathrm{lo}},\, f_{\mathrm{hi}}]} \mathrm{PSD}[k]} {\sum_{k=0}^{K-1} \mathrm{PSD}[k]} \]
where \(f_k = k \cdot f_s / N\) is the frequency of bin \(k\).
Reading the formula in C:
Intuition: speech concentrates energy in the 300–3400 Hz band. Background noise typically has a flatter distribution, yielding a lower BER.
Raw feature values vary by orders of magnitude across microphones and gain settings. The VAD normalizes each feature to [0, 1] using per-feature min/max estimates tracked via exponential moving average (EMA):
\[m_i \leftarrow m_i + \alpha \cdot (f_i - m_i), \quad M_i \leftarrow M_i + \alpha \cdot (f_i - M_i) \]
where \(m_i\) and \(M_i\) are the running minimum and maximum for feature \(i\), \(f_i\) is the current raw value, and \(\alpha\) is the adaptation rate. The normalized feature is:
\[\hat{f}_i = \frac{f_i - m_i}{M_i - m_i} \]
clamped to [0, 1].
Reading the formula in C:
The five normalized features are combined into a single score via weighted sum:
\[S = \sum_{i=0}^{4} w_i \cdot \hat{f}_i \]
where \(w_i\) are caller-tunable weights (default: 0.2 each for equal weighting).
Reading the formula in C:
Intuition: equal weights give each feature equal vote. Setting one weight to 1.0 and the rest to 0.0 reduces the detector to a single-feature VAD.
The detector uses an onset + hangover mechanism to smooth the binary decision:
| State | Condition | Action |
|---|---|---|
| SILENCE | score >= threshold | Increment onset counter |
| SILENCE | onset counter >= onset_frames | Transition to SPEECH |
| SILENCE | score < threshold | Reset onset counter |
| SPEECH | score >= threshold | Reset hangover to hangover_frames |
| SPEECH | score < threshold | Decrement hangover counter |
| SPEECH | hangover counter == 0 | Transition to SILENCE |
Onset gating prevents transient clicks from triggering false positives — the score must exceed the threshold for onset_frames consecutive frames before the detector declares speech.
Hangover bridges brief dips mid-utterance — after the score drops below threshold, the speech decision holds for hangover_frames additional frames.
The interactive plot below shows the VAD processing a synthesized signal with two speech segments (sine bursts at 1000 Hz) separated by silence:
The four panels show:
The default parameters returned by MD_vad_default_params() are not hand-picked — they were found by systematic hyperparameter optimization.
The initial defaults used equal feature weights (0.2 each), a threshold of 0.5, and conservative onset/hangover settings. These worked in clean conditions but degraded significantly in noise, yielding an F2 score of only 0.837 on a standardized benchmark.
The optimization used the LibriVAD corpus, specifically the train-clean-100 split:
The search used Optuna with the TPE (Tree-structured Parzen Estimator) sampler:
| Metric | Baseline | Optimized | Change |
|---|---|---|---|
| F2 | 0.837 | 0.933 | +0.096 |
| Precision | 0.835 | 0.782 | -0.053 |
| Recall | 0.838 | 0.981 | +0.143 |
The optimizer traded some precision for a large recall gain — the system now catches 98% of speech frames.
The optimized parameters are robust across noise types and SNR levels:
| Condition | F2 range (across SNRs) | Worst SNR |
|---|---|---|
| Babble noise | 0.907 – 0.955 | -5 dB |
| City noise | 0.907 – 0.957 | -5 dB |
| Domestic noise | 0.906 – 0.957 | -5 dB |
| Nature noise | 0.906 – 0.958 | -5 dB |
| Office noise | 0.891 – 0.957 | -5 dB |
| Public noise | 0.904 – 0.957 | -5 dB |
| SSN noise | 0.906 – 0.956 | -5 dB |
| Street noise | 0.908 – 0.957 | -5 dB |
| Transport noise | 0.907 – 0.958 | -5 dB |
Even at -5 dB SNR (noise louder than speech), F2 stays above 0.89. Office noise at -5 dB is the hardest condition (F2=0.891), likely because office noise has speech-like spectral characteristics.
If your domain differs significantly from LibriVAD (e.g., telephony, far-field, non-speech audio), you can re-run the optimization on your own data. The optimization script is in optimize/VAD/:
See optimize/VAD/README.md for full usage, including support for Audacity label files.
How does the miniDSP VAD compare to a published neural baseline? We evaluated both systems on the LibriVAD test-clean split (702 files, 9 noise types × 6 SNR levels) using the methodology from the LibriVAD paper.
| Property | miniDSP VAD | ViT-MFCC (small) |
|---|---|---|
| Type | Hand-crafted features + weighted threshold + state machine | Vision Transformer (KWT) on MFCCs |
| Language | C (via pyminidsp CFFI bindings) | Python / PyTorch |
| Trainable parameters | 0 (10 tuned hyperparameters) | ~3.5 M |
| Frame rate | 20 ms | 10 ms |
| Training data | Optuna optimization on train-clean-100 (small) | Supervised training on LibriSpeechConcat (small) |
Both systems used the same-sized small training split, so the comparison isolates the modeling approach (hand-crafted features vs learned ViT) rather than data scale. Ground-truth labels are downsampled independently to each system's native frame rate via majority voting. F2 (beta=2) weights recall twice as heavily as precision. The ViT uses a fixed 0.5 decision threshold; the miniDSP VAD uses the Optuna-optimized defaults from the C library.
| System | F2 | Precision | Recall | AUC (macro) | AUC (pooled) | Wall time |
|---|---|---|---|---|---|---|
| miniDSP VAD | 0.8440 | 0.8274 | 0.8482 | 0.6519 | 0.6517 | 3.6 s |
| ViT-MFCC (small) | 0.9614 | 0.9390 | 0.9672 | 0.9712 | 0.9819 | 77.8 s |
The ViT leads by +0.117 F2 and +0.319 AUC (macro) but takes 22× longer to process the same data. Our ViT AUC of 0.9712 closely reproduces the 0.9710 reported in the LibriVAD paper (Table 7).
| SNR (dB) | miniDSP F2 | ViT F2 | F2 Gap | miniDSP AUC | ViT AUC | AUC Gap |
|---|---|---|---|---|---|---|
| -5 | 0.7993 | 0.9321 | 0.133 | 0.5785 | 0.9145 | 0.336 |
| 0 | 0.8358 | 0.9523 | 0.117 | 0.6205 | 0.9564 | 0.336 |
| 5 | 0.8512 | 0.9606 | 0.109 | 0.6539 | 0.9787 | 0.325 |
| 10 | 0.8515 | 0.9685 | 0.117 | 0.6730 | 0.9888 | 0.316 |
| 15 | 0.8588 | 0.9752 | 0.116 | 0.6870 | 0.9934 | 0.306 |
| 20 | 0.8688 | 0.9801 | 0.111 | 0.6985 | 0.9955 | 0.297 |
The F2 gap is consistent across SNR levels (~0.11–0.13), widening slightly at -5 dB. The AUC gap is larger (~0.30–0.34) and narrows with increasing SNR, showing that miniDSP's continuous scores improve more with cleaner signals than its binary decisions suggest.
miniDSP performs best on noise types with distinctive spectral characteristics: babble (F2 gap 0.050) and SSN (F2 gap 0.080), where its energy and band-energy-ratio features are most discriminative. The largest F2 gaps appear with transport (0.178) and office (0.169) noise, which overlap more with the speech band.
AUC tells a different story: miniDSP's best AUC is on domestic noise (0.717), while babble and SSN — where F2 is strongest — have relatively low AUC (0.60–0.62). This indicates the good F2 on those noise types relies on threshold tuning rather than score quality.
The results above use the library's default parameters (optimized on train-clean-100). To check for overfitting, we compared three Optuna-optimized parameter sets, each tuned on a different data split, all evaluated on test-clean:
| Parameter set | F2 | Precision | Recall |
|---|---|---|---|
| train-clean-100 (small) | 0.9340 | 0.7687 | 0.9870 |
| dev-clean | 0.9347 | 0.7704 | 0.9873 |
| test-clean (cheat) | 0.9348 | 0.7779 | 0.9844 |
| ViT-MFCC (small) | 0.9614 | 0.9390 | 0.9672 |
All three miniDSP parameter sets produce nearly identical F2 (~0.934–0.935). Even optimizing directly on the evaluation data ("cheating") yields only +0.001 F2 over the train-clean-100 parameters. The F2 ceiling from threshold/weight tuning alone is ~0.935 — the remaining gap to the ViT (0.935 → 0.961) requires richer features or a learned model.
AUC measures how well a system's continuous scores separate speech from non-speech across all thresholds. F2 measures binary decision quality at a specific operating point.
miniDSP's low AUC (~0.65) vs the ViT's high AUC (~0.97) means the miniDSP's raw scores have limited discriminative power — its good F2 comes from Optuna finding the right operating point, not from intrinsically well-separated scores. If the threshold needs re-tuning for different conditions, there is limited headroom in the score distribution.
Practical implications: For embedded devices, real-time pipelines, or latency-sensitive applications where conditions are reasonably known, the miniDSP VAD offers a strong cost/performance tradeoff (97% of the ViT's F2 at 22× the speed with zero learned parameters). For maximum accuracy with compute budget available, especially across diverse or unknown noise conditions, the ViT-MFCC is clearly superior.
The comparison script and full per-condition breakdowns are in compare/VAD/:
See compare/VAD/README.md for setup and compare/VAD/README_RESULTS.md for the complete results including per-noise-type breakdowns for all parameter sets.
API:
Quick example — initialize and process:
Calibration — optional, improves accuracy if silence frames are available:
Custom weights — use a single feature: