# pyminidsp > Python bindings to the miniDSP C library — a comprehensive DSP toolkit providing signal generation, spectral analysis, filtering, effects, and more. All functions accept and return NumPy arrays. Source: https://github.com/wooters/pyminidsp Documentation: https://wooters.github.io/pyminidsp/ # API Reference ## Signal measurement, analysis, and scaling functions ### `bessel_i0(x: 'float') -> 'float'` Compute the zeroth-order modified Bessel function of the first kind. --- ### `sinc(x: 'float') -> 'float'` Compute the normalized sinc function: sin(pi*x) / (pi*x), with sinc(0) = 1. --- ### `dot(a: 'npt.ArrayLike', b: 'npt.ArrayLike') -> 'float'` Compute the dot product of two vectors. --- ### `entropy(a: 'npt.ArrayLike', clip: 'bool' = False) -> 'float'` Compute the normalized entropy of a distribution. --- ### `energy(a: 'npt.ArrayLike') -> 'float'` Compute signal energy: sum of squared samples. --- ### `power(a: 'npt.ArrayLike') -> 'float'` Compute signal power: energy / N. --- ### `power_db(a: 'npt.ArrayLike') -> 'float'` Compute signal power in decibels. --- ### `rms(a: 'npt.ArrayLike') -> 'float'` Compute the root mean square (RMS) of a signal. --- ### `zero_crossing_rate(a: 'npt.ArrayLike') -> 'float'` Compute the zero-crossing rate of a signal. --- ### `autocorrelation(a: 'npt.ArrayLike', max_lag: 'int') -> 'npt.NDArray[np.float64]'` Compute the normalised autocorrelation of a signal. Args: a: Input signal. max_lag: Number of lag values to compute. Returns: numpy array of autocorrelation values, length max_lag. --- ### `peak_detect(a: 'npt.ArrayLike', threshold: 'float' = 0.0, min_distance: 'int' = 1) -> 'npt.NDArray[np.uint32]'` Detect peaks (local maxima) in a signal. Args: a: Input signal. threshold: Minimum value for a peak. min_distance: Minimum index gap between peaks. Returns: numpy array of peak indices. --- ### `f0_autocorrelation(signal: 'npt.ArrayLike', sample_rate: 'float', min_freq_hz: 'float' = 80.0, max_freq_hz: 'float' = 400.0) -> 'float'` Estimate F0 using autocorrelation. --- ### `f0_fft(signal: 'npt.ArrayLike', sample_rate: 'float', min_freq_hz: 'float' = 80.0, max_freq_hz: 'float' = 400.0) -> 'float'` Estimate F0 using FFT peak picking. --- ### `mix(a: 'npt.ArrayLike', b: 'npt.ArrayLike', w_a: 'float' = 0.5, w_b: 'float' = 0.5) -> 'npt.NDArray[np.float64]'` Mix (weighted sum) two signals. Args: a, b: Input signals of the same length. w_a, w_b: Weights for signals a and b. Returns: numpy array of the mixed signal. --- ### `scale(value: 'float', oldmin: 'float', oldmax: 'float', newmin: 'float', newmax: 'float') -> 'float'` Map a single value from one range to another. --- ### `scale_vec(a: 'npt.ArrayLike', oldmin: 'float', oldmax: 'float', newmin: 'float', newmax: 'float') -> 'npt.NDArray[np.float64]'` Map every element of a vector from one range to another. --- ### `fit_within_range(a: 'npt.ArrayLike', newmin: 'float', newmax: 'float') -> 'npt.NDArray[np.float64]'` Fit values within [newmin, newmax]. --- ### `adjust_dblevel(signal: 'npt.ArrayLike', dblevel: 'float') -> 'npt.NDArray[np.float64]'` Automatic Gain Control: scale signal to target dB level, clip to [-1, 1]. --- --- ## DTMF tone generation and detection ### `dtmf_detect(signal: 'npt.ArrayLike', sample_rate: 'float' = 8000.0, max_tones: 'int' = 64) -> 'list[tuple[str, float, float]]'` Detect DTMF tones in an audio signal. Args: signal: Audio samples. sample_rate: Sampling rate in Hz. max_tones: Maximum number of tones to detect. Returns: List of (digit, start_s, end_s) tuples. --- ### `dtmf_generate(digits: 'str', sample_rate: 'float' = 8000.0, tone_ms: 'int' = 70, pause_ms: 'int' = 70) -> 'npt.NDArray[np.float64]'` Generate a DTMF tone sequence. Args: digits: String of DTMF characters ('0'-'9', 'A'-'D', '*', '#'). sample_rate: Sampling rate in Hz. tone_ms: Duration of each tone in ms (>= 40). pause_ms: Duration of silence between tones in ms (>= 40). Returns: numpy array of audio samples. --- ### `dtmf_signal_length(num_digits: 'int', sample_rate: 'float' = 8000.0, tone_ms: 'int' = 70, pause_ms: 'int' = 70) -> 'int'` Calculate the number of samples needed for dtmf_generate(). --- --- ## Simple audio effects: delay/echo, tremolo, comb reverb ### `delay_echo(signal: 'npt.ArrayLike', delay_samples: 'int', feedback: 'float' = 0.5, dry: 'float' = 1.0, wet: 'float' = 0.5) -> 'npt.NDArray[np.float64]'` Apply a delay/echo effect. Args: signal: Input signal. delay_samples: Delay length in samples. feedback: Echo feedback gain (abs(feedback) < 1). dry: Dry mix weight. wet: Wet mix weight. Returns: numpy array of the processed signal. --- ### `tremolo(signal: 'npt.ArrayLike', rate_hz: 'float', depth: 'float' = 0.5, sample_rate: 'float' = 44100.0) -> 'npt.NDArray[np.float64]'` Apply a tremolo effect (amplitude modulation). Args: signal: Input signal. rate_hz: LFO rate in Hz. depth: Modulation depth in [0, 1]. sample_rate: Sampling rate in Hz. Returns: numpy array of the processed signal. --- ### `comb_reverb(signal: 'npt.ArrayLike', delay_samples: 'int', feedback: 'float' = 0.5, dry: 'float' = 1.0, wet: 'float' = 0.3) -> 'npt.NDArray[np.float64]'` Apply a comb-filter reverb effect. Args: signal: Input signal. delay_samples: Comb delay in samples. feedback: Feedback gain (abs(feedback) < 1). dry: Dry mix weight. wet: Wet mix weight. Returns: numpy array of the processed signal. --- --- ## FIR filters, convolution, and biquad (IIR) filtering ### `convolution_num_samples(signal_len: 'int', kernel_len: 'int') -> 'int'` Compute the output length of a full linear convolution. --- ### `convolution_time(signal: 'npt.ArrayLike', kernel: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Time-domain full linear convolution. Returns: numpy array of length signal_len + kernel_len - 1. --- ### `moving_average(signal: 'npt.ArrayLike', window_len: 'int') -> 'npt.NDArray[np.float64]'` Causal moving-average FIR filter. Returns: numpy array of the same length as the input. --- ### `fir_filter(signal: 'npt.ArrayLike', coeffs: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Apply a causal FIR filter with arbitrary coefficients. Returns: numpy array of the same length as the input. --- ### `design_lowpass_fir(num_taps: 'int', cutoff_freq: 'float', sample_rate: 'float', kaiser_beta: 'float' = 5.0) -> 'npt.NDArray[np.float64]'` Design a Kaiser-windowed sinc lowpass FIR filter. Args: num_taps: Number of filter coefficients (filter order + 1). cutoff_freq: Cutoff frequency in Hz. sample_rate: Sampling rate in Hz. kaiser_beta: Kaiser window shape parameter (default 5.0). Returns: numpy array of length num_taps containing the filter coefficients. --- ### `convolution_fft_ola(signal: 'npt.ArrayLike', kernel: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Full linear convolution using FFT overlap-add. Returns: numpy array of length signal_len + kernel_len - 1. --- ### `class BiquadFilter(filter_type: 'int', freq: 'float', sample_rate: 'float', db_gain: 'float' = 0.0, bandwidth: 'float' = 1.0) -> 'None'` Biquad (second-order IIR) filter. Supports low-pass, high-pass, band-pass, notch, peaking EQ, low shelf, and high shelf filter types. Example: >>> filt = BiquadFilter(LPF, freq=1000.0, sample_rate=44100.0) >>> for sample in signal: ... output = filt.process(sample) --- --- ## Generalized Cross-Correlation (GCC) delay estimation ### `get_delay(sig_a: 'npt.ArrayLike', sig_b: 'npt.ArrayLike', margin: 'int', weighting: 'int' = 1) -> 'tuple[int, float]'` Estimate the delay between two signals using GCC. Args: sig_a: First signal. sig_b: Second signal. margin: Search +/- this many samples around zero-lag. weighting: GCC_SIMP or GCC_PHAT. Returns: (delay, entropy) tuple. Delay in samples (positive = sig_b lags sig_a). --- ### `get_multiple_delays(signals: 'list[npt.ArrayLike]', margin: 'int', weighting: 'int' = 1) -> 'npt.NDArray[np.int32]'` Estimate delays between a reference signal and M-1 other signals. Args: signals: List of numpy arrays (signals[0] is reference). margin: Search window in samples. weighting: GCC_SIMP or GCC_PHAT. Returns: numpy array of M-1 delay values. --- ### `gcc(sig_a: 'npt.ArrayLike', sig_b: 'npt.ArrayLike', weighting: 'int' = 1) -> 'npt.NDArray[np.float64]'` Compute the full generalized cross-correlation between two signals. Args: sig_a: First signal. sig_b: Second signal. weighting: GCC_SIMP or GCC_PHAT. Returns: numpy array of N doubles (zero-lag at index ceil(N/2)). --- --- ## Signal generators: sine, noise, impulse, chirps, and spectrogram text ### `sine_wave(n: 'int', amplitude: 'float' = 1.0, freq: 'float' = 440.0, sample_rate: 'float' = 44100.0) -> 'npt.NDArray[np.float64]'` Generate a sine wave. Args: n: Number of samples. amplitude: Peak amplitude. freq: Frequency in Hz. sample_rate: Sampling rate in Hz. Returns: numpy array of length n. --- ### `white_noise(n: 'int', amplitude: 'float' = 1.0, seed: 'int' = 42) -> 'npt.NDArray[np.float64]'` Generate Gaussian white noise. Args: n: Number of samples. amplitude: Standard deviation. seed: Random seed for reproducibility. Returns: numpy array of length n. --- ### `impulse(n: 'int', amplitude: 'float' = 1.0, position: 'int' = 0) -> 'npt.NDArray[np.float64]'` Generate a discrete impulse (Kronecker delta). Args: n: Number of samples. amplitude: Spike amplitude. position: Sample index of the spike. Returns: numpy array of length n. --- ### `chirp_linear(n: 'int', amplitude: 'float' = 1.0, f_start: 'float' = 200.0, f_end: 'float' = 4000.0, sample_rate: 'float' = 16000.0) -> 'npt.NDArray[np.float64]'` Generate a linear chirp (swept sine). Args: n: Number of samples. amplitude: Peak amplitude. f_start: Starting frequency in Hz. f_end: Ending frequency in Hz. sample_rate: Sampling rate in Hz. Returns: numpy array of length n. --- ### `chirp_log(n: 'int', amplitude: 'float' = 1.0, f_start: 'float' = 20.0, f_end: 'float' = 20000.0, sample_rate: 'float' = 44100.0) -> 'npt.NDArray[np.float64]'` Generate a logarithmic chirp. Args: n: Number of samples. amplitude: Peak amplitude. f_start: Starting frequency in Hz (must be > 0). f_end: Ending frequency in Hz (must be > 0, != f_start). sample_rate: Sampling rate in Hz. Returns: numpy array of length n. --- ### `square_wave(n: 'int', amplitude: 'float' = 1.0, freq: 'float' = 440.0, sample_rate: 'float' = 44100.0) -> 'npt.NDArray[np.float64]'` Generate a square wave. --- ### `sawtooth_wave(n: 'int', amplitude: 'float' = 1.0, freq: 'float' = 440.0, sample_rate: 'float' = 44100.0) -> 'npt.NDArray[np.float64]'` Generate a sawtooth wave. --- ### `shepard_tone(n: 'int', amplitude: 'float' = 0.8, base_freq: 'float' = 440.0, sample_rate: 'float' = 44100.0, rate_octaves_per_sec: 'float' = 0.5, num_octaves: 'int' = 8) -> 'npt.NDArray[np.float64]'` Generate a Shepard tone (auditory illusion of endlessly rising/falling pitch). Args: n: Number of samples. amplitude: Peak amplitude. base_freq: Centre frequency of the Gaussian envelope in Hz. sample_rate: Sampling rate in Hz. rate_octaves_per_sec: Glissando rate (positive=rising, negative=falling). num_octaves: Number of audible octave layers. Returns: numpy array of length n. --- ### `spectrogram_text(text: 'str', freq_lo: 'float' = 200.0, freq_hi: 'float' = 7500.0, duration_sec: 'float' = 2.0, sample_rate: 'float' = 16000.0) -> 'npt.NDArray[np.float64]'` Synthesise audio that displays readable text in a spectrogram. Args: text: ASCII string to render. freq_lo: Lowest frequency in Hz. freq_hi: Highest frequency in Hz. duration_sec: Total duration in seconds. sample_rate: Sample rate in Hz. Returns: numpy array of audio samples. --- --- ## Shared constants, CFFI helpers, and cleanup for pyminidsp submodules ### `class MiniDSPError(code: 'int', func_name: 'str', message: 'str') -> 'None'` Raised when the miniDSP C library reports an error. --- - `ERR_NULL_POINTER = 1` - `ERR_INVALID_SIZE = 2` - `ERR_INVALID_RANGE = 3` - `ERR_ALLOC_FAILED = 4` ### `shutdown() -> 'None'` Free all internally cached FFT plans and buffers. --- - `LPF = 0` - `HPF = 1` - `BPF = 2` - `NOTCH = 3` - `PEQ = 4` - `LSH = 5` - `HSH = 6` - `STEG_LSB = 0` - `STEG_FREQ_BAND = 1` - `STEG_SPECTEXT = 2` - `STEG_TYPE_TEXT = 0` - `STEG_TYPE_BINARY = 1` - `GCC_SIMP = 0` - `GCC_PHAT = 1` - `VAD_NUM_FEATURES = 5` --- ## Polyphase sinc resampling (sample rate conversion) ### `resample_output_len(input_len: 'int', in_rate: 'float', out_rate: 'float') -> 'int'` Compute the number of output samples for a given resampling operation. --- ### `resample(signal: 'npt.ArrayLike', in_rate: 'float', out_rate: 'float', num_zero_crossings: 'int' = 13, kaiser_beta: 'float' = 5.0) -> 'npt.NDArray[np.float64]'` Resample a signal using a polyphase sinc resampler with Kaiser-windowed anti-aliasing filter. Args: signal: Input signal. in_rate: Input sample rate in Hz. out_rate: Output sample rate in Hz. num_zero_crossings: Number of zero crossings in the sinc kernel (default 13). kaiser_beta: Kaiser window shape parameter (default 5.0). Returns: numpy array of resampled signal. --- --- ## FFT-based spectrum analysis, STFT, mel filterbanks, MFCCs, and window functions ### `lowpass_brickwall(signal: 'npt.ArrayLike', cutoff_hz: 'float', sample_rate: 'float') -> 'npt.NDArray[np.float64]'` Apply an FFT-based ideal (brickwall) lowpass filter. Zeroes all frequency bins above the cutoff frequency. Operates by copying the signal, applying the filter in-place, and returning the result. Args: signal: Input signal. cutoff_hz: Cutoff frequency in Hz. sample_rate: Sampling rate in Hz. Returns: numpy array of the same length as the input. --- ### `magnitude_spectrum(signal: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Compute the magnitude spectrum of a real-valued signal. Returns: numpy array of length N/2 + 1 containing magnitudes. --- ### `power_spectral_density(signal: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Compute the power spectral density (PSD) of a signal. Returns: numpy array of length N/2 + 1 containing power values. --- ### `phase_spectrum(signal: 'npt.ArrayLike') -> 'npt.NDArray[np.float64]'` Compute the one-sided phase spectrum in radians. Returns: numpy array of length N/2 + 1 with phase in [-pi, pi]. --- ### `stft_num_frames(signal_len: 'int', n: 'int', hop: 'int') -> 'int'` Compute the number of STFT frames. --- ### `stft(signal: 'npt.ArrayLike', n: 'int', hop: 'int') -> 'npt.NDArray[np.float64]'` Compute the Short-Time Fourier Transform (STFT). Args: signal: Input signal. n: FFT window size. hop: Hop size in samples. Returns: 2D numpy array of shape (num_frames, n//2+1) containing magnitudes. --- ### `mel_filterbank(n: 'int', sample_rate: 'float', num_mels: 'int' = 26, min_freq_hz: 'float' = 0.0, max_freq_hz: 'float | None' = None) -> 'npt.NDArray[np.float64]'` Build a mel-spaced triangular filterbank matrix. Args: n: FFT size. sample_rate: Sampling rate in Hz. num_mels: Number of mel filters. min_freq_hz: Lower frequency bound. max_freq_hz: Upper frequency bound (defaults to sample_rate/2). Returns: 2D numpy array of shape (num_mels, n//2+1). --- ### `mel_energies(signal: 'npt.ArrayLike', sample_rate: 'float', num_mels: 'int' = 26, min_freq_hz: 'float' = 0.0, max_freq_hz: 'float | None' = None) -> 'npt.NDArray[np.float64]'` Compute mel-band energies from a single frame. Returns: numpy array of length num_mels. --- ### `mfcc(signal: 'npt.ArrayLike', sample_rate: 'float', num_mels: 'int' = 26, num_coeffs: 'int' = 13, min_freq_hz: 'float' = 0.0, max_freq_hz: 'float | None' = None) -> 'npt.NDArray[np.float64]'` Compute MFCCs from a single frame. Args: signal: Input frame. sample_rate: Sampling rate in Hz. num_mels: Number of mel bands. num_coeffs: Number of cepstral coefficients to output. min_freq_hz: Lower frequency bound. max_freq_hz: Upper frequency bound (defaults to sample_rate/2). Returns: numpy array of length num_coeffs. --- ### `hann_window(n: 'int') -> 'npt.NDArray[np.float64]'` Generate a Hanning (Hann) window of length n. --- ### `hamming_window(n: 'int') -> 'npt.NDArray[np.float64]'` Generate a Hamming window of length n. --- ### `blackman_window(n: 'int') -> 'npt.NDArray[np.float64]'` Generate a Blackman window of length n. --- ### `rect_window(n: 'int') -> 'npt.NDArray[np.float64]'` Generate a rectangular window of length n (all ones). --- ### `kaiser_window(n: 'int', beta: 'float') -> 'npt.NDArray[np.float64]'` Generate a Kaiser window of length n with shape parameter beta. Unlike other window functions, Kaiser windows require a beta parameter that controls the trade-off between main-lobe width and side-lobe level. Higher beta gives lower sidelobes but a wider main lobe. Args: n: Window length. beta: Shape parameter (typical values: 5-14). Returns: numpy array of length n. --- --- ## Audio steganography: hide and recover data within audio signals ### `steg_capacity(signal_len: 'int', sample_rate: 'float', method: 'int' = 0) -> 'int'` Compute maximum message length that can be hidden. --- ### `steg_encode(host: 'npt.ArrayLike', message: 'str', sample_rate: 'float' = 44100.0, method: 'int' = 0) -> 'tuple[npt.NDArray[np.float64], int]'` Encode a secret text message into a host audio signal. Args: host: Host signal (not modified). message: String message to hide. sample_rate: Sample rate in Hz. method: STEG_LSB or STEG_FREQ_BAND. Returns: (stego_signal, num_bytes_encoded) tuple. --- ### `steg_decode(stego: 'npt.ArrayLike', sample_rate: 'float' = 44100.0, method: 'int' = 0, max_msg_len: 'int' = 4096) -> 'str'` Decode a secret text message from a stego audio signal. Returns: Decoded string message. --- ### `steg_encode_bytes(host: 'npt.ArrayLike', data: 'bytes', sample_rate: 'float' = 44100.0, method: 'int' = 0) -> 'tuple[npt.NDArray[np.float64], int]'` Encode arbitrary binary data into a host audio signal. Args: host: Host signal. data: bytes-like object to hide. sample_rate: Sample rate in Hz. method: STEG_LSB or STEG_FREQ_BAND. Returns: (stego_signal, num_bytes_encoded) tuple. --- ### `steg_decode_bytes(stego: 'npt.ArrayLike', sample_rate: 'float' = 44100.0, method: 'int' = 0, max_len: 'int' = 4096) -> 'bytes'` Decode binary data from a stego audio signal. Returns: bytes object containing the decoded data. --- ### `steg_detect(signal: 'npt.ArrayLike', sample_rate: 'float' = 44100.0) -> 'tuple[int | None, int | None]'` Detect which steganography method was used. Returns: (method, payload_type) tuple, or (None, None) if no steg detected. method is STEG_LSB, STEG_FREQ_BAND, or None. payload_type is STEG_TYPE_TEXT, STEG_TYPE_BINARY, or None. --- --- ## Voice activity detection (VAD) with adaptive normalization ### `class VAD(*, threshold: 'float | None' = None, onset_frames: 'int | None' = None, hangover_frames: 'int | None' = None, adaptation_rate: 'float | None' = None, band_low_hz: 'float | None' = None, band_high_hz: 'float | None' = None, weights: 'Sequence[float] | None' = None) -> 'None'` Voice activity detector with adaptive feature normalization and onset/hangover smoothing. Wraps the miniDSP C library's stateful VAD API. The detector extracts five features per frame (energy, ZCR, spectral entropy, spectral flatness, band energy ratio), normalizes them adaptively, computes a weighted score, and applies an onset/hangover state machine. Example: >>> detector = VAD(threshold=0.4) >>> detector.calibrate(silence_frame, sample_rate=16000.0) >>> decision, score, features = detector.process_frame(frame, 16000.0) --- --- # Tutorials # Signal Generators pyminidsp provides stateless signal generators for creating test signals. No audio input or microphone source is needed — just specify the parameters and get a NumPy array back. ## Sine wave The fundamental test signal — a pure tone at a single frequency: $$ x[n] = A \sin(2\pi f \, n / f_s) $$ ```python import pyminidsp as md signal = md.sine_wave(44100, amplitude=1.0, freq=440.0, sample_rate=44100.0) # Verify: the FFT peak should align with the expected frequency bin mag = md.magnitude_spectrum(signal) ``` **Listen** — 440 Hz, 2 seconds: ## Impulse (Kronecker delta) A single spike at a given position, zeros everywhere else. The unit impulse (amplitude 1.0 at position 0) is the identity element of convolution and has a perfectly flat magnitude spectrum. ```python imp = md.impulse(1024, amplitude=1.0, position=0) # Flat spectrum — all bins have equal magnitude mag = md.magnitude_spectrum(imp) ``` **Listen** — impulse train (4 clicks at 0.5 s intervals): ## Chirp (swept sine) Two varieties: **Linear chirp** — frequency sweeps at a constant rate. The instantaneous frequency traces a straight diagonal in the spectrogram. ```python # 1-second sweep from 200 Hz to 4 kHz at 16 kHz sample rate chirp = md.chirp_linear(16000, amplitude=1.0, f_start=200.0, f_end=4000.0, sample_rate=16000.0) ``` **Logarithmic chirp** — exponential sweep, spending equal time per octave. Ideal for measuring systems on a log-frequency axis. ```python # Full audible range sweep: 20 Hz to 20 kHz chirp = md.chirp_log(44100, amplitude=1.0, f_start=20.0, f_end=20000.0, sample_rate=44100.0) ``` ## Square wave Alternates between +amplitude and −amplitude. Its Fourier series contains only **odd harmonics** (1f, 3f, 5f, …) with amplitudes decaying as 1/k — a textbook demonstration of the Gibbs phenomenon. ```python sq = md.square_wave(4096, amplitude=1.0, freq=440.0, sample_rate=44100.0) ``` ## Sawtooth wave Ramps linearly from −amplitude to +amplitude each period. Contains **all integer harmonics** (1f, 2f, 3f, …) decaying as 1/k — richer harmonic content than the square wave's odd-only series. ```python saw = md.sawtooth_wave(4096, amplitude=1.0, freq=440.0, sample_rate=44100.0) ``` ## White noise Gaussian white noise has equal power at all frequencies — its PSD is approximately flat. Samples follow N(0, σ²) via the Box-Muller transform. A fixed seed gives reproducible output. ```python noise = md.white_noise(4096, amplitude=1.0, seed=42) # Same seed → same output noise2 = md.white_noise(4096, amplitude=1.0, seed=42) assert (noise == noise2).all() ``` ## Shepard tone See `shepard-tone` for a dedicated guide on this auditory illusion. --- # Basic Signal Operations Five fundamental time-domain analysis techniques that work alongside `pyminidsp.energy`, `pyminidsp.power`, and `pyminidsp.entropy`. ## RMS (Root Mean Square) The standard measure of signal "loudness": $$ \text{RMS} = \sqrt{\frac{1}{N}\sum_{n=0}^{N-1} x[n]^2} $$ A unit sine wave yields ≈ 0.707; a DC signal of value *c* has RMS = $|c|$. ```python import pyminidsp as md signal = md.sine_wave(44100, amplitude=1.0, freq=440.0, sample_rate=44100.0) print(md.rms(signal)) # ≈ 0.707 ``` ## Zero-crossing rate Counts how often the signal changes sign, normalised by the number of adjacent pairs. High ZCR → noise or high-frequency content. Low ZCR → tonal or low-frequency content. ```python signal = md.sine_wave(16000, freq=1000.0, sample_rate=16000.0) zcr = md.zero_crossing_rate(signal) # zcr ≈ 2 * 1000 / 16000 = 0.125 ``` ## Autocorrelation Measures the similarity between a signal and a delayed copy of itself. Periodic signals produce a strong peak at the fundamental period — the basis of autocorrelation-based pitch detection. ```python signal = md.sine_wave(1024, freq=100.0, sample_rate=1000.0) acf = md.autocorrelation(signal, max_lag=50) # acf[0] = 1.0 # acf[10] ≈ 1.0 (lag 10 = one period of 100 Hz at 1 kHz sample rate) ``` ## Peak detection Finds local maxima above a threshold with a minimum distance constraint to suppress secondary peaks. ```python import numpy as np signal = np.array([0, 1, 3, 1, 0, 2, 5, 2, 0], dtype=float) peaks = md.peak_detect(signal, threshold=0.0, min_distance=1) print(peaks) # [2, 6] (values 3 and 5) ``` ## Signal mixing Element-wise weighted sum of two signals: $$ \text{out}[n] = w_a \cdot a[n] + w_b \cdot b[n] $$ ```python sine = md.sine_wave(1024, amplitude=1.0, freq=440.0, sample_rate=44100.0) noise = md.white_noise(1024, amplitude=0.1, seed=42) mixed = md.mix(sine, noise, w_a=0.8, w_b=0.2) ``` --- # Window Functions Window functions taper finite signal blocks before FFT processing to prevent **spectral leakage** — the spreading of energy into neighbouring frequency bins caused by discontinuities at block edges. The DFT assumes the input is one period of a periodic signal. When the signal doesn't have an integer number of cycles in the block, the endpoints are mismatched. A window smoothly tapers the signal to zero at the edges, greatly reducing this leakage. ## Four window types **Hanning (Hann)** — the default choice for FFT analysis. $$ w[n] = 0.5\bigl(1 - \cos(2\pi n / (N-1))\bigr) $$ ```python import pyminidsp as md win = md.hann_window(256) ``` **Hamming** — similar to Hanning but with a lower first sidelobe. $$ w[n] = 0.54 - 0.46\cos(2\pi n / (N-1)) $$ ```python win = md.hamming_window(256) ``` **Blackman** — strongest sidelobe suppression, widest main lobe. $$ w[n] = 0.42 - 0.5\cos(2\pi n/(N-1)) + 0.08\cos(4\pi n/(N-1)) $$ ```python win = md.blackman_window(256) ``` **Rectangular** — all ones (no tapering). Narrowest main lobe but maximum sidelobe leakage. ```python win = md.rect_window(256) ``` ## Comparison | Window | Edge values | Sidelobe level | Main lobe width | |---|---|---|---| | Rectangular | 1.0 | Highest | Narrowest | | Hanning | 0.0 | Low | Medium | | Hamming | 0.08 | Lower first sidelobe | Medium | | Blackman | 0.0 | Lowest | Widest | **Rule of thumb:** start with Hanning. Use Blackman when minimising leakage matters more than frequency resolution. --- # Computing the Magnitude Spectrum The **magnitude spectrum** tells you the amplitude of each sinusoidal component present in a signal. ## Workflow 1. Generate (or load) a signal. 2. Apply a window function to reduce spectral leakage. 3. Compute the magnitude spectrum via `pyminidsp.magnitude_spectrum`. 4. Normalise if needed. ## Example ```python import pyminidsp as md import numpy as np sr = 44100.0 N = 1024 # Build a test signal: 440 Hz + 1000 Hz + 2500 Hz + DC offset t = np.arange(N) / sr signal = (0.1 + 1.0 * np.sin(2 * np.pi * 440.0 * t) + 0.5 * np.sin(2 * np.pi * 1000.0 * t) + 0.3 * np.sin(2 * np.pi * 2500.0 * t)) mag = md.magnitude_spectrum(signal) # mag has N//2 + 1 = 513 bins # bin k → frequency = k * sr / N ``` ## Normalisation The raw output is **not** normalised by *N*. Three steps to get single-sided amplitudes: 1. Divide all bins by *N*. 2. Double interior bins (k = 1 to N/2 − 1) to account for folded negative frequencies. 3. Leave DC (k = 0) and Nyquist (k = N/2) unchanged. ```python amp = mag / N amp[1:-1] *= 2 # double interior bins ``` ## Visualisation The linear plot shows distinct peaks at the input frequencies. The logarithmic (dB) scale reveals the Hanning window's sidelobes and low-level details that are invisible on a linear axis. ```python # Convert to dB (for plotting) mag_db = 20 * np.log10(amp + 1e-12) md.shutdown() ``` --- # Power Spectral Density The Power Spectral Density (PSD) measures how a signal's **power** is distributed across frequencies. While the magnitude spectrum tells you the *amplitude* at each frequency, the PSD tells you the *power* — useful for noise analysis, SNR estimation, and comparing signals of different lengths. ## Formula The periodogram estimator: $$ \text{PSD}[k] = \frac{|X(k)|^2}{N} $$ **Relationship to the magnitude spectrum:** ``PSD[k] = magnitude[k]**2 / N`` **dB conversion:** use ``10 * log10()`` for power (not ``20 * log10()`` as with amplitude), because power scales with amplitude squared: ``10 * log10(A²) = 20 * log10(A)``. ## Example ```python import pyminidsp as md import numpy as np sr = 44100.0 N = 1024 # Multi-tone test signal t = np.arange(N) / sr signal = (0.1 + 1.0 * np.sin(2 * np.pi * 440.0 * t) + 0.5 * np.sin(2 * np.pi * 1000.0 * t) + 0.3 * np.sin(2 * np.pi * 2500.0 * t)) psd = md.power_spectral_density(signal) ``` ## Parseval's theorem Total time-domain energy equals frequency-domain energy (validation): ```python time_energy = np.sum(signal ** 2) freq_energy = psd[0] + 2 * np.sum(psd[1:-1]) + psd[-1] np.testing.assert_allclose(time_energy, freq_energy, rtol=1e-10) ``` ## Visualisation ```python psd_db = 10 * np.log10(psd + 1e-12) md.shutdown() ``` --- # Phase Spectrum The phase spectrum describes the **timing** of frequency components. Each DFT coefficient is a complex number; while magnitude reveals energy distribution, phase reveals the angle or shift of that frequency component: $$ \phi(k) = \arg X(k) = \text{atan2}(\text{Im}\,X(k),\;\text{Re}\,X(k)) $$ Values span $[-\pi, \pi]$. ## Key intuitions - A **cosine** at an integer bin produces $\phi \approx 0$. - A **sine** at the same bin produces $\phi \approx -\pi/2$. - A **time-delayed** signal exhibits **linear phase**: $\phi(k) = -2\pi k d / N$, a principle underlying delay estimation (GCC-PHAT). ## Example ```python import pyminidsp as md import numpy as np N = 1024 sr = 44100.0 t = np.arange(N) / sr # Three tones with known phases signal = (1.0 * np.cos(2 * np.pi * 440.0 * t) # phase ≈ 0 + 0.5 * np.sin(2 * np.pi * 1000.0 * t)) # phase ≈ -π/2 phase = md.phase_spectrum(signal) # phase has N//2 + 1 = 513 bins, values in [-π, π] ``` **IMPORTANT**: Phase is only meaningful at bins where the magnitude is significant. Always examine :func:`~pyminidsp.magnitude_spectrum` alongside the phase to identify significant bins. ## Visualisation ```python md.shutdown() ``` --- # STFT & Spectrogram The magnitude spectrum reveals frequency content across an entire signal, but cannot show how that content **changes over time**. The Short-Time Fourier Transform (STFT) solves this by dividing the signal into short, overlapping frames and computing the DFT of each one, producing a 2-D time-frequency representation called a **spectrogram**. ## Key parameters **Window size** (*n*) — larger windows give better frequency resolution but worse time resolution. For audio at 16 kHz, ``n=512`` (32 ms) is a balanced starting point. **Hop size** (*hop*) — controls frame overlap. 75% overlap (``hop = n // 4``) is the standard choice: smooth spectrograms without excessive computation. ## Example ```python import pyminidsp as md import numpy as np sr = 16000.0 N = 16000 # 1 second # Linear chirp — frequency rises from 200 Hz to 4 kHz signal = md.chirp_linear(N, amplitude=1.0, f_start=200.0, f_end=4000.0, sample_rate=sr) n = 512 hop = 128 spec = md.stft(signal, n=n, hop=hop) # spec.shape == (num_frames, n // 2 + 1) num_frames = md.stft_num_frames(N, n, hop) # Convert bin k to Hz: freq_hz = k * sr / n # Convert frame f to seconds: time_s = f * hop / sr ``` ## Converting to dB Normalise by *n* before taking the log so that a full-scale sine (amplitude 1) reads near 0 dB: ```python spec_db = 20 * np.log10(spec / n + 1e-12) ``` ## Visualisation The linear chirp appears as a diagonal stripe rising across the time-frequency plane. ```python md.shutdown() ``` --- # Mel Filterbanks & MFCCs Two essential features for speech and audio machine learning: 1. **Mel filterbank energies** — triangular spectral bands spaced on the `mel scale `_, which compresses frequency representation to match human hearing. 2. **MFCCs** — decorrelated coefficients derived from the log mel energies via a DCT, widely used in speech recognition and audio classification. ## Mel scale The HTK mapping: $$ \text{mel}(f) = 2595 \cdot \log_{10}\!\left(1 + \frac{f}{700}\right) $$ This densifies low frequencies and coarsens high frequencies, reflecting how humans perceive pitch. ## Building a mel filterbank ```python import pyminidsp as md fb = md.mel_filterbank(512, sample_rate=16000.0, num_mels=26) # fb.shape == (26, 257) — 26 triangular filters over 257 FFT bins ``` ## Computing mel energies From a single frame: ```python signal = md.sine_wave(512, freq=440.0, sample_rate=16000.0) mel = md.mel_energies(signal, sample_rate=16000.0, num_mels=26) # mel.shape == (26,) ``` Processing steps (internally): 1. Apply a Hann window. 2. Compute one-sided PSD bins via FFT: ``|X(k)|² / N``. 3. Apply mel filterbank weights and sum per band. ## Computing MFCCs ```python coeffs = md.mfcc(signal, sample_rate=16000.0, num_mels=26, num_coeffs=13) # coeffs.shape == (13,) ``` Conventions: - HTK mel mapping for filter placement. - Natural-log compression: ``log(max(E_mel, 1e-12))``. - DCT-II with HTK-C0 normalisation. - Coefficient C0 is in ``coeffs[0]``. ## Processing a full utterance To extract MFCCs from a longer signal, use the STFT to break it into frames first: ```python import numpy as np sr = 16000.0 frame_size = 512 hop = 128 # Load or generate a signal signal = md.chirp_linear(int(sr), f_start=200.0, f_end=4000.0, sample_rate=sr) num_frames = md.stft_num_frames(len(signal), frame_size, hop) all_mfcc = np.zeros((num_frames, 13)) for f in range(num_frames): start = f * hop frame = signal[start:start + frame_size] all_mfcc[f] = md.mfcc(frame, sample_rate=sr, num_mels=26, num_coeffs=13) md.shutdown() ``` --- # Pitch Detection Two methods for estimating the fundamental frequency (F0) of a signal. ## Autocorrelation method Searches for the strongest peak in the normalised autocorrelation: $$ f_0 = \frac{f_s}{\tau_\text{peak}} $$ More robust for noisy or strongly harmonic signals. ```python import pyminidsp as md signal = md.sine_wave(4096, freq=200.0, sample_rate=16000.0) f0 = md.f0_autocorrelation(signal, sample_rate=16000.0, min_freq_hz=80.0, max_freq_hz=400.0) print(f"Estimated F0: {f0:.1f} Hz") # ≈ 200.0 ``` ## FFT peak-picking method Applies a Hann window, computes the magnitude spectrum, and identifies the dominant peak in the requested frequency range: $$ f_0 = \frac{k_\text{peak} \cdot f_s}{N} $$ Simple and fast, but can lock onto harmonics (2f0, 3f0) when the fundamental is weak. ```python f0 = md.f0_fft(signal, sample_rate=16000.0, min_freq_hz=80.0, max_freq_hz=400.0) print(f"Estimated F0: {f0:.1f} Hz") # ≈ 200.0 ``` ## Practical notes - **Search range** is critical for both methods. Use prior knowledge of the expected pitch range (e.g. 80–400 Hz for speech). - A return value of **0.0** means no reliable F0 was found — typically silence, unvoiced speech, or noisy frames. - Longer frames improve resolution but reduce time accuracy. ```python md.shutdown() ``` --- # FIR Filters & Convolution Four complementary methods for filtering and convolution, from educational time-domain approaches to efficient FFT-based processing. ## Time-domain convolution For signals of length *N* and kernels of length *M*, computes the full linear convolution. Output length is ``N + M - 1``. ```python import pyminidsp as md import numpy as np signal = md.impulse(100, amplitude=1.0, position=0) kernel = np.array([1.0, 2.0, 3.0]) out = md.convolution_time(signal, kernel) # out[:3] == [1.0, 2.0, 3.0] # len(out) == 102 ``` ## Moving-average filter A simple low-pass filter that computes the running mean over a window. Output matches input length with zero-padded startup. ```python signal = md.sine_wave(1024, freq=440.0, sample_rate=44100.0) smoothed = md.moving_average(signal, window_len=5) ``` ## General FIR filter Apply a causal FIR filter with arbitrary coefficients: $$ \text{out}[n] = \sum_{k=0}^{T-1} \text{coeffs}[k] \cdot \text{signal}[n-k] $$ Output matches input length. ```python coeffs = np.array([0.25, 0.5, 0.25]) filtered = md.fir_filter(signal, coeffs) ``` ## FFT overlap-add Same result as time-domain convolution but **much faster for long kernels** by processing blocks in the frequency domain. ```python kernel = md.hann_window(256) out_time = md.convolution_time(signal, kernel) out_fft = md.convolution_fft_ola(signal, kernel) np.testing.assert_allclose(out_time, out_fft, atol=1e-10) ``` ## Comparison | Method | Complexity | Output length | Best for | |---|---|---|---| | ``convolution_time`` | O(NM) | N + M − 1 | Teaching, short kernels | | ``moving_average`` | O(N) | N | Simple smoothing | | ``fir_filter`` | O(NM) | N | Standard FIR design | | ``convolution_fft_ola`` | O(N log N) | N + M − 1 | Long kernels, production | ```python md.shutdown() ``` --- # Simple Audio Effects Three foundational audio effects built on delay lines. ## Delay / echo A circular buffer with feedback creates repeating echoes that decay geometrically: $$ s[n] &= x[n] + \text{feedback} \cdot s[n - D] \\ y[n] &= \text{dry} \cdot x[n] + \text{wet} \cdot s[n - D] $$ ```python import pyminidsp as md signal = md.sine_wave(44100, freq=440.0, sample_rate=44100.0) echoed = md.delay_echo(signal, delay_samples=4410, feedback=0.5, dry=1.0, wet=0.5) ``` **Before:** **After:** ## Tremolo Amplitude modulation by a sinusoidal LFO. The gain oscillates between ``1 - depth`` and ``1``: $$ g[n] = (1 - d) + d \cdot \frac{1 + \sin(2\pi f_\text{LFO} n / f_s)}{2} $$ ```python tremmed = md.tremolo(signal, rate_hz=5.0, depth=0.5, sample_rate=44100.0) ``` **Before:** **After:** ## Comb-filter reverb Feeds delayed output back into itself, creating closely-spaced echoes that simulate reverberation: $$ c[n] &= x[n] + \text{feedback} \cdot c[n - D] \\ y[n] &= \text{dry} \cdot x[n] + \text{wet} \cdot c[n] $$ ```python reverbed = md.comb_reverb(signal, delay_samples=1000, feedback=0.5, dry=1.0, wet=0.3) ``` **Before:** **After:** ## Verification tips - **Impulse response:** feed an impulse through each effect. Echoes should decay predictably based on the feedback value. - **Parameter extremes:** ``depth=0`` for tremolo should return the original signal unchanged. - **Feedback = 0:** all effects should produce a single delayed copy (no ringing). ```python md.shutdown() ``` --- # DTMF Tone Detection & Generation `Dual-Tone Multi-Frequency (DTMF) `_ is the signalling system used by touch-tone telephones. Each keypad button is encoded as a pair of sinusoids — one from a low-frequency "row" group and one from a high-frequency "column" group: | | 1209 Hz | 1336 Hz | 1477 Hz | 1633 Hz | |---|---|---|---|---| | 697 Hz | 1 | 2 | 3 | A | | 770 Hz | 4 | 5 | 6 | B | | 852 Hz | 7 | 8 | 9 | C | | 941 Hz | \* | 0 | # | D | The frequencies were chosen to avoid harmonic relationships, preventing false detections from speech. ## Timing standards `ITU-T Q.24 `_ specifies: - Minimum tone duration: **40 ms** - Minimum inter-digit pause: **40 ms** Practical systems typically use 70–120 ms for both. ## Generating tones ```python import pyminidsp as md # Generate "5551234" at 8 kHz with 70 ms tones and pauses sig = md.dtmf_generate("5551234", sample_rate=8000.0, tone_ms=70, pause_ms=70) ``` Each digit is rendered as the sum of its row and column sinusoids at amplitude 0.5 (peak combined amplitude = 1.0). ## Detecting tones ```python tones = md.dtmf_detect(sig, sample_rate=8000.0) for digit, start_s, end_s in tones: print(f"{digit} {start_s:.3f}–{end_s:.3f} s") ``` Detection uses a sliding Hanning-windowed FFT with a state machine that enforces ITU-T Q.24 minimum timing. The FFT size is the largest power of two fitting within 35 ms (e.g. 256 at 8 kHz, giving 31.25 Hz resolution). ## Round-trip verification ```python digits = "5551234" sig = md.dtmf_generate(digits, sample_rate=8000.0) detected = md.dtmf_detect(sig, sample_rate=8000.0) result = "".join(t[0] for t in detected) assert result == digits md.shutdown() ``` --- # Shepard Tone A `Shepard tone `_ is an acoustic illusion — a sound that appears to continuously rise (or fall) in pitch without ever actually leaving its frequency range. Cognitive scientist Roger Shepard first described this effect in 1964. It mirrors an M.C. Escher staircase: listeners perceive endless ascending motion that never reaches its destination. ## How it works The illusion relies on two principles: 1. **Octave equivalence** — the human ear perceives tones one octave apart as the "same note" at a different pitch height. 2. **Spectral envelope** — a fixed Gaussian curve in log-frequency space controls loudness. Tones near the centre are loud; those at edges fade nearly silent. Multiple sine waves — each separated by one octave — sound simultaneously while gliding upward. As tones fade at the upper edge, new tones enter at the bottom, fading in. The loudest tones always occupy the middle and move upward, so the sound seems to ascend perpetually. ## Signal model $$ x[n] = A_\text{norm}\sum_k \exp\!\left(-\frac{d_k(t)^2}{2\sigma^2}\right) \sin(\varphi_k(n)) $$ where the octave distance from the Gaussian centre is $$ d_k(t) = k - c + R\,t, \quad c = \frac{L-1}{2}, \quad \sigma = \frac{L}{4} $$ and the instantaneous frequency of layer *k* is $f_k(t) = f_\text{base} \cdot 2^{d_k(t)}$. Phase is accumulated sample-by-sample for smooth glides. ## Example ```python import pyminidsp as md # 5 seconds of endlessly rising Shepard tone at 44.1 kHz sig = md.shepard_tone(5 * 44100, amplitude=0.8, base_freq=440.0, sample_rate=44100.0, rate_octaves_per_sec=0.5, num_octaves=8) ``` **Listen** — rising Shepard tone (5 seconds): **Falling** Shepard tone: **Static** chord (rate = 0): ## Key parameters **Glissando rate** (``rate_octaves_per_sec``): - ``0.0`` — static chord (no motion) - ``0.5`` — moderate rise (default) - Negative values → falling Shepard tone **Number of octaves** (``num_octaves``): - 4–6 — narrow, organ-like quality - 8 — balanced (default) - 10–12 — ethereal, diffuse texture **Base frequency** (``base_freq``): centres the Gaussian envelope. Typical values: 200–600 Hz. ```python # Slowly falling Shepard tone falling = md.shepard_tone(44100 * 3, amplitude=0.8, base_freq=300.0, sample_rate=44100.0, rate_octaves_per_sec=-0.3, num_octaves=10) md.shutdown() ``` --- # Spectrogram Text Art Synthesise audio that displays **readable text** when viewed as a spectrogram — time runs horizontally, frequency vertically. ## How it works 1. Each ASCII character (32–126) is rasterised with a built-in 5 × 7 bitmap font, spaced 3 columns apart. 2. Each bitmap column becomes a time slice. 3. Each "on" pixel becomes a sine wave at the corresponding frequency between *freq_lo* and *freq_hi* (top row → highest frequency, bottom row → lowest, linearly interpolated). 4. A 3 ms raised-cosine crossfade at column boundaries suppresses clicks. 5. The output is normalised to 0.9 peak amplitude. ## Example ```python import pyminidsp as md sig = md.spectrogram_text("HELLO", freq_lo=200.0, freq_hi=7500.0, duration_sec=2.0, sample_rate=16000.0) # View the spectrogram of `sig` to see "HELLO" spelled out # in the frequency domain. ``` **Listen** — "HELLO" encoded in the spectrogram: The result sounds like a buzzy chord, but when analysed with a spectrogram viewer (1024-point FFT at 16 kHz), the text is clearly visible. ## Tips - Use a sample rate of at least 16 kHz and keep *freq_hi* below Nyquist. - Longer *duration_sec* stretches the text horizontally — easier to read in spectrograms. - Short strings work best (the 5 × 7 font has limited resolution). ```python md.shutdown() ``` --- # Voice Activity Detection `Voice activity detection (VAD) `_ is the task of determining whether an audio frame contains speech or silence. It is a fundamental building block in speech processing pipelines — from automatic speech recognition to noise-aware audio analysis. pyminidsp provides a frame-level VAD that extracts five features per frame, normalizes them adaptively, computes a weighted score, and applies an onset/hangover state machine. ## Features The detector extracts five features from each audio frame: **Energy** — sum of squared samples. Silence has near-zero energy; speech has high energy. Energy alone fails in moderate noise. $$ E = \sum_{n=0}^{N-1} x[n]^{2} $$ **Zero-crossing rate (ZCR)** — fraction of consecutive samples that cross zero. Voiced speech has low ZCR; unvoiced fricatives have high ZCR; silence has low ZCR. $$ \text{ZCR} = \frac{1}{N-1} \sum_{n=1}^{N-1} \mathbf{1}\!\bigl[\operatorname{sgn}(x[n]) \neq \operatorname{sgn}(x[n-1])\bigr] $$ **Spectral entropy** — how spread out the energy is across frequency bins. Speech has lower spectral entropy (energy concentrated in harmonics); noise has higher spectral entropy (energy spread evenly). $$ H = -\frac{1}{\ln K} \sum_{k=0}^{K-1} p_k \ln p_k \qquad\text{where } p_k = \frac{\text{PSD}[k]}{\sum_j \text{PSD}[j]} $$ **Spectral flatness** — ratio of the geometric mean to the arithmetic mean of the power spectrum. White noise gives SF ≈ 1; a pure tone gives SF ≈ 0. Speech falls between. $$ \text{SF} = \frac{\bigl(\prod_{k=0}^{K-1} \text{PSD}[k]\bigr)^{1/K}} {\frac{1}{K}\sum_{k=0}^{K-1} \text{PSD}[k]} $$ **Band energy ratio** — fraction of total energy that falls within the speech band (default 300–3400 Hz, telephone bandwidth). $$ \text{BER} = \frac{\sum_{k:\,f_k \in [f_\text{lo},\,f_\text{hi}]} \text{PSD}[k]} {\sum_k \text{PSD}[k]} $$ ## Adaptive normalization Raw feature values vary widely across recordings. The detector tracks per-feature minimums and maximums using an exponential moving average (EMA) and normalizes each feature to [0, 1]: $$ m_i \leftarrow m_i + \alpha\,(f_i - m_i) \qquad M_i \leftarrow M_i + \alpha\,(f_i - M_i) $$ $$ \hat{f}_i = \text{clamp}\!\Bigl(\frac{f_i - m_i}{M_i - m_i},\; 0,\; 1\Bigr) $$ The adaptation rate α (default 0.01) controls how fast the normalization adjusts. Calling `pyminidsp.VAD.calibrate` with known silence seeds the EMA estimates for faster convergence. ## Weighted scoring The five normalized features are combined into a single score: $$ S = \sum_{i=0}^{4} w_i \cdot \hat{f}_i $$ By default all weights are equal (0.2 each). You can emphasize specific features — for example, weighting energy heavily for clean environments, or spectral entropy for noisy ones. ## State machine A raw score above the threshold does not immediately trigger speech. An onset/hangover state machine smooths the decision: | Current State | Condition | Action | |---|---|---| | SILENCE | score ≥ threshold | Increment onset counter | | SILENCE | onset counter ≥ ``onset_frames`` | Transition to SPEECH | | SILENCE | score < threshold | Reset onset counter | | SPEECH | score ≥ threshold | Reset hangover counter to ``hangover_frames`` | | SPEECH | score < threshold | Decrement hangover counter | | SPEECH | hangover counter reaches 0 | Transition to SILENCE | **Onset gating** prevents transient clicks from triggering false positives — the score must exceed the threshold for ``onset_frames`` consecutive frames (default 3). **Hangover** bridges brief dips mid-utterance, holding the speech state for ``hangover_frames`` frames (default 15) after activity drops. ## Creating a detector The `pyminidsp.VAD` class wraps the stateful C implementation. All parameters are optional — omitted values use sensible defaults. ```python import pyminidsp as md # Default parameters detector = md.VAD() # Custom threshold and hangover detector = md.VAD(threshold=0.4, hangover_frames=20) # Custom feature weights (energy, ZCR, spectral entropy, # spectral flatness, band energy ratio) detector = md.VAD(weights=[0.4, 0.1, 0.1, 0.1, 0.3]) ``` ## Calibrating with silence Before processing live audio, feed a few frames of known silence to seed the adaptive normalization. This improves accuracy, especially in the first few frames. ```python sr = 16000.0 frame_len = 320 # 20 ms at 16 kHz silence = np.zeros(frame_len) for _ in range(10): detector.calibrate(silence, sample_rate=sr) ``` ## Frame-by-frame processing `pyminidsp.VAD.process_frame` processes a single frame and returns a ``(decision, score, features)`` tuple. ```python frame = md.sine_wave(frame_len, amplitude=1.0, freq=1000.0, sample_rate=sr) decision, score, features = detector.process_frame(frame, sr) print(f"Decision: {'speech' if decision else 'silence'}") print(f"Score: {score:.3f}") print(f"Features: {features}") ``` - **decision** — ``1`` for speech, ``0`` for silence. - **score** — weighted combination of normalized features in [0.0, 1.0]. - **features** — float64 array of length 5 with normalized feature values. ## Batch processing `pyminidsp.VAD.process` segments a signal into non-overlapping frames and processes each one, returning arrays. ```python # 1 second of signal at 16 kHz signal = md.sine_wave(16000, amplitude=1.0, freq=1000.0, sample_rate=sr) decisions, scores, features = detector.process(signal, sr, frame_len=320) print(f"Frames processed: {len(decisions)}") print(f"Speech frames: {decisions.sum()}") print(f"Features shape: {features.shape}") # (50, 5) ``` ## End-to-end example The following example creates a synthetic signal with two speech-like bursts separated by silence, runs the VAD, and prints per-frame results: ```python import numpy as np import pyminidsp as md sr = 16000.0 frame_len = 320 # 20 ms # Build signal: silence → tone → silence → tone → silence seg = int(0.3 * sr) # 300 ms segments signal = np.concatenate([ np.zeros(seg), md.sine_wave(seg, amplitude=0.8, freq=1000.0, sample_rate=sr), np.zeros(seg), md.sine_wave(seg, amplitude=0.8, freq=1000.0, sample_rate=sr), np.zeros(seg), ]) detector = md.VAD() # Calibrate with leading silence for i in range(10): frame = signal[i * frame_len:(i + 1) * frame_len] detector.calibrate(frame, sample_rate=sr) # Process decisions, scores, features = detector.process(signal, sr, frame_len) for i in range(len(decisions)): t = (i * frame_len + frame_len / 2) / sr label = "SPEECH" if decisions[i] else "silence" print(f" {t:6.3f} s score={scores[i]:.3f} {label}") ``` ## Visualisation The interactive plot below shows the VAD processing the synthetic signal from the example above. Four panels display: the per-frame peak envelope, all five normalized features, the combined score against the threshold (dashed red line), and the final binary decision. ## Tuning parameters The default parameters work well for clean speech at 16 kHz. For noisy environments, you may need to adjust: - **threshold** (default 0.5) — lower values increase sensitivity. - **onset_frames** (default 3) — more frames needed to confirm speech. - **hangover_frames** (default 15) — how long to hold the speech state after activity drops. - **adaptation_rate** (default 0.01) — EMA learning rate for normalization. Lower values track slower-changing environments. - **band_low_hz / band_high_hz** (default 300–3400 Hz) — frequency band for the band energy ratio feature. - **weights** (default 0.2 each) — per-feature weights. Weight energy heavily for clean environments, or spectral entropy for noisy ones. --- # Audio Steganography Hide secret messages or binary data within audio signals so that casual listeners hear only the original sound, while decoders can extract the hidden payload. ## Three methods | Method | Capacity | Audibility | Robustness | Requirement | |---|---|---|---|---| | **LSB** | ~1 bit/sample (~16 KB / 3 s @ 44.1 kHz) | Inaudible (≈ −90 dB) | Fragile (destroyed by lossy compression, resampling) | Any sample rate | | **Frequency-band** | ~2.6 kbit/s (~121 bytes / 3 s @ 44.1 kHz) | Above most listeners' hearing | Moderate (survives mild noise) | sample_rate ≥ 40 kHz | | **Spectrogram text** | ~1 bit/sample (same as LSB) | Audible as buzzy tones; visually readable in spectrogram | Fragile (same as LSB) | Any sample rate | **LSB** flips the least-significant bit of a 16-bit PCM representation — distortion ≈ −90 dB. Best for lossless pipelines (WAV, FLAC). **Frequency-band** encodes data as brief BFSK tone bursts at 18.5 kHz (bit 0) or 19.5 kHz (bit 1). Choose this when light interference is expected. **Spectrogram text** is a hybrid method that hides data via LSB encoding *and* renders the message as readable text in a spectrogram view. The message is rasterised with a built-in bitmap font, and sine waves at corresponding frequencies produce visible characters when viewed with a spectrogram analyser. **SEEALSO**: :doc:`spectrogram-text` Detailed guide on the spectrogram text art synthesis function. `miniDSP C library — Audio Steganography `_ Upstream C library documentation with algorithm details and C-level examples. ## Message structure All three methods prepend a **32-bit little-endian header**: bits 0–30 hold the byte count, bit 31 indicates payload type (0 = text, 1 = binary). This lets the decoder recover messages without prior knowledge of length. ## Hiding text ```python import pyminidsp as md host = md.sine_wave(44100, amplitude=0.8, freq=440.0, sample_rate=44100.0) stego, n = md.steg_encode(host, "secret message", sample_rate=44100.0, method=md.STEG_LSB) print(f"Encoded {n} bytes") ``` **Listen** — compare the host signal and the stego outputs: *Original host (440 Hz sine):* *LSB-encoded (sounds identical):* *Frequency-band encoded (faint high-frequency tones):* Spectrogram text encoding works the same way — just pass ``method=md.STEG_SPECTEXT``: ```python stego_st, n = md.steg_encode(host, "HELLO", sample_rate=44100.0, method=md.STEG_SPECTEXT) print(f"Encoded {n} bytes (visible in spectrogram)") ``` ## Recovering text ```python recovered = md.steg_decode(stego, sample_rate=44100.0, method=md.STEG_LSB) print(recovered) # "secret message" # Recover from spectrogram-text encoded signal recovered_st = md.steg_decode(stego_st, sample_rate=44100.0, method=md.STEG_SPECTEXT) print(recovered_st) # "HELLO" ``` ## Binary data ```python data = b"\x00\x01\x02\xff\xfe\xfd" stego, n = md.steg_encode_bytes(host, data, sample_rate=44100.0) recovered = md.steg_decode_bytes(stego, sample_rate=44100.0) assert recovered == data ``` ## Automatic detection ```python method, payload_type = md.steg_detect(stego, sample_rate=44100.0) if method is not None: names = {md.STEG_LSB: "LSB", md.STEG_FREQ_BAND: "Freq-band", md.STEG_SPECTEXT: "Spectrogram-text"} print(f"Method: {names[method]}") print(f"Type: {'text' if payload_type == md.STEG_TYPE_TEXT else 'binary'}") ``` ## Capacity check ```python cap = md.steg_capacity(44100, sample_rate=44100.0, method=md.STEG_LSB) print(f"Can hide up to {cap} bytes") md.shutdown() ``` ---