Steganography is the practice of hiding a secret message inside an innocuous-looking cover medium. Audio steganography hides data inside an audio signal so that a casual listener hears only the original sound, while a decoder can extract the hidden payload.
miniDSP provides three complementary methods in src/minidsp_steg.c, demonstrated in tools/audio_steg/audio_steg.c:
Build and run the self-test from the repository root:
make -C tools/audio_steg
cd tools/audio_steg && ./audio_steg
Message framing
Both methods prepend a 32-bit little-endian header before the payload. Bits 0–30 hold the message byte count; bit 31 is a payload type flag (0 = text, 1 = binary). This allows the decoder to recover the message without knowing its length in advance, and enables MD_steg_detect() to identify the payload type:
[ bit 31: type flag | bits 0-30: msg_len (LE) ] [ 8 * msg_len bits: payload ]
Each bit of the header and payload is encoded independently using the chosen method. Bits within each byte are transmitted LSB-first.
Method 1: Least Significant Bit (LSB)
The idea
Audio samples are typically stored as 16-bit integers (-32768 to +32767). The least significant bit of each sample contributes only ±1 to a range of 65536 — a change of about -90 dB relative to full scale. By replacing the LSB of each sample with a message bit, we embed data that is completely inaudible.
Signal model
The host signal \(x[n] \in [-1, 1]\) is quantised to 16-bit PCM:
\[p[n] = \mathrm{round}(x[n] \times 32767)
\]
The LSB is then overwritten with message bit \(b_k\):
The two are perceptually identical. The difference signal (host minus stego) is pure quantisation noise at -90 dB:
Trade-offs
Advantage
Disadvantage
Very high capacity
Destroyed by any lossy compression (MP3, AAC, Opus)
Zero audible distortion
Destroyed by resampling or sample-rate conversion
Simple, fast implementation
Destroyed by amplitude scaling or normalisation
Works at any sample rate
Requires lossless transport (WAV, FLAC)
Method 2: Frequency-Band Modulation (BFSK)
The idea
Human hearing sensitivity falls off sharply above ~16 kHz, and most adults cannot hear tones above 18 kHz. By adding low-amplitude tones in the 18–20 kHz "near-ultrasonic" band, we can encode data that is effectively inaudible.
The encoding uses Binary Frequency-Shift Keying (BFSK): each bit is represented by a short burst ("chip") of a sinusoidal tone at one of two carrier frequencies.
Carrier frequencies
Bit value
Carrier frequency
0
18500 Hz
1
19500 Hz
Both carriers are above the typical hearing threshold, and the 1 kHz separation provides reliable discrimination during decoding.
Chip duration
Each bit occupies a 3 ms chip — a burst of \(C\) samples:
After frequency-band encoding (same host, message hidden via BFSK):
The added carriers at 18.5/19.5 kHz are above most listeners' hearing range.
Spectrogram showing the hidden BFSK signal above the 440 Hz host tone:
The faint horizontal bands near the top of the spectrogram are the BFSK carriers. The main 440 Hz tone dominates the audible range.
Trade-offs
Advantage
Disadvantage
Survives mild additive noise
Lower capacity than LSB
Frequency-domain robustness
Requires sample_rate >= 40 kHz
Inaudible to most listeners
May be audible to young listeners with excellent high-frequency hearing
Amenable to spectral analysis
Vulnerable to low-pass filtering above 18 kHz
Method 3: Spectrogram Text (spectext)
The idea
What if a hidden message were visible to the human eye as well as recoverable by machine? The spectext method combines LSB data encoding (for reliable machine decode) with spectrogram text art in the 18–23.5 kHz ultrasonic band (for visual verification). Open the stego file in any spectrogram viewer and the message is spelled out in the high frequencies — while a listener hears nothing unusual.
The spectrogram art also acts as a tamper indicator: if the text is intact, the LSB data likely is too.
Encode pipeline
host.wav ──┐
(any SR) │
▼
┌──────────────┐ ┌────────────────────┐
│ MD_resample()│────▶│ host @ 48 kHz │
│ to 48 kHz │ └────────┬───────────┘
└──────────────┘ │
▼
┌───────────────────────────┐
│ MD_lowpass_brickwall() │
│ cutoff = original_SR / 2 │
│ (eliminates resampler │
│ spectral images) │
└────────────┬──────────────┘
│
"SECRET" ──┐ │
▼ │
┌──────────────────────┐ │
│MD_spectrogram_text() │ │
│ freq: 18–23.5 kHz │ │
│ 30 ms / column │ │
│ amplitude: 0.02 │ │
└──────────┬───────────┘ │
│ mix (add) │
▼ ▼
┌──────────────────────────┐
│ host + spectrogram art │
└────────────┬─────────────┘
│ LSB encode (last step)
▼
stego.wav (48 kHz)
The spectrogram art is mixed into the host before LSB encoding, so the LSB bits remain undisturbed. Decode simply reads the LSB channel.
Automatic upsampling and spectral cleanup
The spectrogram art uses the 18–23.5 kHz band, which requires a Nyquist frequency of at least 23.5 kHz (sample rate >= 47 kHz). If the input host is below 48 kHz, the encoder automatically upsamples it using MD_resample(). After upsampling, MD_lowpass_brickwall() is applied at the original Nyquist frequency to eliminate any residual spectral images from the resampler's transition band. This ensures the 18–23.5 kHz band is completely clean before the spectrogram text is mixed in, so the hidden message is clearly readable in any spectrogram viewer. The output is always 48 kHz.
Fixed column width and capacity
Each character in the bitmap font is 8 columns wide (5 data + 3 spacing). Each column occupies a fixed 30 ms of audio, giving 240 ms per character:
The 7 rows of the 5x7 bitmap font are mapped linearly across the 18–23.5 kHz band. Row 0 (top of character) maps to 23.5 kHz; row 6 (bottom) maps to 18 kHz.
The spectrogram text is generated at full amplitude by MD_spectrogram_text() (normalised to 0.9 peak), then scaled to 0.02 (~-34 dB) before mixing. This is loud enough to be clearly visible in a spectrogram but completely inaudible — most adults cannot hear above 18 kHz.
Visual truncation
If the message is longer than the visual capacity, the spectrogram art shows only the first N characters that fit. The full message is always recoverable via the LSB data channel, which has much higher capacity (~5.5 KB/sec at 48 kHz). For binary payloads, the spectrogram art shows [BIN <N>B] as a label.
Listening comparison
After spectext encoding (same host, message "miniDSP" hidden via spectext):
The ultrasonic tones at 18–23.5 kHz are far above the audible range.
Spectrogram showing "miniDSP" rendered as text art in the ultrasonic band:
The host audio — a TIMIT sentence ("Don't ask me to carry an oily rag like that.") — is visible at the bottom of the spectrogram. The text "miniDSP" is rendered in the 18–23.5 kHz band near the top, using the 5x7 bitmap font from MD_spectrogram_text().
Trade-offs
Advantage
Disadvantage
Human-readable visual watermark
Lower visual capacity than LSB data capacity
Machine-readable round-trip via LSB
Requires 48 kHz output (auto-upsampled)
Visual tamper indicator
Destroyed by lossy compression (like LSB)
Completely inaudible
Destroyed by low-pass filtering above 18 kHz
Embedding binary data
The string-based API (MD_steg_encode / MD_steg_decode) uses null-terminated C strings, which cannot represent binary data containing 0x00 bytes. To hide arbitrary binary payloads — images, compressed archives, cryptographic keys — use the byte-oriented API:
QR code — a 165x165 1-bit grayscale PNG (486 bytes) containing a URL to this repository. This is a "double encoding": data encoded as a QR code, then the QR code hidden inside audio:
How it works: The function probes the first 32 samples (LSB) or 32 BFSK chips (frequency-band) to extract the header. A header is considered valid when the decoded length is positive and fits the signal capacity. For BFSK, the average correlation must also exceed a minimum threshold to avoid false positives. If both methods claim a valid header, BFSK wins (harder to trigger by accident).
Quick example:
int payload_type;
int method = MD_steg_detect(signal, signal_len, 44100.0, &payload_type);
For maximum capacity and fidelity in lossless pipelines, use LSB. For slightly more robust hiding in near-ultrasonic bands, use frequency-band. For a human-readable visual watermark with machine-readable data recovery, use spectext.