Generative Music in the Browser
I wanted to build a Brain.fm competitor in the browser. Web Audio API. FM synthesis. Binaural entrainment. The whole thing, from oscillators to deployment.
21 hours later, I have a working engine that produces sound. Whether that sound qualifies as music is debatable. What I learned about procedural audio generation is probably more valuable than the code.
The Premise
Focus music apps like Brain.fm claim to improve concentration through auditory entrainment — modulating audio at specific frequencies that correspond to brain states. Beta range (12-30 Hz) for focus. Alpha (8-12 Hz) for relaxation. Theta (4-8 Hz) for meditation. Delta (0.5-4 Hz) for sleep.
The modulation is typically invisible — applied as subtle amplitude changes, spectral shifts, or stereo panning at the target frequency. You don't consciously hear 16 Hz pulsing. Your brain (allegedly) responds to it anyway.
The question I wanted to answer: can you generate the music itself procedurally in the browser, then apply the entrainment layer on top?
Short answer: the entrainment layer works. The music generation doesn't. Not yet.
The Architecture That Worked
My engine uses a 3-layer modulation chain that mirrors Brain.fm's patented approach:
Layer 1: Amplitude Modulation. A low-frequency oscillator modulates the master gain at the target entrainment frequency. For focus mode, that's somewhere in the beta range. The depth is subtle — maybe 10-15% of the signal amplitude. Enough for the brain to detect, not enough for the ear to notice.
Layer 2: Spectral Modulation. A second LFO shifts the center frequency of a bandpass filter at the same entrainment rate. This creates a subtle brightness oscillation — the tonal color breathes at the target frequency.
Layer 3: Stereo Panning with Drift. A third LFO pans the signal left-right at a slightly offset frequency, creating binaural-adjacent stimulation. The drift prevents the panning from feeling mechanical.
All three layers are invisible to casual listening. They don't change the musical content. They modulate it. This distinction matters enormously, and it took me most of the session to understand why.
The 5-Mood System
Each mood maps to an entrainment frequency range:
- Focus — Beta range, optimized for sustained attention
- Deep Work — Upper alpha/low beta, for flow states
- Relax — Alpha range, for unwinding
- Meditate — Theta range, for mindfulness
- Sleep — Delta range, for drifting off
Each session randomizes within the mood's Hz range, picks a random key transposition, and seeds the pattern generator differently. No two sessions are identical in theory. In practice, with only 10 pentatonic notes and 3 chord progressions per mood, repetition shows up fast.
The UI presents this as 5 mood cards with a Soundscapes/Both/Music toggle, a circular visualizer, a progress ring, and transport controls. Session timer handles fade-in and fade-out transitions. A completion chime plays when the session ends. All deployed to Cloudflare Pages via a GitHub mirror that auto-triggers builds from the primary Gitea repo.
Straightforward on paper. The complexity is entirely in what happens between clicking "Focus" and hearing something worth focusing to.
FM Synthesis: The Modulation Index Problem
FM (frequency modulation) synthesis generates complex timbres from simple oscillators. A carrier oscillator produces the base tone. A modulator oscillator changes the carrier's frequency rapidly, creating harmonic sidebands that give the sound its character.
The modulation index — the ratio of the modulator's amplitude to its frequency — controls how bright and complex the sound is. And this is where I burned hours.
Index 2.0-4.0 is standard DX7 keyboard territory. Bright, metallic, harmonically rich. Perfect for a synthesizer lead. Absolutely terrible for background focus music. The harmonic complexity demands attention. That's the opposite of what focus music should do.
Index 0.7-1.5 is warm. Gentle. The sidebands are present but subdued. The tone has character without aggression. This is the usable range for ambient and background applications.
My engine uses a Rhodes-like patch with two carrier-modulator pairs:
- Body: 1:1 ratio (fundamental reinforcement), index ~0.8
- Bell: 14:1 ratio (inharmonic brightness), index ~0.3
The 14:1 bell ratio creates a problem at higher frequencies. A note at 1500 Hz times 14 equals 21,000 Hz — above human hearing but within the Web Audio API's range. Push higher and you get 8000 Hz times 14 = 112,000 Hz, which exceeds the browser's Nyquist limit. The oscillator just spams. I had to clamp the bell partial to notes below 1500 Hz.
The Singing Bowls: What Inharmonic Synthesis Gets Right
Meditate mode uses singing bowl synthesis with 5 inharmonic partials at ratios: 1.0, 2.71, 5.04, 8.09, 11.79.
These are not integer ratios. That's the point. Musical instruments use harmonic series (1, 2, 3, 4, 5...) where each partial is a whole-number multiple of the fundamental. Singing bowls, bells, and gongs use inharmonic spectra where the partials don't align to a harmonic series.
The result is a tone that shimmers. The partials beat against each other at non-obvious intervals, creating slow phase interference patterns that evolve over time. A single struck singing bowl note sounds different at 1 second, 5 seconds, and 15 seconds — not because the amplitude changes, but because the phase relationships between the inharmonic partials keep shifting.
This is the one thing that sounded genuinely good from my session. Not "good for procedural generation" — actually pleasant to listen to. Something about those specific partial ratios hits right. The synthesis was straightforward: 5 oscillators per note, each at the appropriate ratio, with independent decay envelopes.
Where Procedural Generation Fails
Nature Sounds From Filtered Noise
I tried synthesizing rain, ocean, forest, and cafe soundscapes from filtered noise buffers. Different filter shapes, different modulation rates, different stereo treatments.
They all sounded like colored static. Because that's what they are. A lowpass-filtered noise buffer doesn't sound like rain. It sounds like a lowpass-filtered noise buffer. The acoustic complexity of real rain — the spatial distribution of individual drops, the resonance of surfaces, the micro-rhythms — cannot be approximated by spectral shaping alone.
Real audio recordings are mandatory for soundscapes. I have 572 MB of them. They can't go in a git repo. They need CDN hosting on something like Cloudflare R2.
Oscillator Drums
Synthesizing hi-hats at 8000 Hz filtered noise and snares at 200 Hz FM patches through the same engine that generates melodic content. The result is not drums. It's clicking and buzzing.
Real drum machines use samples for a reason. Even the most synthetic-sounding drum machines of the 1980s used carefully crafted analog circuits, not naive FM synthesis. I have 18 CC0 drum and instrument samples — 1.3 MB total — that should replace all oscillator-based percussion.
The Uniqueness Problem
10 pentatonic notes. 2-octave range. 3 chord progressions per mood. Do the math on how many unique 4-bar phrases you can generate from that constraint space. It's not enough.
By minute 8 of a focus session, you've heard every motif. The engine shuffles the order, but shuffling a small deck is not the same as drawing from a large one. You need 20+ chord progressions per mood, wider pitch ranges, variable rhythms, different instrument textures per session, and BPM variation — not just reordering the same 3 progressions.
The Brain.fm Revelation
Brain.fm's patent tells the story. Their approach: human composers write the music. Then their system applies amplitude modulation at the target brainwave frequency to the composed audio.
The music quality comes from humans. The therapeutic effect comes from the modulation layer. They are explicitly not generating music procedurally. The research paper in Nature Communications Biology (2024) validated that the entrainment modulation works — 16 Hz AM in the beta range measurably affects focus. ADHD participants showed greater benefit, covaried with ASRS scores.
My 3-layer modulation chain mirrors their patented approach. That part of my architecture is correct. What feeds into the chain is the problem. They feed in human-composed music. I'm feeding in procedurally generated noise.
The Lo-Fi Effects Chain
One thing that does help: a lo-fi effects chain applied after synthesis.
- Reverb — convolution reverb or algorithmic, with a medium decay. Smooths out FM synthesis harshness.
- Delay — short slapback, gives depth to thin melodic lines.
- Tape saturation — soft clipping that rounds off peaks and adds even harmonics.
- Vinyl crackle — subtle noise layer that masks synthesis artifacts and adds warmth.
- Lowpass filter — rolling off everything above 3-8 kHz. Research suggests lo-fi focus music works best in the 65-85 BPM range with jazz chord extensions (min7, maj7, 9th) and this frequency ceiling.
The effects chain makes mediocre synthesis sound less mediocre. It doesn't make it good. But the difference between raw FM output and the same signal through reverb + saturation + lowpass is significant.
What Actually Needs to Happen
The architecture is right. The modulation science is right. The content generation is wrong.
Short term: Use the 18 downloaded samples for all instruments and drums. Expand chord progressions from 3 to 20+ per mood. Add variable BPM. Upload soundscape audio to R2. Add a skip/next button so users can re-roll a bad seed.
Medium term: Pre-generate a library of music stems offline using something like MusicGen or Suno. Batch them, store them, stream them. The modulation chain can work on any audio input — it doesn't care if the source is procedural or pre-composed. Or look at Mubert's API for real-time generation at $49-199/month.
Long term: Eno-style phase loops — 5-7 loops of prime lengths creating evolving interference patterns that never exactly repeat. Markov chain melody generation for more natural phrase construction. Generative.fm proves this can work in the browser with Tone.js and Web Audio API.
Or, honestly, just accept that Brain.fm solved this with $7/month subscriptions and human composers. There might be a reason nobody else has cracked procedural focus music.
The Volume Problem Nobody Warns You About
Four gain stages in series: blend gain, pattern velocity, synthesis gain, effects chain gain. Each one makes sense in isolation. Together they compound multiplicatively.
If blend is at 0.7, velocity at 0.8, synth at 0.6, and effects at 0.9, the signal is at 0.7 * 0.8 * 0.6 * 0.9 = 0.3024 of its original amplitude. Crushed to a whisper. Bump any one stage to compensate and you overshoot into clipping on loud notes while quiet notes stay inaudible.
The fix is thinking about total gain path holistically — normalizing at the end of the chain, not at each stage. But at 3 AM with 6 gain nodes open in Web Audio Inspector, "think holistically" is not the thought that surfaces. "Make this louder" is. And that's how you get 4 hours of chasing volume issues that are actually multiplication issues.
The Caching Gotcha
CF Pages caching is aggressive enough to matter during development. Push a new build, visit the URL, hear the old version. The JS bundle filename changes, but the browser serves the cached HTML that references the old filename.
Hard refresh (Ctrl+Shift+R) fixes it. But it means testing deployments requires discipline — you have to verify the JS filename in the browser console actually changed. Otherwise you're debugging code that isn't the code you just pushed. I spent at least 30 minutes on this before realizing I was hearing a build from 2 deploys ago.
The Technical Leftovers
Some debt from the marathon:
variations.ts— 2,153 lines of procedural soundscape variations, 92 definitions, completely unused since switching to audio filessynthesizer.ts— 35 KB of procedural noise synthesis, also unusedaudio-raw/directory — 642 MB on my local machine, needs cleanup or intentional storage- CF Pages caching is aggressive — users see old JS bundles after deploys, need hard refresh with Ctrl+Shift+R
The code lives on Gitea at git.argobox.com/KeyArgo/argobeat with a GitHub mirror that triggers CF Pages deployment. 15+ commits from the session, latest being FM gentleness fixes and frequency clamping.
Prior Art: What Other People Built
Worth mentioning what else exists in this space.
Generative.fm proves that browser-based generative music is viable. Built on Tone.js and Web Audio API. Ambient, evolving pieces that run indefinitely. The approach works — but the music is ambient texture, not structured focus compositions.
Mubert offers a real-time API for generating music in specific moods and BPM ranges. $49-199/month depending on tier. Designed for apps exactly like this. The cost is non-trivial for a side project, but the output quality is orders of magnitude better than my FM oscillators.
MusicGen and Suno can pre-generate stems offline. Batch processing. Store the output. Stream it through the modulation chain at runtime. The entrainment layer doesn't care where the audio comes from — it modulates whatever signal it receives. Pre-generated human-quality stems plus real-time entrainment modulation might be the actual viable architecture.
The research is encouraging too. The lo-fi aesthetic has a documented sweet spot: 65-85 BPM, jazz chord extensions (min7, maj7, 9th), lowpass filter at 3-8 kHz. Those parameters narrow the generation problem significantly — you're not trying to generate all music, just one specific genre with one specific purpose.
The Takeaway
Procedural music generation in the browser is possible. The Web Audio API is powerful enough. FM synthesis can produce usable tones if you keep the modulation index low. Inharmonic synthesis does beautiful things for bells and bowls. The entrainment modulation layer works and has peer-reviewed support.
But generating music that a human wants to listen to for 45 minutes? That's a content problem, not an engineering problem. And maybe the answer is to stop trying to replace human composers and start building a better pipeline for the modulation layer they'd compose for.
The singing bowls, though. Those really do sound good. I'll keep those.
And maybe next time I'll check Brain.fm's patent before spending 12 hours solving a problem they solved by not solving it.