[Following up on my Kokoro TTS demo. Same idea, different engine.]
Kokoro on desktop is fantastic, but on phones not so much. I needed something small and fast enough to run anywhere, and Piper fits that brief. Quality isn’t Kokoro-tier, but when latency matters more than fidelity, it’s the better tool. Additionally, Piper has much broader multi-language support, with ~80 voices across ~30 languages.
A project I’m working on has Eastern European characters who need to sound the part. Surprisingly hard with current browser TTS. Voices are either fully native or fully neutral, nothing in between. You can’t tell an English voice “read this with a Romanian accent”…
So my workaround is: run English through a phonemizer to get IPA, map the IPA into Romanian orthography, then feed that into Piper’s Romanian voice. It produces English with a thick Romanian accent, exactly what a voice actor would do, and exactly what I hoped and needed.
The demo also speaks the original English with an English voice.
In both cases, it has active-word highlighting synced across both textboxes, handy for debugging the IPA mapping, and makes the whole thing easier to comprehend.
Stack:
phonemize(npm) converts English words to IPA in the browser- A short custom function maps each IPA symbol to its closest Romanian letter e.g.,
ʃbecomesș,tʃbecomesci. About 40 lines of regex. @diffusionstudio/vits-webruns the Piper voice model- For word-by-word highlighting I needed to know when each word is spoken, but Piper just hands back a finished WAV file with no timing data. So I resort to heuristics… not perfect, but close enough that the highlight feels in sync.
Gotchas if you go down this road:
I deliberately constrained myself to a single HTML file. No build step, no bundler, no npm install, just open it in a browser. That’s how the Kokoro demo works, and I wanted to keep things consistent. Most of the friction below comes from that constraint: when you load packages straight from a CDN, you’re at the mercy of how that CDN happens to bundle them.
- The default
esm.shand@mintplex-labs/piper-tts-webpaths both break in different ways (unenvpolyfill errors, missing WASM files). Loading@diffusionstudio/vits-webvia jsdelivr’s+esmendpoint sidesteps both. - iOS needs the standard audio-unlock dance on first user gesture, and it’ll kill the tab if you try to synthesize a long passage in one go. The fix is to chunk the text into sentences, synthesize each separately, and reuse a single
<audio>element for playback (a freshAudio()per chunk loses iOS’s gesture grace and.play()gets refused).
CodePen: https://codepen.io/the-red-reddington/full/VYmmdGp
GitHub Pages: https://red-reddington.github.io/web-demos/browser-local-tts-piper