I wanted realistic voice dialogue for browser-based games and interactive scenes without relying on large pre-recorded audio libraries, so I adapted part of one of my production systems into a single-file, fully local browser TTS demo.
This uses Kokoro TTS running entirely inside the browser with:
no server
no API calls
no cloud processing
real-time synthesis
word-level subtitle timing
Potential use cases for Three.js projects:
NPC dialogue
dynamic quest narration
reactive combat barks
procedural storytelling
accessibility narration
immersive environmental voiceovers
All audio is generated locally on-device, so there is no backend infrastructure and no need to pre-generate hundreds of audio assets.
It also opens the door to spatialized AI voices in 3D scenes. Pairing this with positional audio (e.g., Howler.js) would let you attach generated speech directly to world-space entities and have voices originate from actual locations in the scene.
I know this isnât strictly a Three.js demo, but browser-local AI audio is likely to become increasingly relevant for immersive web experiences and game development.
Firefox works for me! However, it seems to take a while longer to synthesize the audio. Interestingly, I get some onnxruntime errors in the console.
Changing these lines here: dtype:${isMobile ? ââq4ââ : ââfp32ââ}, device:${isMobile ? ââwasmââ : ââwebgpuââ},
to dtype: âq4â, device: âwasmâ,
gets rid of those errors.
As a wild guess, this must be related to Firefoxâs WebGPU support being newer and less mature than Chromeâs.
If and when you get a chance, can you please open the Codepen, make the above edits on lines 973-974, and let me know if that solves the problem for you in Firefox?
This could be very useful. I added voice narration in one of my demos and it was a chore! I had to collect and arrange samples of specific phrases so that I could access them when needed.
This could eliminate a lot of that work. In my case the voices do not have to be extremely great quality because they are spoken over a radio with static.
In this program, there is sometimes a bit of a delay, especially at the beginning. Is there a way to pre-load phrases?
@phil_crowther Thatâs the nice thing about using a worker for the TTS: you can have it running in the background and preload/warm-up phrases before theyâre actually needed.
So for something like radio chatter or repeated NPC lines, you could generate common phrases ahead of time during loading screens or gameplay downtime, then play them instantly later from cache instead of generating them on demand.
@jrlazz I think you need to host this with https for it to work (i could be wrong).
@red-reddington I got this to work in my toy project, and itâs rad. That model caches well.. the only issue Iâve had, is sometimes my implementation drops the first part of the sentence, usually on a cold start, so there might be some race condition between the model loading, and it actually being ready to generate.. or itâs just something in my crappy wrapper.
Another cool thing, when I stepped into the code, it looks like internally it also generates the list of phonemes and timings of them for the speech, which could be awesome for syncing talking avatars and such. I havenât dug in enough to see if that list is surfaced with the generation or not but.. yeah this is super fun and cool!!
@jrlazz as @manthrax says you definitely need to host it, otherwise the worker usually wonât start because of browser security restrictions around modules/workers.
Yeah, try doing a tiny warm-up/silent sentence right after init and see if that helps the cold-start clipping issue. Since the worker/model stay resident afterward, subsequent generations are usually much smoother.
And yep, the phoneme timing data is exactly what makes avatar/lip-sync stuff possible. Thatâs actually how my project EdgeSpeaker works:
It uses a very similar approach to sync speech to a talking avatar, plus local/browser AI generation for dialogue.
The demo code I posted here is basically the same core idea, just using Kokoro-JS directly instead of wiring up ONNX Runtime + Transformers manually.
Also, if youâre serious about this space, I strongly recommend checking out @met4citizen on GitHub, especially his TalkingHead, HeadTTS, and HeadAudio repos. To me heâs the authority on browser-based realtime avatars/TTS and very much at the bleeding edge of whatâs currently possible in-browser.
those strings are getting doubly wrapped so instead of passing âwebgpuâ itâs passing ââwebgpuââ and not recognizing it and falling back to wasm generation on desktop.
Iâm looking into it as well. My code is pretty aligned to the kokorojs online demo. That demo itself also has some errors⌠so the issue might be further upstream, in transformers.js. I think the âwebgpuâ request may not always be honored. Maybe transformers decides to fallback to wasm. Iâm not sure yet. Thanks so much for pitching in to help!
@manthrax I donât think Iâm going down this rabbit hole further. At this point, the errors Iâm seeing donât appear to affect speech synthesis, and they are consistent with the official demo here: https://huggingface.co/spaces/webml-community/kokoro-webgpu (same logs/errors).
A more hands-on approach would be to bypass the library and own the entire synthesis pipeline. Thatâs the approach used in projects like HeadTTS (by @met4citizen), and similar to what Iâve done in EdgeSpeaker.com. The downside is that it comes with significantly more code, more direct dependencies, and a lot more maintenance overhead.
Because of that, it may not be a good fit for a starter-style package, which is what Iâm trying to provide here; something game developers can pick up and use quickly without diving into the full ONNX pipeline.
I hope this is a reasonable conclusion. If anyone does end up digging deeper and resolving the console errors, Iâd appreciate a shout. Thanks.
Same single-file, fully-local idea. Lower-quality voices than Kokoro, but small and fast enough to actually work on phones, and with ~80 voices across ~30 languages. Different niche: Kokoro for hero characters, Piper for âthe guard says helloâ five hundred times a level.