by modeless 10 hours ago

IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

nicktikhonov 10 hours ago | [-3 more]
woodson 10 hours ago | [-1 more]
mountainriver 9 hours ago | [-0 more]

Yeah except moshi doesn’t sound good at all

ilaksh 2 hours ago | [-0 more]

It just about works for our current use case but can't comprehend the concept of an outgoing call. So I am trying to fine tune it. Tricky thing is personaplex forked some of the kyutai code and has not integrated the LoRA stuff they added. So we tried to do update personaplex with the fine tuning stuff. Going to find out tonight or tomorrow whether it's actually feasible when I finish debugg/testing.

cootsnuck 3 hours ago | [-0 more]

I've been working solely on voice agents for the past couple years (and have worked at one of the frontier voice AI companies).

The cascading model (STT -> LLM -> TTS), is unlikely to go away anytime soon for a whole lot of reasons. A big one is observability. The people paying for voice agents are enterprises. Enterprises care about reliability and liability. The cascading model approach is much more amenable to specialization (rather than raw flexibility / generality) and auditability.

Organizations in regulated industries (e.g. healthcare, finance, education) need to be able to see what a voice agent "heard" before it tries to "act" on transcribed text, and same goes for seeing what LLM output text is going to be "said" before it's actually synthesized and played back.

Speech-to-Speech (end-to-end) models definitely have a place for more "narrative" use cases (think interviewing, conducting surveys / polls, etc.).

But from my experience from working with clients, they are clamoring for systems and orchestration that actually use some good ol' fashioned engineering and that don't solely rely on the latest-and-greatest SoTA ML models.

rockwotj 6 hours ago | [-0 more]

Fundamentally, the "guessing when its your turn thing" needs to be baked into the model. I think the full duplex mode that Moshi pioneered is probably where the puck is going to end up: https://arxiv.org/abs/2410.00037

com2kid 8 hours ago | [-1 more]

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).

eru 4 hours ago | [-0 more]

You should probably split it up: an end-to-end model for great latency (especially for baked in turn taking), but under the hood it can call out to any old text based model to answer more intricate question. You just need to teach the speech model to stall for a bit, while the LLM is busy.

Just use the same tricks humans are using for that.

coppsilgold 3 hours ago | [-0 more]

Some of the best current voice tokenizers achieve ~12 Hz, that's many more tokens than a regular LLM would use for ultimately the same content.

russdill 3 hours ago | [-0 more]

At least running things locally, such a model completely blows up your latency

donpark 7 hours ago | [-1 more]

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

tmzt 6 hours ago | [-0 more]

Gemini Nano is supposedly doing it on device. It looks like something similar should work with Apple GPU and ANE.