by brody_hamer 9 hours ago

> Voice is a turn-taking problem

It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)

nicktikhonov 9 hours ago | [-2 more]

100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.

dotancohen 7 hours ago | [-0 more]

Years ago I wrote a system that would generate Lucene queries on the fly and return results. The ~250 ms response time was deemed too long, so I added some information about where the response data originated, and started returning "According to..." within 50 ms of the end of user input. So the actual information got to the user after a longer delay, but it felt almost as fast as conversion.

eru 4 hours ago | [-0 more]

See also any public speaking who starts every answer to a question from the audience (or in a verbal interview) with something like 'that is a good question!' or "thank you for asking me that!"

Same strategy but employed by humans.

Rohunyyy 4 hours ago | [-0 more]

I am not sure about the low hanging fruit. Its not easy to make something robotic more human. Based on personal experience I thought it would be a low hanging fruit for text. Take a simple LLM answer to anything and replace the "-" and "its not x its y" thingy that people almost always associate with LLMs to something else. Guess what? Now those answers sound even MORE robotic. Obviously this was a pet project that I cooked up in less than an hour but the more I tried to make it human the more it became ai

starkparker 8 hours ago | [-0 more]
DoctorOetker 4 hours ago | [-0 more]

1) if the system misdetected end-of-turn and has swiftly realized its error too late, and if we collect 90% of English syllables and find filler that starts with the syllable, it might allow to terminate the commitment to interrupt the speaker by turning it into background filler

2) if end-of-turn was detected very late, we can randomly select a first phonetic syllable, and then add it in the prompt that the reply should start with this syllable!

phkahler 8 hours ago | [-3 more]

Better if it can anticipate its response before you're done speaking. That would be subject to change depending what the speaker says, but it might be able to start immediately.

fragmede 7 hours ago | [-2 more]

it's bad enough how to deal with people that don't think before they speak now we gotta make the computers do it as well‽

eru 4 hours ago | [-1 more]

Huh, the grandfather was suggestion to have the computer think while you speak.

That's different from banning the computer from thinking before they speak, ain't it?

fragmede an hour ago | [-0 more]

Thinking while I'm speaking means it isn't listening to everything I've said before thinking what to say. If I start my reply with "no, because...", and it's already formulating its response based on the "no" and not what comes after the because, then it's not thinking before it speaks.