Sesame | Full-time | SF/NYC/Bellevue | On-site | https://www.sesame.com/
Sesame believes in a future where computers are lifelike - with the ability to see, hear, and collaborate with us in ways that feel natural and human. With this vision, we're designing a new kind of computer, focused on making voice personal agents part of our daily lives. More details from Sequoia: https://www.sequoiacap.com/article/partnering-with-sesame-a-...
Our team brings together founders from Oculus and Ubiquity6, alongside proven leaders from Meta, Google, and Apple, with deep expertise spanning hardware and software.
Open Roles: https://jobs.ashbyhq.com/sesame
- ML Engineers
- Product Designers
- Product Managers
- iOS & Android Engineers
- ML Model Serving Engineer
- Embedded OS Architect
- Mechanical Engineer, Product Design
- Embedded Engineers
- Electrical Engineer
- Audio Systems Engineer
What do y'all think about the latency/quality tradeoff with LLMs?
Human voices don't take 30 seconds to think, retrieve, research, and summarize a high quality answer. Humans are calibrated in their knowledge, they know what they understand and what they don't. They can converse in real time without bullshitting.
Frontier real time-ish LLM generated voice systems are still plagued by 2024 era LLM nonsense, like the inability to count Rs in strawberry. [1]
I'd personally love a voice interface that, constrained by the technology of today, takes the latency hit to deliver quality.
[1] https://www.instagram.com/reel/DTYBpa7AHSJ/?igsh=MzRlODBiNWF...
Not affiliated with Sesame, but this is what the realtime models are trying to solve. If you look at NVIDIA’s PersonaPlex release [0], it uses a duplex architecture. It’s based on Moshi [1], which aims to address this problem by allowing the model to listen and generate audio at the same time.