Latency Problems with Building Conversational AI

I need to stop researching before starting projects

Conversational AI is when AI can simulate human conversation.

There are a ton of benefits to this, but my favorite is people can listen and speak faster than they can read/type¹.

A Bird's Eye View

There are four major parts in conversational AI.

The system needs to listen, translating speech into text. (Called SST or ASR)
Manage turn-taking/when the user interrupts the AI. This relies on the Voice Activity Detector (VAD)
Processing that text, using a LLM.
Turning the LLMs text output into words, using Text-to-Speech (TTS)

But real conversation is much more than just back-and-forth text. To make it feel natural, systems also need a conversation supervisor, something that can stop the AI from talking when the user interrupts. The speech recognition system itself doesn't include the separate system used for VAD.

The whole system's design should follow the rules of human conversation.

But what are conversational rules?

Conversation for Engineers

Quick psychology detour!

Human conversation has some technical dynamics. For example, when people talk, they naturally leave small gaps between sentences. This space is where others nod, say "uh-huh", or drop in quick affirmations. Ideally, the VAD will ignore these affirmations, not ending the AI's turn.

Timing is critical too. On average, people wait about 200 milliseconds between turns. An awkward pause might be just two seconds. Thankfully, people have a higher tolerance with delays when having conversation with a machine vs a person.

Latency - The only problem even talked about in this post

The process of converting speech into text, getting a response from a LLM, and then converting that text back into a voice takes a hot second. There is also all the added latency of networking/internet too.

The golden standard that people are held to is 200ms, while conversational flow breakdown starts around 300-500ms. After 500ms feels like awkward silence.

So where is time spent?

Ideally, ingest and VAD latency together take about 10ms. The ASR usually takes about 100-200ms. The real time monster is the LLM. Using a thinking model or calling MCP tools is wasting precious time. Text to speech (TTS) generation also takes about another 100-200ms. It's important to understand that the LLM and TTS generation can happen concurrently, streaming output from one directly into the other. Sending the audio back to the user could take another 100ms on top of all of this too!

That's a long time. How can we optimize this to be closer to a human response time?

FYI, the rest of this post is me theorizing to achieve that 200ms timing. Good hardware and fast LLMs solve most of these problems. (if you are ok with 500ms+ of time)

Conspiracy Theories That Can Be Done to make it "faster"

VAD Improvements

LLMs are also surprisingly good at responding without full context or when words are dropped. In English², The beginning of the sentence is typically introducing given information, (something the listener might already know). The middle of a sentence carries new information, the core content. The tail is often not even needed. Something like - "Can you add milk to my list for when I go to the store?" Could be briefed down to just with cutting off listening early "Can you add milk to my list"

Humans also do this too during conversation, forming thoughts, ideas, and responses while the other party is still speaking. In a perfect world, the VAD will stop the listening turn early and LLM inference will start before the user finishes talking.

The machine technique of truncation and thinking about the answer before the speaker can cause a multitude of risks. You could miss negations (don't!) or lose context, or miss self-corrections.

Some possible solutions implementing these ideas are:

What if we answer with half the question/thought still being told, Then use a small LLM to see if the answer still stands once the full question is known. If it’s wrong, both inputs can be combined to answer correctly. But if it’s a simple question, it would already have the answer.

Streaming the ASR to the LLM where the LLM gets continuous chunked context updates.

Other Terrible Ideas That Could Work

There are like 100 more fucking magic techniques that can be used to optimize this, But I'm not actually an AI engineer so i'm not going to go over them. ³

Is It All For Nothing?

This is great; however some people don't talk to machines like they do to humans. They often get straight to the point. There are also a million edge cases, so this might all be for nothing.

One thing's for sure, there is excessive token usage for all this.

If you have thoughts/ideas about any of this, please reach out. I would love to hear your thoughts and opinions about all of this.

https://news.stanford.edu/stories/2016/08/stanford-study-speech-recognition-faster-texting ↩
(another W.E.I.R.D. thing, sorry.) ↩
Yet :-| ↩