Voice Pipelines vs Speech-to-Speech Models: What to Ship for Voice Agents
Voice agents are back, but the hard parts have not changed: latency, interruptions, transcription mistakes, background noise, and the uncomfortable truth that users will judge your product by the worst 200 milliseconds of a call.
If you are building a voice agent today, you are usually choosing between two architectures:
- Cascaded (chained) pipeline: ASR → LLM → TTS
- Speech-to-speech (voice-to-voice) model: audio in → audio out
Both can work. The right choice depends on what you are optimizing for: control, debuggability, prosody, cost, compliance, and the type of user experience you need.
I will use OpenAI’s terminology because it maps cleanly to how teams talk about this in practice: a chained system versus a speech-to-speech system. Start with their overview if you want a canonical framing. https://platform.openai.com/docs/guides/audio
TL;DR
- If you need tight control, predictable outputs, easy debugging, and clear policy boundaries, ship a cascaded pipeline.
- If you need lowest perceived latency and the most natural conversational feel, and you can tolerate less explicit control, consider speech-to-speech.
- Many teams end up with a hybrid: speech-to-speech for the “front of house” conversation, and a text layer in the middle for logging, tools, and policy.
If you are doing speech-to-speech in the browser, WebRTC should be the default
If you connect to speech-to-speech models from a browser client, you want two things:
- The lowest possible media latency (audio is the product).
- Minimal server involvement once the session is established (your server should not be on the hot path for every packet).
OpenAI’s Realtime WebRTC guide is explicit: for client connections, “we recommend using WebRTC rather than WebSockets for more consistent performance.” https://platform.openai.com/docs/guides/realtime-webrtc
The other important detail is auth. The WebRTC docs describe two browser connection patterns:
- Ephemeral client secrets minted by your server, then the client connects directly to OpenAI.
- The unified interface, which is simpler but “puts your application server in the critical path for session initialization.” https://platform.openai.com/docs/guides/realtime-webrtc
In practice, if your goal is to shave milliseconds and reduce server load, WebRTC plus ephemeral client secrets is the direction most teams should plan around.
1) Cascaded voice pipelines (ASR → LLM → TTS)
This is the classic “voice stack”:
Mic → VAD / endpointing → ASR → LLM (+ tools) → text post-processing → TTS → Speaker
Common building blocks:
- ASR: OpenAI Whisper, Deepgram, AssemblyAI, Google, on-device ASR
- LLM: a text model with tool calling and guardrails
- TTS: ElevenLabs, PlayHT, Azure, Google, on-device TTS
Concrete example patterns:
- OpenAI’s own chained approach (transcriptions + text model + speech) is described in their Audio guide. https://platform.openai.com/docs/guides/audio
- Real-time transcription via Deepgram Live Audio (WebSocket streaming). https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
- Streaming TTS via ElevenLabs WebSockets. https://elevenlabs.io/docs/developers/websockets
Why teams like cascades
- Observability is straightforward. You can log transcripts, prompts, tool calls, and final text.
- Control is explicit. You can sanitize text, apply policies, and gate sensitive actions before speech happens.
- Incremental upgrades are easy. Swap ASR or TTS without retraining the whole system.
- Failure modes are legible. You can usually answer: was it ASR, reasoning, or TTS?
Where cascades hurt
- Latency stacks up. Even good components add up, and the user feels it.
- Errors compound. A misheard entity can cascade into the wrong tool call and a confident response.
- Prosody is bolted on. Most “emotion” is heuristics in the TTS layer, not model-native conversation state.
The low-latency trick: keep the cascade, but make the LLM step fast
If you want the control of a cascade, the biggest latency lever is usually the LLM step in the middle (especially time-to-first-token).
Two “ultrafast” inference providers teams often consider for text generation are:
- Groq (not to be confused with xAI’s Grok). Pricing and supported models: https://groq.com/pricing/
- Cerebras Inference, with pay-per-token pricing and model details: https://inference-docs.cerebras.ai/support/pricing
Both expose OpenAI-compatible chat completion endpoints:
- Groq OpenAI compatibility: https://console.groq.com/docs/openai
- Cerebras chat completions (see authentication + endpoint examples): https://inference-docs.cerebras.ai/api-reference/authentication
This lets you keep your ASR and TTS choices, keep strict text policy gates, and still get very fast LLM turn latency.
2) Speech-to-speech (voice-to-voice) models (audio in → audio out)
Speech-to-speech models are built to take live audio and respond with audio directly. OpenAI’s Realtime models are one concrete example: voice in, voice out, without an intermediate text-only agent loop. https://platform.openai.com/docs/guides/realtime-model-capabilities and https://openai.com/index/introducing-the-realtime-api/
You can think of it as:
Mic → audio-to-audio model (+ tools) → Speaker
Notable examples to know about:
- OpenAI Realtime API (speech-to-speech voice agents): https://platform.openai.com/docs/guides/realtime
- Google Gemini Live / native audio direction (live voice agents): https://blog.google/products/gemini/gemini-audio-model-updates/
- Speech-to-speech translation research lineage: Google Translatotron and Meta SeamlessM4T (not agent stacks, but useful intuition for end-to-end audio). https://research.google/blog/introducing-translatotron-an-end-to-end-speech-to-speech-translation-model/ and https://about.fb.com/news/2023/08/seamlessm4t-ai-translation-model/
Why teams like speech-to-speech
- Lower perceived latency. Even small reductions in turn-taking delay feel huge.
- More natural conversation. Better handling of interruptions, pacing, and backchannels (when implemented well).
- Richer signals. Models can use non-text cues (tone, timing) when supported by the stack.
Where speech-to-speech hurts
- Control is harder. You can still do safety and policy, but the “what will it say next” surface is less explicit than a text draft you can inspect.
- Debugging takes new tools. You will want audio logging, event traces, and sometimes transcripts for analysis anyway.
- Statefulness can surprise you. Realtime sessions are stateful and have their own limits, which you need to design around. https://platform.openai.com/docs/guides/realtime-model-capabilities
A practical comparison (what matters when you ship)
| Dimension | Cascaded pipeline | Speech-to-speech |
|---|---|---|
| Latency | Higher (sum of parts) | Lower (single model loop) |
| Conversational feel | Often “call center” | Often more natural |
| Observability | Best-in-class (text logs) | Requires stronger telemetry |
| Output control | Strong (text gates) | Harder (still possible) |
| Tool calling | Very strong | Strong, but design matters |
| Compliance and review | Easier to reason about | Possible, but needs discipline |
| Component swapping | Easy | Harder |
| Cost tuning | Many knobs | Fewer knobs, but higher leverage |
The punchline: cascades win when you need predictable, inspectable behavior. Speech-to-speech wins when the experience is the product.
More detailed code: cascaded pipeline (streaming STT → fast LLM → TTS)
Below is a more realistic “starter” skeleton. It assumes:
- Browser streams audio frames to your server over WebSocket.
- Server streams those frames to an ASR provider (for example, Deepgram Live).
- On final transcripts, server calls a text LLM (optionally on Groq or Cerebras for speed).
- Server synthesizes TTS and streams audio back to the browser.
Browser: stream mic audio to your server
This uses MediaRecorder to send small chunks. It is not sample-accurate like an AudioWorklet pipeline, but it is a good first milestone because it lets you build the rest of the system.
// browser-mic.ts
const ws = new WebSocket("wss://your-domain.example/voice");
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const rec = new MediaRecorder(stream, { mimeType: "audio/webm" });
rec.ondataavailable = async (e) => {
if (ws.readyState !== WebSocket.OPEN) return;
ws.send(await e.data.arrayBuffer());
};
// 250ms chunks is a common starting point for streaming STT demos.
rec.start(250);
ws.onmessage = async (e) => {
const msg = JSON.parse(e.data);
if (msg.type === "assistant.text") {
console.log("Assistant:", msg.text);
}
if (msg.type === "assistant.audio_base64") {
const audio = new Audio(`data:audio/mpeg;base64,${msg.audio}`);
await audio.play();
}
};
Server: stitch the pieces together
// server.ts (sketch)
// Deepgram Live Audio streaming example reference:
// https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
//
// Groq OpenAI-compatible docs:
// https://console.groq.com/docs/openai
//
// Cerebras API docs:
// https://inference-docs.cerebras.ai/api-reference/authentication
import { WebSocketServer } from "ws";
import OpenAI from "openai";
type TranscriptEvent = { text: string; isFinal: boolean };
type LlmProvider = "openai" | "groq" | "cerebras";
function llmClient(provider: LlmProvider) {
if (provider === "groq") {
return new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: "https://api.groq.com/openai/v1",
});
}
if (provider === "cerebras") {
return new OpenAI({
apiKey: process.env.CEREBRAS_API_KEY,
baseURL: "https://api.cerebras.ai/v1",
});
}
return new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
}
async function runLLM(provider: LlmProvider, userText: string) {
const client = llmClient(provider);
const resp = await client.chat.completions.create({
// Pick a model you have evaluated. The “fast provider” path is usually open weights.
model: process.env.LLM_MODEL || (provider === "openai" ? "gpt-4o-mini" : "llama-3.1-8b-instant"),
messages: [
{ role: "system", content: "You are a helpful voice agent. Be concise." },
{ role: "user", content: userText },
],
});
return resp.choices?.[0]?.message?.content ?? "";
}
const wss = new WebSocketServer({ port: 8787 });
wss.on("connection", async (clientSocket) => {
// TODO: create a WS connection to your ASR provider here and pipe audio frames through.
// When ASR yields final transcript events, call runLLM() and then your TTS provider.
async function onTranscript(e: TranscriptEvent) {
if (!e.isFinal) return;
const provider = (process.env.LLM_PROVIDER as LlmProvider) || "groq";
const replyText = await runLLM(provider, e.text);
// TODO: synthesize speech and stream audio bytes back to the browser.
clientSocket.send(JSON.stringify({ type: "assistant.text", text: replyText }));
}
clientSocket.on("message", async (data) => {
// data is an audio chunk from the browser; forward to ASR.
});
});
If you want runnable reference points for the parts:
- Deepgram Live Audio streaming example: https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
- OpenAI chained voice agent overview: https://platform.openai.com/docs/guides/audio
- ElevenLabs streaming TTS (WebSocket): https://elevenlabs.io/docs/developers/websockets
If you prefer OpenAI TTS for the first working version, pricing and the Audio guide are here: https://platform.openai.com/pricing and https://platform.openai.com/docs/guides/audio
More detailed code: speech-to-speech with OpenAI Realtime over WebRTC
For browser clients, start with WebRTC. OpenAI’s docs cover two WebRTC connection patterns; the one below uses ephemeral client secrets, which keeps your standard API key off the client. https://platform.openai.com/docs/guides/realtime-webrtc
1) Server: mint a short-lived client secret
// server-token.ts
import express from "express";
const app = express();
app.get("/token", async (_req, res) => {
const r = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
session: {
type: "realtime",
model: "gpt-realtime-mini",
instructions: "Be concise. Ask clarifying questions when needed.",
audio: { output: { voice: "marin" } },
},
}),
});
if (!r.ok) {
res.status(500).json({ error: "Failed to mint client secret" });
return;
}
res.json(await r.json());
});
app.listen(3000);
Client secrets endpoint reference: https://platform.openai.com/docs/api-reference/realtime-sessions/create-secret-response
2) Browser: connect to Realtime via WebRTC, directly
// browser-webrtc.ts
const tokenRes = await fetch("/token");
const tokenJson = await tokenRes.json();
const EPHEMERAL_KEY = tokenJson?.value;
const pc = new RTCPeerConnection();
// audio playback
const audioEl = document.createElement("audio");
audioEl.autoplay = true;
pc.ontrack = (e) => (audioEl.srcObject = e.streams[0]);
// microphone capture
const ms = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(ms.getTracks()[0], ms);
// events (tool calls, transcripts, debug)
const dc = pc.createDataChannel("oai-events");
dc.onmessage = (e) => console.log(JSON.parse(e.data));
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const sdpResponse = await fetch("https://api.openai.com/v1/realtime/calls", {
method: "POST",
body: offer.sdp,
headers: {
Authorization: `Bearer ${EPHEMERAL_KEY}`,
"Content-Type": "application/sdp",
},
});
await pc.setRemoteDescription({
type: "answer",
sdp: await sdpResponse.text(),
});
Full WebRTC walkthrough (including unified interface): https://platform.openai.com/docs/guides/realtime-webrtc
Cost analysis: what the architectures look like on a bill
Pricing changes frequently, so treat the numbers below as “shape of costs” guidance and validate against current pricing pages.
Cascade cost model
A cascade usually looks like:
Cost ≈ STT(minutes) + LLM(input_tokens, output_tokens) + TTS(characters) + infra
Some pricing references (as of late January 2026):
- OpenAI Whisper transcription: $0.006 / minute. https://platform.openai.com/pricing
- Deepgram streaming ASR (Nova-3): $0.0077 / minute. https://deepgram.com/pricing
- OpenAI TTS: $15 / 1M characters. https://platform.openai.com/pricing
- Groq text tokens (example Llama 3.1 8B): $0.05 / 1M input tokens, $0.08 / 1M output tokens. https://groq.com/pricing/
- Cerebras text tokens (example Llama 3.1 8B): $0.10 / 1M input tokens, $0.10 / 1M output tokens. https://inference-docs.cerebras.ai/support/pricing
Worked example (simple voice agent turn):
- User speaks for 60 seconds.
- Assistant replies with about 120 words (roughly 700 characters).
- LLM uses 1,000 input tokens and 250 output tokens.
Option A: Deepgram Nova-3 + Groq Llama 3.1 8B + OpenAI TTS
- STT: 1.0 min × $0.0077/min = $0.0077
- LLM: (1,000 × $0.05 / 1,000,000) + (250 × $0.08 / 1,000,000) ≈ $0.00007
- TTS: 700 × $15 / 1,000,000 ≈ $0.0105
- Total (not counting infra): ~$0.018 per turn
Option B: OpenAI Whisper + Cerebras Llama 3.1 8B + OpenAI TTS
- STT: 1.0 min × $0.006/min = $0.006
- LLM: (1,000 × $0.10 / 1,000,000) + (250 × $0.10 / 1,000,000) ≈ $0.00013
- TTS: same $0.0105
- Total: ~$0.017 per turn
Two takeaways:
- In many cascades, TTS dominates unless you are generating long answers.
- A faster LLM provider can improve latency without materially changing cost, because the LLM token cost is often not the big line item for short turns.
One more practical note: many teams buy TTS in “minutes” rather than per-character. ElevenLabs advertises low-latency TTS “as low as $0.05/min” on their pricing page (often an enterprise-scale signal, not a default self-serve rate). https://elevenlabs.io/pricing
Speech-to-speech (Realtime) cost model
Realtime speech-to-speech is billed as audio tokens in and out, not minutes directly. OpenAI publishes a conversion guideline:
- Audio input: roughly 1 token per 100ms (about 10 tokens/sec).
- Audio output: roughly 1 token per 50ms (about 20 tokens/sec). https://platform.openai.com/docs/guides/realtime/costs
And pricing (Realtime models):
gpt-realtime: $32 / 1M input audio tokens, $64 / 1M output audio tokens. https://platform.openai.com/pricinggpt-realtime-mini: $10 / 1M input audio tokens, $20 / 1M output audio tokens. https://platform.openai.com/pricing
Worked example (same 60s user audio, 30s assistant audio):
- Input audio tokens: 60s × 10 tokens/s = 600 tokens
- Output audio tokens: 30s × 20 tokens/s = 600 tokens
gpt-realtime:
- Input: 600 × $32 / 1,000,000 ≈ $0.0192
- Output: 600 × $64 / 1,000,000 ≈ $0.0384
- Total: ~$0.058 per turn
gpt-realtime-mini:
- Input: 600 × $10 / 1,000,000 = $0.006
- Output: 600 × $20 / 1,000,000 = $0.012
- Total: ~$0.018 per turn
The headline: speech-to-speech can be more expensive on the premium model tier, but the smaller realtime tier can land in the same range as many cascades. The decision should be driven by experience and control requirements, then you tune cost by choosing the right tier and tightening turn lengths.
Optional: when to use WebSockets for Realtime
WebRTC should be your default for browser clients. WebSockets still show up when you need a server-side bridge (telephony, SIP, or custom audio routing). If you do WebSockets, you are responsible for chunking and base64 encoding audio and sending input_audio_buffer.append events. Start with the event reference: https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer and the WebSocket guide: https://platform.openai.com/docs/guides/realtime-websocket
When to use what (my rule-of-thumb)
Choose cascaded if:
- You need auditability (transcripts, prompts, text outputs, tool decisions).
- You need strict policy gates (PII handling, regulated actions, scripted disclosures).
- You want to iterate fast by swapping components and measuring impact.
- Your product can tolerate a slightly more “turn-based” feel.
Choose speech-to-speech if:
- Your product lives or dies by interruptions and turn-taking (sales, coaching, companions).
- You need the most natural prosody and pacing you can get.
- You are comfortable investing in new observability: audio logs, event traces, and careful evaluation.
My personal additions from production voice work:
- Most “voice agent” bugs are not model bugs. They are endpointing and state bugs: barge-in, half-duplex audio paths, and when you decide a user is done speaking.
- You should treat latency as a budget, not a number. Measure it per stage. Decide where you can spend it.
- Plan for failure: dropouts, partial transcripts, tool timeouts, and what you do when the agent is wrong.
If you ship with those constraints in mind, both architectures can deliver a great experience. The difference is where you want the complexity to live.