Voice Pipelines vs Speech-to-Speech Models: What to Ship for Voice Agents

Voice agents are back, but the hard parts have not changed: latency, interruptions, transcription mistakes, background noise, and the uncomfortable truth that users will judge your product by the worst 200 milliseconds of a call.

If you are building a voice agent today, you are usually choosing between two architectures:

Cascaded (chained) pipeline: ASR → LLM → TTS
Speech-to-speech (voice-to-voice) model: audio in → audio out

Both can work. The right choice depends on what you are optimizing for: control, debuggability, prosody, cost, compliance, and the type of user experience you need.

I will use OpenAI’s terminology because it maps cleanly to how teams talk about this in practice: a chained system versus a speech-to-speech system. Start with their overview if you want a canonical framing. https://platform.openai.com/docs/guides/audio

TL;DR

If you need tight control, predictable outputs, easy debugging, and clear policy boundaries, ship a cascaded pipeline.
If you need lowest perceived latency and the most natural conversational feel, and you can tolerate less explicit control, consider speech-to-speech.
Many teams end up with a hybrid: speech-to-speech for the “front of house” conversation, and a text layer in the middle for logging, tools, and policy.

If you are doing speech-to-speech in the browser, WebRTC should be the default

If you connect to speech-to-speech models from a browser client, you want two things:

The lowest possible media latency (audio is the product).
Minimal server involvement once the session is established (your server should not be on the hot path for every packet).

OpenAI’s Realtime WebRTC guide is explicit: for client connections, “we recommend using WebRTC rather than WebSockets for more consistent performance.” https://platform.openai.com/docs/guides/realtime-webrtc

The other important detail is auth. The WebRTC docs describe two browser connection patterns:

Ephemeral client secrets minted by your server, then the client connects directly to OpenAI.
The unified interface, which is simpler but “puts your application server in the critical path for session initialization.” https://platform.openai.com/docs/guides/realtime-webrtc

In practice, if your goal is to shave milliseconds and reduce server load, WebRTC plus ephemeral client secrets is the direction most teams should plan around.

1) Cascaded voice pipelines (ASR → LLM → TTS)

This is the classic “voice stack”:

Mic → VAD / endpointing → ASR → LLM (+ tools) → text post-processing → TTS → Speaker

Common building blocks:

ASR: OpenAI Whisper, Deepgram, AssemblyAI, Google, on-device ASR
LLM: a text model with tool calling and guardrails
TTS: ElevenLabs, PlayHT, Azure, Google, on-device TTS

Concrete example patterns:

OpenAI’s own chained approach (transcriptions + text model + speech) is described in their Audio guide. https://platform.openai.com/docs/guides/audio
Real-time transcription via Deepgram Live Audio (WebSocket streaming). https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
Streaming TTS via ElevenLabs WebSockets. https://elevenlabs.io/docs/developers/websockets

Why teams like cascades

Observability is straightforward. You can log transcripts, prompts, tool calls, and final text.
Control is explicit. You can sanitize text, apply policies, and gate sensitive actions before speech happens.
Incremental upgrades are easy. Swap ASR or TTS without retraining the whole system.
Failure modes are legible. You can usually answer: was it ASR, reasoning, or TTS?

Where cascades hurt

Latency stacks up. Even good components add up, and the user feels it.
Errors compound. A misheard entity can cascade into the wrong tool call and a confident response.
Prosody is bolted on. Most “emotion” is heuristics in the TTS layer, not model-native conversation state.

The low-latency trick: keep the cascade, but make the LLM step fast

If you want the control of a cascade, the biggest latency lever is usually the LLM step in the middle (especially time-to-first-token).

Two “ultrafast” inference providers teams often consider for text generation are:

Groq (not to be confused with xAI’s Grok). Pricing and supported models: https://groq.com/pricing/
Cerebras Inference, with pay-per-token pricing and model details: https://inference-docs.cerebras.ai/support/pricing

Both expose OpenAI-compatible chat completion endpoints:

Groq OpenAI compatibility: https://console.groq.com/docs/openai
Cerebras chat completions (see authentication + endpoint examples): https://inference-docs.cerebras.ai/api-reference/authentication

This lets you keep your ASR and TTS choices, keep strict text policy gates, and still get very fast LLM turn latency.

2) Speech-to-speech (voice-to-voice) models (audio in → audio out)

Speech-to-speech models are built to take live audio and respond with audio directly. OpenAI’s Realtime models are one concrete example: voice in, voice out, without an intermediate text-only agent loop. https://platform.openai.com/docs/guides/realtime-model-capabilities and https://openai.com/index/introducing-the-realtime-api/

You can think of it as:

Mic → audio-to-audio model (+ tools) → Speaker

Notable examples to know about:

OpenAI Realtime API (speech-to-speech voice agents): https://platform.openai.com/docs/guides/realtime
Google Gemini Live / native audio direction (live voice agents): https://blog.google/products/gemini/gemini-audio-model-updates/
Speech-to-speech translation research lineage: Google Translatotron and Meta SeamlessM4T (not agent stacks, but useful intuition for end-to-end audio). https://research.google/blog/introducing-translatotron-an-end-to-end-speech-to-speech-translation-model/ and https://about.fb.com/news/2023/08/seamlessm4t-ai-translation-model/

Why teams like speech-to-speech

Lower perceived latency. Even small reductions in turn-taking delay feel huge.
More natural conversation. Better handling of interruptions, pacing, and backchannels (when implemented well).
Richer signals. Models can use non-text cues (tone, timing) when supported by the stack.

Where speech-to-speech hurts

Control is harder. You can still do safety and policy, but the “what will it say next” surface is less explicit than a text draft you can inspect.
Debugging takes new tools. You will want audio logging, event traces, and sometimes transcripts for analysis anyway.
Statefulness can surprise you. Realtime sessions are stateful and have their own limits, which you need to design around. https://platform.openai.com/docs/guides/realtime-model-capabilities

A practical comparison (what matters when you ship)

Dimension	Cascaded pipeline	Speech-to-speech
Latency	Higher (sum of parts)	Lower (single model loop)
Conversational feel	Often “call center”	Often more natural
Observability	Best-in-class (text logs)	Requires stronger telemetry
Output control	Strong (text gates)	Harder (still possible)
Tool calling	Very strong	Strong, but design matters
Compliance and review	Easier to reason about	Possible, but needs discipline
Component swapping	Easy	Harder
Cost tuning	Many knobs	Fewer knobs, but higher leverage

The punchline: cascades win when you need predictable, inspectable behavior. Speech-to-speech wins when the experience is the product.

More detailed code: cascaded pipeline (streaming STT → fast LLM → TTS)

Below is a more realistic “starter” skeleton. It assumes:

Browser streams audio frames to your server over WebSocket.
Server streams those frames to an ASR provider (for example, Deepgram Live).
On final transcripts, server calls a text LLM (optionally on Groq or Cerebras for speed).
Server synthesizes TTS and streams audio back to the browser.

Browser: stream mic audio to your server

This uses MediaRecorder to send small chunks. It is not sample-accurate like an AudioWorklet pipeline, but it is a good first milestone because it lets you build the rest of the system.

// browser-mic.ts
const ws = new WebSocket("wss://your-domain.example/voice");

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const rec = new MediaRecorder(stream, { mimeType: "audio/webm" });

rec.ondataavailable = async (e) => {
  if (ws.readyState !== WebSocket.OPEN) return;
  ws.send(await e.data.arrayBuffer());
};

// 250ms chunks is a common starting point for streaming STT demos.
rec.start(250);

ws.onmessage = async (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === "assistant.text") {
    console.log("Assistant:", msg.text);
  }
  if (msg.type === "assistant.audio_base64") {
    const audio = new Audio(`data:audio/mpeg;base64,${msg.audio}`);
    await audio.play();
  }
};

Server: stitch the pieces together

// server.ts (sketch)
// Deepgram Live Audio streaming example reference:
// https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
//
// Groq OpenAI-compatible docs:
// https://console.groq.com/docs/openai
//
// Cerebras API docs:
// https://inference-docs.cerebras.ai/api-reference/authentication

import { WebSocketServer } from "ws";
import OpenAI from "openai";

type TranscriptEvent = { text: string; isFinal: boolean };
type LlmProvider = "openai" | "groq" | "cerebras";

function llmClient(provider: LlmProvider) {
  if (provider === "groq") {
    return new OpenAI({
      apiKey: process.env.GROQ_API_KEY,
      baseURL: "https://api.groq.com/openai/v1",
    });
  }
  if (provider === "cerebras") {
    return new OpenAI({
      apiKey: process.env.CEREBRAS_API_KEY,
      baseURL: "https://api.cerebras.ai/v1",
    });
  }
  return new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
}

async function runLLM(provider: LlmProvider, userText: string) {
  const client = llmClient(provider);
  const resp = await client.chat.completions.create({
    // Pick a model you have evaluated. The “fast provider” path is usually open weights.
    model: process.env.LLM_MODEL || (provider === "openai" ? "gpt-4o-mini" : "llama-3.1-8b-instant"),
    messages: [
      { role: "system", content: "You are a helpful voice agent. Be concise." },
      { role: "user", content: userText },
    ],
  });
  return resp.choices?.[0]?.message?.content ?? "";
}

const wss = new WebSocketServer({ port: 8787 });

wss.on("connection", async (clientSocket) => {
  // TODO: create a WS connection to your ASR provider here and pipe audio frames through.
  // When ASR yields final transcript events, call runLLM() and then your TTS provider.

  async function onTranscript(e: TranscriptEvent) {
    if (!e.isFinal) return;

    const provider = (process.env.LLM_PROVIDER as LlmProvider) || "groq";
    const replyText = await runLLM(provider, e.text);

    // TODO: synthesize speech and stream audio bytes back to the browser.
    clientSocket.send(JSON.stringify({ type: "assistant.text", text: replyText }));
  }

  clientSocket.on("message", async (data) => {
    // data is an audio chunk from the browser; forward to ASR.
  });
});

If you want runnable reference points for the parts:

Deepgram Live Audio streaming example: https://deepgram.com/learn/build-a-real-time-transcription-app-with-react-and-deepgram
OpenAI chained voice agent overview: https://platform.openai.com/docs/guides/audio
ElevenLabs streaming TTS (WebSocket): https://elevenlabs.io/docs/developers/websockets

If you prefer OpenAI TTS for the first working version, pricing and the Audio guide are here: https://platform.openai.com/pricing and https://platform.openai.com/docs/guides/audio

More detailed code: speech-to-speech with OpenAI Realtime over WebRTC

For browser clients, start with WebRTC. OpenAI’s docs cover two WebRTC connection patterns; the one below uses ephemeral client secrets, which keeps your standard API key off the client. https://platform.openai.com/docs/guides/realtime-webrtc

1) Server: mint a short-lived client secret

// server-token.ts
import express from "express";

const app = express();

app.get("/token", async (_req, res) => {
  const r = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      session: {
        type: "realtime",
        model: "gpt-realtime-mini",
        instructions: "Be concise. Ask clarifying questions when needed.",
        audio: { output: { voice: "marin" } },
      },
    }),
  });

  if (!r.ok) {
    res.status(500).json({ error: "Failed to mint client secret" });
    return;
  }
  res.json(await r.json());
});

app.listen(3000);

Client secrets endpoint reference: https://platform.openai.com/docs/api-reference/realtime-sessions/create-secret-response

2) Browser: connect to Realtime via WebRTC, directly

// browser-webrtc.ts
const tokenRes = await fetch("/token");
const tokenJson = await tokenRes.json();
const EPHEMERAL_KEY = tokenJson?.value;

const pc = new RTCPeerConnection();

// audio playback
const audioEl = document.createElement("audio");
audioEl.autoplay = true;
pc.ontrack = (e) => (audioEl.srcObject = e.streams[0]);

// microphone capture
const ms = await navigator.mediaDevices.getUserMedia({ audio: true });
pc.addTrack(ms.getTracks()[0], ms);

// events (tool calls, transcripts, debug)
const dc = pc.createDataChannel("oai-events");
dc.onmessage = (e) => console.log(JSON.parse(e.data));

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const sdpResponse = await fetch("https://api.openai.com/v1/realtime/calls", {
  method: "POST",
  body: offer.sdp,
  headers: {
    Authorization: `Bearer ${EPHEMERAL_KEY}`,
    "Content-Type": "application/sdp",
  },
});

await pc.setRemoteDescription({
  type: "answer",
  sdp: await sdpResponse.text(),
});

Full WebRTC walkthrough (including unified interface): https://platform.openai.com/docs/guides/realtime-webrtc

Cost analysis: what the architectures look like on a bill

Pricing changes frequently, so treat the numbers below as “shape of costs” guidance and validate against current pricing pages.

Cascade cost model

A cascade usually looks like:

Cost ≈ STT(minutes) + LLM(input_tokens, output_tokens) + TTS(characters) + infra

Some pricing references (as of late January 2026):

OpenAI Whisper transcription: $0.006 / minute. https://platform.openai.com/pricing
Deepgram streaming ASR (Nova-3): $0.0077 / minute. https://deepgram.com/pricing
OpenAI TTS: $15 / 1M characters. https://platform.openai.com/pricing
Groq text tokens (example Llama 3.1 8B): $0.05 / 1M input tokens, $0.08 / 1M output tokens. https://groq.com/pricing/
Cerebras text tokens (example Llama 3.1 8B): $0.10 / 1M input tokens, $0.10 / 1M output tokens. https://inference-docs.cerebras.ai/support/pricing

Worked example (simple voice agent turn):

User speaks for 60 seconds.
Assistant replies with about 120 words (roughly 700 characters).
LLM uses 1,000 input tokens and 250 output tokens.

Option A: Deepgram Nova-3 + Groq Llama 3.1 8B + OpenAI TTS

STT: 1.0 min × $0.0077/min = $0.0077
LLM: (1,000 × $0.05 / 1,000,000) + (250 × $0.08 / 1,000,000) ≈ $0.00007
TTS: 700 × $15 / 1,000,000 ≈ $0.0105
Total (not counting infra): ~$0.018 per turn

Option B: OpenAI Whisper + Cerebras Llama 3.1 8B + OpenAI TTS

STT: 1.0 min × $0.006/min = $0.006
LLM: (1,000 × $0.10 / 1,000,000) + (250 × $0.10 / 1,000,000) ≈ $0.00013
TTS: same $0.0105
Total: ~$0.017 per turn

Two takeaways:

In many cascades, TTS dominates unless you are generating long answers.
A faster LLM provider can improve latency without materially changing cost, because the LLM token cost is often not the big line item for short turns.

One more practical note: many teams buy TTS in “minutes” rather than per-character. ElevenLabs advertises low-latency TTS “as low as $0.05/min” on their pricing page (often an enterprise-scale signal, not a default self-serve rate). https://elevenlabs.io/pricing

Speech-to-speech (Realtime) cost model

Realtime speech-to-speech is billed as audio tokens in and out, not minutes directly. OpenAI publishes a conversion guideline:

Audio input: roughly 1 token per 100ms (about 10 tokens/sec).
Audio output: roughly 1 token per 50ms (about 20 tokens/sec). https://platform.openai.com/docs/guides/realtime/costs

And pricing (Realtime models):

gpt-realtime: $32 / 1M input audio tokens, $64 / 1M output audio tokens. https://platform.openai.com/pricing
gpt-realtime-mini: $10 / 1M input audio tokens, $20 / 1M output audio tokens. https://platform.openai.com/pricing

Worked example (same 60s user audio, 30s assistant audio):

Input audio tokens: 60s × 10 tokens/s = 600 tokens
Output audio tokens: 30s × 20 tokens/s = 600 tokens

gpt-realtime:

Input: 600 × $32 / 1,000,000 ≈ $0.0192
Output: 600 × $64 / 1,000,000 ≈ $0.0384
Total: ~$0.058 per turn

gpt-realtime-mini:

Input: 600 × $10 / 1,000,000 = $0.006
Output: 600 × $20 / 1,000,000 = $0.012
Total: ~$0.018 per turn

The headline: speech-to-speech can be more expensive on the premium model tier, but the smaller realtime tier can land in the same range as many cascades. The decision should be driven by experience and control requirements, then you tune cost by choosing the right tier and tightening turn lengths.

Optional: when to use WebSockets for Realtime

WebRTC should be your default for browser clients. WebSockets still show up when you need a server-side bridge (telephony, SIP, or custom audio routing). If you do WebSockets, you are responsible for chunking and base64 encoding audio and sending input_audio_buffer.append events. Start with the event reference: https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer and the WebSocket guide: https://platform.openai.com/docs/guides/realtime-websocket

When to use what (my rule-of-thumb)

Choose cascaded if:

You need auditability (transcripts, prompts, text outputs, tool decisions).
You need strict policy gates (PII handling, regulated actions, scripted disclosures).
You want to iterate fast by swapping components and measuring impact.
Your product can tolerate a slightly more “turn-based” feel.

Choose speech-to-speech if:

Your product lives or dies by interruptions and turn-taking (sales, coaching, companions).
You need the most natural prosody and pacing you can get.
You are comfortable investing in new observability: audio logs, event traces, and careful evaluation.

My personal additions from production voice work:

Most “voice agent” bugs are not model bugs. They are endpointing and state bugs: barge-in, half-duplex audio paths, and when you decide a user is done speaking.
You should treat latency as a budget, not a number. Measure it per stage. Decide where you can spend it.
Plan for failure: dropouts, partial transcripts, tool timeouts, and what you do when the agent is wrong.

If you ship with those constraints in mind, both architectures can deliver a great experience. The difference is where you want the complexity to live.