OpenAI Realtime: voice agents are finally production-ready

By Nick Beaugeard · 7 minute read · ← All posts

For a decade, "AI voice agent" meant a recorded menu followed by a transcription, a text model, a TTS layer and the unmistakable feeling of talking to a robot. The Realtime API collapses that pipeline into one continuous, low-latency, speech-in speech-out stream. The difference between a stitched pipeline and an end-to-end voice model, in real customer conversations, is the difference between "interesting demo" and "I forgot it was AI."

We have been building on Realtime for several months. Here is what we have learned.

Why Realtime is different

The headline improvements that matter in production:

Latency. Time-to-first-audio of a few hundred milliseconds, end-to-end. Conversational pacing instead of walkie-talkie pauses.
Interruptibility. Users can talk over the model and it stops. This single feature does more for natural conversation than any prompt engineering.
Tone preservation. Sarcasm, uncertainty, urgency - the model picks up the user's tone and matches it. Stitched STT-LLM-TTS pipelines flatten everything to text and lose this entirely.
Tool use mid-conversation. The model can call your tools (look up a customer, book a slot, check a balance) without breaking flow. The "let me check that for you" pause is now optional and short.
Voice variety. Multiple, distinct voices, with adjustable style. Useful for matching brand tone.

The latency budget

Voice is unforgiving. A 1.5 second pause feels thoughtful. A 4 second pause feels broken. We work to this budget for any new Realtime deployment:

Slice	Target	Why
User stops speaking to first model audio	< 700 ms	Conversational responsiveness
Tool call round trip (lookup)	< 400 ms	Avoids the awkward "thinking" pause
Interrupt response (model stops)	< 200 ms	Below this is unnoticeable; above this feels rude
End-of-conversation hangup	< 1.5 s after intent	Don't leave users hanging

Hit those numbers and the agent feels human. Miss them and the agent feels like every IVR you have ever cursed.

Where it falls down

Realtime is not magic. The failure modes worth designing around:

Background noise. Cafes, car interiors, open-plan offices, kids in the next room. Push-to-talk or VAD tuning matters more than people expect.
Accents and code-switching. The model is strong on standard English but still wobbles on strong regional accents and bilingual switching mid-sentence.
Numbers, dates, IDs. Listen-back confirmation for anything where a digit matters. "Did you say four-seven or four-seven-five?" should be built into the flow.
Hallucination at length. Long calls drift, like any LLM. Tight system prompts, explicit goal management, and forced summarisation at hand-off points all help.
Identity and consent. Australian privacy law and professional standards require care here. Always identify the agent as AI, always confirm consent to recording, always have a human escalation path.

The shape of a Realtime deployment

The pattern that works, in our experience:

One clear job. Booking, intake, qualification, FAQ - not "help with anything."
Three to five tools, no more. Lookup, write-back, schedule, escalate, end. Each one tested in isolation before being wired in.
Hard guardrails on the system prompt. Identity, scope, refusal patterns, escalation triggers. Tight, not chatty.
A live transcript to a human dashboard. So a supervisor can take over when the agent says "let me get someone for you."
Eval set for every prompt change. A fixed bank of recorded calls, replayed offline, scored against expected outcomes. Without this, "I tweaked the prompt" is a roll of the dice.
Recording and review. Every call captured, reviewed weekly, fed back into the eval set.

What to build first

The Realtime use cases that have produced real ROI for our clients, in order of how quickly they paid back:

Inbound booking and scheduling. Clinics, trades, professional services. The agent answers, qualifies, books, hangs up. Replaces a receptionist's bottleneck without replacing the receptionist.
After-hours intake. Law firms, conveyancers, accountants. Capture the matter, the urgency, the contact details. Email a structured summary to the duty solicitor in the morning.
Outbound qualification. Warm leads called within minutes of submitting a form, qualified, booked into a calendar before they have closed the tab.
Voice-driven internal tools. "Tell me the status of job 4172." For tradies, field staff, ops teams. The hands-free interface is a genuine productivity unlock.
Customer support triage. Take the first 60 seconds, route or resolve. The cheap calls finish themselves. The hard calls land at a human with context already captured.

What we are doing with it

Inside Symphony we treat voice agents the same way we treat any other production component. Evals first, observability on, tool use scoped, escalation paths wired, recordings retained. The same engineering discipline that turns a vibe-coded demo into a system you can sell.

For clients, the projects we are seeing demand for are concentrated in scheduling, intake and qualification. The economics are obvious once you do the maths on receptionist hours per booked appointment.

Voice was the last interface AI could not do well. As of now, it can. The opportunity is not to replace humans on phones. It is to use AI to take the boring 60 percent of calls that should never have needed a human.

The short version

OpenAI's Realtime API has moved voice AI from "impressive demo" to "deployable product." It is fast enough, interruptible enough, and good enough at tool use to power real customer-facing scenarios. The engineering required to ship it well is the same engineering required to ship any other AI feature: evals, observability, narrow scope, tight tools, human escalation, and an honest eye for failure modes.

If you are looking at where to spend your AI budget in the next two quarters, voice should be on the shortlist. If you want a senior view on whether your use case is a good fit, get in touch.