OpenAI Realtime: voice agents are finally production-ready
The Realtime API turns voice AI from a flashy demo into something you can deploy to customers. Here is what makes it different, where it breaks, and what we have shipped on it so far.
For a decade, "AI voice agent" meant a recorded menu followed by a transcription, a text model, a TTS layer and the unmistakable feeling of talking to a robot. The Realtime API collapses that pipeline into one continuous, low-latency, speech-in speech-out stream. The difference between a stitched pipeline and an end-to-end voice model, in real customer conversations, is the difference between "interesting demo" and "I forgot it was AI."
We have been building on Realtime for several months. Here is what we have learned.
Why Realtime is different
The headline improvements that matter in production:
- Latency. Time-to-first-audio of a few hundred milliseconds, end-to-end. Conversational pacing instead of walkie-talkie pauses.
- Interruptibility. Users can talk over the model and it stops. This single feature does more for natural conversation than any prompt engineering.
- Tone preservation. Sarcasm, uncertainty, urgency — the model picks up the user's tone and matches it. Stitched STT-LLM-TTS pipelines flatten everything to text and lose this entirely.
- Tool use mid-conversation. The model can call your tools (look up a customer, book a slot, check a balance) without breaking flow. The "let me check that for you" pause is now optional and short.
- Voice variety. Multiple, distinct voices, with adjustable style. Useful for matching brand tone.
The latency budget
Voice is unforgiving. A 1.5 second pause feels thoughtful. A 4 second pause feels broken. We work to this budget for any new Realtime deployment:
| Slice | Target | Why |
|---|---|---|
| User stops speaking to first model audio | < 700 ms | Conversational responsiveness |
| Tool call round trip (lookup) | < 400 ms | Avoids the awkward "thinking" pause |
| Interrupt response (model stops) | < 200 ms | Below this is unnoticeable; above this feels rude |
| End-of-conversation hangup | < 1.5 s after intent | Don't leave users hanging |
Hit those numbers and the agent feels human. Miss them and the agent feels like every IVR you have ever cursed.
Where it falls down
Realtime is not magic. The failure modes worth designing around:
- Background noise. Cafes, car interiors, open-plan offices, kids in the next room. Push-to-talk or VAD tuning matters more than people expect.
- Accents and code-switching. The model is strong on standard English but still wobbles on strong regional accents and bilingual switching mid-sentence.
- Numbers, dates, IDs. Listen-back confirmation for anything where a digit matters. "Did you say four-seven or four-seven-five?" should be built into the flow.
- Hallucination at length. Long calls drift, like any LLM. Tight system prompts, explicit goal management, and forced summarisation at hand-off points all help.
- Identity and consent. Australian privacy law and professional standards require care here. Always identify the agent as AI, always confirm consent to recording, always have a human escalation path.
The shape of a Realtime deployment
The pattern that works, in our experience:
- One clear job. Booking, intake, qualification, FAQ — not "help with anything."
- Three to five tools, no more. Lookup, write-back, schedule, escalate, end. Each one tested in isolation before being wired in.
- Hard guardrails on the system prompt. Identity, scope, refusal patterns, escalation triggers. Tight, not chatty.
- A live transcript to a human dashboard. So a supervisor can take over when the agent says "let me get someone for you."
- Eval set for every prompt change. A fixed bank of recorded calls, replayed offline, scored against expected outcomes. Without this, "I tweaked the prompt" is a roll of the dice.
- Recording and review. Every call captured, reviewed weekly, fed back into the eval set.
What to build first
The Realtime use cases that have produced real ROI for our clients, in order of how quickly they paid back:
- Inbound booking and scheduling. Clinics, trades, professional services. The agent answers, qualifies, books, hangs up. Replaces a receptionist's bottleneck without replacing the receptionist.
- After-hours intake. Law firms, conveyancers, accountants. Capture the matter, the urgency, the contact details. Email a structured summary to the duty solicitor in the morning.
- Outbound qualification. Warm leads called within minutes of submitting a form, qualified, booked into a calendar before they have closed the tab.
- Voice-driven internal tools. "Tell me the status of job 4172." For tradies, field staff, ops teams. The hands-free interface is a genuine productivity unlock.
- Customer support triage. Take the first 60 seconds, route or resolve. The cheap calls finish themselves. The hard calls land at a human with context already captured.
What we are doing with it
Inside Symphony we treat voice agents the same way we treat any other production component. Evals first, observability on, tool use scoped, escalation paths wired, recordings retained. The same engineering discipline that turns a vibe-coded demo into a system you can sell.
For clients, the projects we are seeing demand for are concentrated in scheduling, intake and qualification. The economics are obvious once you do the maths on receptionist hours per booked appointment.
Voice was the last interface AI could not do well. As of now, it can. The opportunity is not to replace humans on phones. It is to use AI to take the boring 60 percent of calls that should never have needed a human.
The short version
OpenAI's Realtime API has moved voice AI from "impressive demo" to "deployable product." It is fast enough, interruptible enough, and good enough at tool use to power real customer-facing scenarios. The engineering required to ship it well is the same engineering required to ship any other AI feature: evals, observability, narrow scope, tight tools, human escalation, and an honest eye for failure modes.
If you are looking at where to spend your AI budget in the next two quarters, voice should be on the shortlist. If you want a senior view on whether your use case is a good fit, get in touch.
Thinking about a voice agent for your business?
Thirty minutes with Nick. No pitch deck, no obligation.
Book a free initial meeting More field notes