Notes from the Sydney OpenAI hackathon

By Nick Beaugeard · 6 minute read · ← All posts

I spent the day at the Sydney OpenAI hackathon. One day, one room, a few hundred builders, and the same models everyone else has. What separates the projects that win from the projects that almost work is no longer the model. It is the engineering around it - and, more than I expected, the polish on top.

These are my notes, written for the engineers and founders who keep asking me what is actually changing at the coalface of AI product work.

What teams built

The popular categories were predictable and revealing:

Vertical agents. Tax, conveyancing, claims handling, employment law, payroll. The pattern was always the same: one workflow, real data, a tight loop with a human in the chair.
Internal copilots. Sales call prep, RFP response, ticket triage. The teams who scoped this hard shipped something working before lunch.
Voice and Realtime. Several teams pushed on OpenAI's Realtime API for live, low-latency voice agents. The good ones felt like a colleague. The bad ones felt like 2014's IVR with a personality.
Multi-agent orchestrations. Specialist agents handing off to each other. Beautiful when they worked. Spectacular failures when they did not.

What won

With only one day on the clock, the teams that took home prizes had a very particular shape. Four things separated them from the rest of the room.

Demo craft beats feature count

The winners obsessed over the demo. Not the product behind it - the demo. They picked a three-minute story, rehearsed it, removed every step that did not advance the narrative, and pre-loaded happy-path data so nothing flaky touched the stage. The teams that tried to show off the breadth of what they had built got asked about the one feature that broke. The teams that showed one rehearsed flow, end to end, got asked when they can sell it. On a one-day clock, the demo is the product.

Look and feel did serious work

The judging skewed hard toward look and feel. Clean typography, considered spacing, a recognisable visual identity, a polished landing page, smooth transitions, voice and tone that matched the brand they had invented that morning. Teams that spent the last two hours on Tailwind, a colour palette and a punchy headline outscored teams that spent those hours on a marginally better backend. Unfair? Maybe. But entirely consistent with how real products get adopted.

Narrow scope

Every winning project did one thing. Not a platform. Not a "personal assistant for X." A single workflow, end to end. Narrow scope was what made the demo craft and the visual polish achievable inside a single day.

Tool use, not chat

The best products were not chatbots. They were agents with three or four well-defined tools, called in the right order, with clear failure modes. A search tool, a calculation tool, a write-back tool, and a human-escalation tool. That is the shape of useful, not the shape of impressive.

Evals where there was time

A handful of teams built a tiny evaluation set in the first hour. Even a dozen handcrafted examples with expected outputs gave them a way to iterate on quality instead of vibes. With one day on the clock, most teams skipped this - understandably - but the ones who did not skip it shipped noticeably more reliable demos. We use the same discipline inside Symphony for the same reason.

What broke

The failures had common roots:

Cold-start latency on Realtime. Live voice is unforgiving. A 1.2 second time-to-first-token feels conversational. A 4 second one feels broken. Several teams discovered this on stage.
Tool-call drift. Agents started inventing tool names that did not exist when prompts got long. The fix is boring and known: tighter system prompts, fewer tools, explicit tool descriptions, evals that catch this.
Context window overflows. Long-running agents quietly truncated their own context and started forgetting earlier decisions. Most teams had not noticed until a judge asked an awkward follow-up.
Eval-free debugging. Without evals, "fix the bug" looked like changing a prompt and hoping. Teams with evals could measure whether their fix actually moved quality.

What it says about the market

Three takeaways for anyone trying to commercialise AI in the next 12 months:

The model is the cheapest part of the stack

Everyone had access to the same models, the same APIs, and roughly the same budget. The differentiator was always the engineering: evals, observability, tool design, fallback paths, human-in-the-loop UX. If your moat is "we use the best model", you do not have a moat.

Voice is suddenly serious

OpenAI's Realtime stack is good enough to ship into customer-facing scenarios. I have written separately about what we have been doing with it. Expect a wave of voice-first products in Australia over the next two quarters, particularly in scheduling, support and intake.

Vertical beats horizontal

The teams who picked a profession, a workflow and a data shape outshipped the teams trying to build a general assistant. This matches what we see in client work. A focused tool for one trade is almost always more valuable than a generic one for everyone.

What I took home

I came in expecting flashier models and left more convinced than ever that the work is in the wrapping. Evals, tool design, latency budgets, telemetry, human-in-the-loop UX, failure modes, observability. The things that make AI products feel reliable are the things software engineers have always done well. We just have a new and powerful component to wrap around.

The model is a battery, not a product. Whoever builds the best chassis wins.

If you are working on something in this space and want a second pair of senior eyes on it, get in touch. We do a lot of this work inside Symphony, and I am always happy to talk to founders trying to get from demo to production.