With new speech models in the OpenAI API, OpenAI is expanding what’s possible with voice: conversations, translations, and transcriptions can now be embedded in applications closer to real time. For organisations, this is most relevant wherever spoken content currently requires extra steps — in support workflows, cross-border collaboration, or anywhere spoken information needs to be captured quickly. This isn’t a ready-made feature for end users to click through; it’s aimed at teams who want to integrate voice capabilities directly into their own processes. Here’s what’s known so far and what to keep in mind before getting started.
|
|
|
4 minutes |
|
OpenAI is positioning the new speech models as API building blocks for real-time audio, translation, and transcription. According to OpenAI, the lineup includes GPT-Realtime-2 for spoken interaction, GPT-Realtime-Translate for live translation, and GPT-Realtime-Whisper for transcription. The core value isn’t a single new interface — it’s the ability to embed voice-based steps directly into existing applications and workflows.
One important framing note: this sits outside a standard Microsoft 365 UI and requires technical implementation via the OpenAI API or Microsoft Foundry.
There’s no end-user click path here — access is technical, via the Playground, API documentation, and the relevant real-time endpoints. For GPT-Realtime-2, the relevant endpoint is v1/realtime; for translation, v1/realtime/translations; and for transcription, v1/realtime/transcription_sessions.
Room for improvement:
The new speech models in the OpenAI API are most relevant for organisations that want to make spoken information usable faster. The clearest value is in live translation, conversation notes, and voice-based assistance built into existing applications. At the same time, this isn’t a one-click solution for end users — it needs technical integration, clear data handling rules, and solid testing with real audio.
For teams that regularly deal with multilingual conversations or large volumes of spoken content, a small pilot is worth running. A practical starting point: transcribe a real meeting, run a short passage through translation, and review the results together — checking for clarity, speed, and how much correction the output actually needs.