Platform · Voice Engine

The Voice OS for Every Agent.

Streaming speech pipeline - hear, reason, act, speak - in real time on every call. Built on Pipecat AI with Deepgram STT, mid-call tool calling, ElevenLabs TTS, and full OpenTelemetry tracing per call session.

Overview

The Voice Engine is the real-time speech pipeline that turns the Core Engine into a phone agent. Built on Pipecat AI, it handles the full lifecycle of a phone call: receive audio from Twilio via TwiML WebSocket, transcribe it in real time using Deepgram, pass it through the Core Engine for reasoning and tool calling, synthesise the response with ElevenLabs or Deepgram TTS, and stream the audio back to the caller.

The pipeline is optimised for sub-second turn-taking. The caller hears a natural conversation while the agent is simultaneously looking up their record, creating a support ticket, and sending a confirmation email - all before the end of the agent's spoken reply.

The Voice Engine runs as its own Azure Container App on port 8001, separate from the Management API. It scales independently based on concurrent WebSocket connections and shares the same PostgreSQL, Redis, and RabbitMQ infrastructure as the rest of the platform.

Live call pipeline - action inside the conversation
CallerPSTN
TwilioTwiML
WSS:8001
HearSTT
Think + Acttools mid-call
SpeakTTS
Callernatural response

Call evidence captured below every stream

Transcriptper call
Sentimentsignal
Outcomeresolved · escalated
Traceend-to-end

Call Flow - Step By Step

  1. Call arrives - Twilio receives the inbound call (or the outbound scheduler places it) and opens a TwiML WebSocket to the Voice Engine at port 8001.
  2. Real-time transcription - the audio stream passes through Deepgram STT, which returns partial and final transcriptions in real time.
  3. Core Engine reasoning - the transcript is passed to the Core Engine (with conversation history and system prompt). The LLM reasons about the intent and may emit one or more tool calls.
  4. Mid-call tool calling - tool calls are routed via MCPRegistry to the configured connectors. The connector calls the external system (CRM lookup, ticket creation, email send) and returns a normalised result - while the call is live.
  5. Response synthesis - the LLM generates the spoken reply. ElevenLabs or Deepgram TTS synthesises it into audio.
  6. Audio delivery - the synthesised audio streams back to the caller through the WebSocket and Twilio.
  7. Call ends - transcript, outcome tags, and sentiment score are written to PostgreSQL.
  8. Trace persisted - a full OpenTelemetry trace is written for the session: one span per HTTP call, DB query, RabbitMQ message, and connector dispatch.

Key Features

FeatureWhat it does
Streaming STTReal-time transcription via Deepgram. Partial results are used for faster turn-taking; final results trigger reasoning.
Sub-second turnsThe pipeline is optimised for low latency from end-of-speech detection to start-of-agent-speech. Typical latency is under one second for simple turns.
Mid-call tool callingThe agent can look up records, create tickets, and send emails while the call is live - before the end of the agent's spoken reply.
Inbound + outboundAnswers incoming calls via Twilio TwiML. Runs scheduled outbound campaigns from a CSV upload or trigger. The same agent configuration handles both.
Human handoffTransfers to a live agent with the full transcript, context summary, and sentiment score. The receiving agent is never starting cold.
Call recordingRecordings stored on persistent volume claims (PVC) and linked to the call session record in the database.
Sentiment analysisPer-call sentiment scoring (positive / neutral / negative) attached to each session and available in the analytics dashboard.
Per-call OTel traceOne OpenTelemetry trace per call session spanning HTTP, DB, RabbitMQ, and every connector call. Available in Grafana Tempo.
Outbound schedulerAPScheduler with PostgreSQL job store - scheduled outbound jobs survive pod restarts and infrastructure events.

Scaling & Performance

  • Separate ACA app - the Voice Engine runs independently from the Management API, scaling on concurrent WebSocket connections (the active_calls_total metric).
  • Min 1 / max 20 replicas - scale-to-zero is disabled for voice to prevent cold-start latency on inbound calls.
  • Redis cache - agent configuration is cached in Redis. If Redis is unavailable the engine degrades gracefully, reading from the database directly.
  • PostgreSQL job store - the outbound scheduler uses PostgreSQL for job persistence, ensuring no scheduled calls are lost across pod restarts.

Want to see a live voice agent demo?

We'll run a live inbound or outbound call with a configured agent, walk through the call trace in real time, and discuss what mid-call tool dispatch looks like for your systems.

Hi there! I'm MyLu!
Your Autonomous AI Guide