Platform · Voice Engine

The Voice OS for Every Agent.

Streaming speech pipeline - hear, reason, act, speak - in real time on every call. Built on Pipecat AI with Deepgram STT, mid-call tool calling, ElevenLabs TTS, and full OpenTelemetry tracing per call session.

On this page

Overview
Call Flow - Step By Step
Key Features
Scaling & Performance
Related

Overview

The Voice Engine is the real-time speech pipeline that turns the Core Engine into a phone agent. Built on Pipecat AI, it handles the full lifecycle of a phone call: receive audio from Twilio via TwiML WebSocket, transcribe it in real time using Deepgram, pass it through the Core Engine for reasoning and tool calling, synthesize the response with ElevenLabs or Deepgram TTS, and stream the audio back to the caller.

The pipeline is optimized for sub-second turn-taking. The caller hears a natural conversation while the agent is simultaneously looking up their record, creating a support ticket, and sending a confirmation email - all before the end of the agent's spoken reply.

The Voice Engine runs as its own Azure Container App on port 8001, separate from the Management API. It scales independently based on concurrent WebSocket connections and shares the same PostgreSQL, Redis, and RabbitMQ infrastructure as the rest of the platform.

Live call pipeline - action inside the conversation

CallerPSTN

→

TwilioTwiML

→

WSS:8001

→

HearSTT

→

Think + Acttools mid-call

→

SpeakTTS

→

Callernatural response

Call evidence captured below every stream

Transcriptper call

Sentimentsignal

Outcomeresolved · escalated

Traceend-to-end

Call Flow - Step By Step

Call arrives - Twilio receives the inbound call (or the outbound scheduler places it) and opens a TwiML WebSocket to the Voice Engine at port 8001.
Real-time transcription - the audio stream passes through Deepgram STT, which returns partial and final transcriptions in real time.
Core Engine reasoning - the transcript is passed to the Core Engine (with conversation history and system prompt). The LLM reasons about the intent and may emit one or more tool calls.
Mid-call tool calling - tool calls are routed via MCPRegistry to the configured connectors. The connector calls the external system (CRM lookup, ticket creation, email send) and returns a normalized result - while the call is live.
Response synthesis - the LLM generates the spoken reply. ElevenLabs or Deepgram TTS synthesizes it into audio.
Audio delivery - the synthesized audio streams back to the caller through the WebSocket and Twilio.
Call ends - transcript, outcome tags, and sentiment score are written to PostgreSQL.
Trace persisted - a full OpenTelemetry trace is written for the session: one span per HTTP call, DB query, RabbitMQ message, and connector dispatch.

Key Features

Feature	What it does
Streaming STT	Real-time transcription via Deepgram. Partial results are used for faster turn-taking; final results trigger reasoning.
Sub-second turns	The pipeline is optimized for low latency from end-of-speech detection to start-of-agent-speech. Typical latency is under one second for simple turns.
Mid-call tool calling	The agent can look up records, create tickets, and send emails while the call is live - before the end of the agent's spoken reply.
Inbound + outbound	Answers incoming calls via Twilio TwiML. Runs scheduled outbound campaigns from a CSV upload or trigger. The same agent configuration handles both.
Human handoff	Transfers to a live agent with the full transcript, context summary, and sentiment score. The receiving agent is never starting cold.
Call recording	Recordings stored on persistent volume claims (PVC) and linked to the call session record in the database.
Sentiment analysis	Per-call sentiment scoring (positive / neutral / negative) attached to each session and available in the analytics dashboard.
Per-call OTel trace	One OpenTelemetry trace per call session spanning HTTP, DB, RabbitMQ, and every connector call. Available in Grafana Tempo.
Outbound scheduler	APScheduler with PostgreSQL job store - scheduled outbound jobs survive pod restarts and infrastructure events.

Scaling & Performance

Separate ACA app - the Voice Engine runs independently from the Management API, scaling on concurrent WebSocket connections (the active_calls_total metric).
Min 1 / max 20 replicas - scale-to-zero is disabled for voice to prevent cold-start latency on inbound calls.
Redis cache - agent configuration is cached in Redis. If Redis is unavailable the engine degrades gracefully, reading from the database directly.
PostgreSQL job store - the outbound scheduler uses PostgreSQL for job persistence, ensuring no scheduled calls are lost across pod restarts.

Core Engine - the reasoning runtime the Voice Engine sits on
Connectors & Adapters - the systems agents call mid-call (CRM, ticketing, email)
Analytics & ROI - the KPIs that measure voice agent performance (containment, sentiment, cost per call)
Platform overview - all four engines and four trust pillars

Want to see a live voice agent demo?

We'll run a live inbound or outbound call with a configured agent, walk through the call trace in real time, and discuss what mid-call tool dispatch looks like for your systems.

Book a live voice demo →Book a live call demo →