Skip to content
AIMOCS

AIMOCS · Stack guides

Stack guide

The ElevenLabs voice-agent stack for branded customer operations

ElevenLabs renders voice that doesn't sound like an IVR. The surrounding stack is what turns that into a customer-facing operator with reasoning, memory, and a real audit trail.

The stack

  • ElevenLabs
  • Anthropic Claude
  • Hermes
  • Glama
  • Supabase
  • MongoDB
  • Stripe

Updated · 2026-05-21

01TL;DR
02The stack
  • L/01Voice synthesis

    ElevenLabs renders the operator's voice in the brand's tone — calibrated, consistent across calls, latency-budgeted for real-time interaction.

    • ElevenLabs
  • L/02Call surface

    Hermes carries the telephony, manages turn-taking, and handles recording with consent. ElevenLabs feeds it the synthesised audio.

    • Hermes
  • L/03Reasoning core

    Anthropic Claude decides what to say, when to escalate, and what tool to call. Model version pinned per workflow.

    • Anthropic Claude
  • L/04Tool gateway

    Glama presents CRM, billing, scheduling, and ticketing tools to the operator over a uniform MCP interface.

    • Glama
  • L/05Memory + money + audit

    Supabase holds account state. Stripe runs the payments. MongoDB stores every transcript + reasoning + tool call + audio reference.

    • Supabase
    • Stripe
    • MongoDB
03Why this stack

Brand-quality audio

ElevenLabs is one of the few voice stacks that crosses the uncanny valley for sustained conversation. Customers stop noticing they're talking to an operator.

Latency budget that works for calls

The combined ElevenLabs + Hermes + Anthropic round trip stays inside the natural pause budget of a phone conversation. The operator doesn't feel slow.

Consistent voice across thousands of calls

Voice cloning means the operator sounds the same every time. Brands stop having to apologise for the IVR tone of voice that was good enough five years ago.

Audit including audio

MongoDB log carries audio references next to reasoning. Disputes, training reviews, and compliance audits all read from the same record.

04Where it shines
  • ◇/01

    Outbound collections where voice closes faster than email

  • ◇/02

    Appointment confirmations and reminders where the customer-experience bar is high

  • ◇/03

    Onboarding and intake calls where the brand needs to sound like itself

  • ◇/04

    Field-service dispatching where the dispatcher voice is the brand to the technician

05Comparison

ElevenLabs in the production voice stack

Pros

  • · Brand-quality voice with consistent tone
  • · Latency budget tuned for real-time conversation
  • · Audio + reasoning audit in one log

Cons

  • · Higher per-minute synthesis cost than commodity TTS

Commodity TTS (Google, Amazon Polly)

Pros

  • · Lower per-minute cost
  • · Wide language coverage

Cons

  • · Audibly synthetic — customers disengage on long calls
  • · No brand-voice control without significant tuning

Pre-recorded human voiceover library

Pros

  • · Highest possible quality per asset

Cons

  • · Cannot handle dynamic content (names, amounts, dates)
  • · Coverage gaps force fallback to commodity TTS — worst of both worlds
06Implementation notes
  1. 01

    Calibrate the brand voice with a curated training set — 20-30 minutes of clean studio audio in the target tone. Skipping this is the most common reason voice operators sound generic.

  2. 02

    Pre-cache common phrases (greetings, confirmations) so the per-call synthesis cost stays linear with the unique content, not the total spoken text.

  3. 03

    Hermes handles silence detection and barge-in. ElevenLabs feeds it the audio chunk at a time so the operator can interrupt itself cleanly when the customer speaks.

  4. 04

    Record consent explicitly at call start in the customer's preferred language. The audio of the consent goes into the same MongoDB record as the rest of the call.

  5. 05

    For Arabic or multi-dialect operations, calibrate per dialect and pin the model. A Saudi operator should not switch to Egyptian Arabic mid-call.

  6. 06

    Watermark synthesised audio inaudibly so downstream review tools can confirm what was machine-generated vs. human.

08Questions
  • Can a voice operator handle a complex multi-turn customer conversation?

    For tightly-scoped workflows yes — confirmation, collection, intake, dispatching. For open-ended customer support or sensitive escalations no, and the operator is built to recognise that and hand off to a human with context.

  • How does the stack handle regional accents and languages?

    ElevenLabs supports voice calibration per language and dialect. AIMOCS configures one voice per language pair the operator handles, and the reasoning model selects per call based on customer locale.

  • What about emotional tone — does the operator notice when a customer is upset?

    Hermes provides sentiment signal alongside the transcript. The reasoning model uses it to escalate ("this is going badly, route to human") rather than to manipulate ("speak softer"). Escalation is the response, not synthetic empathy.

  • Is consent and recording handled legally?

    Yes — explicit verbal consent at call start, stored as audio, with retention tied to the customer's jurisdiction. For two-party-consent states the operator obtains consent before recording starts.

  • How long to deploy a voice operator with this stack?

    Four to six weeks. Week one is voice calibration and workflow mapping. Weeks two-three integrate Hermes + Glama + downstream tools. Week four runs shadow against real calls. Weeks five-six are staged handover with weekly review.

09Begin

We don't advise on AI. We run it for you.