Skip to content
AIMOCS

AIMOCS · White papers

White paper

Best tool stack for Hermes voice agents in commercial workflows

A long-form companion to the Hermes stack guide — the design decisions, failure modes, and operating disciplines behind a production voice operator that customers do not hang up on.

Updated · 2026-05-21

15 min read

round-trip latency budget for a voice operator on a real call

<800ms

of calls escalate to a human under typical workflow scope

5%

of calls audited end-to-end with audio + reasoning

100%

01Abstract
02Context

Why voice, why now, why Hermes

Three things changed at roughly the same time. First, voice synthesis crossed the uncanny valley for sustained business conversation — ElevenLabs and a handful of peers produce voice that does not get hung up on. Second, language models got fast and reliable enough that real-time turn-taking became possible without the operator sounding broken. Third, the outbound-voice surface — telephony, recording, consent, sentiment — was packaged for agent use rather than for human call-center workflows. Hermes is the most mature of those.

The case for voice in commercial workflows is empirical: outbound voice closes faster than email for accounts-receivable, faster than SMS for appointment confirmation, and faster than chat for after-hours intake. The case is not theoretical, and the technology is now production-grade — but only if the surrounding stack is built deliberately.

03Architecture

The reference stack

Hermes — the call surface

Hermes carries the telephony, manages turn-taking, handles consent at call start, records the audio with regional-jurisdiction handling, provides sentiment signal alongside the transcript, and exposes the entire call surface to the operator container via webhooks and a real-time stream.

ElevenLabs — the voice

ElevenLabs synthesises the operator's voice from a calibrated brand-tone profile. We pre-cache common phrases (greetings, confirmations, escalation handoffs) so the marginal cost of synthesis tracks unique spoken content rather than total speech.

Anthropic Claude — reasoning

Claude does the planning and the decision. We pin the model version per workflow. Claude's tool-use predictability on multi-turn workflows is the practical reason it is our voice default; in production an unpredictable reasoning core makes voice calls feel broken even when the surface technology is good.

Glama — the tool gateway

Every downstream tool — the CRM, the billing system, the scheduling platform, the ticketing system — is exposed to the operator over Glama's MCP. Glama holds the scoped credentials, applies rate limits, and writes a tool-call audit feed straight into MongoDB.

Stripe — money movement

When a workflow involves payment, Stripe payment intents are the action the operator takes. The intent is idempotent so a retry never doubles a charge. The intent ID lives in the MongoDB audit record next to the call ID, so finance reconciliation is a cross-join.

Supabase — memory

Per-account state — payer tier, prior promises, communication preferences, last contact — lives in Supabase. The operator reads it at call open and writes back at call close.

MongoDB — audit

Append-only record of every action plus the reasoning that produced it, plus references to the audio file the action came from. The audit log is the single artifact that lets the workflow owner trust the operator over time.

04Real-time discipline

Latency budget

A voice operator that takes more than about 800 milliseconds to start its turn after the customer stops talking feels broken on a phone call. That budget is finite and consumed by every layer in the stack. We allocate roughly as follows:

  • Hermes inbound speech-to-text: 150-200 ms
  • Anthropic Claude reasoning: 350-500 ms with a tight prompt
  • ElevenLabs synthesis of the first audio chunk: 100-150 ms
  • Glama tool call (when needed): 50-100 ms with the gateway colocated
  • Network and orchestration overhead: 50-100 ms

The disciplines that keep the budget intact: keep the system prompt small, pre-warm Glama tool definitions, colocate the operator container with the gateway, use ElevenLabs chunked-output mode so the operator can start speaking before its full response is synthesised, and never make a synchronous tool call mid-turn when an asynchronous read at the beginning of the turn would do.

05Conversation design

Turn-taking and barge-in

Most voice operators that feel broken feel broken because of turn-taking. They talk over the customer, they pause too long after the customer stops, or they cannot be interrupted gracefully. The Hermes stack solves this if it is configured correctly.

  • Voice activity detection on the customer side fires at the same time as the speech-to-text. The operator does not wait for an end-of-utterance signal that might be 1500 ms late; it begins its turn at the natural pause.
  • Barge-in interrupts the operator mid-turn cleanly. ElevenLabs feeds audio in chunks so the operator can stop speaking on the chunk boundary rather than mid-syllable.
  • When the customer interrupts, the operator does not just stop — it acknowledges. "Yes, go ahead." The acknowledgement is a short pre-cached phrase to keep latency under the conversational pause budget.
06When the operator stops

Escalation and the human handover

A trustworthy voice operator escalates often early and progressively less as the workflow tightens. The escalation surface needs three things to work:

  1. 01Clear escalation triggers in the reasoning prompt — sensitive language, ambiguous intent, requests outside the authority bar, customer asking explicitly for a human. The bar is enforced at the tool gateway, not just in the model.
  2. 02A warm handover. The operator says "I am going to bring in a colleague who can help with that — please hold for a moment." The handover happens via the existing customer-service queue with the full conversation transcript and the operator's reasoning attached.
  3. 03The human picks up with context. They see the transcript, the operator's notes, the customer's account state. They do not start cold.
07Keeping the unit economics right

Cost discipline

A voice operator can be cost-positive on the first deployment if the cadence is right and cost-negative if it is wrong. Three disciplines:

  • Pre-cache synthesis for common phrases. The marginal cost of an ElevenLabs call should track unique content, not greetings.
  • Cap per-workflow daily spend at the Glama gateway. A runaway operator that loops on retries is the most common cost surprise.
  • Tier customers by expected response value. High-value institutional accounts get the full voice-first cadence; low-value accounts may get an SMS-first cadence that escalates to voice only on engagement.
08What goes wrong

Failure modes we have seen

  • Synthetic voice drift. ElevenLabs model updates can subtly change the brand voice. Re-validate the calibration against the brand-tone reference set monthly.
  • Turn-taking regression after a model upgrade. New Claude version, slightly different pacing, suddenly the operator is interrupting customers. Caught by the regression suite if you replay last week's calls; missed otherwise.
  • Authority-bar gap. The operator handles a workflow that was added quietly without updating the signed authority bar. The tool gateway should reject the new call; if the gateway has a wildcard scope, it will not.
  • Audio-audit drift. The audit log accumulates the reasoning but the audio reference points to a file that has been pruned by the retention policy. Set the audio retention to match the audit-log retention.
  • Consent skip. The operator skips the recording-consent line because a system prompt update accidentally removed it. The regression suite must include a consent-line check; manual sampling is not enough.
Questions
  • Is Hermes available in Arabic?

    Yes — for both inbound and outbound. We calibrate the voice per dialect (Saudi, Khaleeji, Levantine, MSA) and pin the reasoning model so behavior stays consistent across calls in the same market.

  • What is the typical call cost?

    Per-call cost varies with duration, synthesis ratio, and tool calls per call. For a typical accounts-receivable workflow we see total per-call cost in the low single-digit dollar range — well below the cost of a human SDR making the same call, and well above commodity TTS-only outbound dialers.

  • How does the operator handle a customer who is upset?

    Hermes provides sentiment signal alongside the transcript. The reasoning prompt uses it as an escalation trigger, not as an instruction to manipulate. An upset customer triggers a warm handover to a human, with the transcript and context attached.

  • Can the operator make payment commitments on a call?

    Yes, within the signed authority bar. The authority bar specifies the maximum payment amount, the eligible payment methods, and the tier of customer for whom auto-commitment is allowed. Outside those parameters the call escalates.

  • Does AIMOCS run the voice operator end-to-end?

    Yes. Customer owns the workflow, the audit log, and the data; AIMOCS owns the operations — model version discipline, calibration drift checks, the escalation queue triage, the monthly log review.

Citations
  1. [1]Hermes stack guide — aimocs.com/stack/hermes.
  2. [2]ElevenLabs voice-agent stack guide — aimocs.com/stack/elevenlabs.
  3. [3]Glama MCP stack guide — aimocs.com/stack/glama.
  4. [4]Accounts receivable workflow — aimocs.com/operator/workflow/accounts-receivable.
Begin

We don't advise on AI. We run it for you.