Why wrap Claude Code in Docker — isn't the agent already sandboxed?

Claude Code runs in your shell with your permissions. That's fine for exploration, dangerous for production. Docker is the cheap, definite boundary that means a wrong tool call costs you a container restart, not a backup restore.

How does this stack handle secrets?

Secrets never live in the prompt or the agent's memory. They're injected into the container at run time from Vercel or a vault, and the audit log records that they were requested — never the values themselves.

Does the audit log slow agents down?

The log writes are async into MongoDB and add a single-digit-millisecond cost per tool call. The slowest part of any agent is the model itself, not the log.

Can a human review what the agent did before it merges?

Yes — that's the default. The agent opens the PR, runs the checks, and stops. A human (or another agent with merge authority) approves. The audit log shows every step of that approval chain.

When should we not use Claude Code in production?

When the workflow is high-context judgment work — system design, customer escalation calls, security postmortems. The agent helps with the boring scaffolding around those, but humans still own the call.

AIMOCS · Stack guides

Stack guide

The Claude Code production stack for engineering workflows

Claude Code is the in-terminal agent half. The other half is the boring scaffolding that lets it run safely against your real repos, your real CI, and your real customers.

The stack

Updated · 2026-05-21

01TL;DR

02The stack

L/01Agent surface
Claude Code is the in-terminal interface and tool runner. It's where the human developer and the agent share a context window.
L/02Reasoning core
Anthropic Claude with a project-pinned system prompt and a frozen tool surface. We swap the underlying model in deliberate releases, never per-run.
L/03Tool gateway
Glama exposes Linear, GitHub, PagerDuty, Datadog, and internal admin tools over MCP so Claude Code calls one interface for everything outside the editor.
L/04Isolation
Every non-trivial agent run goes inside a Docker container with the minimal toolchain. Mistakes can't escape the container, and runs are repeatable across machines.
L/05Deploy + memory + audit
Vercel ships the result. Supabase holds per-project state and learned preferences. MongoDB carries the append-only log: every shell command, file edit, and the reasoning behind it.

03Why this stack

Sandbox by default

Docker means a misfired rm -rf or migration is contained. The blast radius is one container, not your workstation or your prod cluster.

Predictable upgrades

Pin the Claude model version per project. New model releases go through a regression set before they touch a live repo — same discipline as a database upgrade.

One log to investigate

Combine the Glama tool log and the agent reasoning log in MongoDB and you can reconstruct any change end-to-end: what the agent saw, what it decided, what it did, and what shipped.

Real CI integration

The same agent surface that wrote the patch can open the PR, watch the checks, and stage the deploy on Vercel — gated by a human approval, not bypassed.

04Where it shines

◇/01
Repetitive engineering work (codemods, dependency bumps, lint fixes) where humans get bored and miss steps
◇/02
Tier-2 production incidents where the rote investigation (read logs, check deploys, query DB) is the bottleneck
◇/03
Internal tools and admin scripts that get written once and then break silently when the schema moves
◇/04
Test-suite maintenance — generating and pruning tests against a known coverage target

05Comparison

Claude Code in a hardened stack

Pros

· Containerised — safe by default, repeatable everywhere
· Tool surface gated through Glama with per-tool authority
· Full append-only audit log for every action

Cons

· Setup cost on day one (Docker images, MCP wiring, eval set)

Claude Code on a developer laptop

Pros

· Zero setup, friction-free
· Works for one developer on one project

Cons

· No shared state across the team
· No audit trail — "who let the agent rm that file?"
· Easy to escalate from a useful tool into a liability

GitHub Copilot Workspace or Codex CLI

Pros

· Tighter integration with one specific platform

Cons

· Locked into one vendor's tooling assumptions
· Less flexibility to wire arbitrary internal MCP tools

06Implementation notes

01
Build per-team Docker images, not one giant image. The smaller the toolchain the agent has, the harder it is for it to make a destructive choice.
02
Use Glama's scoped tokens for every tool. The agent's GitHub token can open PRs but never merge to main; the deploy token can stage but never promote.
03
Capture the full transcript and the diff in MongoDB. "What changed and why" is the only artifact that justifies trust over time.
04
Run a nightly regression suite that replays last week's agent runs against the new model version. Drift catches you before it catches a customer.
05
Wire Datadog or Sentry into the MCP layer so the agent can read errors directly during an incident, not over Slack.
06
Keep a hard list of "agent never touches" paths — billing config, secrets, prod migrations — enforced at the container level, not the prompt level.

07Related

Industries it fits

◇ OperatorAgencies & professional services

Workflows it fits

08Questions

Why wrap Claude Code in Docker — isn't the agent already sandboxed?
Claude Code runs in your shell with your permissions. That's fine for exploration, dangerous for production. Docker is the cheap, definite boundary that means a wrong tool call costs you a container restart, not a backup restore.
How does this stack handle secrets?
Secrets never live in the prompt or the agent's memory. They're injected into the container at run time from Vercel or a vault, and the audit log records that they were requested — never the values themselves.
Does the audit log slow agents down?
The log writes are async into MongoDB and add a single-digit-millisecond cost per tool call. The slowest part of any agent is the model itself, not the log.
Can a human review what the agent did before it merges?
Yes — that's the default. The agent opens the PR, runs the checks, and stops. A human (or another agent with merge authority) approves. The audit log shows every step of that approval chain.
When should we not use Claude Code in production?
When the workflow is high-context judgment work — system design, customer escalation calls, security postmortems. The agent helps with the boring scaffolding around those, but humans still own the call.

09Begin

We don't advise on AI. We run it for you.

Book a consultation

Proven on your data before you commit.