Operations: Overview
How to run ZeroClaw in production. The surface is intentionally small: one binary, one config file, one SQLite workspace. Most “operations” is “systemd and journald”.
This section covers:
- Service & daemon: keeping the process alive
- Logs & observability: reading what the agent did
- Cost tracking: token spend and per-model cost
- Troubleshooting: when things break
- Network deployment: exposing the gateway, tunnels, reverse proxies
The shape of a deployment
A typical always-on ZeroClaw install is:
zeroclaw service — systemd / launchctl / Windows Service
└── zeroclaw daemon — the single long-running process
├── gateway listener :42617 — REST / WebSocket / webhook intake
├── channel pollers — Telegram, IMAP, Nostr relays (outbound poll)
├── channel listeners — Discord / Slack / Matrix / WebSocket (inbound stream)
├── cron scheduler — scheduled SOPs and jobs
└── agent loop (one per session) — provider call + tool execution
▲ driven by any listener, poller,
gateway request, or cron fire
on disk (everything but the binary can move)
├── ~/.zeroclaw/config.toml — configuration
├── ~/.zeroclaw/.secret_key — master key for the encrypted secrets store
└── ~/.zeroclaw/data/ — runtime state
├── memory/ — agent memory backend
├── sessions/ — per-session conversation stores
└── state/ — scheduler, cost, health, misc runtime state
logs — journald / launchctl / Windows Event Log (platform-native)
Everything except the binary can move. The data dir defaults to ~/.zeroclaw/data/ (the legacy ~/.zeroclaw/workspace/ name is still accepted); config paths resolve per environment (Homebrew vs. bootstrap vs. XDG), and log destinations are platform-native by default.
What to monitor
Four signals matter:
1. Service liveness
Is the process running?
Linux
systemctl --user is-active zeroclaw
macOS
launchctl list | grep -c com.zeroclaw.daemon
Windows
schtasks /Query /TN "ZeroClaw Daemon" /FO LIST | findstr Status
If it’s dying repeatedly, check Troubleshooting → Daemon keeps restarting.
2. Channel and component health
The gateway exposes a component health snapshot at /health (public, no secrets) and /api/health (authenticated). Channels, providers, and other long-running components register themselves in the components map as they start, report OK, or error.
sh
curl -s http://localhost:42617/health | jq
{
"status": "ok",
"paired": true,
"require_pairing": true,
"runtime": {
"pid": 4821,
"updated_at": "2026-06-08T09:00:00+00:00",
"uptime_seconds": 3600,
"components": {
"channel:telegram": {"status": "ok", "updated_at": "…", "last_ok": "…", "last_error": null, "restart_count": 0},
"channel:matrix": {"status": "error", "updated_at": "…", "last_ok": "…", "last_error": "401 Unauthorized", "restart_count": 3}
}
}
}
Each component carries status (starting / ok / error), last_ok, last_error, and restart_count. Watch for status: "error" and climbing restart_count.
3. Provider reliability
Providers surface as components in the same /health snapshot. For request-level signal (latency, success rate, token counts), scrape /metrics (see below) and read zeroclaw_llm_requests_total and zeroclaw_request_latency_seconds.
4. Tool-call volume and metrics
/metrics returns Prometheus text exposition. It requires [observability] backend = "prometheus" in config; without it the endpoint returns a one-line “backend not enabled” hint.
sh
curl -s http://localhost:42617/metrics
zeroclaw_tool_calls_total{success="true",tool="shell"} 342
zeroclaw_tool_calls_total{success="false",tool="shell"} 6
zeroclaw_tool_calls_total{success="true",tool="file_write"} 89
The zeroclaw_tool_calls_total counter is labelled by tool and success ("true"/"false"). A rising success="false" count for one tool is worth looking at: either a policy block, a misbehaving agent, or a flaky tool. Other useful series include zeroclaw_llm_requests_total, zeroclaw_errors_total, zeroclaw_active_sessions, and zeroclaw_tokens_input_total / zeroclaw_tokens_output_total.
Capacity
A single ZeroClaw instance can handle:
- Multiple concurrent conversations across all channels
- Tool calls at whatever rate the provider and sandbox allow
- Long-running agent loops (tool chains of 20+ calls)
Scale laterally by running one instance per workspace. Don’t try to run two daemons on the same workspace: SQLite’s single-writer model will produce lock contention and ultimately corruption.
For multi-tenant hosting, see the proposal in #2765 (closed, historical, the architecture for in-process multi-workspace routing).
Backups
What to back up:
~/.zeroclaw/data/memory/*.db: SQLite conversation memory (brain.db, plusaudit.db)~/.zeroclaw/data/sessions/: persisted session state~/.zeroclaw/.secret_key: master key for the encrypted secrets store (if used). Without it, the config’s encrypted secrets are unrecoverable.
A plain tar czf zeroclaw-$(date +%F).tar.gz ~/.zeroclaw covers everything. Restic, borg, or Duplicacy work fine for incremental backups.
~/.zeroclaw/data/memory/response_cache.db is a regenerable LLM response cache; it’s safe to include in a full-directory backup or to exclude to save space. Tool receipts are in-band HMAC tokens in the conversation history (see Tool receipts), not an on-disk log, so there is nothing separate to back up for them.
Updates
The service does not auto-update. Subscribe to the release feed (GitHub releases or the Discord #releases channel: see Contributing → Communication). Typical update cadence:
- Read the release notes
- Back up
~/.zeroclaw/ - Update the binary (
brew upgrade, bootstrap re-run, orcargo install --force) zeroclaw service restart- Verify the
/healthendpoint reportsstatus: "ok"with no component inerror
If the new version requires config migrations, the startup log emits a warning and the binary usually auto-migrates. Check zeroclaw config list to spot-check values after upgrade, and zeroclaw config migrate to apply any pending schema migrations manually.
See also
- Setup → Service management: install/remove/logs per platform
- Logs & observability
- Troubleshooting
- Network deployment