Operations: Overview

How to run ZeroClaw in production. The surface is intentionally small: one binary, one config file, and one install root with a handful of runtime stores. Most “operations” is “systemd and journald”.

This section covers:

Service & daemon: keeping the process alive
Logs & observability: reading what the agent did
Cost tracking: token spend and per-model cost
Troubleshooting: when things break
Network deployment: exposing the gateway, tunnels, reverse proxies

The shape of a deployment

A typical always-on ZeroClaw install is:

zeroclaw service                          — systemd / launchctl / Windows Service
└── zeroclaw daemon                       — the single long-running process
    ├── gateway listener  :42617          — REST / WebSocket / webhook intake
    ├── channel pollers                   — Telegram, IMAP, Nostr relays (outbound poll)
    ├── channel listeners                 — Discord / Slack / Matrix / WebSocket (inbound stream)
    ├── cron scheduler                    — scheduled SOPs and jobs
    └── agent loop  (one per session)     — provider call + tool execution
                                            ▲ driven by any listener, poller,
                                              gateway request, or cron fire

on disk (everything but the binary can move)
├── ~/.zeroclaw/config.toml               — configuration
├── ~/.zeroclaw/.secret_key               — master key for the encrypted secrets store
└── ~/.zeroclaw/data/                     — runtime state
    ├── memory/                           — agent memory backend
    ├── sessions/                         — per-session conversation stores
    └── state/                            — scheduler, cost, health, misc runtime state

logs                                      — journald / launchctl / Windows Event Log (platform-native)

Everything except the binary can move. The data dir defaults to ~/.zeroclaw/data/ (the legacy ~/.zeroclaw/workspace/ name is still accepted); config paths resolve per environment (Homebrew vs. bootstrap vs. XDG), and log destinations are platform-native by default. For the full store map, see Runtime state and persistence.

What to monitor

Four signals matter:

1. Service liveness

Is the process running?

Linux

systemctl --user is-active zeroclaw

macOS

launchctl list | grep -c com.zeroclaw.daemon

Windows

schtasks /Query /TN "ZeroClaw Daemon" /FO LIST | findstr Status

If it’s dying repeatedly, check Troubleshooting → Daemon keeps restarting.

2. Channel and component health

The gateway exposes a component health snapshot at /health (public, no secrets) and /api/health (authenticated). Channels, providers, and other long-running components register themselves in the components map as they start, report OK, or error.

sh

curl -s http://localhost:42617/health | jq

{
  "status": "ok",
  "paired": true,
  "require_pairing": true,
  "runtime": {
    "pid": 4821,
    "updated_at": "2026-06-08T09:00:00+00:00",
    "uptime_seconds": 3600,
    "components": {
      "channel:telegram": {"status": "ok", "updated_at": "…", "last_ok": "…", "last_error": null, "restart_count": 0},
      "channel:matrix":   {"status": "error", "updated_at": "…", "last_ok": "…", "last_error": "401 Unauthorized", "restart_count": 3}
    }
  }
}

Each component carries status (starting / ok / error), last_ok, last_error, and restart_count. Watch for status: "error" and climbing restart_count.

3. Provider reliability

Providers surface as components in the same /health snapshot. For request-level signal (latency, success rate, token counts), scrape /metrics (see below) and read zeroclaw_llm_requests_total and zeroclaw_request_latency_seconds.

4. Tool-call volume and metrics

/metrics returns Prometheus text exposition. It requires [observability] backend = "prometheus" in config; without it the endpoint returns a one-line “backend not enabled” hint.

sh

curl -s http://localhost:42617/metrics

zeroclaw_tool_calls_total{success="true",tool="shell"} 342
zeroclaw_tool_calls_total{success="false",tool="shell"} 6
zeroclaw_tool_calls_total{success="true",tool="file_write"} 89

The zeroclaw_tool_calls_total counter is labelled by tool and success ("true"/"false"). A rising success="false" count for one tool is worth looking at: either a policy block, a misbehaving agent, or a flaky tool. Other useful series include zeroclaw_llm_requests_total, zeroclaw_errors_total, zeroclaw_active_sessions, and zeroclaw_tokens_input_total / zeroclaw_tokens_output_total.

Capacity

A single ZeroClaw instance can handle:

Multiple concurrent conversations across all channels
Tool calls at whatever rate the provider and sandbox allow
Long-running agent loops (tool chains of 20+ calls)

Scale laterally by running one instance per workspace. Don’t try to run two daemons on the same workspace: SQLite’s single-writer model will produce lock contention and ultimately corruption.

For multi-tenant hosting, see the proposal in #2765 (closed, historical, the architecture for in-process multi-workspace routing).

Backups

What to back up:

~/.zeroclaw/data/memory/*.db: SQLite conversation memory (brain.db, plus audit.db)
~/.zeroclaw/data/sessions/: persisted session state
~/.zeroclaw/.secret_key: master key for the encrypted secrets store (if used). Without it, the config’s encrypted secrets are unrecoverable.

A plain tar czf zeroclaw-$(date +%F).tar.gz ~/.zeroclaw covers everything. Restic, borg, or Duplicacy work fine for incremental backups.

~/.zeroclaw/data/memory/response_cache.db is a regenerable LLM response cache; it’s safe to include in a full-directory backup or to exclude to save space. Tool receipts are in-band HMAC tokens in the conversation history (see Tool receipts), not an on-disk log, so there is nothing separate to back up for them.

Updates

The service does not auto-update. Subscribe to the release feed (GitHub releases or the Discord #releases channel: see Contributing → Communication). Typical update cadence:

Read the release notes
Back up ~/.zeroclaw/
Update the binary (brew upgrade, bootstrap re-run, or cargo install --force)
zeroclaw service restart
Verify the /health endpoint reports status: "ok" with no component in error

If the new version requires config migrations, the startup log emits a warning and the binary usually auto-migrates. Check zeroclaw config list to spot-check values after upgrade, and zeroclaw config migrate to apply any pending schema migrations manually.

ZeroClaw Docs