Pain analysis: why restart feels like a lie
Pain one: version skew without obvious errors. The config file updates silently during first boot of a new build. Operators assume success because JSON still parses, yet service actions fail because the supervising binary is older than the stamp.
Pain two: PATH schizophrenia. Homebrew, global npm, and manual symlinks frequently disagree between interactive shells and launchd plists. which openclaw in Terminal is not authoritative for the service.
Pain three: duplicate units. Deep scans reveal parallel gateway-like services from older experiments. Doctor and install fight each other until duplicates are retired.
Pain four: pairing red herrings. Pairing loops after upgrades should only be attempted once versions and PATH align; otherwise you chase identity while the real issue is binaries.
Pain five: mixing with remote URL issues. If gateway.remote.url still targets a decommissioned host, symptoms resemble split-brain. Split the hypotheses with which -a before touching pairing.
Layered signals: old binary refusal versus real corruption
L0 capture openclaw --version and any gateway subcommand version lines plus the gateway status summary.
L1 compare meta.lastTouchedVersion with the CLI major line. If the file was touched by a newer build, stop attempting service mutations from an older binary.
L2 run openclaw gateway status --deep and archive hints about system-level units, extra launchd labels, or schtasks on Windows hosts.
L3 run openclaw doctor after PATH alignment; use doctor --fix only when versions match, respecting security audit behavior from the 4.14 article.
L4 only after RPC is healthy should you run channels status --probe; otherwise follow the channel silence playbook.
Evidence bundle for tickets
Paste into every incident: both version strings, Config(cli) versus Config(service) paths, meta.lastTouchedVersion, which -a openclaw, supervisor unit names, and the deep scan excerpt. Hourly deltas during upgrade windows make regressions obvious.
On remote Mac farms also attach CPU and IO snapshots so disk starvation is not misread as RPC failure.
Matrix: PATH first or doctor first
| Signature | Meaning | First move | Avoid |
|---|---|---|---|
| Service mutation refused | lastTouchedVersion newer than CLI | Fix PATH to new dist then install --force | Running doctor --fix on stale binary |
| Deep lists duplicates | Stale plist or systemd unit | Disable extras per official hints | Keeping two launchd labels |
| Config paths differ only | HOME drift in plist | Align EnvironmentFile and shell | Editing JSON without reloading service |
| RPC fails with bad remote URL | Wrong host | Switch to remote gateway article | Pairing resets while URL is wrong |
Seven-step runbook
- Freeze evidence with full
gateway statusand--deepoutput. - List binaries via
which -a openclawand symlink targets. - Align PATH for shell and supervisor to the same dist prefix; absolute paths in plists when needed.
- Verify versions match before service mutations.
- Reinstall metadata with
openclaw gateway install --forcethen cold restart. - Doctor loop to clear blocking checks.
- Regression probe record RPC latency and a successful UI action as the new baseline.
Example skeleton
which -a openclaw
openclaw --version
openclaw gateway status --deep
openclaw doctorDrill scenarios before the next upgrade
Quarterly drills should simulate PATH drift: temporarily prepend a dummy directory with an older wrapper script and confirm monitoring catches the wrong resolution before production traffic does. Another drill removes read permissions on the new dist to ensure alerts distinguish permission failures from version skew.
Inject duplicate plist labels in a staging Mac and verify that deep output lists both, and that your runbook states which label is canonical. Rotate engineers through the drill so muscle memory is not single-person dependent.
Document rollback: if install --force fails, which tarball or package version is known good, and how long it takes to restore. Split-brain incidents are stressful; rehearsed rollback shortens mean time to recovery even when root cause is still ambiguous.
Ownership and upgrade hygiene
Assign one owner for PATH and one for supervisor units. Without ownership, every upgrade becomes a scavenger hunt. Keep a changelog entry whenever plists change, including the exact PATH string embedded.
Track upstream release notes for breaking changes to config keys; align doctor templates before rollout day. Pair release notes with automated smoke tests that assert version strings and RPC probes from CI, not only from laptops.
When multiple teams share a gateway host, namespace working directories and ensure CI jobs never mutate global npm prefixes concurrently with interactive upgrades.
Observability and SLOs: make split-brain visible before users notice
Treat version skew as a first-class metric. Emit a nightly job that compares the resolved openclaw binary path from a non-interactive shell with the path embedded in the supervisor unit. Alert when they diverge, even if RPC still appears green. Pair that check with a lightweight RPC probe from the same non-interactive context used by automation.
Service level objectives should include time-to-detect PATH drift, not only uptime of the gateway process. Many teams only page on process death, while split-brain manifests as partial failures that confuse support but never crash the binary.
Log shipping should preserve the first lines of gateway status output after each upgrade, hashed for deduplication, so you can diff across releases without opening laptops. Store redacted token prefixes separately from full secrets to keep dashboards safe.
When using remote Mac fleets, aggregate skew counts per host to see whether a subset of machines missed a rollout script. That pattern often reveals a broken Ansible task or a manual fix applied only to staging.
Finally, annotate dashboards with the expected OpenClaw version for the week. Humans spot mismatches faster when the expected number is visible next to the observed number.
Packaging ecosystems: Homebrew, npm, and manual tarballs
Homebrew upgrades frequently reshuffle Cellar paths while leaving launchd plist PATH variables stale. After every brew upgrade that touches OpenClaw, rerun gateway install --force from the brew-linked binary and verify the plist references the new prefix.
Global npm installs can shadow brew if PATH lists npm before brew. Document the canonical order for production hosts and enforce it with configuration management. For developer laptops, accept divergence but never use laptop PATH assumptions when debugging server gateways.
Manual tarballs extracted under /opt need explicit versioned directories and symlinks similar to app bundles. Avoid unpacking multiple versions into the same directory tree; collisions there mimic split-brain without touching meta stamps at all.
Containerized gateways should pin image digests and rebuild when base images rotate. Record the digest in the same evidence bundle as openclaw versions so upgrades across orchestrators remain auditable.
Windows hosts mixing WSL and native services need parallel discipline: ensure only one side supervises the gateway, and document which side owns upgrades. Deep scans on macOS have analogs on Windows via scheduled tasks inventory exports.
Communication patterns: how to write the postmortem
Postmortems for split-brain should name the supervising binary path explicitly, not only the semantic version number. Readers six months later need to know whether the failure was a packaging bug, a PATH regression, or a partial rollout. Attach the diff of plist or unit files when available.
Avoid blameful language toward individuals who ran brew upgrade during an incident; instead capture the missing guardrail that should have alerted earlier. Organizations that reward early detection over heroic late-night fixes see fewer repeats.
Include customer impact in minutes, even if external users saw no outage, because internal developer hours lost to confusion still count as cost. Quantify retry storms on CI jobs that attempted gateway restarts from stale binaries.
Close the loop by linking the postmortem to a concrete ticket that adds automation: a nightly PATH check, a stricter install script, or documentation updates. Split-brain repeats when lessons stay tribal.
Finally, schedule a follow-up review thirty days later to confirm metrics improved. Without that checkpoint, regressions creep back alongside the next rapid release train.
Security intersections: why doctor is not always safe first
Security-hardened releases may reject certain config patches until audits pass. Running doctor --fix from an older binary can attempt transformations that the newer security layer would block, producing confusing partial writes. Align versions first, then read release notes about audit tooling changes before automated fixes.
When workspace access tightened in recent releases, doctor may suggest narrowing scopes. Apply those suggestions only after confirming dependent skills still load; snapshot the workspace tree before and after.
Tokens should rotate on a schedule independent of OpenClaw upgrades, but document overlaps where token rotation and binary upgrades happen the same weekend. Two simultaneous changes multiply ambiguity in logs.
Keep a short compatibility matrix in your internal wiki listing which doctor flags are safe on which OpenClaw minor lines. That table becomes invaluable when multiple teams pin different minor versions across environments.
Add one line to every upgrade checklist: confirm the resolved binary path before declaring success. That single habit prevents most split-brain surprises.
When in doubt, capture five minutes of openclaw logs --follow around the failed command; timestamps beside PATH resolution errors often tell the story faster than theory.
Share those snippets in the ticket so the next engineer starts from evidence, not folklore about mysterious restarts that nobody can reproduce twice in production reliably across teams without finger-pointing. That is what good observability buys you.
Related reading
Remote gateway URL issues: remote gateway matrix. Pairing after upgrades: pairing article. Telegram silent init: Telegram restart guide.
Laptops sleep and multiply installs; a rented remote Mac with a single supervised gateway often lowers total cost of ownership when you need seven-by-twenty-four stability plus file delivery on the same machine class.
FAQ
Does reinstall alone fix split-brain?
Only if PATH and duplicate units are cleaned first; otherwise reinstall rebinds the wrong binary.
Docker plus bare metal?
Pick one supervisor pattern; deep output should list a single intended path.
Value and limits
This article covers same-host skew, not provider outages or model rate limits.
Why SFTPMAC remote Mac
Single truth for binaries, stable ingress, and colocated artifact workflows reduce PATH and plist drift while you keep tokens and runbooks.
