Split-brain vs remote gateway drift?

Split-brain is same-host dual binaries and supervisors; remote drift is wrong gateway.remote.url or tokens. Use which -a and deep output first.

Why does old CLI refuse restart?

Newer binary stamped the config; older binaries refuse mutating service state until PATH points to the newer build.

When rent a remote Mac?

When laptops, sleep, and multiple installs keep reintroducing PATH and plist drift while you need 7x24 stability.

2026 OpenClaw Upgrade Split-Brain: meta.lastTouchedVersion, gateway status --deep, PATH Repair

Pain analysis: why restart feels like a lie

Pain one: version skew without obvious errors. The config file updates silently during first boot of a new build. Operators assume success because JSON still parses, yet service actions fail because the supervising binary is older than the stamp.

Pain two: PATH schizophrenia. Homebrew, global npm, and manual symlinks frequently disagree between interactive shells and launchd plists. which openclaw in Terminal is not authoritative for the service.

Pain three: duplicate units. Deep scans reveal parallel gateway-like services from older experiments. Doctor and install fight each other until duplicates are retired.

Pain four: pairing red herrings. Pairing loops after upgrades should only be attempted once versions and PATH align; otherwise you chase identity while the real issue is binaries.

Pain five: mixing with remote URL issues. If gateway.remote.url still targets a decommissioned host, symptoms resemble split-brain. Split the hypotheses with which -a before touching pairing.

Layered signals: old binary refusal versus real corruption

L0 capture openclaw --version and any gateway subcommand version lines plus the gateway status summary.

L1 compare meta.lastTouchedVersion with the CLI major line. If the file was touched by a newer build, stop attempting service mutations from an older binary.

L2 run openclaw gateway status --deep and archive hints about system-level units, extra launchd labels, or schtasks on Windows hosts.

L3 run openclaw doctor after PATH alignment; use doctor --fix only when versions match, respecting security audit behavior from the 4.14 article.

L4 only after RPC is healthy should you run channels status --probe; otherwise follow the channel silence playbook.

Evidence bundle for tickets

Paste into every incident: both version strings, Config(cli) versus Config(service) paths, meta.lastTouchedVersion, which -a openclaw, supervisor unit names, and the deep scan excerpt. Hourly deltas during upgrade windows make regressions obvious.

On remote Mac farms also attach CPU and IO snapshots so disk starvation is not misread as RPC failure.

Matrix: PATH first or doctor first

Signature	Meaning	First move	Avoid
Service mutation refused	lastTouchedVersion newer than CLI	Fix PATH to new dist then install --force	Running doctor --fix on stale binary
Deep lists duplicates	Stale plist or systemd unit	Disable extras per official hints	Keeping two launchd labels
Config paths differ only	HOME drift in plist	Align EnvironmentFile and shell	Editing JSON without reloading service
RPC fails with bad remote URL	Wrong host	Switch to remote gateway article	Pairing resets while URL is wrong

Seven-step runbook

Freeze evidence with full gateway status and --deep output.
List binaries via which -a openclaw and symlink targets.
Align PATH for shell and supervisor to the same dist prefix; absolute paths in plists when needed.
Verify versions match before service mutations.
Reinstall metadata with openclaw gateway install --force then cold restart.
Doctor loop to clear blocking checks.
Regression probe record RPC latency and a successful UI action as the new baseline.

Example skeleton

which -a openclaw
openclaw --version
openclaw gateway status --deep
openclaw doctor

Drill scenarios before the next upgrade

Quarterly drills should simulate PATH drift: temporarily prepend a dummy directory with an older wrapper script and confirm monitoring catches the wrong resolution before production traffic does. Another drill removes read permissions on the new dist to ensure alerts distinguish permission failures from version skew.

Inject duplicate plist labels in a staging Mac and verify that deep output lists both, and that your runbook states which label is canonical. Rotate engineers through the drill so muscle memory is not single-person dependent.

Document rollback: if install --force fails, which tarball or package version is known good, and how long it takes to restore. Split-brain incidents are stressful; rehearsed rollback shortens mean time to recovery even when root cause is still ambiguous.

Ownership and upgrade hygiene

Assign one owner for PATH and one for supervisor units. Without ownership, every upgrade becomes a scavenger hunt. Keep a changelog entry whenever plists change, including the exact PATH string embedded.

Track upstream release notes for breaking changes to config keys; align doctor templates before rollout day. Pair release notes with automated smoke tests that assert version strings and RPC probes from CI, not only from laptops.

When multiple teams share a gateway host, namespace working directories and ensure CI jobs never mutate global npm prefixes concurrently with interactive upgrades.

Observability and SLOs: make split-brain visible before users notice

Treat version skew as a first-class metric. Emit a nightly job that compares the resolved openclaw binary path from a non-interactive shell with the path embedded in the supervisor unit. Alert when they diverge, even if RPC still appears green. Pair that check with a lightweight RPC probe from the same non-interactive context used by automation.

Service level objectives should include time-to-detect PATH drift, not only uptime of the gateway process. Many teams only page on process death, while split-brain manifests as partial failures that confuse support but never crash the binary.

Log shipping should preserve the first lines of gateway status output after each upgrade, hashed for deduplication, so you can diff across releases without opening laptops. Store redacted token prefixes separately from full secrets to keep dashboards safe.

When using remote Mac fleets, aggregate skew counts per host to see whether a subset of machines missed a rollout script. That pattern often reveals a broken Ansible task or a manual fix applied only to staging.

Finally, annotate dashboards with the expected OpenClaw version for the week. Humans spot mismatches faster when the expected number is visible next to the observed number.

Packaging ecosystems: Homebrew, npm, and manual tarballs

Homebrew upgrades frequently reshuffle Cellar paths while leaving launchd plist PATH variables stale. After every brew upgrade that touches OpenClaw, rerun gateway install --force from the brew-linked binary and verify the plist references the new prefix.

Global npm installs can shadow brew if PATH lists npm before brew. Document the canonical order for production hosts and enforce it with configuration management. For developer laptops, accept divergence but never use laptop PATH assumptions when debugging server gateways.

Manual tarballs extracted under /opt need explicit versioned directories and symlinks similar to app bundles. Avoid unpacking multiple versions into the same directory tree; collisions there mimic split-brain without touching meta stamps at all.

Containerized gateways should pin image digests and rebuild when base images rotate. Record the digest in the same evidence bundle as openclaw versions so upgrades across orchestrators remain auditable.

Windows hosts mixing WSL and native services need parallel discipline: ensure only one side supervises the gateway, and document which side owns upgrades. Deep scans on macOS have analogs on Windows via scheduled tasks inventory exports.

Communication patterns: how to write the postmortem

Postmortems for split-brain should name the supervising binary path explicitly, not only the semantic version number. Readers six months later need to know whether the failure was a packaging bug, a PATH regression, or a partial rollout. Attach the diff of plist or unit files when available.

Avoid blameful language toward individuals who ran brew upgrade during an incident; instead capture the missing guardrail that should have alerted earlier. Organizations that reward early detection over heroic late-night fixes see fewer repeats.

Include customer impact in minutes, even if external users saw no outage, because internal developer hours lost to confusion still count as cost. Quantify retry storms on CI jobs that attempted gateway restarts from stale binaries.

Close the loop by linking the postmortem to a concrete ticket that adds automation: a nightly PATH check, a stricter install script, or documentation updates. Split-brain repeats when lessons stay tribal.

Finally, schedule a follow-up review thirty days later to confirm metrics improved. Without that checkpoint, regressions creep back alongside the next rapid release train.

Security intersections: why doctor is not always safe first

Security-hardened releases may reject certain config patches until audits pass. Running doctor --fix from an older binary can attempt transformations that the newer security layer would block, producing confusing partial writes. Align versions first, then read release notes about audit tooling changes before automated fixes.

When workspace access tightened in recent releases, doctor may suggest narrowing scopes. Apply those suggestions only after confirming dependent skills still load; snapshot the workspace tree before and after.

Tokens should rotate on a schedule independent of OpenClaw upgrades, but document overlaps where token rotation and binary upgrades happen the same weekend. Two simultaneous changes multiply ambiguity in logs.

Keep a short compatibility matrix in your internal wiki listing which doctor flags are safe on which OpenClaw minor lines. That table becomes invaluable when multiple teams pin different minor versions across environments.

Add one line to every upgrade checklist: confirm the resolved binary path before declaring success. That single habit prevents most split-brain surprises.

When in doubt, capture five minutes of openclaw logs --follow around the failed command; timestamps beside PATH resolution errors often tell the story faster than theory.

Share those snippets in the ticket so the next engineer starts from evidence, not folklore about mysterious restarts that nobody can reproduce twice in production reliably across teams without finger-pointing. That is what good observability buys you.

FAQ

Does reinstall alone fix split-brain?

Only if PATH and duplicate units are cleaned first; otherwise reinstall rebinds the wrong binary.

Docker plus bare metal?

Pick one supervisor pattern; deep output should list a single intended path.

Value and limits

This article covers same-host skew, not provider outages or model rate limits.

Why SFTPMAC remote Mac

Single truth for binaries, stable ingress, and colocated artifact workflows reduce PATH and plist drift while you keep tokens and runbooks.

2026 OpenClaw Upgrade Split-Brain: meta.lastTouchedVersion, gateway status --deep, and PATH Alignment