2026 OpenClaw on Linux systemd after upgrade: HOME user drift, empty-looking gateway config, merge recovery runbook
After upgrades, user services may still bind while reading a skeleton tree under a different HOME or XDG path than your SSH session. Treat the symptom as an environment and merge problem first, then run the official troubleshooting ladder. This note targets bare-metal or VM Linux with systemd user units, not container-only installs.
Contents
1. Three-layer symptoms: process up, RPC odd, UI skeleton
Process layer: systemctl --user reports active while the gateway still reads a skeleton openclaw.json because the unit never exported the same HOME as your SSH session.
RPC layer: gateway status may show a build id that disagrees with the CLI; read the split-brain article before deleting directories.
Shell layer: the browser UI can load static assets while websocket auth still points at an empty credential store; run the official probe ladder first.
- Capture the first two hundred journal lines after restart and grep for ENOENT HOME and config path tokens.
- Diff three environments: interactive login, sudo non-login, and systemd --user show.
- Always tarball old and new trees before merging human-edited JSON.
Seeing a listening TCP port from ss -lntp only proves a process bound the port, not that it opened the same openclaw.json you edited under your personal home directory. Many teams install the CLI with sudo for a shared service account while still editing configs as root or as a deploy user; each identity owns a different HOME, so the gateway can happily serve traffic from a skeleton tree while your manual edits never load.
The XDG base directory variables matter when upstream changes default search order between minors. If XDG_CONFIG_HOME is empty in the unit but set in your shell, the gateway may prefer ~/.config/openclaw in one context and ~/.openclaw in another. Write the resolved absolute paths into the change ticket, not shorthand like tilde paths that depend on whichever user expanded them.
When plugins disappear from the UI but files still exist on disk, suspect a second process started by cron or a leftover root-owned unit. Use systemctl --user list-units 'openclaw*' together with pgrep -af openclaw to ensure you are not debugging the wrong PID while the real gateway reads a different tree.
2. status to gateway status to logs to doctor
Follow the same ordering as the gateway probe article: prove listener and RPC before blaming upstream model providers.
A systemd --user unit does not inherit your interactive shell profile. PAM exports for graphical login, SSH AcceptEnv, and ~/.bashrc can all disagree with the slice that starts openclaw gateway. When HOME is unset, Node and many CLIs still pick a writable directory, which is why you can see a fresh skeleton openclaw.json while your real file lives under the account you use over SSH.
Use systemctl --user show UNIT -p Environment -p WorkingDirectory -p FragmentPath as the authoritative truth. Compare it with printenv HOME XDG_CONFIG_HOME XDG_STATE_HOME from a login shell, a non-login shell started with su -, and any CI runner that might have installed the package. If only one row disagrees, fix that row with a drop-in instead of copying trees blindly.
WorkingDirectory changes how relative paths inside ExecStart resolve. If your unit still points at an old checkout after an npm global move, the gateway may boot with default plugins while the UI looks empty. Record the absolute path returned by openclaw gateway status after each restart so auditors can diff it against the tarball you took before merge.
Headless hosts need loginctl enable-linger for the owning user; otherwise the user manager exits when the last session closes and you chase ghosts across SSH reconnects. Pair linger checks with journalctl --user -u UNIT -b so you can correlate restarts with package upgrades instead of guessing from wall-clock time.
When semver between CLI and gateway diverges, read the split-brain runbook first. HOME drift and version drift stack: fixing only one leaves channels half-registered. Capture openclaw gateway status --deep output when available, then run openclaw doctor, then capture status again so the ticket shows a before and after pair.
Merge credentials last. Tokens and OAuth material should move only after channels and plugin blocks validate, because a partial merge that touches secrets first can lock you out of chat providers while the UI still renders. Keep two tarballs: one pre-merge and one post-doctor, each named with UTC timestamp and host short name.
For fleets, freeze the golden image before rolling an OpenClaw minor that changes default config search order. Track drop-in SHA256 and package version in the same change record. If you must run gateway install --force, do it inside a declared window with a rollback unit file in version control, not from muscle memory on a Friday evening.
Docker deployments live on a different plane: environment comes from compose or Kubernetes, not from ~/.config/systemd/user. If you mix bare-metal user units with compose on one host, label tickets clearly so reviewers do not apply Linux drop-in advice to a container that already sets HOME=/root.
macOS operators should read the launchd gateway restart article for plist and Node prefix issues. Linux systemd focuses on User=, WorkingDirectory, linger, and explicit Environment= lines. The symptoms look alike; the repair steps differ, so keep cross-links instead of copying commands between platforms.
Finally, budget fifteen minutes after any merge for channel probes and a single synthetic message per provider. Skipping that step is how silent regressions reach production even when doctor exits zero. If probes fail, capture HTTP status codes and websocket close codes before opening a vendor ticket so support can reproduce your path quickly.
3. Decision matrix: restart only versus merge versus forced install
Use this matrix inside a maintenance window; freeze images when touching fleets. Re-evaluate the row you picked if journal noise spikes right after daemon-reload, because a bad drop-in can pass syntax checks yet still export the wrong path.
| Signature | First action | Risk |
|---|---|---|
| Environment missing HOME writes under package tree | Add drop-in HOME WorkingDirectory then restart | Low |
| Old tree has full json new tree skeleton | Merge blocks then record meta bump | Medium |
| CLI semver diverges doctor requests install | gateway install or force inside window | High |
Use the matrix during a scheduled maintenance window. For fleets, freeze the base image hash and package lockfile before rolling a gateway change, then roll one canary host first. If the canary shows a new skeleton path, stop the rollout and attach the three-way environment diff to the incident instead of pushing the same unit file everywhere.
Risk labels are relative: a low-risk drop-in edit can still brick automation if you forget to reload the user manager. Always pair systemctl --user daemon-reload with a controlled restart and a ready rollback tarball. High-risk gateway install --force should only run after semver alignment and with a second engineer reviewing the diff of generated unit fragments.
4. Seven-step merge checklist
systemctl --user show openclaw-gateway.service -p Environment -p WorkingDirectory
echo "$HOME" "$XDG_CONFIG_HOME"
tar czf ~/openclaw-premerge-$(date +%Y%m%d%H%M).tgz ~/.openclaw ~/.config/openclaw 2>/dev/null
- Freeze concurrent edits: stop CI jobs and ask humans not to run parallel
npm i -gor editors against the same JSON while you merge. - Export environment dumps for root, login shell, and the user unit into three text files so you can diff them side by side without retyping commands during stress.
- Use
openclaw gateway statushints about config root as the source of truth for which tree the running gateway adopted, not whichever path you prefer emotionally. - Merge channels first, then plugins, then credentials, validating each block with a quick JSON parse and a minimal smoke test before touching tokens.
- Run
openclaw doctor, capture stdout and stderr, then repeatgateway statusso the ticket shows a paired before and after snapshot. - Validate linger with
loginctl show-user "$USER" -p Lingeron headless automation hosts where SSH sessions are ephemeral. - Record semver, drop-in SHA256, rollback tarball path, and restart timestamp in the change system so the next responder inherits context instead of rediscovering drift.
5. Printable checklist with concrete thresholds
Carry these rows into your runbook binder. Numbers are suggestions for small teams; tighten them if you operate regulated workloads.
| Check | Command | Expected |
|---|---|---|
| User unit HOME | systemctl --user show | Matches owning account home |
| linger | loginctl show-user | Headless sessions survive SSH exit |
| journal hints | journalctl --user -u | No repeating ENOENT on config |
| Snapshot size sanity | ls -lh ~/openclaw-premerge-*.tgz | Non-zero archive before destructive edits |
| Restart budget | Change calendar | At most one forced install attempt per host per window |
Archive at least seven days of journal excerpts after a successful merge so you can compare spike patterns if drift returns on the next upgrade. Store tarball checksums alongside unit file hashes so integrity checks survive ticket migrations between systems.
6. Boundaries with split brain, Docker, and macOS guides
The split-brain article covers semver alignment, meta.lastTouchedVersion, and deep gateway status output when CLI and daemon disagree. Use it when doctor keeps flagging version skew even after HOME is fixed.
The Docker Compose token article covers OPENCLAW_GATEWAY_TOKEN alignment and websocket pairing inside containers. Do not copy Linux drop-in snippets into compose files verbatim; instead map the same variables through compose syntax and keep secrets out of git.
The macOS gateway restart article covers launchd plist reload and Node path churn. Symptoms resemble Linux drift, but the remediation targets launchctl and plist ownership, not loginctl linger.
Linux systemd work centers on explicit User=, WorkingDirectory=, EnvironmentFile=, linger, and journal correlation with package upgrades. Keep those concerns separated in tickets so future readers pick the right playbook within five minutes.
7. FAQ
Q Can I delete the skeleton tree and copy old files? A Stop the service and snapshot first; partial writes while the gateway runs can corrupt JSON. Prefer block merges with validation over wholesale deletes.
Q What if root and a normal user each start a gateway? A You will fight over ports and config roots. Pick one identity, set User= explicitly, and disable the stray unit.
Q Should I bump OpenClaw weekly on production? A Only with automation for snapshots and canaries; otherwise drift incidents outpace your merge capacity.
8. Conclusion and when hosted remote Mac helps
Moving from a green process indicator to a correct configuration requires aligning HOME and XDG variables with the systemd unit truth, then proving RPC health with the official ladder instead of blind restarts.
Self-hosted Linux still demands disciplined owners for drop-ins, golden images, and merge hygiene; rapid upstream cadence makes silent drift likely if those habits slip.
If you need long-running gateways with Apple Silicon compatibility, predictable paths, and less weekly merge toil, compare your Linux baseline with SFTPMAC hosted remote Mac plans and the help center runbooks so operators spend time on agents, not on rediscovering which HOME the daemon adopted.