Why does the gateway stop when I disconnect SSH on Linux?

systemd --user services are tied to the login session unless lingering is enabled for that user. Use loginctl enable-linger and verify the user manager stays active after logout.

Is gateway install --force the same as uninstalling?

Force repair refreshes gateway-managed artifacts in place while keeping broader install choices; uninstall removes integration surfaces and is for path collisions or untrusted residue.

Does a healthy process imply a healthy gateway?

No. A binary can remain alive while RPC routing, TLS termination, or plugin registration is broken; combine gateway status, logs, and synthetic RPC checks.

2026 OpenClaw Gateway Daemon Install and Reinstall Runbook: gateway install --force, launchd, systemd --user, loginctl Linger, and Official Troubleshooting Ladder

Pain points when daemons lie about gateway readiness

Pain 1: treating a running PID as end state. A supervisor can keep a crashed child in a rapid restart loop while upstream clients see timeouts. Without the ladder from gateway operations and doctor channel troubleshooting, teams burn hours tuning models instead of validating RPC surfaces.

Pain 2: skipping linger on Linux user services. Operators start the gateway over SSH, verify once, disconnect, and discover the unit vanished because the user session ended. This is predictable under systemd user mode and is unrelated to OpenClaw bugs.

Pain 3: mixing force repair with uninstall semantics. Force refresh targets gateway-managed artifacts; uninstall removes broader integration. Collapsing the two strategies invites either insufficient cleanup or unnecessary downtime.

Pain 4: launchd stdout paths that rotate incorrectly after upgrades. When binaries move between npm global prefixes, log file paths in plist dictionaries may point at stale locations, hiding errors during reconnect storms.

Pain 5: ignoring reverse proxy symptoms. TLS and WebSocket mismatches masquerade as gateway failures. Correlate with Nginx and Caddy production guidance before reinstalling binaries.

Pain 6: bundled mistakes that read as random flakiness. Teams archive JSON but skip plist or unit SHA256, enable loginctl enable-linger on shared human accounts, or validate RPC only on 127.0.0.1 while clients use another TLS name. Tie gateway install --force to the same ticket as semver plus 4.x doctor stabilization, WSL2 caveats, and MCP cold restart guidance. Log CLI and gateway semver, doctor output hash, linger yes or no, synthetic RPC p95, certificate notAfter, supervisor restart counters, and free space on log volumes so postmortems stay evidence-led.

Official troubleshooting ladder and RPC probes versus a healthy process

Begin with global openclaw status style summaries when your distribution documents them, then narrow to gateway scoped status that exposes listen addresses, build identifiers, and channel adapters. Only after those layers look coherent should you tail structured logs for reconnect loops, TLS errors, or plugin exceptions. Run openclaw doctor without automatic fix to classify configuration drift, deprecated aliases, and permission issues referenced in 4.x upgrade notes. Introduce doctor --fix only after snapshots and during windows recorded in rollback guidance.

A healthy process only proves the supervisor invoked a binary. An RPC probe proves request routing: authentication, serialization, routing to channel workers, and consistent build tags. Combine localhost checks with the same path external clients use through reverse proxy TLS termination so split-brain between edge and loopback surfaces early. Document expected latency ceilings, allowed error rates, and backoff windows so on-call can compare live metrics with baselines instead of vibes.

When logs show partial initialization, resist reinstalling until you verify environment variables injected by launchd or systemd drop-ins, because those layers disappear from JSON-only reviews. Cross-check file descriptor limits with guidance in daemon restart matrices after any force repair that touches plist or unit generators.

macOS launchd install and reinstall posture for OpenClaw gateway

On Apple Silicon and Intel Macs, launchd remains the durable interface between user expectations and background work. Store plist Label values that match internal dashboards, pin ProgramArguments to the exact Node or bundled binary resolved after your chosen install channel from npm or Docker install comparison, and set WorkingDirectory to the workspace root that also houses skills and context files. Use KeepAlive with bounded restart policies so a poisoned config does not infinite-loop silently. Route StandardOutPath and StandardErrorPath into rotated files monitored by your log platform.

After gateway install --force, revalidate the plist because installers may rewrite helper paths. Load with launchctl bootstrap domains appropriate to system versus user scope, then inspect exit codes through launchctl print summaries. When reinstalling from scratch, unload cleanly, remove stale sockets, reconcile file permissions with least privilege guidance, and only then bootstrap again. Tie these checks to the channel layering in gateway operations guide so macOS-specific fixes do not fork from Linux runbooks unnecessarily.

Linux systemd user units, loginctl linger, and headless SSH disconnect

User services follow the user manager. When administrators SSH in as the service user, start systemctl --user enable --now openclaw-gateway.service, and exit, systemd may tear down the user slice unless linger is enabled. Run loginctl enable-linger servicename for the dedicated account, then confirm with loginctl show-user that Linger=yes. This is not optional folklore for unattended servers; it is the difference between a demo and production.

Pair linger with explicit WantedBy=default.target dependencies for the user unit, document XDG_RUNTIME_DIR expectations, and avoid mixing root-level system units with user units unless your security review demands that split. After enabling linger, reboot once to prove the gateway returns without interactive login. For Windows adjacent deployments, validate whether you are truly on bare Linux or inside WSL2 using WSL2 guidance before copying unit snippets.

SSH hardening interacts with automation: short ClientAliveInterval values and jump hosts can interrupt long-running interactive doctor sessions but should not stop lingered services. When they do, investigate forced cgroup teardown or home directory mounts from networked storage that fail after VPN changes.

Capacity planning should treat gateway log append rates and inode consumption as first-class metrics. A successful gateway install --force can temporarily double structured log volume while adapters reconnect, so filesystems that already hover near ninety percent utilization may cross rotation thresholds during the same maintenance window. Retain at least two dated checkpoints that bracket both a clean doctor run and a failed RPC attempt so diffs show whether corruption predates or follows repair. When CI shares the host, pin CPU and IO budgets so compile storms cannot starve the supervisor during rolling upgrades.

Measurable baselines to log on every gateway host (2026)

Store the listen address from openclaw gateway status; upstream documentation often references port 18789 for the local console—treat whatever value your host prints as authoritative. Align Node major with the minimum openclaw doctor enforces. Keep at least 15% free space on the filesystem that holds long-lived logs. After enabling linger, perform one cold reboot plus one RPC check using the same public hostname and TLS name production clients use.

Field	Example	Why it matters
CLI / gateway semver	2026.4.x	Links incidents to binaries
plist or unit SHA256	64 hex	Catches silent ExecStart drift after force
`Linger`	yes / no	Proves headless survival
RPC latency p95	example <300 ms LAN edge	Splits network from application faults
Log volume free	≥15%	Preserves evidence during rotations

gateway install --force, uninstall, reinstall, and a numbered operator path

Use the following ordered steps as a minimum contract for change management. Replace example unit names with those your package emits.

# 0) prove RPC failure versus cosmetic log noise first
# 1) openclaw status
# 2) openclaw gateway status   # per upstream docs
# 3) tail gateway logs with channel identifiers
# 4) openclaw doctor
# 5) gateway install --force   # only after snapshot approval
# 6) systemctl --user daemon-reload && systemctl --user restart openclaw-gateway.service
# 7) loginctl enable-linger serviceuser && verify Linger=yes

Snapshot JSON, secrets, plist or unit fragments, and proxy blocks with a dated directory.
Execute the official ladder through doctor without fix to classify errors.
If corruption is limited to gateway artifacts, run gateway install --force inside an approved window.
Reload supervisors and perform cold restart if MCP children are enabled.
Enable linger for user services on Linux and reboot test a headless host.
Send synthetic RPC and channel probes documented in gateway operations.
If collisions persist across channels, escalate to uninstall per install channel guide, prune paths, reinstall cleanly, and restore snapshots selectively.

Uninstall paths matter when multiple semver trees coexist, ownership on plugin directories is inconsistent, or compromise is suspected. Reinstall after uninstall only when filesystem hygiene and environment variables are reconciled; otherwise you recreate the same fault vector.

Decision matrix, measurable parameters, FAQ, and SFTPMAC hosted remote Mac

Scenario	Preferred action	Evidence	Risk
Gateway binary present but RPC errors	Ladder first, then targeted restart	Structured logs with stack traces	False negative if only loopback tested
Corrupt artifacts after partial upgrade	`gateway install --force`	Doctor warnings about missing templates	Overwriting local hotfixes
Multi-channel failure behind TLS edge	Proxy and certificate review	Edge access logs plus gateway logs	Reinstall noise masks proxy bug
User service dies on SSH exit	`loginctl enable-linger`	`Linger=no` before change	Misapplied root scope units
Path collisions across npm and Docker	Uninstall one channel	`which openclaw` ambiguity	Long downtime if unplanned

Concrete parameters to record per host: semver for CLI and gateway, plist or unit SHA256, last doctor output hash, linger boolean, last synthetic RPC latency, certificate expiry for public edges, restart counter from supervisor, and disk free percentage for log volumes. Keep them on a single operations row updated after every force repair.

Should force repair run automatically in CI?

No. Treat it as a privileged mutation tied to snapshots and human or policy approval except in ephemeral lab hosts.

Does launchd need linger?

No. linger is a systemd user-session concept; on macOS focus on domain, bootstrap context, and user versus system scope.

When is uninstall mandatory?

When integrity cannot be proven, paths are ambiguous, or compromise is suspected; pair with clean reinstall from a single channel.

Summary: Gateway reliability is a systems problem: supervisors, session semantics, TLS edges, and RPC truth must align before binaries change. Use the ladder, prefer force repair for bounded corruption, enable linger for Linux user services, and validate RPC the way clients experience it.

Limits: Self-managed hosts still require you to own patching cadence, disk hygiene, VPN stability, and operator access. When teams need Apple-native, always-on automation without turning every engineer into a part-time systems administrator, SFTPMAC hosted remote Mac capacity pairs resilient gateway images with disciplined file delivery over SFTP and rsync so upgrades stay repeatable.

SFTPMAC bridge: Colocate long-running gateways with workspace sync patterns that auditors recognize, reduce one-off workstation variance, and keep delivery paths consistent for CI and humans alike.

Publish gateway version, snapshot name, linger state, and last RPC probe on one panel so force repair decisions stay evidence-driven.