What does RPC probe failed with Runtime running mean?

Process is alive but control plane unreachable; check tokens, binds, proxies, and wrong remote URLs.

How is remote drift different from pairing?

Remote issues concern URL and config singularity; pairing concerns device identity. Use the status ladder before re-onboarding.

2026 OpenClaw Remote Gateway: CLI vs Service Config Drift, gateway.remote.url, Health Probe Failures

Pain points: remote gateway is an operations contract

Pain 1: split brains. Remote gateway mode intentionally separates the machine that runs openclaw CLIs from the machine that hosts the gateway process. When openclaw gateway status prints different Config (cli) and Config (service) paths, you are editing one JSON while systemd or launchd still points the daemon at another copy under a different HOME or EnvironmentFile snapshot.

Pain 2: stale remote URLs. Operators rotate VMs, recycle container IDs, or change reverse-proxy upstreams but forget to update gateway.remote.url. The UI may intermittently reach a CDN edge while RPC probes still aim at a decommissioned private IP, producing the worst class of heisenbugs.

Pain 3: mixing remote misconfig with pairing drift. Pairing and device identity issues also disconnect dashboards. Run the official status ladder first so you do not burn hours re-onboarding when the root cause is simply a mismatched token on the service side.

Pain 4: conflating RPC failures with channel silence. When RPC probes fail, fix transport, auth, and proxy first. When RPC is green yet Telegram is quiet, pivot to the channels runbook instead of toggling remote URLs randomly.

Layered triage: CLI, RPC, channels

L0 versions. Record openclaw --version and the gateway build identifier your organization standardizes. Laptop upgrades without server upgrades recreate subtle RPC mismatches.

L1 gateway status. Capture Runtime, probe target URL, and whether the probe used TLS SNI you expect. Compare the normalized URL against gateway.remote.url.

L2 doctor. Clear blocking items before declaring networking innocent. Security-hardening releases may tighten bind and token rules simultaneously.

L3 channels. Only after L1 is green should you run channels status --probe and interpret Telegram or Slack readiness separately from RPC.

Quantified ticket fields

Ticket template should always include dual config summaries, remote URL, masked token prefixes, unit names, last successful RPC timestamp, TLS notAfter, and RTT between operator laptop and gateway host.

During upgrades append hourly snapshots so you can diff the last known healthy tuple against the first failing tuple. That discipline shortens executive escalations.

Capacity planning should include how many concurrent operators attach to the same remote gateway; bursts of reconnects can trip rate limits unrelated to OpenClaw bugs.

Decision matrix

Symptom	Likely root	First action	Rollback note
Config(cli)≠Config(service)	Stale service metadata	gateway install --force then cold restart	Backup JSON and credentials dirs first
Runtime running, probe failed	Token, bind, proxy	Align tokens; verify upstream; curl or ws probe	Log proxy reload ordering
URL changed but client still old	DNS or UI cache	Flush DNS; restart UI; temporary explicit host	Do not pin IP forever in prod
Channels quiet with green RPC	Not remote layer	Follow channels and model quota guides	Avoid parallel token and webhook edits

How-to: seven ordered steps

Freeze variables: screenshot full gateway status with redaction before edits.
Verify remote semantics: confirm gateway.mode and gateway.remote.url follow current docs without legacy aliases.
Align tokens across CLI, UI, and service readers; non-loopback binds refuse anonymous listeners per FAQ narratives.
Validate reverse proxy quartet: TLS termination, WebSocket upgrade headers, allowedOrigins, upstream keepalive tuning.
Reinstall service metadata with gateway install --force, then gateway restart, documenting unit names.
Run doctor until no blocking items remain; attach summarized output to the ticket.
Record probe latency p95 and one successful UI interaction as the regression baseline for the next upgrade.

Example capture block for tickets

openclaw --version
openclaw gateway status
openclaw config get gateway.remote.url
openclaw doctor

Tabletop revoke gateway tokens mid-session to ensure failures are loud and logged with HTTP status instead of vague UI disconnects.

Quarterly revisit vendor FAQ changes about loopback token defaults because defaults evolve faster than internal wikis.

FinOps should translate repeated multi-hour remote triage into dollars; that framing often unlocks budget for a dedicated always-on Mac ingress instead of laptops sleeping through alerts.

Documentation should name one canonical staging hostname per environment; three synonyms in three runbooks guarantee miswired scripts during incidents.

Disaster drills should include deleting a stale EnvironmentFile path to confirm the service fails closed rather than silently falling back to an unexpected home directory.

Cross-team vocabulary matters: calling the same directory inbox, staging, and dropbox in different Slack threads guarantees wrong rsync targets beside OpenClaw gateways.

When evaluating VPS versus remote Mac hosting, weigh Apple-native expectations for future file automation alongside gateway uptime; teams mixing SFTP delivery with agents benefit from colocation stories.

Security scanning should treat remote URLs like egress allow-list entries; unexpected hosts in config diffs deserve alerts before merge.

Performance reviewers should chart probe latency separately from model latency so dashboards do not misattribute slow LLM calls to infrastructure.

Finally, align on-call ownership: network, platform, and application rotations each need a first step in the ladder to avoid parallel conflicting experiments.

Incident commanders should forbid simultaneous edits to JSON, proxy, and DNS during sev1 windows unless a single owner coordinates ordering: DNS first, proxy second, gateway third, clients fourth. Random reordering extends outages.

Blue-green style gateway migrations deserve explicit cutover timestamps recorded in both config repos and status pages so humans know which URL is authoritative for each five minute slice.

Automated config tests can assert that gateway.remote.url resolves to the same A and AAAA records your Terraform outputs expect, catching split-brain DNS weeks before users notice.

When teams multiplex personal laptops as bastions, enforce jump host images with read-only tooling and session recording instead of ad hoc SSH tunnels that disappear when someone closes a lid.

Observability vendors should receive synthetic probes from the same subnets as CI runners, not only from office networks, because path asymmetry is common with split tunnel VPNs.

Certificate transparency logs can validate that the public name in remote URLs matches the TLS certificate actually served, catching misissued intermediates faster than browser caches.

Chaos experiments should include revoking refresh tokens for dashboard sessions to ensure operators reauthenticate cleanly without wedging long-running CLIs unexpectedly.

Runbooks should document how to roll back gateway.remote.url to the previous known-good tuple within five minutes, including which systemd drop-ins to revert.

Product managers should treat remote gateway reliability as a feature with SLOs, not an implementation detail, because customer-facing agent latency often traces to control-plane health.

Compliance teams may require dual-control approvals for token rotation events; bake that approval latency into maintenance windows so engineers do not surprise executives with midnight downtime.

Telemetry cardinality explodes when every laptop emits unique client IDs; standardize client naming schemes so dashboards remain readable after org growth.

Latency budgets should separate TLS handshake time from websocket upgrade time from first RPC payload so regressions pinpoint which layer regressed.

Vendor support bundles should include anonymized gateway status, doctor summaries, and reverse-proxy error logs zipped with consistent filenames to shorten ticket round trips.

Architecture review boards should reject proposals that silently embed environment-specific hostnames inside shared JSON templates without templating discipline.

Gradual rollout using feature flags for remote endpoints is safer than big bang DNS flips when multiple downstream automations consume the same hostname.

Finally, celebrate boring weeks: if probes stay green and drift metrics stay flat, publish a short internal note reinforcing which controls prevented regressions so teams do not dismantle them accidentally.

Runbook writers should include explicit negative examples: screenshots of misleading healthy states where HTML loads but RPC fails, so new responders do not repeat classic mistakes.

Change management should require a preflight checklist that references both gateway host and operator laptop patch levels because mismatched TLS cipher suites still appear in legacy corporate networks.

When integrating identity providers, document clock skew tolerances because JWT validation errors sometimes masquerade as generic unauthorized responses in dashboards.

Capacity reviews should estimate websocket fan-out during incident bridges when dozens of engineers attach simultaneously, stressing gateways beyond steady-state engineering load.

Backup strategies must include encrypted offline copies of gateway JSON and secrets metadata, not only database dumps, because rebuild time dominates recovery when configs are lost.

Penetration testers should validate that remote URLs cannot be swapped to attacker-controlled hosts without multi-party approval, treating config files as part of the attack surface.

Localization teams should ensure error strings surfaced to operators translate cleanly without breaking log parsers that rely on English keywords today.

Mobile operator tethering paths deserve explicit warnings in runbooks because NAT resets can drop long-lived RPC sessions during long triage calls from parking lots.

Dependency scanners should flag outdated Node runtimes on gateway hosts with the same severity as application vulnerabilities because runtime drift breaks subtle websocket behaviors.

Executive dashboards should show trend lines for probe failures per thousand sessions, not only raw counts, so seasonal traffic spikes do not hide creeping error rates.

Partner integrations that embed iframes pointing at gateway UIs need third-party cookie policies validated after every browser major release to avoid silent auth regressions.

Regional failover drills should rehearse switching remote URLs between continents including DNS TTL math so operators know how long partial user populations may straddle both regions.

Cost attribution tags on gateway hosts help finance understand which product lines consume support hours when incidents trace back to shared infrastructure.

Accessibility reviewers should confirm that critical status badges remain visible under high-contrast themes because operations centers often enforce strict visual modes.

Long-term archival of gateway logs must respect retention laws; anonymize tokens before shipping logs to cold storage buckets shared across jurisdictions.

Developer experience teams should supply copy-paste snippets for common remote debugging tasks to reduce variance between senior and junior responders.

Finally, schedule an annual game day that assumes complete loss of the primary gateway host image, forcing teams to rebuild from infrastructure-as-code plus secret managers only.

FAQ and hosted remote Mac angle

Is SSH port-forwarding enough?

Good short-term; still document canonical URLs and tokens so teams do not depend on one engineer laptop staying awake.

When pick VPS versus remote Mac?

VPS fits fixed Linux automation; remote Mac fits Apple-native build and file delivery colocated with agents.

Why SFTPMAC leasing?

When runbooks exist but hardware churn and ingress drift still burn on-call time, leasing packages always-on bare metal with observability defaults while you keep token policies.

2026 OpenClaw Remote Gateway Mode: CLI versus Service Config Drift, gateway.remote.url, and Health Probe Failure Playbook