Three recurring install pains that look like “OpenClaw is broken”
Teams rarely open tickets titled “wrong packaging choice.” They open “the dashboard worked yesterday” or “npm install never finishes on the bastion.” Those complaints map to three engineering patterns that cut across install.sh, npm, pnpm, and Docker equally until you standardize expectations.
1) Path mismatch between operators. One teammate follows the script path, another installs a global CLI with npm, a third pulls Compose from a forked gist. Each method places configuration, caches, and credentials in subtly different locations. When the gateway fails after an OS patch, nobody agrees which directory to back up or which systemd unit actually launches the process. The incident becomes a scavenger hunt across home directories instead of a rehearsed rollback.
2) Upgrade without a data contract. OpenClaw gateways accumulate state: local config, model cache directories, channel tokens, and downloaded assets. Promoting a new version without copying the right folders forward produces “clean install” behavior that looks like data loss. Rolling back without snapshots yields the opposite pain: stale binaries reading incompatible schemas. Treat the data directory as a first-class migration surface, not an afterthought.
3) Skipping the triage ladder. When messages stop flowing, reflex reinstalls waste hours. The productive sequence verifies the process and listening port, confirms gateway HTTP reachability and environment injection, runs openclaw doctor for static consistency, then captures logs while reproducing the failure. Jumping to “delete node_modules” before you know whether the container even mounts the config file you edited simply hides the fault layer.
Instrument installs the same way you instrument production: record Node version, package manager lockfiles or image digests, the exact commit of any install script, and a tarball of config before you touch upgrades. That discipline pays off the first time you must prove whether a regression came from application code, provider quotas, or an accidental permission change on the data volume.
Path selection matrix: trial laptop versus team standard versus production host
Use the matrix during architecture reviews, not only during emergencies. The qualitative risks stay stable across Apple Silicon Macs, Linux cloud VMs, and Windows dev machines even though absolute paths differ.
| Path | Best for | Primary risk | Minimum controls |
|---|---|---|---|
| install.sh curated flow | Solo first success, demos, golden-path onboarding | Opaque script changes between releases | Pin script URL or checksum, log stdout, snapshot config dir |
| npm or pnpm CLI install | Fast iteration, monorepo engineers, CI agents | Global versus local CLI confusion, Node drift | Volta or fnm pinned Node, non-root service user, lockfile in repo |
| Docker Compose stack | Production parity, repeatable ports and volumes | Volume mount mistakes, image tag drift | Named volumes or bind mounts with documented ownership, digest-pinned images |
When two options tie, prefer more explicit filesystem boundaries over fewer moving parts. Extra containers or an additional volume mount often reduce weekend pages compared with a single overloaded user account that mixes personal projects and the production gateway.
install.sh path: what it optimizes and where it still needs your discipline
Curated install scripts exist to collapse dozens of README steps into one auditable command. They typically handle shell detection, dependency hints, and initial directory scaffolding. They do not replace your responsibility to verify checksums, store secrets outside shell history, or align with corporate proxy rules. Before running any script on a shared bastion, read the sections that mutate shell profiles or install global binaries; those side effects persist for every future user who logs in.
Treat the script as versioned infrastructure: download to a content-addressed filename, record the SHA-256 in your change log, and execute from a maintenance window when possible. Capture the entire terminal transcript. If the script offers non-interactive flags, use them in automation so CI and humans produce identical trees. After completion, immediately run openclaw doctor and a local HTTP probe against the gateway port documented for your release train, commonly 18789 in 2026 guides, adjusting if your team standardizes another ingress port behind a reverse proxy.
Common failure modes include stale sudo caches masking permission errors, corporate SSL inspection breaking curl mid-script, and locale settings altering path parsing on minimal cloud images. When the script completes but the gateway does not listen, compare the user that ran the script with the user configured in your process supervisor. Mismatched accounts are a top source of “works in SSH session, dead after reboot.”
npm and pnpm path: Node pinning, global CLI trade-offs, and permission hygiene
Package-manager installs shine when engineers need rapid upgrade cycles and tight integration with existing JavaScript repositories. They hurt when teams confuse a global npm install -g with a project-local toolchain, or when multiple Node versions fight on one host. Standardize on a version manager and document the exact major Node release your OpenClaw distribution supports; drifting ahead for unrelated projects is how production gateways pick up incompatible native modules.
pnpm’s content-addressable store reduces disk churn when several services share dependencies, which matters on small cloud disks. npm remains ubiquitous for contractors who should not learn another toolchain under incident pressure. Either way, run the gateway under a dedicated Unix account with home directory and data paths you control, not under a personal login that also runs browsers and sleeps the machine.
Mitigate npm registry timeouts and geographic latency with an internal mirror or a documented mirror flag; those operational details belong in the same runbook as firewall rules. After install, verify which openclaw resolves to the intended location, then run doctor before exposing any public listener. If your organization forbids global installs, wrap the CLI with npx or a checked-in package script, but keep the wrapper stable across releases so automation does not surprise on-call.
Docker Compose path: volumes, ports, restart policies, and resource floors
Compose bundles the reproducibility win: the same YAML describes images, environment files, published ports, and restart policies. The failure mode shifts to mis-mounted configuration and UID mismatches between container users and host bind mounts. Name every volume in documentation and map it to a backup job. Treat .env files as secrets-bearing artifacts with rotation playbooks, not as informal notes.
Publish only the ports your ingress actually needs; put TLS termination on a reverse proxy with health checks rather than exposing raw Node listeners broadly. Set memory limits high enough for conversational bursts—roughly 1.5 gibibytes free RAM headroom remains a practical small-node baseline alongside the thresholds discussed in gateway operations guides—while still leaving headroom for the host kernel. Use unless-stopped or equivalent restart policies so transient failures recover without manual SSH.
Inside the container namespace, run the same openclaw status, doctor, and log tail sequence you would on bare metal. If doctor passes on the laptop but fails in the container, suspect different working directories, missing bind mounts, or environment variables injected only in interactive shells. Align Compose health checks with the HTTP probe your load balancer uses so false negatives do not flap production traffic.
Example triage sequence (host or container)
openclaw status
curl -sS -m 5 http://127.0.0.1:18789/health || echo "gateway probe failed"
openclaw doctor
openclaw health --json > /tmp/openclaw-health-$(date +%Y%m%d%H%M).json
openclaw logs --follow
Adapt the port if your reverse proxy terminates externally; keep the logical order—process, gateway HTTP, static validation, runtime health, then narrative logs.
Upgrade, rollback, triage order, and when a hosted remote Mac wins
Before any upgrade, back up the configuration directory, environment files, channel tokens export (from your secret manager, never from chat logs), and the model or cache directories your team considers expensive to rebuild. Store a tarball with a dated name alongside the previous container image digest or npm lockfile. Roll forward in staging with the same compose file or script flags, rerun doctor and a health JSON capture, then promote during a low-traffic window.
Rollback means restoring those artifacts and restarting with the prior image tag or package version, not reinstalling from memory. If schema migrations ever appear between releases, read release notes before you downgrade; partial migrations may require manual SQL or file repairs that doctor alone cannot undo.
When incidents strike, follow the ladder: status proves a process owns the port; gateway HTTP proves listeners and proxies align; doctor catches static misconfiguration; logs explain what happened for a real user event. Escalate to provider status pages only after you timestamp-correlate local evidence. This ordering matches the operational story in the dedicated gateway article while staying honest that HTTP health and “gateway layer” checks belong between raw process state and static doctor validation.
- npm ci hangs: suspect registry mirrors, MTU issues, or CPU throttling on tiny instances before blaming OpenClaw.
- Doctor clean but channels silent: shift to bridge logs and tokens; UI static assets mislead.
- Compose upgrade broke mounts: diff old and new volume paths before touching data directories.
Summary: Three install paths trade onboarding speed, iteration ergonomics, and reproducibility; all of them fail gracefully only when backups, Node pinning, and triage order are explicit team contracts.
Limitation: Laptops that sleep, personal accounts shared with production, and undocumented manual edits remain the dominant failure class even when packaging is perfect.
SFTPMAC angle: A hosted remote Mac offers stable power, Apple-native toolchains, and colocation with SFTP-based artifact flows many teams already use beside agents. When your OpenClaw gateway must stay online next to the same rsync or SFTP delivery paths your release pipeline trusts, moving off a fragile personal machine reduces sleep-induced disconnects and permission drift without sacrificing macOS compatibility.
We focus on reachable nodes and predictable file permissions so doctor output and health JSON remain comparable across environments. If reliability matters more than repurposing retired hardware, standardize on infrastructure designed for continuous operation and rehearsed rollback.
Which path for a two-day hackathon?
Favor install.sh or a documented npm script with minimal global side effects; snapshot config before you leave the venue.
Which path for regulated production?
Favor Compose with pinned digests, secret injection from a vault, and automated backups of named volumes.
Do I rerun doctor after every env change?
Yes; doctor is cheap compared to an hour of tailing logs without a hypothesis.
Need a stable Mac host for OpenClaw next to managed file sync workflows? Compare SFTPMAC plans and baseline your gateway there.
