Pain points: unattended is not copy-paste from Terminal
Interactive success, daemon failure. Terminal sessions inherit login shell PATH additions, agent sockets, and sometimes TouchID-backed key access. launchd plist jobs and cron entries inherit almost none of that unless you encode it explicitly. The failure signature is a hang after TCP connects or during incremental listing, not an immediate permission denied from the remote Mac.
openrsync stalls on large trees. Reports cluster around transfers that progress through many files then stop between files without error. That pattern differs from WAN congestion, which tends to throttle gradually, and differs from disk full, which eventually surfaces I/O errors. Mitigations include protocol flags, rsync binary alignment, and splitting very large trees into multiple jobs with independent logs.
SFTP batch waits forever. sftp -b in non-interactive mode cannot answer prompts. A permissions mismatch or a relative path typo can wedge the session silently unless verbose logging and stderr redirection are configured in the plist.
Decoupled from atomic publishing. Fixing rsync without fixing release mechanics means partial files become visible. Pair this article with staged directories and checksum gates so success means verified bytes, not merely exit code zero.
Layered symptoms: environment versus protocol stack
L0 proves SSH with BatchMode=yes under the same user context as the job. Failures here precede rsync tuning. L1 distinguishes list-building stalls from mid-file stalls. L2 addresses openrsync interoperability with the remote rsync implementation, including explicit --rsync-path and optional --protocol=28.
For launchd, validate UserName, WorkingDirectory, StandardOutPath, and StandardErrorPath. LaunchDaemons versus LaunchAgents also change keychain access patterns. Prefer dedicated CI keys without passphrases for unattended paths, and restrict them to upload-only accounts on the remote Mac.
Document how GitHub Actions self-hosted runners differ again: they inject another PATH layer and another home directory. The same wrapper script should print its identity at the top so triage stays factual instead of anecdotal.
When multiple teams share one Mac ingress, combine this runbook with concurrency caps so retries from one team do not stampede sshd while another team's job is mid-checksum.
Security reviewers sometimes ask whether exporting a broader PATH in plist XML weakens integrity guarantees. The pragmatic answer is that daemons already inherit an implicit policy whether you document it or not. Making PATH explicit is therefore a security improvement because it removes accidental reliance on whatever a human engineer typed last Friday. Pair explicit PATH with file permissions on the wrapper script itself so only the service account can execute it.
Capacity planning also benefits from honest accounting of retry storms. If openrsync occasionally needs three attempts to finish a snapshot, your concurrency model must reserve headroom for those attempts instead of assuming one-shot success. Otherwise a benign statistical tail becomes an outage when three pipelines align on the same bad minute.
Quantified baselines: measure stall seconds, not vibes
Before changing flags, capture a twenty-four hour baseline: start time, end time, file counts, bytes moved, exit codes, and seconds without byte progress. If progress idles longer than roughly three hundred seconds, wrap the command with timeout and alert. Correlate with RTT and loss on the WAN path so you do not misclassify L2 stalls as L0 network flaps.
For CI, define success as three explicit phases: transport complete, checksum verified, atomic pointer switched. Give each phase its own exit contract so flaky openrsync retries show up as measurable retry rates instead of binary pass-fail noise.
When executives ask whether the Mac or the network is broken, answer with distributions, not single anecdotes. Histograms of stall durations after protocol alignment show whether you are chasing a rare tail or a systemic regression tied to a macOS minor upgrade.
Revisit baselines after every Sequoia point release because Apple adjusts bundled tooling behavior across updates. A job that survived fifteen dot releases can still shift when openrsync defaults change.
When you present numbers to finance, translate stall minutes into developer wait minutes. A ten minute nightly stall across twenty engineers is not a networking footnote; it is hundreds of dollars of attention tax every week. Framing the problem in money often unlocks the modest infrastructure budget needed to pin GNU rsync consistently.
Also chart seasonality. End-of-quarter release windows concentrate uploads and amplify latent concurrency bugs. A matrix that works in February may still fail in December unless you rehearse peak shapes with synthetic load.
Finally, keep a one-page escalation ladder taped to the incident channel topic: first owner checks plist logs, second owner checks SSH BatchMode probes, third owner checks rsync versions, fourth owner engages network about NAT. Without that ladder, well-meaning responders parallelize conflicting experiments and extend downtime.
Decision matrix: first actions
| Symptom | Likely root cause | First action | Risk or rollback |
|---|---|---|---|
| Only daemons hang | PATH, missing agent socket, no TTY | Export PATH explicitly; preload keys; fail fast with BatchMode | Over-broad env exports can collide across plists; isolate per job |
| Stalls mid-tree | openrsync pairing skew | Add --protocol=28; set --rsync-path to GNU rsync | Wrong remote path fails fast; test in staging |
| SFTP batch silent | Prompt or permission | Run with -v; split batches; explicit bye | Verbose logs may leak paths; scrub archives |
| Flaky after OS update | Toolchain drift | Re-pin rsync versions; rerun baseline suite | Short-term dual binaries increase support surface |
How-to: seven ordered steps
#!/bin/bash
set -euo pipefail
export PATH="/usr/local/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin"
export RSYNC_RSH="ssh -o BatchMode=yes -o ServerAliveInterval=30"
/usr/bin/rsync -av --protocol=28 --rsync-path=/opt/homebrew/bin/rsync \
./artifacts/ "[email protected]:/data/inbox/staging/"
Step 1: Print id, pwd, and a sorted redacted env inside the plist context.
Step 2: Validate SSH with ssh -o BatchMode=yes -o ConnectTimeout=10 before rsync.
Step 3: Align rsync implementations and set --rsync-path; keep version numbers in internal docs.
Step 4: Wire keys: dedicated CI key or managed ssh-agent; never rely on interactive passphrase entry.
Step 5: Add timeout and bounded exponential backoff; log retry ordinal for openrsync recovery.
Step 6: Dry-run against staging; only then attach atomic switch steps from the release guide.
Step 7: Archive per-run summaries for monthly review of retry rates and tail latency.
After mechanical steps, rehearse one failure injection quarterly: kill rsync mid-transfer and verify gates mark the build bad without publishing partial artifacts. Exercises should include a runner reboot mid-job because that stresses both agent wiring and partial file cleanup.
Keep wrapper scripts in version control next to pipeline definitions so diffs ride normal code review. The goal is to stop secret per-host tweaks that never return to documentation.
If legal requires passphrase-protected keys everywhere, invest in an approved non-interactive unlock path rather than pretending cron can prompt. Split human keys from machine keys at the account boundary on the remote Mac.
When logs grow large, rotate them with the same discipline as application logs. Silent disks full of logs have caused more than one false attribution to openrsync.
Document ownership for each flag you add. A --protocol=28 workaround that lives only in one engineer's forked gist will disappear when that engineer rotates teams. Centralize in git with a short rationale comment so future readers know whether the flag is compatibility debt or a permanent requirement.
If you operate both Intel and Apple Silicon Macs in one pool, verify rsync paths separately per architecture. Homebrew prefixes differ; hardcoding only one path creates subtle arch-specific failures that confuse dashboards because half the fleet looks healthy.
Testing strategy should include downgrade drills: temporarily remove Homebrew rsync from staging and confirm alarms fire when the job falls back to openrsync without the mitigation bundle. Drills prove monitoring is wired, not merely configured.
Lastly, align naming: call the plist label, the log file prefix, and the Grafana series the same string so midnight triage does not require mental translation between three vocabularies.
Related reading
Parallel sessions and keepalive tuning live in concurrent SFTP. Handshake optimization for dense CI loops remains in ControlMaster matrix. Directory isolation belongs with chroot SFTP and multi-team collaboration. The homepage summarizes SFTPMAC plans for always-on Mac ingress.
Self-managed remote Mac operators carry hardware lifecycle, macOS minor upgrades, plist drift, and multi-tenant permission design simultaneously. When that total cost exceeds the value of bespoke control, leasing a dedicated ingress with documented defaults frequently shortens incident timelines.
FAQ and hosted remote Mac contrast
Should I ban system rsync entirely?
No blanket ban. Small local copies can stay on system tools. Remote, high-file-count, unattended workloads justify pinning GNU rsync and documenting versions.
Is cron still acceptable?
Apple prefers launchd. If cron remains, treat its minimal environment as default and encode assumptions explicitly.
Summary: Unattended success on Sequoia requires explicit PATH and agent context, deterministic non-interactive SSH, aligned rsync implementations, timeouts, and staged releases.
Limitation: This runbook does not replace network design for dual-stack or corporate proxies; those remain separate L0 programs.
Contrast: SFTPMAC hosted remote Mac offerings package stable ingress, permission boundaries, and operational defaults so teams spend fewer nights correlating plist logs with openrsync restarts. You keep pipeline ownership while outsourcing node hygiene and baseline drift, which is often the cheapest path when on-call time is scarce.
