Replication
PutFS is single-node by design – there is no built-in replication protocol. Instead, we delegate to existing tools that replicate filesystems or files over the network. The primary node accepts writes; secondaries hold read-only copies and can be promoted on failure. Since all data is plain files on disk, any tool that can copy files can replicate PutFS. The trade-off is async replication – writes between the last snapshot and a failure aren't replicated yet, but can be recovered from the failed node's filesystem once it comes back (see disaster recovery below). Read downtime is zero.
Via zrepl, rsync, rclone, or DRBD.
zrepl (ZFS)
Continuous incremental replication via ZFS send/receive:
# /etc/zrepl/zrepl.yml
jobs:
- name: putfs-push
type: push
connect:
type: tcp
address: backup-host:8888
filesystems:
"tank/putfs<": true
snapshotting:
type: periodic
interval: 15m
prefix: zrepl_
rsync
Works on any filesystem:
rclone
Sync between PutFS instances over HTTP or WebDAV:
Load balancer failover
Writes to primary, reads round-robin. If a secondary returns 404 (replication lag), the load balancer retries on the primary.
upstream primary {
server primary-host:8000;
}
upstream readers {
server primary-host:8000;
server secondary-host:8000;
}
server {
listen 80;
location ~ /$ {
proxy_pass http://readers$uri$is_args$args;
proxy_next_upstream error timeout http_404;
}
location / {
limit_except GET HEAD {
proxy_pass http://primary;
}
proxy_pass http://readers;
proxy_next_upstream error timeout http_404;
}
}
global
maxconn 4096
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
frontend putfs
bind *:80
acl is_write method PUT DELETE
use_backend writers if is_write
default_backend readers
backend writers
server primary primary-host:8000 check
backend readers
balance roundrobin
retry-on 404
retries 1
server primary primary-host:8000 check
server secondary secondary-host:8000 check
Disaster recovery (zrepl)
For a two-node zrepl setup (primary pushes snapshots to secondary), the recovery path depends on what failed.
In most cases, the ZFS pool itself is intact – the data disks are mirrored and survive an OS disk failure, a software crash, or a network outage. The right response is to get the primary back up, not to failover. During that window (hopefully not even 1 hour) the write path is unavailable, but reads continue from the secondary. Applications using PutFS should handle write errors with retry logic anyway – a temporary 5xx or connection refused is expected behaviour during recovery and not a reason to trigger a full failover.
Decision tree
Primary fails
│
├─ Pool intact (OS disk failure, software crash)
│ → Recover primary, do NOT failover
│ → Secondary serves reads during recovery
│ → zrepl resumes automatically once primary is back
│
└─ Unrecoverable (hardware dead, days to replace)
→ Promote secondary to primary
→ When old primary recovers:
1. Transfer gap files (written after last snapshot, before failure)
2. Rollback old primary to shared ancestor snapshot
3. Reverse zrepl direction (new primary pushes to recovered node)
4. Optional: flip back to original topology once synced
Failover: promote secondary
- Stop zrepl sink on secondary
- Start PutFS write path on secondary
- Update DNS / load balancer to point to secondary
- Record the last shared snapshot name – this is the common ancestor for recovery
Recovery: old primary comes back
The old primary's filesystem may contain gap files – writes that happened after the last replicated snapshot but before the failure. These were never snapshotted.
- Find gap files on old primary: files newer than the shared ancestor snapshot (
find -newer .zfs/snapshot/<shared>) - Transfer gap files to new primary via rsync (WORM = no conflicts, purely additive)
- Rollback old primary to the shared ancestor snapshot (
zfs rollback -r) - Reverse zrepl: configure new primary to push to the recovered node
- zrepl identifies the shared ancestor and sends only the delta
Monitoring replication lag
Alert if replication falls behind – larger lag means more gap files on failure:
With Prometheus (zrepl exposes metrics):
- alert: ZreplReplicationLag
expr: time() - zrepl_replication_last_successful_step_timestamp > 900
for: 5m