Skip to content

Replication

PutFS is single-node by design – there is no built-in replication protocol. Instead, we delegate to existing tools that replicate filesystems or files over the network. The primary node accepts writes; secondaries hold read-only copies and can be promoted on failure. Since all data is plain files on disk, any tool that can copy files can replicate PutFS. The trade-off is async replication – writes between the last snapshot and a failure aren't replicated yet, but can be recovered from the failed node's filesystem once it comes back (see disaster recovery below). Read downtime is zero.

Via zrepl, rsync, rclone, or DRBD.

zrepl (ZFS)

Continuous incremental replication via ZFS send/receive:

# /etc/zrepl/zrepl.yml
jobs:
  - name: putfs-push
    type: push
    connect:
      type: tcp
      address: backup-host:8888
    filesystems:
      "tank/putfs<": true
    snapshotting:
      type: periodic
      interval: 15m
      prefix: zrepl_

rsync

Works on any filesystem:

*/15 * * * * rsync -a --delete /srv/putfs/ backup-host:/srv/putfs/

rclone

Sync between PutFS instances over HTTP or WebDAV:

rclone sync :webdav,url=https://primary/dav/ :webdav,url=https://secondary/dav/

Load balancer failover

Writes to primary, reads round-robin. If a secondary returns 404 (replication lag), the load balancer retries on the primary.

upstream primary {
    server primary-host:8000;
}

upstream readers {
    server primary-host:8000;
    server secondary-host:8000;
}

server {
    listen 80;

    location ~ /$ {
        proxy_pass http://readers$uri$is_args$args;
        proxy_next_upstream error timeout http_404;
    }

    location / {
        limit_except GET HEAD {
            proxy_pass http://primary;
        }
        proxy_pass http://readers;
        proxy_next_upstream error timeout http_404;
    }
}
global
    maxconn 4096

defaults
    mode http
    timeout connect 5s
    timeout client 30s
    timeout server 30s

frontend putfs
    bind *:80
    acl is_write method PUT DELETE
    use_backend writers if is_write
    default_backend readers

backend writers
    server primary primary-host:8000 check

backend readers
    balance roundrobin
    retry-on 404
    retries 1
    server primary primary-host:8000 check
    server secondary secondary-host:8000 check

Disaster recovery (zrepl)

For a two-node zrepl setup (primary pushes snapshots to secondary), the recovery path depends on what failed.

In most cases, the ZFS pool itself is intact – the data disks are mirrored and survive an OS disk failure, a software crash, or a network outage. The right response is to get the primary back up, not to failover. During that window (hopefully not even 1 hour) the write path is unavailable, but reads continue from the secondary. Applications using PutFS should handle write errors with retry logic anyway – a temporary 5xx or connection refused is expected behaviour during recovery and not a reason to trigger a full failover.

Decision tree

Primary fails
    ├─ Pool intact (OS disk failure, software crash)
    │    → Recover primary, do NOT failover
    │    → Secondary serves reads during recovery
    │    → zrepl resumes automatically once primary is back
    └─ Unrecoverable (hardware dead, days to replace)
         → Promote secondary to primary
         → When old primary recovers:
              1. Transfer gap files (written after last snapshot, before failure)
              2. Rollback old primary to shared ancestor snapshot
              3. Reverse zrepl direction (new primary pushes to recovered node)
              4. Optional: flip back to original topology once synced

Failover: promote secondary

  1. Stop zrepl sink on secondary
  2. Start PutFS write path on secondary
  3. Update DNS / load balancer to point to secondary
  4. Record the last shared snapshot name – this is the common ancestor for recovery

Recovery: old primary comes back

The old primary's filesystem may contain gap files – writes that happened after the last replicated snapshot but before the failure. These were never snapshotted.

  1. Find gap files on old primary: files newer than the shared ancestor snapshot (find -newer .zfs/snapshot/<shared>)
  2. Transfer gap files to new primary via rsync (WORM = no conflicts, purely additive)
  3. Rollback old primary to the shared ancestor snapshot (zfs rollback -r)
  4. Reverse zrepl: configure new primary to push to the recovered node
  5. zrepl identifies the shared ancestor and sends only the delta

Monitoring replication lag

Alert if replication falls behind – larger lag means more gap files on failure:

zrepl status | grep -A5 "push_to_replica"

With Prometheus (zrepl exposes metrics):

- alert: ZreplReplicationLag
  expr: time() - zrepl_replication_last_successful_step_timestamp > 900
  for: 5m

Further reading