Skip to content

ZFS tuning

This guide covers ZFS pool and dataset configuration optimized for PutFS workloads – primarily small file random reads, writes, and prefix listing.

See Data integrity and Scaling for what ZFS brings to PutFS.

Pool topology

Special vdev (the biggest win)

A special vdev stores ZFS metadata and small files (anything under special_small_blocks) on a dedicated device – typically NVMe. This is the single most impactful optimization for PutFS.

# Create pool with mirrored data vdevs and mirrored NVMe special vdev
zpool create tank \
    mirror /dev/sda /dev/sdb \
    mirror /dev/sdc /dev/sdd \
    special mirror /dev/nvme0n1 /dev/nvme1n1

Warning

Always mirror the special vdev. If it's a single device and it fails, you lose the pool.

The special vdev always stores metadata (dnodes, indirect blocks, directory entries). To also store small files on it, set special_small_blocks:

# Files under 128K go to special vdev (NVMe)
zfs set special_small_blocks=128K tank/putfs

With special_small_blocks=0 (the default), only metadata lives on the special vdev – no file data. This still accelerates directory listings and stat calls, but reads still hit the HDD pool. Setting it to 128K moves small files to NVMe too, which is where PutFS gets the biggest read speedup.

With this, every small file read is an NVMe read. Directory lookups, inodes, and metadata are also on NVMe. The spinning disks only handle large files.

SLOG

A SLOG (Separate intent LOG) absorbs synchronous write latency. PutFS uploads are synchronous writes – without a SLOG, every PUT waits for the data to hit spinning rust.

# Add NVMe SLOG
zpool add tank log /dev/nvme2n1

A small, fast NVMe device (even 16–32 GB) is enough. The SLOG only holds in-flight transactions, not persistent data.

L2ARC

L2ARC extends the ARC (RAM cache) onto SSD. Files that fall out of ARC but are still warm get served from SSD instead of spinning disk.

# Add SSD L2ARC
zpool add tank cache /dev/sde

L2ARC is most useful when your working set exceeds available RAM but fits on SSD. If your entire dataset fits in ARC, L2ARC adds nothing.

NVMe 1+2 (mirror)  → special vdev  (metadata + small files)
NVMe 3             → SLOG          (write intent log)
SSD 1              → L2ARC         (read cache overflow)
HDD 1+2 (mirror)   → data vdevs   (large files)
HDD 3+4 (mirror)   → data vdevs   (large files)

Synchronous vs asynchronous writes

ZFS has a per-dataset sync property that controls when a write is acknowledged to the client:

Value Behavior PutFS PUT returns when...
sync=standard (default) Synchronous. Data is written to the ZFS intent log (ZIL) before acknowledging. Data is on the SLOG (or main pool if no SLOG). Safe against power loss.
sync=always Forces synchronous even for writes that don't request it. Same as standard for PutFS (all writes are sync).
sync=disabled Asynchronous. Acknowledges immediately, data is flushed to disk later (up to zfs_txg_timeout, default 5s). Data is in RAM only. Faster, but up to 5 seconds of writes can be lost on power failure.

Trade-offs

sync=standard + SLOG (recommended): Every PUT waits for the SLOG write (~0.1ms on NVMe). This is the safe default – no data loss on power failure, and the SLOG absorbs the latency that would otherwise hit the HDD pool (~5-15ms per sync write on spinning disk).

sync=disabled: Writes return immediately from RAM. Dramatically faster for small files (the PUT latency drops to sub-millisecond), but a power failure or kernel panic loses any writes not yet flushed to disk. Acceptable for scratch data, processing caches, or datasets that can be regenerated. Not acceptable for primary storage.

sync=standard without SLOG: Every sync write goes to the main pool. On HDDs, this means 5-15ms per PUT. On all-SSD pools, this is fine – no SLOG needed.

# Default: safe, uses SLOG if available
zfs set sync=standard tank/putfs

# Fast but unsafe: scratch data, regenerable datasets
zfs set sync=disabled tank/putfs/acme-corp/cache

# Per-dataset: safe for important data, fast for temp
zfs set sync=standard tank/putfs/acme-corp/documents
zfs set sync=disabled tank/putfs/acme-corp/processing-tmp

Warning

sync=disabled is a per-dataset footgun. Only use it on datasets where you accept data loss on crash. Never set it on the parent dataset if child datasets need durability – children inherit the parent's sync property unless explicitly overridden.

Dataset properties

Per-dataset datasets

Create a dataset per dataset for independent tuning:

zfs create tank/putfs
zfs create tank/putfs/acme-corp
zfs create tank/putfs/acme-corp/invoices

recordsize

The default recordsize is 128K – optimal for large sequential files but wastes space for small files. For datasets that primarily store small files:

# Small file dataset – 4K records
zfs set recordsize=4K tank/putfs/acme-corp/thumbnails

# Mixed dataset – keep default 128K
# ZFS only allocates what's needed, 128K is the *maximum*

Note

recordsize is the maximum block size, not the minimum. A 2KB file in a 128K recordsize dataset uses a single 4K block (ZFS minimum), not 128K. Lowering recordsize primarily helps when you have many files between 4K and 128K, as it reduces metadata overhead per block.

Compression

Always enable compression. With zstd, the effective ARC and L2ARC size increases because cached data is compressed:

zfs set compression=zstd tank/putfs

For datasets with already-compressed content (JPEG, video, ZIP):

zfs set compression=off tank/putfs/acme-corp/media

Extended attributes

Pack extended attributes into the dnode itself to avoid extra I/O:

zfs set xattr=sa tank/putfs
zfs set dnodesize=auto tank/putfs

atime

Disable access time updates – every GET would otherwise trigger a metadata write:

zfs set atime=off tank/putfs

Quotas

Per-dataset storage limits:

zfs set quota=100G tank/putfs/acme-corp/invoices
zfs set quota=1T tank/putfs/acme-corp/documents

PutFS returns 507 Insufficient Storage when the filesystem is full.

ARC tuning

Size

By default, ZFS uses up to 50% of system RAM for ARC. For a dedicated PutFS server, give it more:

# /etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=34359738368  # 32 GB

Monitoring

Check ARC hit rate – anything above 90% means most reads are served from RAM:

arc_summary | grep -A5 "ARC size"
arcstat 1  # live monitoring

Key metrics:

  • ARC hit rate – above 90% is good, above 98% is excellent
  • Demand metadata hits – directory lookups and stat calls. If this drops, small file performance degrades.
  • L2ARC hit rate – if low, your working set exceeds SSD + RAM

Complete example

A PutFS server with 64 GB RAM, 2x NVMe, 1x SSD, 4x HDD:

# Create pool
zpool create tank \
    mirror /dev/sda /dev/sdb \
    mirror /dev/sdc /dev/sdd \
    special mirror /dev/nvme0n1 /dev/nvme1n1 \
    cache /dev/sde

# Root dataset
zfs create tank/putfs
zfs set compression=zstd tank/putfs
zfs set atime=off tank/putfs
zfs set xattr=sa tank/putfs
zfs set dnodesize=auto tank/putfs
zfs set special_small_blocks=128K tank/putfs

# ARC – give it 48 GB of the 64 GB
echo 'options zfs zfs_arc_max=51539607552' > /etc/modprobe.d/zfs.conf

# Per-dataset datasets
zfs create tank/putfs/acme-corp
zfs create -o recordsize=4K tank/putfs/acme-corp/thumbnails
zfs create -o quota=500G tank/putfs/acme-corp/documents

With this setup:

  • Small files (< 128K) are stored and read from NVMe via the special vdev
  • Hot files are served from RAM (ARC)
  • Warm files fall back to SSD (L2ARC)
  • Large files live on mirrored HDDs
  • Compression makes everything effectively larger in cache
  • No access time writes on reads

Further reading