ZFS tuning
This guide covers ZFS pool and dataset configuration optimized for PutFS workloads – primarily small file random reads, writes, and prefix listing.
See Data integrity and Scaling for what ZFS brings to PutFS.
Pool topology
Special vdev (the biggest win)
A special vdev stores ZFS metadata and small files (anything under special_small_blocks) on a dedicated device – typically NVMe. This is the single most impactful optimization for PutFS.
# Create pool with mirrored data vdevs and mirrored NVMe special vdev
zpool create tank \
mirror /dev/sda /dev/sdb \
mirror /dev/sdc /dev/sdd \
special mirror /dev/nvme0n1 /dev/nvme1n1
Warning
Always mirror the special vdev. If it's a single device and it fails, you lose the pool.
The special vdev always stores metadata (dnodes, indirect blocks, directory entries). To also store small files on it, set special_small_blocks:
With special_small_blocks=0 (the default), only metadata lives on the special vdev – no file data. This still accelerates directory listings and stat calls, but reads still hit the HDD pool. Setting it to 128K moves small files to NVMe too, which is where PutFS gets the biggest read speedup.
With this, every small file read is an NVMe read. Directory lookups, inodes, and metadata are also on NVMe. The spinning disks only handle large files.
SLOG
A SLOG (Separate intent LOG) absorbs synchronous write latency. PutFS uploads are synchronous writes – without a SLOG, every PUT waits for the data to hit spinning rust.
A small, fast NVMe device (even 16–32 GB) is enough. The SLOG only holds in-flight transactions, not persistent data.
L2ARC
L2ARC extends the ARC (RAM cache) onto SSD. Files that fall out of ARC but are still warm get served from SSD instead of spinning disk.
L2ARC is most useful when your working set exceeds available RAM but fits on SSD. If your entire dataset fits in ARC, L2ARC adds nothing.
Recommended topology summary
NVMe 1+2 (mirror) → special vdev (metadata + small files)
NVMe 3 → SLOG (write intent log)
SSD 1 → L2ARC (read cache overflow)
HDD 1+2 (mirror) → data vdevs (large files)
HDD 3+4 (mirror) → data vdevs (large files)
Synchronous vs asynchronous writes
ZFS has a per-dataset sync property that controls when a write is acknowledged to the client:
| Value | Behavior | PutFS PUT returns when... |
|---|---|---|
sync=standard (default) |
Synchronous. Data is written to the ZFS intent log (ZIL) before acknowledging. | Data is on the SLOG (or main pool if no SLOG). Safe against power loss. |
sync=always |
Forces synchronous even for writes that don't request it. | Same as standard for PutFS (all writes are sync). |
sync=disabled |
Asynchronous. Acknowledges immediately, data is flushed to disk later (up to zfs_txg_timeout, default 5s). |
Data is in RAM only. Faster, but up to 5 seconds of writes can be lost on power failure. |
Trade-offs
sync=standard + SLOG (recommended): Every PUT waits for the SLOG write (~0.1ms on NVMe). This is the safe default – no data loss on power failure, and the SLOG absorbs the latency that would otherwise hit the HDD pool (~5-15ms per sync write on spinning disk).
sync=disabled: Writes return immediately from RAM. Dramatically faster for small files (the PUT latency drops to sub-millisecond), but a power failure or kernel panic loses any writes not yet flushed to disk. Acceptable for scratch data, processing caches, or datasets that can be regenerated. Not acceptable for primary storage.
sync=standard without SLOG: Every sync write goes to the main pool. On HDDs, this means 5-15ms per PUT. On all-SSD pools, this is fine – no SLOG needed.
# Default: safe, uses SLOG if available
zfs set sync=standard tank/putfs
# Fast but unsafe: scratch data, regenerable datasets
zfs set sync=disabled tank/putfs/acme-corp/cache
# Per-dataset: safe for important data, fast for temp
zfs set sync=standard tank/putfs/acme-corp/documents
zfs set sync=disabled tank/putfs/acme-corp/processing-tmp
Warning
sync=disabled is a per-dataset footgun. Only use it on datasets where you accept data loss on crash. Never set it on the parent dataset if child datasets need durability – children inherit the parent's sync property unless explicitly overridden.
Dataset properties
Per-dataset datasets
Create a dataset per dataset for independent tuning:
recordsize
The default recordsize is 128K – optimal for large sequential files but wastes space for small files. For datasets that primarily store small files:
# Small file dataset – 4K records
zfs set recordsize=4K tank/putfs/acme-corp/thumbnails
# Mixed dataset – keep default 128K
# ZFS only allocates what's needed, 128K is the *maximum*
Note
recordsize is the maximum block size, not the minimum. A 2KB file in a 128K recordsize dataset uses a single 4K block (ZFS minimum), not 128K. Lowering recordsize primarily helps when you have many files between 4K and 128K, as it reduces metadata overhead per block.
Compression
Always enable compression. With zstd, the effective ARC and L2ARC size increases because cached data is compressed:
For datasets with already-compressed content (JPEG, video, ZIP):
Extended attributes
Pack extended attributes into the dnode itself to avoid extra I/O:
atime
Disable access time updates – every GET would otherwise trigger a metadata write:
Quotas
Per-dataset storage limits:
PutFS returns 507 Insufficient Storage when the filesystem is full.
ARC tuning
Size
By default, ZFS uses up to 50% of system RAM for ARC. For a dedicated PutFS server, give it more:
Monitoring
Check ARC hit rate – anything above 90% means most reads are served from RAM:
Key metrics:
- ARC hit rate – above 90% is good, above 98% is excellent
- Demand metadata hits – directory lookups and stat calls. If this drops, small file performance degrades.
- L2ARC hit rate – if low, your working set exceeds SSD + RAM
Complete example
A PutFS server with 64 GB RAM, 2x NVMe, 1x SSD, 4x HDD:
# Create pool
zpool create tank \
mirror /dev/sda /dev/sdb \
mirror /dev/sdc /dev/sdd \
special mirror /dev/nvme0n1 /dev/nvme1n1 \
cache /dev/sde
# Root dataset
zfs create tank/putfs
zfs set compression=zstd tank/putfs
zfs set atime=off tank/putfs
zfs set xattr=sa tank/putfs
zfs set dnodesize=auto tank/putfs
zfs set special_small_blocks=128K tank/putfs
# ARC – give it 48 GB of the 64 GB
echo 'options zfs zfs_arc_max=51539607552' > /etc/modprobe.d/zfs.conf
# Per-dataset datasets
zfs create tank/putfs/acme-corp
zfs create -o recordsize=4K tank/putfs/acme-corp/thumbnails
zfs create -o quota=500G tank/putfs/acme-corp/documents
With this setup:
- Small files (< 128K) are stored and read from NVMe via the special vdev
- Hot files are served from RAM (ARC)
- Warm files fall back to SSD (L2ARC)
- Large files live on mirrored HDDs
- Compression makes everything effectively larger in cache
- No access time writes on reads
Further reading
- OpenZFS documentation – official docs covering all properties and commands
- OpenZFS
zfspropsman page – all dataset properties (recordsize,compression,xattr,special_small_blocks, etc.) - OpenZFS
zpoolpropsman page – pool-level properties - Klara Systems: Performance tuning ARC, L2ARC & SLOG – ARC sizing, L2ARC and SLOG configuration
- Brendan Gregg: ZFS L2ARC – L2ARC internals and performance characteristics
- 45Drives: ZFS Caching – practical guide to ARC and L2ARC tuning
- Jim Salter: ZFS 101 – comprehensive introduction to ZFS pool topologies and performance tradeoffs