Skip to content

ZFS : the basics#

24 feb 2022 - by Christian Quest

Some story#

OpenZFS

written 20 years ago.

by Sun Micro System.

For some time, licensing problems, now solved.

A lot of forks have joined again (OpenZFS) and more features are developped

ZFS glossary#

forget partitions / filesystem / lvm / md etc.

storage space (full disks, parititons, even files, S3 planned !)
vdev - virtual device (assemble different storage space together, defines redundancy)
pool - gather different virtual devices into a big unique space
datasets - they are like the filesystem (but they are not a filesystem 😉)

⚠️ there is no traditional notion of partitions in zfs

see also: https://openzfs.github.io/openzfs-docs/man/7/zfsconcepts.7.html

ZFS virtual device (vDev)#

vDev = set of storage spaces :
- Recommended: use whole disks in vDev
- But you can optionally add a partition (part of a disk).
- You can even use a file (good for testing or dev, not for prod).
- (in next ZFS version even S3 storage can be added)

Important:
- In one vdev there is only one redundancy configuration. (but in a pool you can put vdev with different redundancy)
- In a vdev all disk must have same size.

vDev manipulation#

Most of the time, you create the vdev by adding it to the pool.

You can't extend a vDev by adding more disks (so far).

But you can change the disks in the vdev one by one by bigger disks. The vdev is expanded when all disks have increased size.

There are special vdevs:
- read cache
- write journaling

ZFS : pools#

There can be several virtual devices in a pool.

Eg:
- a pool with a vdev with 5 disks in a RAIDZ1 vdev
- and, later, a vdev with 5 more disks in a RAIDZ2 vdev

You can remove some vdev from a pool (VDEV) but some cannot be removed - for example ZRAID vdevs.

So think carefully before adding them.

ZFS vdev redundancy#

Example:
- 5 x 4TB disks: you have 20 TB of storage space
- to get redundancy, mirrors, raid, you can rearrange them using virtual devices (vdev)

That vdev I can ask the level of redundancy (number of disk that can disapear without data lost)
- RAIDZ1 - RAIDZ2 - RAIDZ3 (1, 2 or 3 disk redondancy)
- mirror: all data written on all disks (max redundancy)
- or no redundancy at all…

Create a zpool and vdev inside it#

zpool create mypool raidz1 sda sdb sdc sdd sde

Creates a pool named mypool with one vdev in RAIDZ1 with the 5 disk

zpool add mypool mirror sdf sdg

Adds a new vdev in mirror mode with two disks

Spare disks#

You can add spare disks in your pool. That is a disk which is not in a vdev, but can be used as a replacement if a disk fails in any vdev (replacement will be automatic).

Ex: 3 vdev ZRAID1, the remaining spare may go in any of the vdev

Maybe a disk that was lost, may be in fact still resetable (after some tests), and can become the new spare.

Datasets#

Each pool can have as many datasets as wanted

Each dataset has it's own settings can be compressed, encrypted etc.

You don't know how the datasets is laid out in the pool.

Datasets have no defined size - they can grow (but you may place quotas)

Datasets are created in the pool, not on a specific vdev

Datasets (2)#

The command to manage datasets in zfs

ZFS is a copy on write (COW) storage system. Data is always written to a new place, not where you read it previously. Thanks to that we can create snapshots (freeze part dataset at a point in time, while still writing to it).

Dataset creation example#

zfs create mypool/mydataset

Creates a dataset with default settings (taking whole pool size)

zfs set compress=on mypool/mydataset

Compression will be active for everything added after that

Dataset size#

For dataset size:
- either add a quota on total size
- or add a quota

zfs set refquota=100G mypool/mydataset

Will limit the quantity of data to 100G, but without taking into account snapshots.

snapshots#

You can diff two snapshots. This enables very cheap backups.

Really helpful on Open Food Facts for the products (/path/bar/code/version.sto) - it created millions of files, and it was hard to backup because rsync must look at every files (more than 2/3 hours). With ZFS we are down to below a minute. And we are able to backup every 1/2 hour.

Also the snapshot is immediatly usable (no restore needed), it's already the dataset.

When you use snapshots + diff, you can access any version of snapshot thanks to virtual folders.

You can also remove some diffs.

To sync ZFS you can snapshot at regular intervals.

Clone#

You can create writable snapshots, known as clones. It's like a fork of the filesystem.

For example you want to test a script, you can test it on a clone.

If you remove the clone, your changes are lost, but you can also promot the clone to replace the main dataset.

At Open Food Facts stagging areas use clone of backup datasets (and mount them through nfs).

ZFS snapshot and sync#

zfs send mypool/mydataset | zfs recv otherpool/otherdataset

Generates a stream of data. Then store it either on a file or on another dataset.

zfs snapshot mypool/mydataset@myshanpshotname

create a snapshot.

zfs send -I mypool/mydataset@oldsnap mypool/mydataset@newpool

creates an incremental snapshot (-i), (-I sends all snapshot inbetween the two)

Note: On the receiving side, you can also maintain a receive token, to be able to resume send at a specific point if it breaks in the middle.

Block storage datasets#

Datasets can be blocks. And you can format that block storage as ext4, etc. All read/write are done on the dataset, so you still get snapshots, compression, etc.

There is shortcuts in ZFS for this option.

Some features#

encryption

Datasets can be encrypted. Thanks to that you can have encrypted snapshots and make the backup without deciphering data.

Compression

Dataset can use different compression algorithm LZ4, LZ0 and Zstandard.

NFS is integrated to ZFS, which is very handy.

Performance / Data safety tradeoff#

ZFS is not the champion of performance because it prefers data safety.

All data have checksums.

real life example: checksum errors once happens because of a failing SSD Cache. When the SSD was removed it all went back to normal.

zscrub verifies data integrity. If a checksum is wrong, the data is moved to another place (auto-repair).

If a disk have bad sectors - an error is reported. Thanks to redundancy zfs will write again the data (to ensure redundancy). And zfs will manage "pending sectors" - sectors that cannot be read, that will come back (reallocated) after a new successful write.

ZFS Read Cache#

cache is at pool level

ZFS has a RAM cache and possibly a 2nd level with SSD, with cache vdev.

cache balances Least Recently Used (LRU) and Most Recently Used (MRU), this avoid cache poisonning. (which happens with LRU cache when you read a big file (you loose all interesting cache))

ZFS write with journaling#

At pool level.

Organizing data on the disk may need to wait for big writes.

ZFS has a write cache in RAM, but it's not safe. You can add a SSD to keep a journaling of last write, used only if we lose RAM content after a crash. (you need fast disk and disks that accepts a lot of rewrite - you just need a few gigabits)

Deduplication (at pool level)#

You can deduplicate data in a pool.
It uses a checksum to avoid
It save spaces but needs RAM and slows down writing quite a lot.
As a consequence, few people use it.

It's real time dedup (no async dedup, yet).

ZFS daemon#

ZED is the ZFS Event Daemon.

If there are important errors, sends a mail to the administrator.

Notifications from ZED are really important.

SSD management#

SSD timming can be automatic, but you can launch it with zpool trim

ZFS limits#

Very hard to hit limits on number of directories and files.

Same for file size.

Summary#

ZFS - very stable
focused on data safety
- with checksum
- transparent sectors handling, etc.
- recommended to use ECC memory?
not the best for performance, but still good
if you loose too much redundancy you loose the entire pool (as data is anywhere)
- for very large space, multiple pools might be considered

https://cq94.medium.com/zfs-vous-connaissez-vous-devriez-1d2611e7dad6

Using ZFS on your own machine#

It make sense using ZFS to sync prod data on your
- maybe using a partition or a large file

Btrfs - might be a good option
- more flexible on pool / vdev definitions
- ⚠️ but beware stripping raid which is a bit buggy

ZFS for root - only if you know it well