ZFS overview#
Why#
We use a lot ZFS for our data for it's reliability and incredible capabilities. The most important feature is data synchronization through snapshots. Clone also enables to easily have same data as production for tests.
Learning resources#
To learn about ZFS, see the onboarding made by Christian
Also See OpenZFS official documentation it's very complete, but it's a bit hard to navigate.
- misc man page section contains a lot of useful information on properties and features
- Basic Concepts is a good place to return to from time to time to grab concepts better
- Man pages on system administration commands are really useful
- Performance and Tuning is also valuable to dig in some options
Proxmox ZFS documentation is another valuable resource.
Tutorial about ZFS snapshots and clone: https://ubuntu.com/tutorials/using-zfs-snapshots-clones#1-overview
A good cheat-sheet: https://blog.mikesulsenti.com/zfs-cheat-sheet-and-guide/
Some useful commands#
zpool status
to see eventual errors-
zpool list -v
to see all deviceNote: there is a quirk with ALLOC which is different for mirror pools and raidz pools. On the first it's allocated space to datasets, on the second it's used space.
zfs list -r
to get all datasets and their mountpoints
zpool iostat
to see stats about read / write.zpool iostat -vl 5
is particularly useful,zpool iostat -w
helps you understand the time taken by data to be read / written.
zpool history
list all operations done on a pool
-
zpool list -o name,size,usedbysnapshots,allocated
see space allocated (equivalent todf
)Note:
df
on a dataset does not really work because free space is shared between the datasets. You can still see datasets usage by using:zfs list -o zfs list -o name,used,usedbydataset,usedbysnapshots,available -r <pool_name>
zdb
is also worth knowing (zdb -s <poolname>
for example)
Proxmox#
Proxmox uses ZFS to replicate containers and VMs between servers. It also use it to backup data.
Using sanoid#
We use sanoid / syncoid to sync ZFS datasets between servers (also to back them up).
See sanoid
How to create a zpool#
Ensure disks are not mounted and available (eg using lsblk
).
You can eventually split them into different partitions if needed.
Just create the pool with the resources:
zpool create <pool_name> <pool-type> <device1> <device2> ...
For pool name use something like zfs-something (zfs-hdd, zfs-nvme, etc.)
For pool-type, use the one that's needed: none (no mention), mirror, raidz.
You may add a cache and / or a log device.
How to NFS mount a zfs dataset#
ZFS directly integrate to NFS server.
To have a dataset shared in NFS, you have to set sharenf property to allowed addresses and other options.
Example:
zfs set sharenfs="rw=@10.0.0.0/28,no_root_squash" <pool_name>/<dataset_name>
zfs set sharenfs="rw=@10.0.0.1/32,rw=@10.1.0.200/32,no_root_squash" <pool_name>/<dataset_name>
Very important: always filter on a internal sub network, otherwise your Dataset is exposed to the internet !!!
Beware that all descendant inherit the property and will also be shared.
How to replace a disk in zpool#
To replace a disk in a zpool.
- Get a list of devices using
zpool status <pool_name>
- Put the disk offline:
zpoool <pool_name> offline <device_name>
(egzpool rpool offline sdf
)
- Replace the disk physically
- Ask zpool to replace the drive:
zpool <pool_name> replace /dev/<device_name>
(egzpool rpool replace /dev/sdf
)
- verify disk is back and being resilvered:
zpool status <pool_name> … state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since … … replacing-5 DEGRADED 0 0 0 old OFFLINE 0 0 0 sdf ONLINE 0 0 0 (resilvering)
- after resilver finishes, you can eventually run a scrub:
zpool scrub <pool_name>
How to Sync ZFS datasets#
To Sync ZFS you just take snapshots on the source at specific intervals (we use cron jobs). You then use zfs-send an zfs-recv through ssh to sync the distant server (send snapshots).
Doing it automatically#
We normally do it using sanoid and syncoid
Proxmox might also do it as part of corosync to replicate containers across cluster.
Doing it manually#
zfs send <previous-snap> <dataset_name>@$<last-snap> \
| ssh <hostname> zfs recv <target_dataset_name> -F
ZFS sync of sto files from off1 to off2:
You also have to clean snapshots from time to time to avoid retaining too much useless data.
On ovh3: snapshot-purge.sh
How to Docker mount a zfs dataset#
If ZFS dataset is on same machine we can use bind mounts to mount a folder in a ZFS partition.
For distant machines, ZFS datasets can be exposed as NFS partition. Docker as an integrated driver to mount distant NFS as volumes.
How to Mount datasets in a proxmox container#
To mount dataset in a proxmox container you have:
- to use a shared disk on proxmox
- or to mount them as bind volumes
Use a shared disk on proxmox#
(not really experimented yet, but it could have the advantage to enable replication)
On your VM/CT, in resource, add a disk.
Add the mountpoint to your disk and declare it shared.
In another VM/CT, add the same disk.
mount dataset as bind volumes#
See: https://pve.proxmox.com/wiki/Linux_Container#_bind_mount_points
and https://pve.proxmox.com/wiki/Unprivileged_LXC_containers
Edit /etc/pve/lxc/<container_id>.conf
and ad volumes with mount point.
For example:
# volumes
mp0: /zfs-hdd/opff,mp=/mnt/opff
mp1: /zfs-hdd/opff/products/,mp=/mnt/opff/products
mp2: /zfs-hdd/off/users/,mp=/mnt/opff/users
mp3: /zfs-hdd/opff/images,mp=/mnt/opff/images
mp4: /zfs-hdd/opff/html_data,mp=/mnt/opff/html_data
mp5: /zfs-hdd/opff/cache,mp=/mnt/opff/cache
Important: if you have nested mount points, the order is very important. First the outermost, then the inner ones.
To take changes in account, you have to reboot: pct reboot <container_id>
Getting uids and gids right#
LXC maps uids inside the container to specific ids outside, most of the time by adding a large value. It's a way to ensure security.
If you want to have file belonging to say uid 1000 in the zfs mount, you will have to tweak it:
We edit /etc/subuid and /etc/subgid to add root:1000:10
. This allow container started by root to map ids 1000 to their same ids on system.
Edit /etc/pve/lxc/<machine_id>.conf
conf to add sub_id exceptions:
# uid map: from uid 0 map 999 uids (in the ct) to the range starting 100000 (on the host)
# so 0..999 (ct) → 100000..100999 (host)
lxc.idmap = u 0 100000 999
lxc.idmap = g 0 100000 999
# we map 10 uid starting from uid 1000 onto 1000, so 1000..1010 → 1000..1010
lxc.idmap = u 1000 1000 10
lxc.idmap = g 1000 1000 10
# we map the rest of 65535 from 1010 upto 101010, so 1010..65535 → 101010..165535
lxc.idmap = u 1011 101011 64525
lxc.idmap = g 1011 101011 64525