2025-06-30 OVH3 disk full post-mortem#

Symptoms#

While running the general assembly on 2025-06-30 wiki was no more accessible.

After checking, we got a "no space left on device".

Their is a steady augmentation of the disk used space, and the zpool is effectively saturated. (zpool list rpool)

Remediation 1#

I decided to remove some snapshots that where already synchronized on vm200 disk, gaining maybe 30G.

I did new that after this kind of incident, you have to restart everything because otherwise many might be in degraded mode.

Then I restarted all containers. But vm didn't want to restart. So finally I did go for a full restart of the server (which is the right way to go).

But after a while (30 min or so), disk was full again.

Remediation 2#

Not seeing any obvious fix [vm-200-disk], I decided to remove the monitoring disk.

I already have stop the vm (as monitoring was moved). I did not remove the vm, in case we should spin it again, but just removed it's disk using zfs. Before I did check we add synced data on ovh3.

# USE WITH CAUTION
zfs destroy -r rpool/vm-203-disk-0

This gave 130 Go of free space (note: you have to wait a bit before zpool list shows it).

I modified vm203 options to tell it not to automatically start at boot time.

At this point, I did reboot the host (ovh1) and wait until it was up.

As a measure of security, though, I did go on staging vm (200) and stopped the search-a-licious service (in /home/off/searchalicious-net)

Cleaning#

On ovh3 I removed the synchronization line for ovh1 vm-203-disk-0 in syncoid.conf, and modified sanoid.conf to say that this dataset must follow the local_data policy. This way we can keep it for sometime, until we decide to remove it completely.

See commit 7867b48d5daa061ad3e3c40be3f924244d07a964

Lessons learned#

We have to be carefull with ES and ZFS snapshots, as ES systematically rewrites indexes files to the disk, leading to rapid increase of disk used space if we keep snapshots.

I did list the snapshot of vm-200-disk-2 and did not find a big variation but this was the wrong disk to check… for historical reasons, the disk with data is vm-200-disk-0 instead. If I would have checked the right disk I might have been able to remove more snapshots and recover disk space. ↩