2025-07-02 OVH3 Disk Replace#
Symptoms#
During the resilver was finished after last disk replacement, few days ago,
ZFS reported errors with /dev/sdb on the rpool ZFS pool.
The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 12619
class: statechange
state: FAULTED
host: ovh3
time: 2025-06-28 13:26:17+0000
vpath: /dev/sdb1
vphys: pci-0000:00:1f.2-ata-2
vguid: 0xEF1AC8F34C562144
devid: ata-HGST_HUH721212ALE601_5PG578XF-part1
pool: 0x0600B76D0B2E72DB
I can see it with zpool status
zpool status
pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 8.56T in 4 days 16:34:51 with 0 errors on Wed Jul 2 06:45:23 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb FAULTED 31 0 0 too many errors
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
Asking disk replacement#
At 17:13, I asked OVH for disk replacement using their support.
To get the serial number:
smartctl -i /dev/sdb | grep "Serial Number"
CS11289820
Numéro de série du ou des disques défectueux :
5PG578XF
Si le ou les disques à remplacer ne sont pas détectés, renseignez ici les numéros de série des disques à conserver :
5PHH715D
8DJ8YH1H
5PHAYLUF
8CKERNSE
8DH4VTRH
Pour les serveurs de type HG ou FS uniquement, si l’identification LED n’est pas disponible, merci de nous confirmer le changement de pièce en arrêtant votre serveur (coldswap) :
No
Possédez-vous une sauvegarde de vos données ?
yes
Quel est l'état de vos volumes RAID ?
We don't have material RAID
We use ZFS raidz2-0.
Informations complémentaires :
We want to replace sdb device asap.
Quel est le résultat des tests smartctl ?
zpool status
pool: rpool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 8.56T in 4 days 16:34:51 with 0 errors on Wed Jul 2 06:45:23 2025
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sda ONLINE 0 0 0
sdb FAULTED 31 0 0 too many errors
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
The OVH services automatically created a concurrent ticket (CS11289821) at 17:20 and at 22:26 (5 hours later) the disk was replaced.
Resilvering#
The morning after I logged on ovh3 and replace the old disk with the new one:
zpool replace rpool /dev/sdb
cannot open '/dev/sdb1': Device or resource busy
,
there might be some race condition with the partition creation).
I speed up the resilver at max, since it's vital
# default is 3000
echo 15000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms
zpool estimates me it gonna take 12 hours, but I expect it to take 4 days, as for last resilver.
zpool status
...
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Jul 3 08:22:23 2025
536G scanned at 4.92G/s, 129G issued at 1.18G/s, 52.5T total
3.41G resilvered, 0.24% done, 12:38:31 to go
...