Skip to content

2025-06-27 OVH3 Disk Replace#

Symptoms#

ZFS reported errors with /dev/sde on the rpool ZFS pool.

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 4000890
  class: statechange
  state: FAULTED
   host: ovh3
   time: 2025-05-31 13:16:36+0000
  vpath: /dev/sde1
  vphys: pci-0000:00:1f.2-ata-5
  vguid: 0xBE814F46F0DCFA0A
  devid: ata-HGST_HUH721212ALE601_5PHAYBUF-part1
   pool: 0x0600B76D0B2E72DB

I can see it with zpool status

zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
  scan: resilvered 38.5G in 03:21:49 with 0 errors on Fri Jun  6 09:12:33 2025
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       DEGRADED     0     0     0
      raidz2-0  DEGRADED     0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     FAULTED    457     0     0  too many errors
        sdf     ONLINE       0     0     0

Asking disk replacement#

At 11:48, I asked OVH for disk replacement using their support.

To get the serial number: smartctl -i /dev/sde | grep "Serial Number"

Numéro de série du ou des disques défectueux :
5PHAYBUF

Si le ou les disques à remplacer ne sont pas détectés, renseignez ici les numéros de série des disques à conserver :
5PHH715D
5PG578XF
8DJ8YH1H
5PHAYLUF
8DH4VTRH

Pour les serveurs de type HG ou FS uniquement, si l’identification LED n’est pas disponible, merci de nous confirmer le changement de pièce en arrêtant votre serveur (coldswap) :
No

Possédez-vous une sauvegarde de vos données ?
yes

Quel est l'état de vos volumes RAID ?
We don't have material RAID

We use ZFS raidz2-0.

zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
 repaired.
  scan: resilvered 38.5G in 03:21:49 with 0 errors on Fri Jun  6 09:12:33 2025
config:

 NAME        STATE     READ WRITE CKSUM
 rpool       DEGRADED     0     0     0
   raidz2-0  DEGRADED     0     0     0
     sda     ONLINE       0     0     0
     sdb     ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sdd     ONLINE       0     0     0
     sde     FAULTED    457     0     0  too many errors
     sdf     ONLINE       0     0     0

errors: No known data errors

Informations complémentaires :
We want to replace sde device asap.

The OVH services responded immediately and at 15:22 (3,5 hours later) the disk was replaced

Resilvering#

Then I logged on ovh3 and replace the old disk with the new one:

zpool replace rpool /dev/sde

I speed up the resilver at max, since it's vital

echo 15000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

zpool estimates me it gonna take 1day and 6 hours.

zpool status
  ...
  scan: resilver in progress since Fri Jun 27 14:10:32 2025
    1.64T scanned at 1.16G/s, 690G issued at 490M/s, 52.2T total
    109G resilvered, 1.29% done, 1 days 06:36:49 to go