Skip to content

2025-11-20 ovh3 disk replace#

We had once again a failure on OVH3, resulting as the disk being put offline.

We got an email:

ZFS device fault for pool 0x0600B76D0B2E72DB on ovh3

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 2804785
  class: statechange
  state: FAULTED
   host: ovh3
   time: 2025-11-17 03:01:30+0000
  vpath: /dev/sdf1
  vphys: pci-0000:00:1f.2-ata-6
  vguid: 0xCC41DA46A0AC20A4
  devid: ata-HGST_HUH721212ALE601_5PG0KKAF-part1
   pool: 0x0600B76D0B2E72DB

As always, I asked for a replacement to OVH3 support, on our mecenat account. on 19.11.2025 16:49:27

Here is the content of the ticket.

CS12863462

Numéro de série du ou des disques défectueux :
5PG0KKAF

Si le ou les disques à remplacer ne sont pas détectés, renseignez ici les numéros de série des disques à conserver :
5PHH715D
8DKZYH9H
8DJ8YH1H
5PHAYLUF
8CKERNSE

Pour les serveurs de type HG ou FS uniquement, si l’identification LED n’est pas disponible, merci de nous confirmer le changement de pièce en arrêtant votre serveur (coldswap) :
Yes

Possédez-vous une sauvegarde de vos données ?
Yes

Quel est l'état de vos volumes RAID ?
We don't have material RAID

# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
 repaired.
  scan: scrub repaired 0B in 5 days 09:36:29 with 0 errors on Fri Nov 14 10:00:30 2025
config:

 NAME        STATE     READ WRITE CKSUM
 rpool       DEGRADED     0     0     0
   raidz2-0  DEGRADED     0     0     0
     sda     ONLINE       0     0     0
     sdb     ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sdd     ONLINE       0     0     0
     sde     ONLINE       0     0     0
     sdf     FAULTED      0    21     0  too many errors

errors: No known data errors

Informations complémentaires :
We want to replace the device asap.

Quel est le résultat des tests smartctl ?
# smartctl -l error /dev/sdf
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.203-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 1
 CR = Command Register [HEX]
 FR = Features Register [HEX]
 SC = Sector Count Register [HEX]
 SN = Sector Number Register [HEX]
 CL = Cylinder Low Register [HEX]
 CH = Cylinder High Register [HEX]
 DH = Device/Head Register [HEX]
 DC = Device Command Register [HEX]
 ER = Error register [HEX]
 ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 45587 hours (1899 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 68 b0 e0 90 9a 40 08  33d+19:08:36.193  WRITE FPDMA QUEUED
  61 80 d0 80 a2 93 40 08  33d+19:08:36.192  WRITE FPDMA QUEUED
  61 40 48 f0 a0 93 40 08  33d+19:08:36.192  WRITE FPDMA QUEUED
  61 40 68 50 9f 93 40 08  33d+19:08:36.191  WRITE FPDMA QUEUED
  61 40 80 10 9f 93 40 08  33d+19:08:36.191  WRITE FPDMA QUEUED

We got the disk replaced after less than an hour !

So the morning after I did replace the disk and give resilvering a high priority:

zpool replace rpool /dev/sdf
echo 15000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

As of friday at 11:11, the zpool status report was:

  scan: resilver in progress since Thu Nov 20 08:44:25 2025
    12.2T scanned at 139M/s, 12.0T issued at 137M/s, 56.3T total
    1.78T resilvered, 21.30% done, 3 days 22:02:50 to go