2025-2025-12-09 scaleway 01 disk problem#

Symptoms#

On the 28/11, Alex was facing an apt upgrade problem (with zfs-dkms) on scaleway-01. He wanted to reboot, to try to switch to new kernel and get rid of the old one… but scaleway-01 never came back online. Unfortunately the IPMI was not yet reachable (it could have been but we did not know how to reach it).

Resolution#

On 09/12, Christian did go to Scaleway data center and try to reboot the server. It was rebooting but not finishing with a lot of errors displayed by a disk. Indeed we had errors mail, but Alex didn't realize the importance of the alert (because of a problem in communication with Christian).

Christian was initially going to Scaleway DC to replace a disk on off1 (which is close by) but he decided to use this disk on scaleway-01. So he removed the old disk, place the new disk, did a resilver. He also did a dist upgrade.

Missed warnings#

As I told, the disk state was reported as bad by smartd. Alerts where:

Subject: SMART error (CurrentPendingSector) detected on host: scaleway-01

This message was generated by the smartd daemon running on:

   host name:  scaleway-01
   DNS domain: infra.openfoodfacts.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sdd [SAT], 19 Offline uncorrectable sectors

Device info:
WDC  WUH721414ALE604, S/N:9JH5NR2T, WWN:5-000cca-258d0ab1c, FW:LDGNW400, 14.0 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Nov 17 05:05:02 2025 CET
Another message will be sent in 24 hours if the problem persists.

The relative small number of sectors did not trigger a high level of alert to Alex.

Alex did launch a smartctl long test on the disk but on 28/11 it was not finished (it spans several days)

Note that while searching this mail in my archive, that scaleway did send other mail with this error after the incident, but the number of Offline uncorrectable sectors was then far higher (160).

After though#

With this accident:

We gain more experience on the importance of this kind of errors on the disk, although it was hard to predict the reboot would have been such a problem… so it also happens
We are setting up access to IPMI. This access must be done using ssh SOCKS proxy so that you reach IPMI on the right IP address. (because otherwise it will refuse to connect).