Skip to content

2024-09-30 OVH3 backups (wrong approach)#

VERY IMPORTANT: this approach does not work and at the time of writing, we are on the way to change the way we do it.

We need an intervention to change a disk on ovh3.

We still have very few backups for OVH services.

Before the operation, I want to at least have replication of OVH backups on the new MOJI server.

Ideally I would like to use sanoid for every kind of backups, but I don't want to disrupt the current setup, as I don't have the time to.

What I wanted to do:

  • add sanoid managed snapshots to current volumes replications of ovh1 / ovh2 containers and VMs
  • add sanoid managed snapshots to current volumes ovh3 containers + add a backup of ovh3 system (it's not on ZFS)
  • synchronize all those ZFS to MOJI, it may need some tweaking because of the replication snapshot, which should not be replicated.

This is not feasible ! Because of the replication. The replication must start from last replication snapshot and cannot be done in reverse.

What we can do instead:

  • add sanoid snapshots on the ovh1 / ovh2 servers
  • let replication of the container sync those snapshots to ovh3
  • keep very few snapshots (1 or 2) on the ovh1/ ovh2 side (we have very few space left)
  • keep snapshots for longer on the ovh3 side

Moving replication to a pve sub dataset (abandonned)#

NOTE: finally not done, because I didn't succeed to make it work, and was not really confident about the procedure.

Currently we have a replication landing in /rpool, this is annoying because it does not enable to configure sanoid using recursive property (which would also ensure new volumes are under sanoid control). So I would like to move them to pve.

To do this:

  • first, 106 replication was stalled for a long time, I deleted the replication job and re-created it.
  • Using the interface, I first disabled replication of all vm containers to ovh3. It can also be done using pvesr disable <id>
  • I also stopped the two containers on ovh3 (100 (Munin) and 150 (gdrive-backup)).
  • I then created a new dataset: zfs create rpool/pve
  • Then I changed /etc/pve/storage.cfg to change the pool and mountpoint of the rpool storage
  • I tried to move a first replication by using zfs rename to move a subvol from rpool to rpool/pve and then re-enabled the replication… but it failed with a zfs allow/unallow error.
  • As I was not able to understand the error (there was no particular allowed user before (as showned by zfs allow rpool)), I stepped back:
    • disable replication on the container where I did re-enabled it
    • rename the volume back to a child of rpool
    • restored /etc/pve/storage.cfg to its original state

Adding sanoid snapshots to replicated volumes#

Adding sanoid on ovh1 and ovh2#

I installed sanoid using the .deb that was on ovh3:

apt install libcapture-tiny-perl libconfig-inifiles-perl pv lzop mbuffer
dpkg -i /opt/sanoid_2.2.0_all.deb

I then:

  • created the email on failure unit
  • personalized the sanoid systemctl unit
cd /opt/openfoodfacts-infrastructure/confs/$HOSTNAME
mkdir -p systemd/system
cd systemd/system
ln -s ../../../common/systemd/system/email-failures\@.service .
ln -s ../../../common/systemd/system/sanoid.service.d .

ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/email-failures\@.service /etc/systemd/system
ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/sanoid.service.d /etc/systemd/system
systemctl daemon-reload

Then I added the sanoid.conf telling to snapshot the volumes but keeping only 2 snapshots and snapshot once an hour.

Then we activate:

ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/sanoid/sanoid.conf /etc/sanoid/
systemctl enable --now sanoid.timer

Configuring sanoid on ovh3#

On ovh3 we want to keep more snapshots than on ovh1 and ovh2. So we configure sanoid to do so.

Syncing to MOJI#

On Moji, we don't currently sync data from ovh3.

I setup an operator account on ovh3 for moji.

Created the syncoid-args.conf file.

I did a first sync using:

grep -v "^#" syncoid-args.conf | while read -a sync_args;do [[ -n "$sync_args" ]] && time syncoid  "${sync_args[@]}" </dev/null;done

Setup syncoid service and timer, and enable them.

Side fix: fixing vm 200 replication#

VM 200 (docker staging) was stalled on ovh1.

I tried to remove the replication job but it failed. To remove it I did: pvesr delete 200-0 -force and it worked.

I then recreated the replication job.

Side fix: removing old volumes#

There are volumes remaining on ovh3 of containers that were deleted.

To get an idea of the container it was from, I can cat the /etc/hostname. For example, for container 112:

cat /rpool/subvol-112-disk-0/etc/hostname 
mongo2

for num in 109 115 116 117 119 120 122;do echo $num; cat /rpool/subvol-$num-disk-0/etc/hostname;done
109
slack
115
robotoff-dev
116
mongo-dev
117
tensorflow-xp
119
robotoff-net
120
impact-estimator
122
off-net2
``

I did destroy the following volumes:
```bash
# slack
zfs destroy rpool/subvol-109-disk-0 -r
# mongo2
zfs destroy rpool/subvol-112-disk-0 -r
# robotoff-dev
zfs destroy rpool/subvol-115-disk-0 -r
# mongo-dev
zfs destroy rpool/subvol-116-disk-0 -r
# tensorflow-xp
zfs destroy rpool/subvol-117-disk-0 -r
# robotoff-net
zfs destroy rpool/subvol-119-disk-0 -r
# impact estimator
zfs destroy rpool/subvol-120-disk-0 -r
# off-net2
zfs destroy rpool/subvol-122-disk-0 -r