2024-09-30 OVH3 backups (wrong approach)#
VERY IMPORTANT: this approach does not work and at the time of writing, we are on the way to change the way we do it.
We need an intervention to change a disk on ovh3.
We still have very few backups for OVH services.
Before the operation, I want to at least have replication of OVH backups on the new MOJI server.
Ideally I would like to use sanoid for every kind of backups, but I don't want to disrupt the current setup, as I don't have the time to.
What I wanted to do:
- add sanoid managed snapshots to current volumes replications of ovh1 / ovh2 containers and VMs
- add sanoid managed snapshots to current volumes ovh3 containers + add a backup of ovh3 system (it's not on ZFS)
- synchronize all those ZFS to MOJI, it may need some tweaking because of the replication snapshot, which should not be replicated.
This is not feasible ! Because of the replication. The replication must start from last replication snapshot and cannot be done in reverse.
What we can do instead:
- add sanoid snapshots on the ovh1 / ovh2 servers
- let replication of the container sync those snapshots to ovh3
- keep very few snapshots (1 or 2) on the ovh1/ ovh2 side (we have very few space left)
- keep snapshots for longer on the ovh3 side
Moving replication to a pve sub dataset (abandonned)#
NOTE: finally not done, because I didn't succeed to make it work, and was not really confident about the procedure.
Currently we have a replication landing in /rpool
,
this is annoying because it does not enable to configure sanoid
using recursive property (which would also ensure new volumes are under sanoid control).
So I would like to move them to pve.
To do this:
- first, 106 replication was stalled for a long time, I deleted the replication job and re-created it.
- Using the interface, I first disabled replication of all vm containers to ovh3.
It can also be done using
pvesr disable <id>
- I also stopped the two containers on ovh3 (100 (Munin) and 150 (gdrive-backup)).
- I then created a new dataset:
zfs create rpool/pve
- Then I changed
/etc/pve/storage.cfg
to change the pool and mountpoint of the rpool storage - I tried to move a first replication by using
zfs rename
to move a subvol fromrpool
torpool/pve
and then re-enabled the replication… but it failed with a zfs allow/unallow error. - As I was not able to understand the error (there was no particular allowed user before (as showned by
zfs allow rpool
)), I stepped back:- disable replication on the container where I did re-enabled it
- rename the volume back to a child of
rpool
- restored
/etc/pve/storage.cfg
to its original state
Adding sanoid snapshots to replicated volumes#
Adding sanoid on ovh1 and ovh2#
I installed sanoid using the .deb that was on ovh3:
apt install libcapture-tiny-perl libconfig-inifiles-perl pv lzop mbuffer
dpkg -i /opt/sanoid_2.2.0_all.deb
I then:
- created the email on failure unit
- personalized the sanoid systemctl unit
cd /opt/openfoodfacts-infrastructure/confs/$HOSTNAME
mkdir -p systemd/system
cd systemd/system
ln -s ../../../common/systemd/system/email-failures\@.service .
ln -s ../../../common/systemd/system/sanoid.service.d .
ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/email-failures\@.service /etc/systemd/system
ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/systemd/system/sanoid.service.d /etc/systemd/system
systemctl daemon-reload
Then I added the sanoid.conf telling to snapshot the volumes but keeping only 2 snapshots and snapshot once an hour.
Then we activate:
ln -s /opt/openfoodfacts-infrastructure/confs/$HOSTNAME/sanoid/sanoid.conf /etc/sanoid/
systemctl enable --now sanoid.timer
Configuring sanoid on ovh3#
On ovh3 we want to keep more snapshots than on ovh1 and ovh2. So we configure sanoid to do so.
Syncing to MOJI#
On Moji, we don't currently sync data from ovh3.
I setup an operator account on ovh3 for moji.
Created the syncoid-args.conf file.
I did a first sync using:
grep -v "^#" syncoid-args.conf | while read -a sync_args;do [[ -n "$sync_args" ]] && time syncoid "${sync_args[@]}" </dev/null;done
Setup syncoid service and timer, and enable them.
Side fix: fixing vm 200 replication#
VM 200 (docker staging) was stalled on ovh1.
I tried to remove the replication job but it failed.
To remove it I did:
pvesr delete 200-0 -force
and it worked.
I then recreated the replication job.
Side fix: removing old volumes#
There are volumes remaining on ovh3 of containers that were deleted.
To get an idea of the container it was from, I can cat the /etc/hostname
.
For example, for container 112:
cat /rpool/subvol-112-disk-0/etc/hostname
mongo2
for num in 109 115 116 117 119 120 122;do echo $num; cat /rpool/subvol-$num-disk-0/etc/hostname;done
109
slack
115
robotoff-dev
116
mongo-dev
117
tensorflow-xp
119
robotoff-net
120
impact-estimator
122
off-net2
``
I did destroy the following volumes:
```bash
# slack
zfs destroy rpool/subvol-109-disk-0 -r
# mongo2
zfs destroy rpool/subvol-112-disk-0 -r
# robotoff-dev
zfs destroy rpool/subvol-115-disk-0 -r
# mongo-dev
zfs destroy rpool/subvol-116-disk-0 -r
# tensorflow-xp
zfs destroy rpool/subvol-117-disk-0 -r
# robotoff-net
zfs destroy rpool/subvol-119-disk-0 -r
# impact estimator
zfs destroy rpool/subvol-120-disk-0 -r
# off-net2
zfs destroy rpool/subvol-122-disk-0 -r