2026-02-17 monitoring 01, Elasticsearch disk full#
Symptoms#
After a deployment on monitoring-01, we got an alert that kibana was down.
Looking at it on monitoring-01, I saw in its log that it's unable to start due to a migration that can't happen because disk is full.
Looking at ES log, I see that the index is readonly because disk usage is above the threshold.
Indeed the disk is at 96% capacity !
zfs list data-zfs -r
NAME USED AVAIL REFER MOUNTPOINT
data-zfs 190G 2.79G 96K /data-zfs
data-zfs/docker-volumes 190G 2.79G 128K /data-zfs/docker-volumes
data-zfs/docker-volumes/monitoring_alertmanager-data 96K 2.79G 96K /data-zfs docker-volumes/monitoring_alertmanager-data
data-zfs/docker-volumes/monitoring_elasticsearch-data 164G 2.79G 164G /data-zfs docker-volumes/monitoring_elasticsearch-data
data-zfs/docker-volumes/monitoring_grafana-data 46.5M 2.79G 17.8M /data-zfs docker-volumes/monitoring_grafana-data
data-zfs/docker-volumes/monitoring_influxdb-data 1.73G 2.79G 110M /data-zfs docker-volumes/monitoring_influxdb-data
data-zfs/docker-volumes/monitoring_prometheus-data 24.1G 2.79G 2.13G /data-zfs/docker-volumes/monitoring_prometheus-data
df -h /data-zfs/docker-volumes/monitoring_elasticsearch-data
Filesystem Size Used Avail Use% Mounted on
data-zfs/docker-volumes/monitoring_elasticsearch-data 167G 164G 2.8G 99% /data-zfs docker-volumes/monitoring_elasticsearch-data
Resolution#
My plan is to:
- list indexes that I can remove (old logs)
- ~~remove the watermark temporarily~~
- remove indexes
- ~~set back the watermark~~
I will then consider if:
- some logging is too high
- ILM should be adjusted to trash log earlier
- seek why we did not get an alert on the disk
To ease the work, as kibana is not accessible, I will use elasticvue app in standalone mode
Listing and removing indexes#
Elasticvue offers me a view with all indexes.
I can see that some indices are very big:
in particular: logs-current-logs-2025.09.01-000308 is 18G and it's really old.
I am able to remove it (no need to change threshold)
It seems ILM is not working correctly as there are a lot of old indexes. I will remove them. I use the filter to get all indexes of 2024, and use bulk action to remove them.
It has an immediate effect on disk use:
data-zfs/docker-volumes/monitoring_elasticsearch-data 128G 125G 2.8G 98% /data-zfs/docker-volumes/monitoring_elasticsearch-data
I do the same for indexes before 2025-10, which are even more and huge.
After that we are back to a very reasonable amount of data:
data-zfs/docker-volumes/monitoring_elasticsearch-data 2.8G 51M 2.8G 2% /data-zfs/docker-volumes/monitoring_elasticsearch-data
Kibana is finally back !
Looking at indices content#
I connect to kibana.openfoodfacts.org
In Management > Stack Management > Index Management
I see that finally I only have one log index remaining! It is logs-current.
It is not under ILM …
Using analytics > visualize library:
- I create a datatable
- Index pattern: logs-*
- I had a split row bucket on terms on com_docker_compose_project_working_dir by count, showing 50 items
- I also had to take care of the time window on top right corner and set last 3 years
I see:
/home/off/monitoring 54,179
/home/off/off-search-org 25,799
/home/off/open-prices-org 22,166
/home/off/filebeat 18,558
/home/off/robotoff-net 16,598
/home/off/off-search-net 15,547
/home/off/nutripatrol-org 13,217
/home/off/argilla-org 12,305
/home/off/metrics-net/docker/metrics 10,651
/home/off/open-prices-net 10,276
Setting Index Lifecycle Policy#
As we saw that logs-current is not under ILM, we want to add it.
Normally filebeat should have done that itself, but it's not the case.
The configs/filebeat/config.yml tells us the ILM config name is logs.
In Management > Stack Management > Index Lifecycle Policies,
I can see the policy logs exists
and its content is conform to configs/filebeat/ilm-config.json.
Before I can setup the policy, there is a problem to fix:
logs-current is an indice when it should be an alias to an indice.
Also in ES we can't rename an index.
So first I do the following:
- put logs current in read only mode (there is an action on index in elasticvue interface)
- clone it in
logs-2025.03.15-000001(there is an action in elasticvue) - remove
logs-current - ~~add alias
logs-currentto created log~~
Last step does not work because filebeat immediately recreate the index…
To disable this here is what I did:
- I generated a password hash using:
openssl passwd -5 -stdin - I edited
/opt/reverse_proxy/nginx/passwords/elasticsearch.openfoodfacts.org- commented current off password
- added my new one
- change password in elasticvue Then I:
- clone
logs-currentinlogs-2026.02.17-000001(there is an action in elasticvue) - remove
logs-current - add alias
logs-currenttologs-2026.02.17-000001Finally I edit again/opt/reverse_proxy/nginx/passwords/elasticsearch.openfoodfacts.orgto put back the old password, and put it back also in elasticvue.
No I must apply the log policy my logs indices. For this, in kibana, I go in Management > Stack Management > Index Management,
- I search
logs-and selectlogs-2026.02.17-000001 - I use the action "add lifecycle policy"
- I select the
logslifecycle policy, with logs-current rolling alias and apply
About alert on disk threshold#
It would have been important to have an alert as the disk gets full.
Diagnosis#
In the configs/prometheus/alerts.yml, we have a VMOutOfDiskSpace alert,
but it's looking at a specific device /dev/sda1 and mountpoint /
which only work for a very specific setting !
So we need to change this.
Testing in prometheus, the request node_filesystem_avail_bytes{device!="tmpfs"}
only gives me two entries, which are from OVH.
Why don't we have the one from monitoring-01 ?
Fixing node metrics on monitoring-01#
on monitoring itself, if I do a:
curl localhost:8281/metrics|grep filesystem_avail
But if I go in /home/off/monitoring, and use
sudo -u off docker compose exec --user root prometheus sh
wget -O - host.docker.internal:8281/metrics
Connecting to host.docker.internal:8281 (172.17.0.1:8281)
wget: can't connect to remote host (172.17.0.1): Connection timed out
ping 172.17.0.1
PING 172.17.0.1 (172.17.0.1): 56 data bytes
64 bytes from 172.17.0.1: seq=0 ttl=64 time=0.201 ms
docker ps|grep node
53ebef18c7a6 prom/node-exporter:v1.2.2 "/bin/node_exporter …" 7 months ago Up 7 months 0.0.0.0:8281->9100/tcp, [::]:8281->9100/tcp filebeat-node_exporter-1
This is blocking requests.
I modified the off_private_networks_v4 in group_vars/all/common.yml
and run:
ansible-playbook jobs/configure.yml -l monitoring-01 --tags firewall
back to my prometheus container on monitoring-01,
I now have wget -O - host.docker.internal:8281/metrics working correctly !
The change is contributed in https://github.com/openfoodfacts/openfoodfacts-infrastructure/pull/595
The strange thing is that I see no alerts in prometheus logs
about host.docker.internal not working…
Fixing the alert#
Now I want my alert to be more general.
Playing around in prometheus graph interface, I came with this expression:
node_filesystem_avail_bytes{device!="tmpfs",device!="ramfs"} / on (app,device,env,service,fstype,mountpoint) node_filesystem_size_bytes{device!="tmpfs",device!="ramfs"}
The first term selects node_filesystem_avail_bytes that are not tmpfs or ramfs,
We then do a division by metrics node_filesystem_size_bytes
having the same app,device,env,service,fstype,mountpoint labels as the first one.
But finally it seems the on is not mandatory, by default it will match all labels, so this simpler expression works:
(node_filesystem_avail_bytes{device!="tmpfs",device!="ramfs"} / node_filesystem_size_bytes)
This is contributed in https://github.com/openfoodfacts/openfoodfacts-monitoring/pull/129