Ingredients Spellcheck
A key element of the Open Food Facts database is the parsing of the product ingredients. These lists of ingredients either come from contributors' annotations or from OCR-extracted text from packaging pictures.
However, text typos or wrong OCR-extraction lead to ingredients not recognized by the Product Opener service. Check about this process in the wiki.
For this reason, the Ingredients Spellcheck was developed to be implemented to solve this issue and improve the ingredient parsing quality.
TL;DR
Mistral-7B-Base was fine-tuned on lists of ingredients extracted from the Open Food Facts database. This dataset was synthetically generated using closed-source LLMs (GPT-3.5-Turbo) and manually reviewed with Argilla, an open-source annotation tool.
The current model (v1) shows the best performances over the closed-source LLMs on our benchmark. A custom evaluation algorithm was created to correctly estimate the Spellcheck performances.
Model | Correction Precision | Correction Recall | Correction F1 |
---|---|---|---|
GPT-3.5-Turbo | 0.557 | 0.727 | 0.631 |
GPT-4o | 0.311 | 0.702 | 0.431 |
Gemini-1.5-flash | 0.544 | 0.596 | 0.569 |
Claude3-Sonnet-3.5 | 0.178 | 0.810 | 0.292 |
Our model | 0.664 | 0.630 | 0.647 |
The model is integrated into Robotoff in Batch Inference using Google Batch Job.
Evaluation algorithm
Our solution is very specific: correct errors in list of ingredients to enable the Ingredients Parser to accurately identify the composition of each product.
However, since the corrections are later added to the database, we need to ensure the model doesn't correct an ingredient by mistake. In other words, we minimize the number of False Positives while maximizing the overall Recall.
Traditional evaluation metrics, such as ROUGE, BLEU, or METEOR fall short in assessing the quality of the spellcheck process. They don't provide a detailed analysis about how many words were correctly rectified versus those that weren't...
Therefore, we developed an algorithm that takes 3 inputs: the original, the reference, and the prediction of a list of ingredients.
Example:
Original: "Th cat si on the fride,"
Reference: "The cat is on the fridge."
Prediction: "Th big cat is in the fridge."
We transform each text into a sequence of tokens and perform a sequence alignment method to align identical tokens between respectively original-reference, and prediction-reference. We assign 1 or 0 whether the tokens is modified.
By comparing these 2 pairs of sequences, we calculate the number of True Positives (TP), False Positives (FP), and True Negatives (TN). Therefore, the overall Precision and Recall.
Orig-Ref: 1 0 0 1 0 1 1 1 1
Orig-Pred: 0 1 0 1 1 1 1 1 1
Signification: FN FP TN TP FP TP TP TP TP
Coupled with a benchmark carefully prepared using the Spellcheck Guidelines, the algorithm is capable of evaluating any solution, from Regular Expression techniques to LLMs.
You'll find more details about the evaluation algorithm1 in the project README.
Guidelines
The Guidelines is a set of rules defined to guide and restrict the correction made by the Spellcheck.
It was also used to create the benchmark, and also to generate the training dataset using proprietary LLMs (GPT-3.5-Turbo) for the synthetic data generation.
Model
The model is accessible on Hugging Face, along its demo.
A text instruction is provided to the model during the training and inference, which you can find in the same model repository.
Training pipeline
The model training consists in a succession of steps, each one requiring different resources allocations, such as cloud GPUs, data validation and logging. For this reason, we decided to orchestrate the training using Metaflow, an orchestrator designed for Data science and Machine Learning projects.
The training pipeline2 is composed as follow:
- Configurations and hyperparameters are imported to the pipeline from config yaml files3.
- The training job is launched in the cloud using AWS Sagemaker. The
spellcheck/src/
package, containing the different modules, is imported as well as the training script4. Once the job done, the model artifact is stored in AWS S3 bucket (private). All training details are tracked in the Experiment Tracker Comet ML. - The fine-tuned model is then evaluated on the benchmark using the custom evaluation algorithm. vLLM is used to accelerate the evaluation. Currently, this process is handled manually, but further work is needed to fully integrate it into the pipeline.
- The predictions against the benchmark, also stored in AWS S3, are sent to Argilla for human-evaluation5 under an unique ID: the experiment key.
Human-evaluation with Argilla
The model and dataset versions are handled by Hugging Face repository as branch (v1, v2) and commits (v1.1, v1.2). You can easily access any version using the Dataset library from Hugging Face.
from datasets import load_dataset
dataset = load_dataset(
path="openfoodfacts/spellcheck-dataset",
revision="v8",
split="train+test"
)
Integration with Batch Job
Once the model is selected, the inference script with its dependencies are containerized in a Docker Image6 before being pushed to the Image Registry7 (currently Google Artifact Registry). The image is then used within the batch job pipeline, defined by the batch job type ingredients-spellcheck
.