Nutrition extraction
Dataset on Hugging Face - Model on Hugging Face
We developped a ML model to automatically extract nutrition information from photos of product packaging where nutrition facts are displayed.
This model detects the most common nutrition values (proteins, salt, energy-kj,...), either for 100g or per serving. We use LayoutLMv3, an architecture used in Document AI to perform various tasks on structured documents (bills, receipts, reports,...). The model expects the input image, the tokens (=words) and the spatial position of each token on the image. As the model requires token text and position as input, an OCR must be performed beforehand. We use Google Cloud Vision to extract text content from the image.
LayoutLMv3 architecture can perform several tasks, we frame the problem as a token classification task. The model must predict the class of each token, among a predefined set of classes. We follow the IOB format for entity classes. Here is a complete list of the token classes detected by the model:
- O
- B-ENERGY_KJ_SERVING
- I-ENERGY_KJ_SERVING
- B-CARBOHYDRATES_100G
- I-CARBOHYDRATES_100G
- B-CHOLESTEROL_SERVING
- I-CHOLESTEROL_SERVING
- B-ENERGY_KCAL_100G
- I-ENERGY_KCAL_100G
- B-SALT_SERVING
- I-SALT_SERVING
- B-SALT_100G
- I-SALT_100G
- B-SERVING_SIZE
- I-SERVING_SIZE
- B-CALCIUM_100G
- I-CALCIUM_100G
- B-SODIUM_SERVING
- I-SODIUM_SERVING
- B-FIBER_100G
- I-FIBER_100G
- B-IRON_SERVING
- I-IRON_SERVING
- B-IRON_100G
- I-IRON_100G
- B-POTASSIUM_100G
- I-POTASSIUM_100G
- B-CALCIUM_SERVING
- I-CALCIUM_SERVING
- B-TRANS_FAT_100G
- I-TRANS_FAT_100G
- B-SATURATED_FAT_100G
- I-SATURATED_FAT_100G
- B-PROTEINS_SERVING
- I-PROTEINS_SERVING
- B-SATURATED_FAT_SERVING
- I-SATURATED_FAT_SERVING
- B-VITAMIN_D_100G
- I-VITAMIN_D_100G
- B-ENERGY_KJ_100G
- I-ENERGY_KJ_100G
- B-FAT_100G
- I-FAT_100G
- B-PROTEINS_100G
- I-PROTEINS_100G
- B-VITAMIN_D_SERVING
- I-VITAMIN_D_SERVING
- B-ADDED_SUGARS_SERVING
- I-ADDED_SUGARS_SERVING
- B-CHOLESTEROL_100G
- I-CHOLESTEROL_100G
- B-SUGARS_100G
- I-SUGARS_100G
- B-CARBOHYDRATES_SERVING
- I-CARBOHYDRATES_SERVING
- B-ADDED_SUGARS_100G
- I-ADDED_SUGARS_100G
- B-SODIUM_100G
- I-SODIUM_100G
- B-FIBER_SERVING
- I-FIBER_SERVING
- B-SUGARS_SERVING
- I-SUGARS_SERVING
- B-ENERGY_KCAL_SERVING
- I-ENERGY_KCAL_SERVING
- B-FAT_SERVING
- I-FAT_SERVING
- B-TRANS_FAT_SERVING
- I-TRANS_FAT_SERVING
- B-POTASSIUM_SERVING
- I-POTASSIUM_SERVING
Nutrients that are not in this list are detected as O
1.
Dataset
Random images selected as nutrition images were picked for annotation. Using the list of labels above, more than 3500 images were manually annotated. To learn more about the dataset, have a look at the description of the dataset on Hugging Face.
Robotoff integration
Pre-processing, inference and post-processing
The model was exported to ONNX and is served by Triton server. The model integration in Robotoff can be found in robotoff.prediction.nutrition_extraction
module. The predict
function 2 takes as input the image (as a Pillow Image) and the Google Cloud Vision OCR result (as a OCRResult
object).
When extracting nutrient information from an image, we perform the following steps:
- extract the words and their coordinates from the OCR result
- preprocess the image, the words and their coordinates using the LayoutLMv3 preprocessor, that takes care of preprocessing the data in the right format for the LayoutLMv3 model
- perform the inference: the request is sent to Triton server through gRPC
- postprocess the results
Postprocessing includes the following steps:
- gather pre-entities from individual labels. There is one pre-entities for each input token.
- aggregate entities: the 'O' (OTHER) entity is ignored, and pre-entities with the same entity class are merged together.
- post-process entities: we post-process the detected text to correct some known limitations of the model,
and we extract the value (ex:
5
) and the unit (ex:g
) from the entity text.
The predict
function returns a NutritionExtractionPrediction
dataclass that has two fields:
nutrients
contains postprocessed entities that were considered valid during post-processing (thevalid
field described below is therefore not present).entities
contains the raw pre-entities, the aggregated entities and the post-processed entities (respectively in theraw
,aggregated
andpostprocessed
fields). This field is useful for debugging and understanding model predictions.
Postprocessed entities contain the following fields:
entity
: the nutrient name, in Product Opener format (ex:energy-kcal_100g
orsalt_serving
)text
: the text of the entity (ex:125 kJ
)value
: the nutrient value. It's either a number ortraces
unit
: the nutrient unit, eitherg
,mg
,µg
,kj
,kcal
ornull
. Sometimes the nutrient unit is not present after the value, or the OCR didn't detect the corresponding word. You can either infer a plausible unit given the entity (ex:g
for proteins, carbohydrates,...) or ignore this entity.score
: The entity score. We use the score of the first pre-entity as the aggregated entity score.start
: the word start index of the entity, with respect to the original OCR JSONend
: the word end index of the entity, with respect to the original OCR JSONchar_start
: the character start index of the entity, with respect to the original OCR JSONchar_end
: the character end index of the entity, with respect to the original OCR JSONvalid
: whether the extracted entity is valid. We consider an entity invalid if we couldn't extract nutrient value from thetext
field, or if there are more than one entity for a single nutrient. For example, twoproteins_100g
entities are both considered invalid, but oneproteins_100g
and oneproteins_serving
are considered valid.
Integration
For every new uploaded image, the model is run on this image 3. As for all computer vision models, we save the model prediction in the image_prediction
table.
If some entities were detected, we create a Prediction
in DB using the usual import mechanism 4, under the type nutrient_extraction
.
We only create an insight if we detected at least one nutrient value that is not in the product nutrients 5.
-
Using a fixed set of classes is not the best approach when we have many classes. It however allows us to use LayoutLM architecture, which is very performant for this task, even when the nutrition table is hard to read due to packaging deformations or alterations. To detect the long-tail of nutrients, approaches using graph-based approach, where we would map a nutrient mention to its value, could be explored in the future. ↩
-
In
robotoff.prediction.nutrition_extraction
module ↩ -
See function
robotoff.workers.tasks.import_image.extract_nutrition_job
↩ -
See
NutrientExtractionImporter.generate_candidates
for implementation ↩