ProductOpener::Ingredients - process and analyze ingredients lists
ProductOpener::Ingredients
processes,
normalize,
parses and analyze ingredients lists to extract and recognize individual ingredients,
additives and allergens,
and to compute product properties related to ingredients (is the product vegetarian,
vegan,
does it contain palm oil etc.)
use ProductOpener::Ingredients qw/:all/; [..] clean_ingredients_text($product_ref); extract_ingredients_from_text($product_ref); extract_additives_from_text($product_ref); detect_allergens_from_text($product_ref);
[..]
This function initializes regular expressions needed to parse traces and allergens in ingredients lists.
This function creates regular expressions that match quantities or percent of an ingredient, including localized strings like "minimum"
This function creates regular expressions that match all variations of labels that we want to recognize in ingredients lists, such as organic and fair trade.
Check if the specific ingredients structure (extracted from the end of the ingredients list and product labels) contains a property for an ingredient. (e.g. do we have an origin specified for a specific ingredient)
If the ingredient_id parameter is undef, then we return the value for any specific ingredient. (useful for products for which we do not have ingredients, but for which we have a label like "French eggs": we can still derive the origin of the ingredients, e.g. for the Environmental-Score)
e.g. "origins"
- undef if we don't have a specific ingredient with the requested property matching the requested ingredient - otherwise the value for the matching specific ingredient
Go through the ingredients structure, and add properties to ingredients that match specific ingredients for which we have extra information (e.g. origins from a label).
Add a percent_max value for salt and sugar ingredients, based on the nutrition facts.
Check if the product has labels that indicate properties (e.g. origins) for specific ingredients.
e.g.
en:French pork fr:Viande Porcine Française, VPF, viande de porc française, Le Porc Français, Porc Origine France, porc français, porc 100% France origins:en: en:france ingredients:en: en:pork
This function extracts those mentions and adds them to the specific_ingredients structure.
Array of specific ingredients.
Lists of ingredients sometime include extra mentions for specific ingredients at the end of the ingredients list. e.g. "Prepared with 50g of fruits for 100g of finished product".
This function extracts those mentions and adds them to the specific_ingredients structure.
This function is also used to parse the origins of ingredients field.
Used to find % values, language specific.
Pass undef in order to skip % recognition. This is useful if we know the text is only for the origins of ingredients.
Array of specific ingredients.
This function extract processing method from one ingredient. If processing methods are found and remaining ingredient text exists without the processing method, then, it returns: - $processing (concatenate if more than one), - $ingredient (without processing) and - $ingredient_id (without processing) If it does not result in known ingredient, then it returns the same but unchanged.
language abbreviation (en for English, for example)
string ("pear", for example)
reference to an array of processings
updated ingredient without processing methods
English first element for that ingredient (en:pear, for example)
0 or 1
This function parses the origins of ingredients field to extract the origins of specific ingredients. The origins are stored in the specific_ingredients structure of the product.
Note: this function is similar to parse_specific_ingredients_from_text() that operates on ingredients lists. The difference is that parse_specific_ingredients_from_text() only extracts and recognizes text that is an extra mention at the end of an ingredient list (e.g. "Origin of strawberries: Spain"), while parse_origins_from_text() will also recognize text like "Strawberries: Spain".
In most cases it is the same as $product_ref->{ingredients_lc}, except if there are no ingredients listed, in which case we can have origins listed in the main language of the product.
Array of specific ingredients.
Select, set and return the `ingredients_lc` field in $product_ref.
This is the language that will be used to parse ingredients. We first check that ingredients_text_{lang} exists and is non-empty for the product main language (`lc`), and return it if it does. Otherwise we look at all languages defined in `languages_codes` for a non-empty `ingredients_text_lang`.
If we find a language with non empty ingredients in ingredients_text_{lang}: - we copy the value to the `ingredients_text` field in the product - we set the `ingredients_lc` field in the product and return it, otherwise we unset it.
Language code for ingredients parsing.
Return the ingredients_lc field if already set, otherwise call select_ingredients_lc() to select it.
This function is used in ingredients related services, to ensure that the ingredients_lc field is set.
Used to assign percent or quantity for strings parsed with $percent_or_quantity_regexp.
If the percent_or_quantity_unit is %, we return a defined value for percent, otherwise we return quantity and quantity_g
If the unit is not %, quantity is a concatenation of the quantity value and unit
Normalized quantity in grams.
$ingredient = "100% cocoa"; # or "milk 10cl"
if ($ingredient =~ /\s$percent_or_quantity_regexp$/i) { $percent_or_quantity_value = $1; $percent_or_quantity_unit = $2;
my ($percent, $quantity, $quantity_g) = get_percent_or_quantity_and_normalized_quantity($percent_or_quantity_value, $percent_or_quantity_unit);
Parse the ingredients_text field to extract individual ingredients.
This function is a product service that can be run through ProductOpener::ApiProductServices
product object reference
reference to a hash of product fields that have been created or updated
reference to an array of error messages
After the nested ingredients structure has been built with the parse_ingredients_text_service, this service adds some properties to the ingredients:
- Origins, labels etc. that have been extracted from other fields - Ciqual and Ecobalyse codes
product object reference
reference to a hash of product fields that have been created or updated
reference to an array of error messages
Flatten the nested list of ingredients.
Go through the nested ingredients and:
Compute ingredients_original_tags and ingredients_tags.
Compute the total % of "leaf" ingredients (without sub-ingredients) with a specified %, and unspecified %.
- ingredients_with_specified_percent_n : number of "leaf" ingredients with a specified % - ingredients_with_specified_percent_sum : % sum of "leaf" ingredients with a specified % - ingredients_with_unspecified_percent_n - ingredients_with_unspecified_percent_sum
This function calls:
- parse_ingredients_text_service() to parse the ingredients text in the main language of the product to extract individual ingredients and sub-ingredients
- compute_ingredients_percent_min_max_values() to create the ingredients array with nested sub-ingredients arrays
- compute_ingredients_tags() to create a flat array ingredients_original_tags and ingredients_tags (with parents)
- analyze_ingredients_service() to analyze ingredients to see the ones that are vegan, vegetarian, from palm oil etc. and to compute the resulting value for the complete product
Assign a ciqual_food_code or a ciqual_proxy_food_code to ingredients and sub ingredients.
reference to an array of ingredients
Assign a ecobalyse_code or a ecobalyse_proxy_code to ingredients and sub ingredients. (NOTE : this is a first version that'll soon be improved)
reference to an array of ingredients
Retrieve the geographical area for ecobalyse. (NOTE : this is a first version that'll soon be improved)
reference to the name of the country
Compute minimum and maximum percent ranges and percent estimates for each ingredient and sub ingredient.
This function is a product service that can be run through ProductOpener::ApiProductServices
product object reference
reference to a hash of product fields that have been created or updated
reference to an array of error messages
Count ingredients with specified percent, including sub-ingredients.
Number of ingredients.
Number of ingredients with a specified percent value.
Sum of the specified percent values.
Note: this can be greater than 100 if percent values are specified for ingredients and their sub ingredients.
This function deletes the percent_min and percent_max values of all ingredients.
It is called if the compute_ingredients_percent_min_max_values() encountered impossible values (e.g. "Water, Sugar 80%" -> Water % should be greater than 80%, but the total would be more than 100%)
The function is recursive to also delete values for sub-ingredients.
This function computes the possible minimum and maximum ranges for the percent values of each ingredient and sub-ingredients.
Ingredients lists sometimes specify the percent value for some ingredients, but usually not all. This functions computes minimum and maximum percent values for all other ingredients.
Ingredients list are ordered by descending order of quantity.
This function is recursive and it calls itself for each ingredients with sub-ingredients.
0 when the function is called on all ingredients of a product, but can be different than 0 if called on sub-ingredients of an ingredient that has a minimum value set.
100 when the function is called on all ingredients of a product, but can be different than 0 if called on sub-ingredients of an ingredient that has a maximum value set.
The analysis encountered an impossible value. e.g. "Flour, Sugar 80%": The % of Flour must be greated to the % of Sugar, but the sum would then be above 100%.
Or there were too many loops to analyze the values.
The return value is the number of times we adjusted min and max values for ingredients and sub ingredients.
Initialize the percent, percent_min and percent_max value for each ingredient in list.
$ingredients_ref is the list of ingredients (as hash), where parsed percent are already set.
$total_min and $total_max might be set if we have a parent ingredient and are parsing a sub list.
When a percent is specifically set, use this value for percent_min and percent_max.
Warning: percent listed for sub-ingredients can be absolute (e.g. "Sugar, fruits 40% (pear 30%, apple 10%)") or they can be relative to the parent ingredient (e.g. "Sugar, fruits 40% (pear 75%, apple 25%)". We try to detect those cases and rescale the percent accordingly.
Otherwise use 0 for percent_min and total_max for percent_max.
Set the percentage maximum for ingredients like flavouring where this is defined on the Ingredients taxonomy. The percent_max will not be applied in the following cases:
- if applying the percent_max would mean that it is not possible for the ingredient total to add up to 100% - If a later ingredient has a higher percentage than the percent_max of the restricted ingredient
This function computes a possible estimate for the percent values of each ingredient and sub-ingredients.
The sum of all estimates must be 100%, and the estimates try to match the min and max constraints computed previously with the compute_ingredients_percent_min_max_values() function.
100 when the function is called on all ingredients of a product, but can be different than 100 if called on sub-ingredients of an ingredient.
Analyzes ingredients to see the ones that are vegan, vegetarian, from palm oil etc. and computes the resulting value for the complete product.
The results are overridden by labels like "Vegan", "Vegetarian" or "Palm oil free"
Results are stored in the ingredients_analysis_tags array.
This function is a product service that can be run through ProductOpener::ApiProductServices
product object reference
reference to a hash of product fields that have been created or updated
reference to an array of error messages
This function is called by normalize_enumeration()
Given a category ($a) and a type ($b), it will return the ingredient that result from the combination of these two.
English: oil, olive -> olive oil Croatian: ječmeni, slad -> ječmeni slad French: huile, olive -> huile d'olive Russian: масло растительное, пальмовое -> масло растительное оливковое
language abbreviation (en for English, for example)
string, category as defined in %ingredients_categories_and_types, example: 'oil' for 'oil (sunflower, olive and palm)'
string, type as defined in %ingredients_categories_and_types, example: 'sunflower' or 'olive' or 'palm' for 'oil (sunflower, olive and palm)'
e.g. in French we combine "huile" and "olive" to "huile d'olive" but we combine "poivron" and "rouge" to "poivron rouge".
Reference to an array of alternate names for the category
string, comma-joined category and type, example: 'palm vegetal oil' or 'sunflower vegetal oil' or 'olive vegetal oil'
This function is called by develop_ingredients_categories_and_types()
Some ingredients are specified by an ingredient "category" (e.g. "oil") and a "types" string (e.g. "sunflower, palm").
This function combines the category to all elements of the types string $category = "Vegetal oil" and $types = "palm, sunflower and olive" will return "vegetal oil (palm vegetal oil, sunflower vegetal oil, olive vegetal oil)"
language abbreviation (en for English, for example)
string, as matched from definition in %ingredients_categories_and_types, example: 'Vegetal oil' for 'Vegetal oil (sunflower, olive and palm)'
string, as matched from definition in %ingredients_categories_and_types, example: 'sunflower, olive and palm' for 'Vegetal oil (sunflower, olive and palm)'
e.g. in French we combine "huile" and "olive" to "huile d'olive" but we combine "poivron" and "rouge" to "poivron rouge".
Reference to an array of alternate names for the category
e.g. for "carbonates d'ammonium et de sodium", we want only "carbonates d'ammonium, carbonates de sodium" and not "carbonates (carbonates d'ammonium, carbonates de sodium)" as "carbonates" is another additive
string, with the type + a list of comma-joined category with all elements of the types example: 'vegetal oils (sunflower vegetal oil, olive vegetal oil, palm vegetal oil)'
Some producers send us an ingredients list that starts with the generic name followed by the actual ingredients list.
e.g. "Pâtes de fruits aromatisées à la fraise et à la canneberge, contenant de la maltodextrine et de l'acérola. Source de vitamines B1, B6, B12 et C. Ingrédients : Pulpe de fruits 50% (poire William 25%, fraise 15%, canneberge 10%), sucre, sirop de glucose de blé, maltodextrine 5%, stabilisant : glycérol, gélifiant : pectine, acidifiant : acide citrique, arôme naturel de fraise, arôme naturel de canneberge, poudre d'acérola (acérola, maltodextrine) 0,4%, vitamines : B1, B6 et B12. Fabriqué dans un atelier utilisant: GLUTEN*, FRUITS A COQUE*. * Allergènes"
This function splits the list to put the generic name in the generic_name_[lc] field and the ingredients list in the ingredients_text_[lc] field.
If there is already a generic name, it is not overridden.
WARNING: This function should be called only during the import of data from producers. It should not be called on lists that can be the result of an OCR, as there is no guarantee that the text before the ingredients list is the generic name. It should also not be called when we import product data from the producers platform to the public database.
Perform some cleaning of the ingredients list.
The operations included in the cleaning must be 100% safe.
The function can be applied multiple times on the ingredients list.
This function should be called once when getting text data from the OCR that includes an ingredients list.
It tries to remove phrases before and after the list that are not ingredients list.
It MUST NOT be applied multiple times on the ingredients list, as it could otherwise remove parts of the ingredients list. (e.g. it looks for "Ingredients: " and remove everything before it. If there are multiple "Ingredients:" listed, it would keep only the last one if called multiple times.
This function is used inside regular expressions to turn additives to a normalized form.
Using a function to concatenate the E-number, letter and variant makes it possible to deal with undefined $letter or $variant without triggering an undefined warning.
$text =~ s/(\b)e( |-|\.)?$additivesregexp(\b|\s|,|\.|;|\/|-|\\|$)/replace_additive($3,$6,$9) . $12/ieg;
Some ingredients are specified by an ingredient "category" (e.g. "oil", "flavouring") and a "type" (e.g. "sunflower", "palm" or "strawberry", "vanilla").
Sometimes, the category is mentioned only once for several types: "strawberry and vanilla flavourings", "vegetable oil (palm, sunflower)".
This function lists each individual ingredient: "oil (sunflower, olive and palm)" becomes "sunflower oil, olive oil, palm oil"
For each language, we list the categories and types of ingredients that can be combined when the ingredient list contains something like "<category> (<type1>, <type2> and <type3>)"
We can also provide a list of alternate_names, so that we can have a category like "oils and fats" and generate entries like "sunflower oil", "cocoa fat" when the ingredients list contains "oils and fats (sunflower, cocoa)".
Alternate names need to contain "<type>" which will be replaced by the type.
This can be especially useful in languages like German where we can create compound words with the type and the category* like "Kokosnussöl" or "Sonnenblumenfett":
de => [ { categories => ["pflanzliches Fett", "pflanzliche Öle", "pflanzliche Öle und Fette", "Fett", "Öle"], types => ["Avocado", "Baumwolle", "Distel", "Kokosnuss", "Palm", "Palmkern", "Raps", "Shea", "Sonnenblumen",], # Kokosnussöl, Sonnenblumenfett alternate_names => ["<type>fett", "<type>öl"], }, ],
Simple plural (just an additional "s" at the end) will be added in the regexp.
Note that a "<categories> ([list of types])" enumeration will be developed only if all the types can be matched to the specified types in ingredients_categories_and_types.
This function transform the ingredients list in a more normalized list that is easier to parse.
It does the following:
- Normalize quote characters - Replace abbreviations by their full name - Remove extra spaces in compound words width dashes (e.g. céléri - rave -> céléri-rave) - Split vitamins enumerations - Normalize additives and split additives enumerations - Split other enumerations (e.g. oils, some minerals) - Split allergens and traces - Deal with signs like * to indicate labels (e.g. *: Organic)
This function extracts additives from the ingredients text and adds them to the product_ref in the additives_tags array.
TODO: this function is independent of the ingredient parsing, we should combine the two.
Check if the product contains sweeteners and non nutritive sweeteners (used for the Nutri-Score for beverages)
The NNS / Non nutritive sweeteners listed in the Nutri-Score Update report beverages_31 01 2023-voted have been added as a non_nutritive_sweetener:en:yes property in the additives taxonomy.
The function sets the following fields in the product_ref hash.
If there are no ingredients specified for the product, the fields are not set.
Detects allergens from the ingredients extracted from the ingredients text, using the "allergens:en" property associated to some ingredients in the ingredients taxonomy.
This functions needs to be run after the $product_ref->{ingredients} array is populated from the ingredients text.
It is called by detect_allergens_from_text(). Allergens are added to $product_ref->{"allergens_from_ingredients"} which is then used by detect_allergens_from_text() to populate the allergens_tags field.
In the allergens provided by users, we may get ingredients that are not in the allergens taxonomy, but that are in the ingredients taxonomy and have an inherited allergens:en property. (e.g. the allergens taxonomy has an en:fish entry, but users may indicate specific fish species)
This function tries to match the ingredient with an allergen in the allergens taxonomy, and otherwise return the taxonomy id for the original ingredient.
The language code of $ingredient_or_allergen.
The ingredient or allergen to match. Can also be an ingredient id or allergens id prefixed with a language code.
The taxonomy id for the allergen, or the original ingredient if no allergen was found.
This function: - combines all the ways we have to detect allergens in order to populate the allergens_tags and traces_tags fields. - creates the ingredients_text_with_allergens_[lc] fields with added HTML <span class="allergen"> tags
Allergens are recognized in the following ways:
1. using the list of ingredients that have been recognized through ingredients analysis, by looking at the allergens:en property in the ingredients taxonomy. This is done with the function detect_allergens_from_ingredients()
2. when entered in ALL CAPS, or between underscores
3. when matching exact entries o synonyms of the allergens taxonomy
Allergens detected using 2. or 3. are marked with <span class="allergen">
Recursive function to compute the percentage of ingredients that match a specific function.
The match function takes 2 arguments: - ingredient id - processing (comma separated list of ingredients_processing taxonomy entries)
Used to compute % of fruits and vegetables, % of milk etc. which is needed by some algorithm like the Nutri-Score.
Sum of matching ingredients percent.
Percent of water (used to recompute the percentage for categories of products that are consumed after removing water)
This function analyzes the ingredients to estimate the percentage of ingredients of a specific type (e.g. fruits/vegetables/legumes for the Nutri-Score).
Reference to a function that matches specific ingredients (e.g. fruits/vegetables/legumes)
If the $nutrient_id argument is defined, we also store the nutrient value in $product_ref->{nutriments}.
Estimated percentage of ingredients matching the function.
Determine if an ingredient should be counted as "fruits, vegetables, nuts, olive / walnut / rapeseed oils" in Nutriscore 2021 algorithm.
- we use the nutriscore_fruits_vegetables_nuts:en property to identify qualifying ingredients - we check that the parent of those ingredients is not a flour - we check that the ingredient does not have a processing like en:powder
NUTRI-SCORE FREQUENTLY ASKED QUESTIONS - UPDATED 27/09/2022:
"However, fruits, vegetables and pulses that are subject to further processing (e.g. concentrated fruit juice sugars, powders, freeze-drying, candied fruits, fruits in stick form, flours leading to loss of water) do not count. As an example, corn in the form of popcorn or soy proteins cannot be considered as vegetables. Regarding the frying process, fried vegetables which are thick and only partially dehydrated by the process can be taken into account, whereas crisps which are thin and completely dehydrated are excluded."
This function analyzes the ingredients to estimate the minimum percentage of fruits, vegetables, nuts, olive / walnut / rapeseed oil, so that we can compute the Nutri-Score fruit points if we don't have a value given by the manufacturer or estimated by users.
Results are stored in $product_ref->{nutriments}{"fruits-vegetables-nuts-estimate-from-ingredients_100g"} (and _serving)
Determine if an ingredient should be counted as "fruits, vegetables, legumes" in Nutriscore 2023 algorithm.
- we use the eurocode_2_group_1:en and eurocode_2_group_2:en property to identify qualifying ingredients - we check that the parent of those ingredients is not a flour - we check that the ingredient does not have a processing like en:powder
1.2.2. Ingredients contributing to the "Fruit, vegetables and legumes" component
The list of ingredients qualifying for the "Fruit, vegetables and legumes" component has been revised to include the following Eurocodes: • Vegetables groups o 8.10 (Leaf vegetables); o 8.15 (Brassicas); o 8.20 (Stalk vegetables); o 8.25 (Shoot vegetables); o 8.30 (Onion-family vegetables); o 8.38 (Root vegetables); o 8.40 (Fruit vegetables); o 8.42 (Flower-head vegetables); o 8.45 (Seed vegetables and immature pulses); o 8.50 (Edible fungi); o 8.55 (Seaweeds and algae); o 8.60 (Vegetable mixtures) Fruits groups o 9.10 (Malaceous fruit); o 9.20 (Prunus species fruit); o 9.25 (Other stone fruit); o 9.30 (Berries); o 9.40 (Citrus fruit); o 9.50 (Miscellaneous fruit); o 9.60 (Fruit mixtures). Pulses groups o 7.10 (Pulses).
Additionally, in the fats and oils category specifically, oils derived from ingredients in the list qualify for the component (e.g. olive and avocado).
--
NUTRI-SCORE FREQUENTLY ASKED QUESTIONS - UPDATED 27/09/2022:
"However, fruits, vegetables and pulses that are subject to further processing (e.g. concentrated fruit juice sugars, powders, freeze-drying, candied fruits, fruits in stick form, flours leading to loss of water) do not count. As an example, corn in the form of popcorn or soy proteins cannot be considered as vegetables. Regarding the frying process, fried vegetables which are thick and only partially dehydrated by the process can be taken into account, whereas crisps which are thin and completely dehydrated are excluded."
This function analyzes the ingredients to estimate the minimum percentage of fruits, vegetables, legumes, so that we can compute the Nutri-Score (2023) fruit points.
Results are stored in $product_ref->{nutriments}{"fruits-vegetables-legumes-estimate-from-ingredients_100g"} (and _serving)
Determine if an ingredient should be counted as milk in Nutriscore 2021 algorithm
This function analyzes the ingredients to estimate the percentage of milk in a product, in order to know if a dairy drink should be considered as a food (at least 80% of milk) or a beverage.
Return value: estimated % of milk.
Determine if an ingredient should be counted as red meat in Nutriscore 2023 algorithm
This function analyzes the ingredients to estimate the percentage of red meat, so that we can determine if the maximum limit of 2 points for proteins should be applied in the Nutri-Score 2023 algorithm.
Returns a list of ingredients that have a specific property value.