Brand Prediction
Brand prediction identifies the brand name of a product from OCR text or logo detection.
Robotoff uses multiple data sources to detect brands:
- Curated brand list: A manually curated list of common brands
- Brand taxonomy: Brands extracted from the Open Food Facts brand taxonomy
- Logo detection: Google Cloud Vision logo annotations
Data Sources
Curated Brand List
A manually curated list of brand names stored in OCR_BRANDS_PATH (brand.txt). This list contains well-known brands that are frequently found on product packaging.
This source is deprecated, and will be removed in favor of the more comprehensive taxonomy-based approach.
Brand Taxonomy
Brands extracted from the Open Food Facts brand taxonomy. The taxonomy is processed to:
- Filter out numeric-only entries
- Apply a minimum product count threshold (configured via
BRAND_MATCHING_MIN_COUNT) - Apply a minimum name length filter (configured via
BRAND_MATCHING_MIN_LENGTH) - Exclude blacklisted brands
The processed taxonomy is stored in OCR_TAXONOMY_BRANDS_PATH (brand_from_taxonomy.gz).
Logo Annotation
Google Cloud Vision logo detection results are matched against a mapping file (OCR_LOGO_ANNOTATION_BRANDS_DATA_PATH - brand_logo_annotation.txt). The file uses || as separator:
logo_description||brand_tag
Prediction Process
The brand prediction process (defined in robotoff/prediction/ocr/brand.py) works as follows:
- OCR text extraction: Extract text from product images using OCR
- Keyword matching: Use flashtext's
KeywordProcessorto find brand mentions in the text - Bounding box extraction: When available, extract the bounding box coordinates for the matched brand text
- Prediction generation: Create predictions with:
value: The matched brand name. Example: "Nestlé"value_tag: A normalized tag (e.g.,en:nestlefor "Nestlé")predictor: The data source used (curated-list,taxonomy, orgoogle-cloud-vision)automatic_processing: Whether the insight can be applied automatically
Predictors
| Predictor | Source | Automatic Processing | Description |
|---|---|---|---|
curated-list |
Manually curated brand list | Yes | High-confidence brands from curated list |
taxonomy |
Open Food Facts taxonomy | No | Brands from the taxonomy (need validation) |
google-cloud-vision |
Logo detection | No | Brands detected via logo recognition |
Validation
Brand predictions undergo validation before becoming insights.
These checks are only applied to taxonomy and curated-list predictors.
Blacklist Check
Certain brands are blacklisted (stored in OCR_TAXONOMY_BRANDS_BLACKLIST_PATH) and excluded from automatic detection. This includes:
- Brands that are too generic
- Brands that cause frequent false positives
Barcode Range Validation
Each brand is associated with a set of barcode prefixes they typically use. This is computed from existing product data and stored in BRAND_PREFIX_PATH. The validation:
- Generates a barcode prefix (first 7 digits for EAN-13)
- Checks if the (brand_tag, barcode_prefix) pair exists in the brand prefix dataset
- Rejects predictions where the barcode is outside the expected range
This prevents incorrect brand assignments, for example assigning a chocolate brand to a dairy product.
Insight Generation
The BrandInsightImporter class (robotoff/insights/importer.py) handles converting predictions to insights:
- Existing brand check: If the product already has brands filled in, no new insights are created
- Validation: Apply blacklist and barcode range checks for
taxonomyandcurated-listpredictors - Automatic processing: Set based on the predictor type and data source
Conflict Resolution
Insights conflict if they have the same value_tag. When conflicts occur, the system prioritizes:
- Insights from the most recent source image
- Insights with automatic processing enabled
Data Files
| File | Path | Description |
|---|---|---|
| Brand list | data/ocr/brand.txt |
Curated brand names |
| Taxonomy brands | data/ocr/brand_from_taxonomy.gz |
Compressed taxonomy brands |
| Logo annotations | data/ocr/brand_logo_annotation.txt |
Logo-to-brand mapping |
| Brand blacklist | data/ocr/brand_taxonomy_blacklist.txt |
Blacklisted brands |
| Brand prefixes | data/BrandPrefixes.json.gz |
Barcode prefix ranges |
Configuration
Key settings in robotoff/settings.py:
BRAND_MATCHING_MIN_COUNT: Minimum product count for taxonomy brandsBRAND_MATCHING_MIN_LENGTH: Minimum brand name lengthOCR_BRANDS_PATH: Path to curated brand listOCR_TAXONOMY_BRANDS_PATH: Path to taxonomy brandsOCR_LOGO_ANNOTATION_BRANDS_DATA_PATH: Path to logo annotationsOCR_TAXONOMY_BRANDS_BLACKLIST_PATH: Path to brand blacklist
See Also
- Brand Insight Importer module (
robotoff.insights.importer) - Insight import logic - Brand Module (
robotoff.brands) - Brand data management and barcode range validation