Brand Prediction

Brand prediction identifies the brand name of a product from OCR text or logo detection.

Robotoff uses multiple data sources to detect brands:

Curated brand list: A manually curated list of common brands
Brand taxonomy: Brands extracted from the Open Food Facts brand taxonomy
Logo detection: Google Cloud Vision logo annotations

Data Sources

Curated Brand List

A manually curated list of brand names stored in OCR_BRANDS_PATH (brand.txt). This list contains well-known brands that are frequently found on product packaging.

This source is deprecated, and will be removed in favor of the more comprehensive taxonomy-based approach.

Brand Taxonomy

Brands extracted from the Open Food Facts brand taxonomy. The taxonomy is processed to:

Filter out numeric-only entries
Apply a minimum product count threshold (configured via BRAND_MATCHING_MIN_COUNT)
Apply a minimum name length filter (configured via BRAND_MATCHING_MIN_LENGTH)
Exclude blacklisted brands

The processed taxonomy is stored in OCR_TAXONOMY_BRANDS_PATH (brand_from_taxonomy.gz).

Logo Annotation

Google Cloud Vision logo detection results are matched against a mapping file (OCR_LOGO_ANNOTATION_BRANDS_DATA_PATH - brand_logo_annotation.txt). The file uses || as separator:

logo_description||brand_tag

Prediction Process

The brand prediction process (defined in robotoff/prediction/ocr/brand.py) works as follows:

OCR text extraction: Extract text from product images using OCR
Keyword matching: Use flashtext's KeywordProcessor to find brand mentions in the text
Bounding box extraction: When available, extract the bounding box coordinates for the matched brand text
Prediction generation: Create predictions with:
- value: The matched brand name. Example: "Nestlé"
- value_tag: A normalized tag (e.g., en:nestle for "Nestlé")
- predictor: The data source used (curated-list, taxonomy, or google-cloud-vision)
- automatic_processing: Whether the insight can be applied automatically

Predictors

Predictor	Source	Automatic Processing	Description
`curated-list`	Manually curated brand list	Yes	High-confidence brands from curated list
`taxonomy`	Open Food Facts taxonomy	No	Brands from the taxonomy (need validation)
`google-cloud-vision`	Logo detection	No	Brands detected via logo recognition

Validation

Brand predictions undergo validation before becoming insights. These checks are only applied to taxonomy and curated-list predictors.

Blacklist Check

Certain brands are blacklisted (stored in OCR_TAXONOMY_BRANDS_BLACKLIST_PATH) and excluded from automatic detection. This includes:

Brands that are too generic
Brands that cause frequent false positives

Barcode Range Validation

Each brand is associated with a set of barcode prefixes they typically use. This is computed from existing product data and stored in BRAND_PREFIX_PATH. The validation:

Generates a barcode prefix (first 7 digits for EAN-13)
Checks if the (brand_tag, barcode_prefix) pair exists in the brand prefix dataset
Rejects predictions where the barcode is outside the expected range

This prevents incorrect brand assignments, for example assigning a chocolate brand to a dairy product.

Insight Generation

The BrandInsightImporter class (robotoff/insights/importer.py) handles converting predictions to insights:

Existing brand check: If the product already has brands filled in, no new insights are created
Validation: Apply blacklist and barcode range checks for taxonomy and curated-list predictors
Automatic processing: Set based on the predictor type and data source

Conflict Resolution

Insights conflict if they have the same value_tag. When conflicts occur, the system prioritizes:

Insights from the most recent source image
Insights with automatic processing enabled

Data Files

File	Path	Description
Brand list	`data/ocr/brand.txt`	Curated brand names
Taxonomy brands	`data/ocr/brand_from_taxonomy.gz`	Compressed taxonomy brands
Logo annotations	`data/ocr/brand_logo_annotation.txt`	Logo-to-brand mapping
Brand blacklist	`data/ocr/brand_taxonomy_blacklist.txt`	Blacklisted brands
Brand prefixes	`data/BrandPrefixes.json.gz`	Barcode prefix ranges

Configuration

Key settings in robotoff/settings.py:

BRAND_MATCHING_MIN_COUNT: Minimum product count for taxonomy brands
BRAND_MATCHING_MIN_LENGTH: Minimum brand name length
OCR_BRANDS_PATH: Path to curated brand list
OCR_TAXONOMY_BRANDS_PATH: Path to taxonomy brands
OCR_LOGO_ANNOTATION_BRANDS_DATA_PATH: Path to logo annotations
OCR_TAXONOMY_BRANDS_BLACKLIST_PATH: Path to brand blacklist