Benchmark
When releasing the first version of the logo detection pipeline, the performance was unknown, as we didn't have any labeled logo dataset to measure it.
This pipeline helped us build the first annotated logo dataset so that we can now measure how the original EfficientNet-b0 model (pretrained on ImageNet) performs compared to other pretrained models.
Logo embeddings were computed using each model. For each logo, two different distances were used to find the most similar logos among the rest of the dataset, and results were sorted by ascending distance. The first table shows the results obtained with the L2 distance and the second one shows the results we got with the cosine distance.
To keep the comparison fair and avoid favoring classes with many samples, for each target image, we only considered at most 4 items of each class. These items were sampled at random among the class items. As each class contains at least 5 items, all classes (including the target class, i.e. the class of the target logo) have 4 candidates. With this setting, an oracle model would have a recall@4 of 1.
The val split was used to perform this benchmark. The benchmark code can be found here.
We use the following metrics:
- micro-recall@4: skewed by classes with many samples.
- macro-recall@4: gives equal weight to all classes.
All compared models were trained on ImageNet, except:
beit_large_patch16_224_in22k
, trained on ImageNet 22kclip-vit-*
, trained on the proprietary dataset described in the CLIP paper.
Note that we use the timm
library to generate embeddings (except for CLIP models, where the transformers
library was used). The model weights mostly come from the timm
author's training and differ from the original weights.
Latency was measured on 50 batches of 8 samples with a Tesla T4 GPU.
With L2 distance :
model | micro-recall@4 | macro-recall@4 |
---|---|---|
random | 0.0083 | 0.0063 |
efficientnet_b1 | 0.4612 | 0.5070 |
resnest101e | 0.4322 | 0.5124 |
beit_large_patch16_384 | 0.4162 | 0.5233 |
efficientnet_b2 | 0.4707 | 0.5323 |
rexnet_100 | 0.5158 | 0.5340 |
efficientnet_b4 | 0.4807 | 0.5450 |
resnet50 | 0.4916 | 0.5609 |
efficientnet_b0 | 0.5420 | 0.5665 |
beit_base_patch16_384 | 0.4758 | 0.5666 |
resnet50d | 0.5313 | 0.6133 |
beit_large_patch16_224_in22k | 0.5723 | 0.6660 |
clip-vit-base-patch32 | 0.7006 | 0.8243 |
clip-vit-base-patch16 | 0.7295 | 0.8428 |
clip-vit-large-patch14 | 0.7706 | 0.8755 |
deit_base_patch16_384 | 0.3920 | 0.4375 |
With cosine distance :
model | micro-recall@4 | macro-recall@4 |
---|---|---|
random | 0.0091 | 0.0114 |
efficientnet_b1 | 0.4941 | 0.5334 |
resnest101e | 0.4777 | 0.5415 |
efficientnet_b2 | 0.4909 | 0.5466 |
rexnet_100 | 0.5461 | 0.5716 |
efficientnet_b4 | 0.5141 | 0.5861 |
resnet50 | 0.5241 | 0.5868 |
efficientnet_b0 | 0.5489 | 0.5838 |
resnet50d | 0.5791 | 0.6353 |
clip-vit-base-patch32 | 0.7030 | 0.8244 |
clip-vit-base-patch16 | 0.7297 | 0.8470 |
clip-vit-large-patch14 | 0.7753 | 0.8722 |
deit_base_patch16_384 | 0.4403 | 0.5070 |
Embedding size and per-sample latency (independent of the distance used):
model | embedding size | per-sample latency (ms) |
---|---|---|
random | - | - |
efficientnet_b1 | 1280 | 3.93 |
resnest101e | 1024 | 142.75 |
efficientnet_b2 | 1408 | 4.29 |
rexnet_100 | 1280 | 3.91 |
efficientnet_b4 | 1792 | 6.99 |
resnet50 | 2048 | 3.50 |
efficientnet_b0 | 1280 | 5.51 |
beit_base_patch16_384 | 768 | 41.88 |
resnet50d | 2048 | 4.01 |
beit_large_patch16_224_in22k | 1024 | 43.56 |
clip-vit-base-patch32 | 768 | 3.08 |
clip-vit-base-patch16 | 768 | 11.69 |
clip-vit-large-patch14 | 1024 | 56.68 |
deit_base_patch16_384 | 768 | 1.73 |
N.B: we didn't use the cosine-distance for the beit models as they were not working anymore when doing the benchmark with the cosine distance. Some explanations can be found there.
We note that the cosine distance tends to be slightly better than the L2 distance for each model (except for the clip-vit-large-patch14
) but the use of one distance or the other does not change the following analysis as the ranking of the models remains the same.
As expected, the current model (efficientnet-b0
) performs well above the random baseline. Its performances are competitive compared to most other architectures pretrained on ImageNet.
However, CLIP models largely outperform any other tested architecture on this benchmark: with clip-vit-large-patch14
we gain +22.8 on micro-recall@4 and +30.9 on macro-recall@4 compared to efficientnet-b0
.
Performances of CLIP models increase as models gets larger or with a smaller image patch size. The prediction latency is however 3.8x and 18,4x higher for clip-vit-base-patch16
and clip-vit-large-patch14
respectively compared to clip-vit-base-patch32
.
In conclusion, CLIP models are very good candidates for an off-the-shelf replacement of the efficientnet-b0
model currently used to generate logo embeddings. An additional benefit from this model architecture is the smaller embedding size (768 or 1024, depending on the version) compared to the original efficientnet-b0
model (1280).
To study the behaviour of each model in a case closer to our usecase, we performed another benchmark based this time of the efficiency of each model to predict the right label for each logo of val.txt file. The predicted label is the most represented one among the k closest neighbours.
Here are the results we obtained working with a cosine distance :
model | micro-recall@4 | macro-recall@4 |
---|---|---|
random | 0.0098 | 0.0098 |
efficientnet_b1 | 0.7347 | 0.7711 |
resnest101e | 0.7089 | 0.7653 |
efficientnet_b2 | 0.7379 | 0.7847 |
rexnet_100 | 0.7848 | 0.7896 |
efficientnet_b4 | 0.7590 | 0.8025 |
resnet50 | 0.7667 | 0.8050 |
efficientnet_b0 | 0.7835 | 0.8098 |
resnet50d | 0.8054 | 0.8229 |
clip-vit-base-patch32 | 0.9139 | 0.9196 |
clip-vit-base-patch16 | 0.9171 | 0.9219 |
clip-vit-large-patch14 | 0.9427 | 0.9417 |
deit_base_patch16_384 | 0.7025 | 0.7573 |
The CLIP models remain better than the others.