1st place in the WACV Physical Retail AI Workshop challenge

Hello, I am Jiaqi Ning, a researcher from the human reasoning group of FUJITSU RESEARCH & DEVELOPMENT CENTER (FRDC), China. Recently, my colleagues from FRDC, FRJ, FRA and I formed a team to participate in the WACV Physical Retail AI challenge (https://physicalstoreworkshop.github.io/challenge.html). We won the first place with a significant lead over the second place team. In this blog, I will introduce the challenge information, our solution and offer a review of the event.

Figure 1. Official final ranking on closed testing data.

Overview of the challenge

The Physical Retail AI challenge was organized by Amazon. This challenge aims to accelerate progress in developing AI-enabled physical retail shopping related technologies. In recent years, physical retail related applications have been considered very promising and challenging by industry and academia. The second track we participated in is Appearance Based Verification (ABV). The goal of this challenge is to assess whether the model can accurately match a query product image with its corresponding images in the gallery. The participating teams need to output a ranking list of items based on the matching the model predicts for each query image. This challenge involves overcoming several real-world obstacles, such as occlusions caused by hands or other objects, background interference, multiple viewing angles, and more. These efforts are crucial for meeting the needs of related applications, such as analyzing shoppers' interests in product selection.

Data - The organizer provided 74200 images, each labeled with a product ID, taken from anonymized customers. These images were captured using a GoPro camera mounted on a standard U.S. shopping cart.

Metrics - The evaluation is based on Cumulative Matching Characteristics (CMC) curves, indicating that both top-K accuracy and average precision play important roles.

Our solution

We use a common process for handling verification tasks. The core of our solution is to extract representative features for the products in the images. We then rank the matching results based on the distance between the features in the query images and those in the gallery images. Figure 2 uses a single query image to illustrate the overall framework of our solution. Images shown in Figure 2 are some example images used in the challenge.

In the framework of our solution, the most import component is the feature extractor. Regarding the distance calculator, common methods like cosine distance are quite effective. Next, I'll delve into the details of the feature extraction method we adopted.

Feature extraction method

After observing training data provided by challenge organizer, we found that there are following characteristics of this ABV task. Firstly, there are hundreds categories of products in the dataset. Some products belonging to different categories look very similar. However, the appearance of products within the same category can vary significantly when viewed from different angles. Secondly, the product images usually have complex backgrounds, including customers’ hands, shopping carts, shelves. These backgrounds add complexity to the task of extracting discriminative features for each product from the images. Thus, a fine-grained feature extraction method is required. Our feature extraction method includes the following three modules, which are the backbone, multi-scale feature fusion and background suppression.

Backbone - Literatures [1-3] have demonstrated that Transformer-based backbone has more advantages than the convolutional neural networks (CNNs)-based backbone in many computer version tasks such as classification and object detection. To extract fine-grained distinctive feature, we prefer to choose a Transformer-based backbone. And we expect to obtain multiple spatial resolution features to increase discrimination of each product. Based on the above analysis, we choose Swin Transformer [1] as the backbone of our solution. The Swin Transformer constructs hierarchical feature maps, which can conveniently leverage advanced techniques for multi-scale feature fusion such as feature pyramid networks (FPN) [4].

Multi-scale feature fusion - Utilizing the Swin Transformer as backbone, we can get multiple features with different scales and resolution output by different blocks. Specifically, we choose the Swin-L as Swin Transformer’s structure. Both top-down fusion and bottom-up fusion are used to fuse the multi-sacle features output by different backbone blocks [5]. This can be regarded as a FPN with an additional bottom-up path structure as shown in figure 3. In the top-down feature fusion process, the feature of layer $i$ is derived from the sum of the transformed outputs of layers $i$ and $i+1$ of the backbone. Similarly, in the bottom-up feature fusion, the feature of layer $i$ is obtained by summing the transformed outputs of layers $i$ and $i-1$ of the backbone. During training, we place a classifier following each layer's output of both the bottom-up and top-down fusion processes. In addition, we ensure that the classification result distributions from both the bottom-up and top-down fusion are similar by using the Kullback-Leibler divergence loss as a constraint. [5].

Figure 3. method of multi-scale feature fusion.

Background suppression - For each feature map, it can be considered that some pixels belong to foreground and the other pixels belong to background. The classifier generates a classification result for feature of each pixel in the feature map. The criteria of determining whether a pixel belongs to foreground or background is as follows: We rank the classification response of each pixel on the ground truth category from large to small. The top $\gamma$ percent (e.g., $\gamma$ = 20) of pixels in a certain feature map are classified as foreground, while the rest are considered as background. The classification result of combined foreground pixels feature should match the ground truth category of the image. In contrast, the classification result for each background pixel should not align with any category present in the training data. We will give the category of pixels of background a pseudo target to calculate their classification loss [5]. Since we assume that the class of pixels belonging to the background is not any class in the training data, this background suppression method also works well when extracting features from images whose classes are not exist in the training data. In our solution, regardless of whether the category of an image exists in the training data, we perform average pooling operation on each feature map after bottom-up fusion, than concatenate them as the appearance feature of an image.

Review of the challenge

The challenge organizer didn’t release test data for final evaluation. Therefore, we should ensure the generalization of the model. This can be achieved through diversified data augmentation and reasonable segmentation of the data set provided by the organizer, retaining a part as validation set and not participating in training.

Acknowledgement

I would also like to thank my colleagues from various departments of Fujitsu for their contributions to this challenge. Participants in this challenge include Boix Xavier (FRA), Guo Zihao (FRDC), Kikuchi Takashi (FRJ), Li Fei (FRDC), Matsumoto Shinichi (FRA), Ning Jiaqi (FRDC), Pelat Guillaume (FRJ), Takeuchi Shun (FRJ) and Yamanaka Jin (FRA) (in alphabetical order). This is a successful global collaboration in Fujitsu.

References

Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 213-229.
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
Chou P Y, Kao Y Y, Lin C H. Fine-grained Visual Classification with High-temperature Refinement and Backgrounfd Suppression[J]. arXiv preprint arXiv:2303.06442, 2023.