Hello, I am Ziqiang Shi from the team working on developing Self CheckOut Monitoring (SCOM) at FUJITSU RESEARCH & DEVELOPMENT CENTER in China. Recently, our SCOM technology won the 1st prize in the 7th AI CITY CHALLENGE (https://www.aicitychallenge.org/), with an absolute advantage of 6% over the runner-up. In this article, we will introduce the competition, our solution and efforts during this event, and a final review.
Overview of the competition
The AI CITY CHALLENGE was organized and initiated by NVIDIA with several universities (including Johns Hopkins University, Boston University, Iowa State University, Indian Institute of Technology etc.) in 2017. It is currently the most authoritative competition in the fields of smart retail systems and intelligent transportation systems. The 4th track we participated in is "Multi-Class Product Counting & Recognition for Automated Retail Checkout", which is to evaluate the performance of the current vision-based automatic checkout technology in smart retail. Customers hold one or more products in their hands and pass through the tray area captured by the video. The participating teams need to accurately identify the category of the products and the time when they enter the tray. The technology needs to overcome several real-world challenges, including occlusion, movement, similarity of SKU appearance. So as to achieve the requirement of not checking out one more product (customer dissatisfied) and not paying one less product (merchant dissatisfied).
Data - The organizer provided 116,500 synthetic images from 116 3D objects based on the pipeline from . These are the only SKU image training data we can use.
Metrics - It is measured by F1 score, which can be calculated with True Positives (TP), False Positives (FP) and False Negatives (FN).
Our system consists of five parts, namely data preprocessing, detector, tracker, classifier, and post-processing as shown in Figure 2.
Controllable synthetic data optimization
Publicly available video data for automated retail checkouts is very scarce and difficult to annotate.  proposes to build a 3D model for each product, so that unlimited product images from various angles and lighting can be generated. The product image is used as the foreground and pasted on the background, we can get unlimited data for training product detection and classification models. Therefore, how to embed product images in the background becomes the key to the success of the whole system.
We designed a controllable synthetic data optimization scheme, in which three hyperparameters are used to control synthetic data optimization, the number of products on the background, the occlusion degree between two products, and the scaling of product sizes. Here the occlusion degree is represented by Intersection over Union (IoU), and the scaling size is also within [0, 1]. These three parameters affect each other. If we put too many products on the background at one time and keep the original size of the products, the occlusion between the products will inevitably increase. The range of these parameters are specified empirically, such as 1-6 products, the occlusion degree is less than 0.3, and the scaling size is within [0.15, 05] to generate different training data. We adjusted different hyperparameters, conducted ablation experiments, and found that with regard to occlusion, it is best to either have an upper bound of 0.5, or not have occlusion; and the scaling size should be set around 0.5 as much as possible. At this time, the system will achieve better performance. For detailed experiments, please refer our paper .
We also conducted experiments to investigate the performance difference caused by using synthetic images and real images as backgrounds. Experimental results show that using real images improves performance. This also tells us that for automated retail checkouts, synthetic data is only a compromise, and it is better to use real data when conditions permit. Figure 3 shows some synthetic images for training with different scalings, IoUs, and backgrounds.
Tracking product by CheckSORT
Effectively tracking the moving product during checkout is the key for later recognition. In our implementation, StrongSORT  is adopted with several improvements to associate the product bounding box in video, including decomposed Kalman filtering and dynamic tracklet feature sequences.
In most previous association algorithms, a single Kalman filter is used to model and predict the position and size of the target object, x = [x, y, a, h, x’, y’, a’, h’]. Here x and y are the center point of the bounding box, a is the aspect ratio, h is the height, and the ‘ represents the first derivative, which is the rate of change. This position [x, y] and size [a, h] of the product usually show completely different motion patterns. The movement of products at checkout can be decomposed, one is the smooth movement of the center [x, y], and the other is a nearly independent rigid body motion, such as the rotation of a product. The translational movement is relatively simple, almost linear, and while the rotation corresponds to the nonlinear sharp change of [a, h] of the bounding box. Figure 4 shows the comparison between the center movement and the aspect ratio change curve of a product in a test video provided by the challenge. It can be seen that the change in aspect ratio is much larger than the movement of the center. We therefore propose the decomposed Kalman filter (DKF), which can model p = [x, y, x’, y’] and b = [a, h, a’, h’] of products separately.
When matching the detection boxes of the current frame and the historical trajectory, it is necessary to obtain the pairwise distance between them. In our DKF, we need to calculate the distance matrix of position and aspect ratio separately. Similar to DeepSORT, we can get a gating matrix for each matrix and the final cost matrix. The difference is that our cost matrix is a weighted sum of three different matrices.
After getting all the trajectories through the CheckSORT, we need to do some refinement to improve the accuracy. Under some empirical tuning, we specify the following rules to process these raw trajectories:
- If the track is very short, or the track is classified as a background class and does not belong to any product class, delete it.
- If the track has a gap greater than half a second in the middle, it is broken into two traces.
- If the classification results of several trajectories are the same, and the distance between two trajectories is less than 3 seconds, these trajectories are merged.
Figure 5 shows an example, it can be seen that postprocessing is very important and can significantly improve the results.
DetectoRS  is used as the product detection model, which is pretrained on the Microsoft COCO dataset . The data prepared above are used to fine-tune this pretrained model. And in order to obtain a robust classifier, in addition to the data provided by the organizer, product images with different backgrounds are extracted from the optimized synthetic data and used for fine-tuning. Three types of classification models are used, namely EfficientNet, ResNeSt-50, and ResNeSt-100 pretrained on ImageNet .
Effect of synthetic data optimization
If we fix the number range of products on tray as 2-6, then there are only two hyperparameters that can be adjusted, one is the upper bound of IoU, and the other is the range of scaling. Figure 6 shows the recognition performance on challenge test data ‘TestA 2022’ under different IoUs with fixed scaling in the range of 0.15∼0.5. Both experiments show that larger upper bounds on IoUs lead to better performance. That is to say, the more occlusions between two products and the more complex data generated are more conducive to multi-object detection, classification and tracking.
Figure 7 shows the performances under different scaling ranges when the upper bound of IoU is fixed at 0.1. It can be seen that scaling around 0.55 will produce better performance. Larger scaling cannot generate enough training data, and too small scaling will not match the actual situation.
Effect of CheckSORT
Figure 8 shows the performance of different trackers, DeepSORT, StrongSORT, and CheckSORT on challenge test data ‘TestA 2023’. Whether compared to DeepSORT or StrongSORT, CheckSORT has different degrees of improvement.
Review of the competition
To bridge the large gap between training and testing data, we propose a synthetic training data optimization paradigm, besides we also propose a customized CheckSORT tracking algorithm based on the particularity of product checkout scenarios. These methods worked very well for this competition.
My personal thoughts and experiences from this competition are to keep my brain open, pursue excellence, iterate quickly, brainstorm, and try all paths of improvement.
- Naphade, Milind, Shuo Wang, David C. Anastasiu, Zheng Tang, Ming-Ching Chang, Yue Yao, Liang Zheng et al. "The 7th AI City Challenge." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5537-5547. 2023.
- Yue Yao, Liang Zheng, Xiaodong Yang, Milind Napthade, and Tom Gedeon. Attribute descent: Simulating object-centric datasets on the content level and beyond. arXiv preprint arXiv:2202.14034, 2022
- Shi, Ziqiang, Zhongling Liu, Liu Liu, Rujie Liu, Takuma Yamamoto, Xiaoyu Mi, and Daisuke Uchida. "CheckSORT: Refined synthetic data combination and optimized sort for automatic retail checkout." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5390-5397. 2023.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
- Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ´ Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014
- Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10213–10224, 2021
- Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.