Hello, I am Nan Zhang, from Fujitsu Research & Development Center Co, Ltd. Recently, we participated in WeatherProof challenge (CVPR 2024 UG2+ Track 3). And we won the first place on the final leaderboard. In this blog, I will introduce the challenge information and our solution.
Overview of the challenge
The WeatherProof Dataset Challenge [1] introduces the first collection of accurately paired clear and weather-degraded images. This challenge tackles real-world weather effects on semantic segmentation, aiming to spark new methods for handling these challenging images.
Our solution
To address the problems of semantic segmentation in adverse weather, including poor image quality, rain and snow noise interference, and large scene differences, we first propose stronger semantic segmentation models. And then we introduce a new dataset and employ data augmentation methods. Finally, effective training strategies and ensemble method are applied to improve final performance.
Semantic segmentation models
Semantic segmentation models based on Depth Anything - We use Depth Anything [2] pre-trained backbone to improve semantic segmentation models. Depth Anything is a MDE (Monocular Depth Estimation) model. It inherits the rich semantic priors from a pre-trained DINOv2 via a simple feature alignment constraint. In the other hand, it has been trained on a large-scale unlabeled dataset and has good generalization ability, so we choose the Depth Anything pre-trained model as the backbone. We use this method to improve UperNet model and SETRMLA model. Figure.2 shows key components of UperNet model with depth anything pretrained backbone.
Semantic segmentation model based on language guidance - To leverage language guidance to ease the model’s ability to learn adaptability to different categories in different weather conditions. We utilize the CLIP model, which learns a latent space that is shared by both image and text encodings [3, 4, 5]. To enhance the model's performance, we do two improvements, enriching the prompt words and enhancing CLIP image input methodology. By incorporating both weather and object category information, the model gains a more comprehensive understanding of the scene, leading to improved segmentation accuracy, even under adverse weather conditions. Figure.3 shows the overview of segmentation model based on CLIP guidance.
Other semantic segmentation models - Except for above improved semantic segmentation models, we also try other models, and choose following two for our solution. The first one is OneFormer model [6], a multi-task universal image segmentation framework based on transformers. We use text "the task is semantic" to encode task token, semantic text to encode text token. The second is InternImage model [5], which uses deformable convolutions to maintain the long-range dependence of attention layers in a low memory/computation regime.
Datasets and data augmentation
We use three datasets for training: WeatherProof, WeatherProofClean, and WeatherProofExtra. WeatherProof is the original dataset provided by the challenge organizer. During testing phase, all the data is used for training. WeatherProofClean has the clean images for each scene in WeatherProof, and the annotations are the same. WeatherProofExtra is an extra dataset. It has 160 scenes, including rainy, snowy and foggy.
We do an augmentation on the extra data, including adverse weather and super-resolution. Figure 4 shows an example of the augmented images. We use rainy, snowy or lighting weather watermarks to cover the original image to simulate various adverse weather conditions. We also use super-resolution method to narrow the gap between training and testing resolutions.
Training strategy
We first train the models using the original datasets WeatherProof and WeatherProofClean. Basic data enhancement operations are also performed on the training set. In addition, after analyzing the training, validation and test sets, we identify that the test sets have higher resolution and wider viewing angle. Then we use WeatherProofExtra dataset to finetune the models with larger image size as training input.
Effect
Effect of language guidance and strategies
To evaluate the effectiveness of our prompt word for language guidance, we take InternImage-XL as the basic model, then compare the result of adding category prompts and weather prompts. Table. 1 shows the result of adding different prompts on WeatherProof validation set.
We further evaluate various strategies using the WeatherProof Dataset Challenge test set. The baseline is InternImage-XL model. CLIP Guidance/C method means our improved model by adding language guidance. Finetuning means the training strategy of finetuning with extra data. Inference enhancement involves sliding window operation and scaling images, then calculating the average. The results are shown in Table 2.
Effect of final result
We evaluate the performance of different methods, and finally choose six results of five models to ensemble. The ensemble method is to vote for each pixel in six predicted results and select the category with the highest number of votes as the final category. The results are listed in Table 3. For single model, SETRMLA-depth can reach the mIoU of 0.46. After ensemble for the six results, our final mIoU is 0.47. Figure 5 shows 2 examples of visualization result from different methods.
Review of the competition
To tackle the problems of semantic segmentation in adverse weather, we provide an enhanced pipeline with improved segmentation models, extra dataset, effective strategies and ensemble method. Our method works very well for this competition. Thanks to the efforts of all team members, we won the first place.
This competition is a great experience for us. We explored how to use large model and language model guidance to optimize visual fundational model. Next, we will consider how to apply this experience to actual products to improve competitiveness.
References
- WeatherProof Dataset Challenge: Semantic Segmentation In Adverse Weather (CVPR’24 ug²+ track 3). https://codalab.lisn.upsaclay.fr/competitions/16728, 2024.
- Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. Depth anything: Unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024.
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Blake Gella, Howard Zhang, Rishi Upadhyay, Tiffany Chang, Matthew Waliman, Yunhao Ba, Alex Wong, and Achuta Kadambi. "WeatherProof: A Paired-Dataset Approach to Semantic Segmentation in Adverse Weather." arXiv preprint arXiv:2312.09534, 2023.
- Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
- Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. OneFormer: One Transformer to Rule Universal Image Segmentation. CVPR, 2023.