Overview
Hi! I am Sho Takemori from AI Innovation Core Project, Artificial Intelligence Laboratory. Recently, in our joint research with the Indian Institute of Science (IISc), we have developed an AI technology termed SelMix that enables an efficient optimization of real-world, complex performance measures for classification, and presented our paper at ICLR 2024 as a spotlight presentation.
Paper
- Title: Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives
- Conference: The Twelfth International Conference on Learning Representations (ICLR 2024)
- Authors: Shrinivas Ramasubramanian (FRIPL), Harsh Rangwani (IISc), Sho Takemori (FRJ), Kunal Samanta (IISc), Yuhei Umeda (FRJ), Venkatesh Babu Radhakrishnan (IISc)
- Link to the paper
(Here, FRIPL: Fujitsu Research of India Private Limited, FRJ: Fujitsu Research Japan)
Motivation and Problem Formulation
In the paper, we propose a fine-tuning method for optimizing non-decomposable objectives in both semi-supervised and supervised learning setup. In the following, we detail our motivation and problem formulation.
Long-tailed datasets and Non-decomposable objectives
Classification accuracy is a conventional metric for deep neural classifiers for computer vision tasks. However, in practical applications, datasets are often imbalanced (i.e., the distribution of class labels is long-tailed), and we have to optimize more nuanced metrics to ensure the robustness of the classifier. More precisely, we consider optimization of non-decomposable objectives (metrics) [Narasimhan et al., 2022]. Here, a classification metric is called decomposable if there exists a score function defined on each instance, and the metric associated with a set of instances can be computed by scores of instances. Otherwise, we call a metric non-decomposable. A typical example of a decomposable metric is accuracy. Some practical metrics are non-decomposable. Examples include the mean, minimum, harmonic-mean of class-wise recalls, F1, AUC. In addition, optimizing these metrics is of practical importance, especially when the dataset is long-tailed. More specifically, we consider non-decomposable metrics that can be written as a function of (the entries of) the confusion matrix of a classifier (or a score function associated the classifier). For a confusion matrix of a classifier, we assume the non-decomposable metric is given as with a function . We list examples of non-decomposable metrics as a function of entries of a confusion matrix in the following table. Here, for a constrained objective regarding coverage, we consider an unconstrained objective by introducing Lagrange multipliers. Here, classifier coverage for a given class label is the probability that a classifier predicts an instance as the class label and the coverage constraints are useful for representing fairness. The mean, min, H-mean recall in the table represent the average, minimum, and harmonic mean of class-wise recalls, respectively.
A classifier with a feature extractor
In the previous work [Rangwani et al., 2022], we consider optimization of non-decomposable metrics (in the semi-supervised learning setting) and proposed a framework termed cost sensitive self-training (CSST). However, CSST needs to train deep neural networks from scratch and is computationally expensive. In this work, for computational efficiency, we fine-tune a pre-trained model by "selective mixup", which we explain later. Let be the number of classes and the instance space. We assume the classifier can be computed through the score function that consists of a feature extractor and a linear layer parametrized by a matrix , i.e., the classifier is given as , where for each instance .
Semi-supervised and supervised learning
Similarly to the previous work [Rangwani et al., 2022], our method is able to utilize unlabeled instances (i.e., semi-supervised learning setting). Although our main focus is the semi-supervised learning (SSL) setting, our method can be easily extended to the supervised learning setting and we evaluate our method in both SSL and supervised settings.
Proposed Method
Feature Mixup and (i, j)-mixup
Mixup is a data augmentation method and existing works show that mixup has better generalization performance, calibration, and robustness, especially in the standard setting, where the metric is accuracy [Zhang et al., 2018, Zhang et al., 2021, Zhong et al., 2021]. In the most basic form of the mixup method, for two samples , mixup creates a new sample by a convex combination, i.e., a new sample is defined as . Here, we identify the class label with a one-hot vector in and . Our method is based on the feature mixup [Verma et al., 2019] and we consider a convex combination of feature vectors instead of that of instances. Given two samples , we define the mixup loss as follows:
where denotes the softmax cross-entropy loss and .
Conventional mixup methods draw two samples uniform randomly. However, since our objective is more complex (optimization of non-decomposable metrics), such a naive sampling strategy leads to a sub-optimal performance. Therefore, we consider -mixups that take into account of class labels of the samples . We call the feature mixup a -mixup if and . Here, we consider a convex combination of only features and we use the class label for the mixup sample. For each , we denote by the subset of instances with the label . In the setup of SSL, we denote the subset of instances with the pseudo label . For each , we denote by the mean feature vector . Then, the representative of the expected loss due to -mixups is defined as
Gain due to mixup and approximated gain
If we perform an -mixup, we update the weights of the linear layer by the directional vector . We consider the change of the metric due to the update of the weights . Let be a small scalar. Then, by the Taylor expansion, we have
where denotes the directional derivative. We define the gain due to -mixups by i.e., represents change of the metric due to -mixups.
Selective Mixup
Next, we explain how to select a pair . We assume we select a mixup pair following a distribution on . Then, the expected change of the weights is given as and the expected gain induced by the distribution is given as To optimize the expected gain, we introduce a sampling strategy termed SelMix defined as where is an inverse temperature parameter.
Estimation of Gains
To implement our algorithm, we need to compute gains . However, definition of gains involves the confusion matrix , which may be non-smooth w.r.t. the weights (we assume directional derivative of w.r.t exists or we assume the objective is defined through an approximation of the confusion matrix). We assume there exist a matrix called the unconstrained confusion matrix such that is (approximately) equal to , where denotes the -th row of a matrix and .
Then, we can approximately compute gain as follows (this is an informal statement and we refer to Sec. 4 and Sec. D of our paper for more precise statements):
Theorem 4.1 Assume is sufficiently small. Then, we have the following approximation formula:
Overview of Algorithm
In our algorithm, we perform (i, j)-mixup and update the weight using the gradient of the mixup loss . For each iteration , the algorithm computes the SelMix distribution using the approximation formula and sample a mixup pair from the distribution. Then, instances are sampled uniformly from and and the weights are updated the gradient of the mixup loss . We refer to our paper for pseudo code.
Theoretical Results
We briefly introduce a theoretical result (convergence analysis) that justifies our proposed method.
In the setting of our theoretical analysis, we assume that for each iteration , our algorithm selects a mixup pair from the SelMix distribution, and updates the weights by , where is the normalized directional vector . We assume that is -smooth and concave w.r.t and is a reasonable directional vector, more precisely, we assume there exists a constant such that holds for all . Let be an optimal parameter . Then, with some additional boundedness assumption, by a standard argument we can prove the following under further mild assumptions:
Theorem For any , we have .
Experimental Results
To show effectiveness of our method, we have conducted experiments for optimizing various non-decomposable objectives in both semi-supervised and supervised learning settings. Here, we only introduce experimental results on a long-tailed version of the CIFAR-10 dataset (we refer to our paper for further experimental results).
Semi-Supervised Learning Setup
In experiments, we consider optimization of the minimum of coverage across classes, and the minimum (the mean, the harmonic mean, the geometric mean) of recalls across classes, that are important metrics for long-tailed datasets. The following figure shows comparison of our method (SelMix) with SoTA semi-supervised methods CSST [Rangwani et al., 2022], DASO [Oh et al., 2022], ABC [Lee et al., 2021], FixMatch [Sohn et al., 2020] with the logit-adjusted loss and the results show our method significantly improves these SoTA methods.
Supervised Learning Setup
The following figure shows comparison in the supervised setting. We compare our method with one of SoTA methods (MiSLAS), where similar to SelMix, MiSLAS [Zhong el al., 2021] also decouples the learning procedure into two stages: representation learning (i.e., learning of feature extractor) and classifier learning (i.e., fine-tuning). In the experiments, we fine-tune the pre-trained model obtained by MiSLAS with Stage 1. The experimental results also show significant improvement over the SoTA method.
Conclusion
We proposed a novel fine-tuning method termed SelMiX for optimization of non-decomposable objectives, prove its validity, and demonstrate effectiveness of the proposed method across various experimental settings. Our paper has been accepted at ICLR 2024 as a spotlight presentation and we have presented our joint work at ICLR 2024.
References
- Harikrishna Narasimhan and Aditya K Menon. Training over-parameterized models with nondecomposable objectives. Advances in Neural Information Processing Systems, 34, 2021
- Harikrishna Narasimhan, Harish G Ramaswamy, Shiv Kumar Tavker, Drona Khurana, Praneeth Netrapalli, and Shivani Agarwal. Consistent multiclass algorithms for complex metrics and constraints. arXiv preprint arXiv:2210.09695, 2022.
- Hyuck Lee, Seungjae Shin, and Heeyoung Kim. Abc: Auxiliary balanced classifier for classimbalanced semi-supervised learning. Advances in Neural Information Processing Systems, 34: 7082–7094, 2021.
- Youngtaek Oh, Dong-Jin Kim, and In So Kweon. Daso: Distribution-aware semantics-oriented pseudo-label for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9786–9796, 2022.
- Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596–608, 2020
- Harsh Rangwani, Shrinivas Ramasubramanian, Sho Takemori, Kato Takashi, Yuhei Umeda, and Venkatesh Babu Radhakrishnan. Cost-sensitive self-training for optimizing non-decomposable metrics., Advances in Neural Information Processing Systems, 2022
- Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438–6447. PMLR, 2019.
- L Zhang, Z Deng, K Kawaguchi, A Ghorbani, and J Zou. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021.
- Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16489–16498, 2021