Please enable JavaScript in your browser.

fltech - 富士通研究所の技術ブログ

富士通研究所の研究員がさまざまなテーマで語る技術ブログ

"Exploring the Whole Rashomon Set of Sparse Decision Trees" presented on NeurIPS 2022

We are conducting research and development on "AI-based Knowledge Discovery". Recently, We have started a joint research project with [Duke University's Interpretable Machine Learning Lab (https://users.cs.duke.edu/~cynthia/lab.html), and the research result, "Exploring the Whole Rashomon Set of Sparse Decision Trees," has been accepted by NeurIPS2022. This article briefly introduces the contents of the paper.

Paper

Overview

Knowledge discovery methods using prediction models are based on building human-interpretable white-box models (e.g., decision trees, rule sets, and Lasso models). Existing research aims to output a single model that minimizes the objective function. However, as Leo Breiman pointed out, "there are multiple models that can successfully explain the data," which is known as the Rashomon Effect [1]. The Rashomon Effect is named after Akira Kurosawa's movie "Rashomon," in which the two views of a single event contradict each other. As mentioned earlier, the Rashomon effect is also effective for knowledge discovery in machine learning, suggesting that there is not only one optimal model for explaining data, and that by building a variety of models, knowledge that was previously missed can be discovered. The goal of this paper is to construct a Rashomon Set that contains all prediction models that achieve accuracy that is nearly equal to the optimal accuracy. If one is able to construct a Rashomon Set, one can specify properties such as interpretability, fairness, and whether a particular feature is used, to obtain a list of applicable models. You can also select models in the Rashomon set that match your own expertise, or conversely, select models with non-trivial explanations for knowledge discovery. The output of multiple models (the Rashomon set), rather than a single prediction model, will become increasingly important in the future for explainability and knowledge discovery.

However, constructing Rashomon sets for nonlinear models is very difficult due to computational complexity. Consider, for example, the sparse decision tree that is the subject of this paper. Even for a tree with a depth of 4 and only 10 binary features, the number of possible trees (size of the hypothesis space) is  9.338 \times 10^{20} or more. Therefore, it is not possible to construct the Rashomon set in real time by constructing all of these trees, measuring their accuracy, and then determining whether or not they should be stored in the Rashomon set. Until now, no previous research has successfully constructed Rashomon sets for nonlinear models, not limited to decision tree classes.

The hypothesis space of sparse decision trees is huge, but the Rashomon set, the set of good decision trees, is a small subset of it. In fact, even when the entire hypothesis space is too large to enumerate, the Rashomon set itself may be small enough to hold. Therefore, instead of enumerating all sparse decision trees and preserving only those that are included in the Rashomon set, we use analytical bounds to safely prune out those that are not members of the Rashomon set. This allows us to focus only on the part of the space that contains the members of the Rashomon set. In addition, the Rashomon set is stored in a newly proposed data structure that allows for memory-efficient storage and easy access to the members of the Rashomon set. This combination of pruning and efficient representation allowed us to propose the first complete enumeration algorithm for the Rashomon set for sparse decision trees on real-world data sets.

The Rashomon set provides new metrics. For example, how big is the Rashomon set? Does its size depend on the dataset? How does the importance of a variable vary across trees in the Rashomon set? and so on. These questions may provide a better understanding of how important a variable is not only for one model, but also for a dataset.

Definition of Sparse Decision Tree and Rashomon Set

First, we define the sparse decision tree addressed in this paper. We denote the training dataset as  {(\textbf{x}_i, \textbf{y}_i)}_{i=1}^n, where  \textbf{x} _i \in \{0,1\}^p,  y_i \in \{0,1\} are binary features and binary class, while our method can be generalized to multiclass classification as well. Let  \ell(t, \textbf{x}, \textbf{y}) = \frac{1}{n}\sum_{i=1}^n\mathbf{1}[\hat{y}_i \neq y_i] be the loss of tree on the training set. We define the objective function as the combination of the misclassification loss and a sparsity penalty on the number of leaves:  Obj(t, \textbf{x}, \textbf{y}) = \ell(t, \textbf{x}, \textbf{y}) + \lambda H_t, where  H_t is the number of leaves in tree. A decision tree with this as its objective function is called a sparse decision tree [2].

Next, we define the Rashomon set for this decision tree.

ε-Rashomon Set

Let  t_{\textrm{ref}} be a benchmark model or reference model from  \mathcal{T}, where  \mathcal{T} is a set of binary decision trees. The  \epsilon-Rashomon set  R_{\textrm{set}}(\epsilon, t_{\textrm{ref}}, \mathcal{T}) is a set of all trees  t\in \mathcal{T} with objective value at most  1 + \epsilon \times Obj(t_{\textrm{ref}},\textbf{x},\textbf{y}):

 \displaystyle
R_{\textrm{set}}(\epsilon, t_{\textrm{ref}}, \mathcal{T}) := {t
 \in  \mathcal{T} :  Obj(t,\textbf{x},\textbf{y}) \leq  (1+\epsilon) \times  Obj(t_{\textrm{ref}},\textbf{x},\textbf{y}) }

In the following, the benchmark model is assumed to be the optimal model that minimizes the objective function for the training data.

Branch-and-Bound Technique

We consider pruning branches to reduce the search space for the construction of the Rashomon set. When growing a decision tree in dynamic programming, some of its leaves are considered to be already determined (fixed), while others are not yet determined (unfixed). In such a case, we assume that all unfixed leaves of the tree have the best classification performance. This allows us to estimate a lower bound for the objective function. This lower bound is compared to the threshold of the objective function value of the Rashomon set, \theta_{\epsilon}. If the lower bound of the objective function is greater than  \theta_{\epsilon}, that is, if the performance is not good, then no matter how the decision tree is extended, it will never become a member of the Rashomon set. Therefore, extending the tree is meaningless, and the search space can be reduced.

This is shown below as a theorem.

First, each subtree  t is represented by the following five variables. Let  t_{fix} be a fixed leaf that is not extended,  {\delta}_{fix} be the label of a data point in the fixed leaf,  t_{split} be an unfixed leaf that may be split (similarly, label as \delta_{split}) and [tex: H_t is the number of leaves in the current tree. Using this, the current position in the search space can be represented by a subtree  t = (t_{fix},\delta_{fix}, t_{split}, \delta_{split}, H_t).

Theorem: Basic Rashomon Lower Bound

Let the threshold of the Rashomon set be  \theta_{\epsilon}. Let  t be a decision tree with leaves  t_{fix}, t_{split}, and let  b(t_{fix}, \textbf{x}, \textbf{y}):=\ell(t_{fix}, \textbf{x}, \textbf{y})+\ lambda H_t. Then if  b(t_{fix}, \textbf{x}, \textbf{y}) > \theta_{\epsilon}, then  t and all its children (with extended leaves) are not members of the  \epsilon-Rashomon set.

This theorem could use a tighter lower bound, but we omit it here. See the paper for details.

Experimental Results

Comparison with other enumeration methods

First, we compare our method, which can compute the Rashomon set exactly, with methods that construct multiple decision trees, such as random forests.

Figure 2: Decision Tree Enumeration

TreeFARMS is our proposed method. The vertical axis shows the objective function values of the enumerated decision trees. The horizontal axis shows the enumerated decision trees sorted by decreasing objective function value. The black dashed line is the threshold of the Rashomon set. In this case, we enumerate the set of decision trees that achieve an objective function value 1.1 times that of the optimal decision tree (i.e.,  \epsilon = 0.1). The number in parentheses for each method name is the number of decision trees enumerated in the denominator, and the number in the numerator is the number of them in the Rashomon set.

Although the proposed method rigorously constructs the Rashomon set, for example, the decision tree constructed by Random Forest (RF), even the one that achieves the best objective function value, is not in the Rashomon set (i.e., it is significantly less accurate than the optimal one). Other methods miss a significant number of decision trees that are members of the Rashomon set.

Feature Importance with Rashomon Sets

By constructing the Rashomon set, we can define a new feature importance Model Class Reliance (MCR) for a dataset; simply put, the MCR provides a range of feature importance. In previous work, $MCR could be computed only for convex losses in linear models [3]. For decision trees, this problem is nonconvex and could not be solved by previous methods. However, using the results of this paper, it is possible to compute the MCR by computing the importance for all trees in the Rashomon set and finding their minimum and maximum values.

MCR for the Bar data set

This dataset predicts whether or not a customer will accept a coupon for a bar. The red dots are the importance levels indicated by the optimal decision tree. For example, the feature childrennumber_0 (the number of children is 0) is not considered important on the optimal decision tree, but other decision trees that achieve almost equal accuracy (members of the Rashomon set) are considered somewhat important. Thus, we can expect to find important features that have been missed by conventional methods by using the Rashomon set.

References

  • [1]: Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3):199–231, 2001.
  • [2]: Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, and Margo Seltzer. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020.
  • [3]: Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Journal of Machine Learning Research, 20(177):1–81, 2019