Introduction
Hello! I'm Suzuki from Artificial Intelligence Laboratory. We are conducting R&D on "Decision Making via Causal Discovery" and working on real-world decision-making tasks such as improving employee productivity based on engagement survey data. We are pleased to announce that our research paper on "Fast Causal Discovery on Linear Non-Gaussian Acyclic Model" was accepted at ECML-PKDD 2024, a major international conference in the field of machine learning and data mining. This article briefly introduces the contents of the paper.
Paper
- Title:LayeredLiNGAM: A Practical and Fast Method for Learning a Linear Non-Gaussian Structural Equation Model
- Authors:Hirofumi Suzuki (Fujitsu Limited)
- Conference:European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024
Overview
TL;DR
In this paper, we focused on DirectLiNGAM, a representative causal discovery algorithm on Linear Non-Gaussian Acyclic Model (LiNGAM). And we proposed LayeredLiNGAM, a faster algorithm, by generalizing the theoretical framework of DirectLiNGAM.
Background
Causal discovery analyzes the causal relationships among variables from a given dataset consisting of multiple observations and variables. For example, given a dataset with three variables: precipitation, visitors, and sales, we can obtain causal relationships such as "more precipitation leads to fewer visitors" (causality from precipitation to visitors) and "more visitors leads to more sales" (causality from visitors to sales). Multiple causal relationships are often summarized as a causal graph. Given a causal graph, we can see how intervening in one variable (arbitrarily determining its value) affects other variables. This can be applied to decision making, such as performing causal discovery on employee productivity and multiple related variables, then analyzing the effects of interventions, and developing plans to improve employee productivity.
Causal graphs are often modeled with a structural equation model (SEM) and output as a directed acyclic graph (DAG). In the example above, we would expect to get a simple DAG such as "precipitation → visitors → sales", but in general, more complex DAG can be obtained as the number of variables increases. LiNGAM is one of the most fundamental types of SEM and has important properties [1]. DirectLiNGAM is a representative causal discovery algorithm on LiNGAM [2], and it is widely spread around the Python package lingam [3].
LiNGAM
We denote a data of variables as a -dimensional vector and a weighted adjacency matrix representing a DAG of nodes as . LiNGAM is a data generation process formulated by the following equation:
where is a random noise vector such that each follows a non-gaussian distribution of zero mean. That is, LiNGAM represents each variable as a linear sum of the other variables and a non-Gaussian noise term.
Since represents a DAG, the variables can be ordered according to their precedence. We call it causal order and denote it as where is an order of a variable : if , then precedes . By using the causal order, we can reformulate LiNGAM as follows:
Variables that can be at the beginning of a causal order are specifically called exogenous variables and play an important role in the causal discovery algorithm described below.
LiNGAM has a very important property in practice called identifiability. The property ensures that "different model parameters always produce different data". In other words, the parameters of the model are uniquely determined for a given dataset. As a result, we have a unique estimation on a dataset satisfying LiNGAM assumptions.
DirectLiNGAM
Causal dicovery by DirectLiNGAM consists of two steps: (1) estimating a causal order and (2) estimating an adjacency matrix . The causal order estimation is particularly important because once is determined, can be easily estimated by sparse regression according to the LiNGAM formula. There is a theoretical framework consisting of two lemmas and one corollary listed below.
Lemma 1. ([2]) |
---|
Let be the residual when regresses . Then, is an exogenous variable if and only if is independent of all residuals . |
Lemma 2. ([2]) |
---|
Let be the vector collecting the residuals by an exogenous variable . Then follows LiNGAM. |
Corollary 1. ([2]) |
---|
For an exogenous variable , preserves the original causal order |
First, since Lemma 1 indicates that we can identify exogenous variables, we can arbitrarily select one exogenous variable. Then, if we remove the influence of the selected exogenous variable from the data, Lemma 1 can be applied recursively since the residuals follows LiNGAM from Lemma 2. Repeating these operations will eventually remove all variables. Now, because of corollary 1, the order in which the variables are removed is equal to a causal order of the original data. This is the theoretical framework of the causal order estimation by DirectLiNGAM.
Contributions
In this paper, we showed a room for improvement in the computational speed of DirectLiNGAM and proposed a faster algorithm LayeredLiNGAM in practice by generalizing the theoretical framework. Specific contributions can be summarized as follows.
- Focusing on the causal order estimation algorithm in DirectLiNGAM, we clarified the issues by discussing the theoretical complexity and measuring it by computer experiments.
- Using the concept of layers, we generalized the theoretical framework of DirectLiNGAM and derived LayeredLiNGAM. We have also clarified its advantages over DirectLiNGAM.
- We conducted experiments verifying LayeredLiNGAM performs faster than DirectLiNGAM without accuracy degradation. Specifically, there were instances of up to a 25 times speed-up on sparse random graphs and up to a 5 times speed-up on real-world datasets.
Computational Complexity and Actual Behavior of DirectLiNGAM
Let's discuss the computational complexity of DirectLiNGAM, where the number of input data is . Causal order estimation is an algorithm that removes the exogenous variables one by one until all the variables are removed, so the process of selecting an exogenous variable is repeated times. For the selection of an exogenous variable, we require simple regression and independence test on each pair of variables according to Lemma 1. The simple regression on a pair of variables takes time. There are various methods for the independence test, but if we denote the computational complexity of the independence test for one pair of variables as , it holds . Thus, the selecting an exogenous variable takes time. Therefore, the causal order estimation takes time. Estimating an adjacency matrix can achieve by utilizing a fast sparse regression algorithm such as LARS [4]. Therefore, DirectLiNGAM takes time overall.
From the discussion above, it is easy to see that a bottleneck in DirectLiNGAM is the causal order estimation. However, it has been shown that the independence test can achieve using a technique called maximum entropy approximation [5], under which DirectLiNGAM works with . So, does the maximum entropy approximation solve the bottleneck of DirectLiNGAM? Actually, it is NOT. The following preliminary experimental results show that the bottleneck in DirectLiNGAM is still the causal order estimation, suggesting the presence of overhead in the maximum entropy approximation. This motivates us to improve the term in causal order estimation.
LayeredLiNGAM
Improving the computational complexity of DirectLiNGAM, we consider to reduce the number of loops that occur times. To do this, we need to order multiple variables together in a single loop. Therefore, we take an advantage of the fact that DAGs can be decomposed into layers. Now, we consider a new ordering instead of . allows multiple variables to have the same value, and considers a set of variables having the same value as a layer. We call layered causal order when the following LiNGAM formula holds:
As can be seen from this formula, the layered causal order is a generalization of the causal order.
In this paper, we proposed an algorithm called LayeredLiNGAM that replaced the causal order estimation in DirectLiNGAM with the layered causal order estimation. LayeredLiNGAM is based on the new theoretical framework that generalizes Lemma 2 and Corollary 1.
Lemma 3. |
---|
Let be the residual when regresses . Then, if is a set of exogenous variables, follows LiNGAM. |
Corollary 2. |
---|
For a set of exogenous variables , preserves the original layered causal order. |
First, Lemma 1 used in DirectLiNGAM can be used not only to select one exogenous variable, but also to select a set of exogenous variables (a layer) . Then, if we remove the influence of from the data, it follows LiNGAM from Lemma 3, so we can apply Lemma 1 recursively. Repeating these operations will eventually remove all variables. Now, because of corollary 2, the order in which the layers are removed is equal to a layered causal order of the original data. This is the theoretical framework of the layered causal order estimation by LayeredLiNGAM.
Let's discuss the computational complexity of LayeredLiNGAM. The difference of LayeredLiNGAM from DirectLiNGAM is that the number of loops to select the exogenous variables, and the residual for updating data is taken by multiple regression. The identification of exogenous variables in each loop is equivalent in DirectLiNGAM and LayeredLiNGAM. Multiple regression can be calculated using the least squares method in time. Thus, given as the number of layers found by LayeredLiNGAM, estimating layered causal order takes time. Note that is generally preferable because LiNGAM involves regression problems, and that the number of layers is always less than or equal to the number of variables (). It turns out that the computational complexity of estimating layered causal order in LayeredLiNGAM is under the typical setting, which is faster than the causal order estimation in DirectLiNGAM.
Experimental Results
In computational experiments, we compared the performance of LayeredLiNGAM with DirectLiNGAM and the LiNGAM estimation algorithm by Ruixuan et al. [6], using synthetic datasets based on three types of sparse random graphs and two real-world datasets of gene expression and drug design. Here, we show experimental results on synthetic datasets and briefly describe the behavior of LayeredLiNGAM. See the paper for experiments on real-world datasets.
Each synthetic dataset is generated as follows: After randomly generating a DAG, we randomly set each non-zero value . Each random noise was generated from a Laplace distribution with mean 0 and variance 1, and data were sampled based on the LiNGAM formula. We used the following three random DAG generation methods:
- RG4:We created an undirected graph with an average degree of 4, which is a random graph with a constant number of edges, and then the graph is converted into a DAG by randomly ordering the nodes.
- BA4:We created an undirected graph based on the Barabási–Albert model with the model parameter and then the graph is converted into a DAG by randomly ordering the nodes.
- LA128:A DAG with a fixed number of nodes and layers , with each layer having exactly nodes, and each node is connected to one random node in each of the layers directly above and below it.
Figure 4 shows the results. For RG4 and BA4, we observed the behavior as the number of nodes changed. For LA128, we observed the behavior as the number of layers changed. We generated 10 datasets for each experimental parameter and plotted the mean and standard error of computation times and accuracies for each method.
For the causal order estimation, LayeredLiNGAM was up to 50 times faster on RG4 and BA4 than DirectLiNGAM, and its speed was close to that of adjacency matrix estimation. As a result, LayeredLiNGAM achieved up to 25 times faster computation time for overall causal discovery on RG4 and BA4. For LA128, we observed that LayeredLiNGAM is faster with fewer layers in the DAG, and is comparable to DirectLiNGAM when , i.e., LayeredLiNGAM worked faithfully to the computational complexity. We also observed that LayeredLiNGAM has no undesirable accuracy degradation compared to DirectLiNGAM. From the above, we confirmed that LayeredLiNGAM can perform causal discovery on LiNGAM faster than DirectLiNGAM without accuracy degradation.
Conclusion
This article introduced the research on "Fast Causal Discovery on Linear Non-Gaussian Acyclic Model" accepted at ECML-PKDD 2024. There are some uncovered contents such as proofs, a hyperparameter of LayeredLiNGAM, and experiments on real-world datasets. Please refer to the paper for details. As future work, it would be interesting to consider ways to make the output DAG of LayeredLiNGAM closer to that of DirectLiNGAM, or ways to improve the accuracy, while maintaining the speed of LayeredLiNGAM.
References
- [1] Shimizu et al., "A Linear Non-Gaussian Acyclic Model for Causal Discovery", Journal of Machine Learning Research 7, Pages: 2003–2030, 2006
- [2] Shimizu et al., "DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model", Journal of Machine Learning Research 12, Pages: 1225–1248, 2011
- [3] Ikeuchi et al., "Python package for causal discovery based on LiNGAM", Journal of Machine Learning Research 24, Pages: 14:1–14:8, 2023
- [4] Efron et al., "Least Angle Regression", The Annals of Statistics 32(2), Pages: 407 – 499, 2004
- [5] Hyvärinen et al, "Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models", Journal of Machine Learning Research 14, Pages: 111–152, 2013
- [6] Zhao et al., "Learning linear non-Gaussian directed acyclic graph with diverging number of nodes", Journal of Machine Learning Research 23, Pages: 269:1–269:34, 2021