This vignette is a modification of (Tikka, Hyttinen, and Karvanen 2021).

A causal effect is defined as the distribution \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}),\mathbf{Z})\) where variables \(\mathbf{Y}\) are observed, variables \(\mathbf{X}\) are intervened upon (forced to values irrespective of their natural causes) and variables \(\mathbf{Z}\) are conditioned on. Instead of placing various parametric restrictions based on background knowledge, we are interested in this paper in the question of identifiability: can the causal effect be uniquely determined from the distributions (data) we have and a graph representing our structural knowledge on the generating causal system.

In the most basic setting we are identifying causal effects from a
single observational input distribution, corresponding to passively
observed data. To solve such problems more generally than what is
possible with the back-door adjustment (Spirtes, Glymour, and Scheines 1993; Pearl 2009; Greenland, Robins, and Pearl 1999),
Pearl (1995) introduced
**do-calculus**, a set of three rules that together with
probability theory enable the manipulation of interventional
distributions. Shpitser and Pearl (2006) and Huang
and Valtorta (2006)
showed that do-calculus is complete by presenting polynomial-time
algorithms whose each step can be seen as a rule of do-calculus or as an
operation based on basic probability theory. The algorithms have a high
practical value because the rules of do-calculus do not by themselves
provide an indication on the order in which they should be applied. The
algorithms save us from manual application of do-calculus, which is a
tedious task in all but the simplest problems.

Since then many extensions of the basic identifiability problem have appeared. In identifiability using surrogate experiments (Bareinboim and Pearl 2012), or \(z\)-identifiability, an experimental distribution is available in addition to the observed probability distribution. For data observed in the presence of selection bias, both algorithmic and graphical identifiability results have been derived (Bareinboim and Tian 2015; Correa, Tian, and Bareinboim 2018). More generally, the presence of missing data necessitates the representation of the missingness mechanism, which poses additional challenges (Mohan, Pearl, and Tian 2013; Shpitser, Mohan, and Pearl 2015). Another dimension of complexity is the number of available data sources. Identification from a mixture of observational and interventional distributions that originate from multiple conceptual domains is known as transportability for which complete solutions exist in a specific setting (Bareinboim and Pearl 2014).

While completeness has been accomplished for a number of basic identifiability problems, there are still many challenging but important extensions to the identifiability problem that have not been studied so far. To find solutions to the more complicated identifiability problems, we present a unified approach to the identification of observational and interventional causal queries by constructing a search algorithm that directly applies the rules of do-calculus. We impose no restrictions to the number or type of known input distributions: we thus provide a solution to problems for which no other algorithmic solutions exist. We also extend to identifiability under missing data together with mechanisms related to selection bias and transportability.

To combat the inherent computational complexity of the search-based
approach, we derive rules and techniques that avoid unnecessary
computational steps. We are able to detect trivial queries where
non-identifiability can be determined directly from the inputs. We also
present a search heuristic that considerably speeds up the search in the
cases where the effect is indeed identifiable. The approach, called
**do-search**, is provably sound and it retains the
completeness in the cases previously proven to be solved by do-calculus
rules. We can easily scale up to the problems sizes commonly reported in
the literature. The R package `dosearch`

(R Core Team 2024; Tikka, Hyttinen, and Karvanen 2020)
provides an implementation of the search algorithm and is available on
CRAN. The
complete details of **do-search** can be found in (Tikka, Hyttinen, and Karvanen
2021).

Our presentation is based on Structural Causal Models (SCM) and the language of directed graphs. We assume the reader to be familiar with these concepts and refer them to detailed works on these topics for extended discussion and descriptions, such as (Pearl 2009) and (Koller and Friedman 2009).

Following the standard set-up of do-calculus (Pearl 1995),
we assume that the causal structure can be represented by a
*semi-Markovian causal graph* \(G\) over a set of vertices \(\mathbf{V}\). The directed edges correspond
to direct causal relations between the variables (relative to \(\mathbf{V}\)); directed edges do not form
any cycles. Confounding of any two observed variables in \(\mathbf{V}\) by some unobserved common
cause is represented by a bidirected edge between the variables.

In a non-parametric setting, the problem of expressing a causal quantity of interest in terms of available information has been be described in various ways depending on the context. When available data are affected by selection bias or missing data, a typical goal is to “recover” some joint or marginal distributions. If data are available from multiple conceptual domains, a distribution is “transported” from the source domains, from which a combination of both observational and experimental data are available, to a target domain. The aforementioned can be expressed in the SCM framework by equipping the graph of the model with special vertices. However, on a fundamental level these problems are simply variations of the original identifiability problem of causal effects and as such, our goal is to represent them as a single generalized identifiability problem.

The general form for a causal identifiability problem that we consider is formulated as follows.

*Input*: A set of known distributions of the form \(P(\mathbf{A}_i | \textrm{do}(\mathbf{B}_i), \mathbf{C}_i)\), a query \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}), \mathbf{Z})\) and a semi-Markovian causal graph \(G\) over \(\mathbf{V}\).*Task*: Output a formula for the query \(P(\mathbf{Y} { \, | \, }\textrm{do}(\mathbf{X}),\mathbf{Z})\) over the input distributions, or decide that it is not identifiable.

Here \(\mathbf{A}_i,\mathbf{B}_i, \mathbf{C}_i\) are disjoint subsets of \(\mathbf{V}\) for all \(i\), and \(\mathbf{X},\mathbf{Y},\mathbf{Z}\) are disjoint subsets of \(\mathbf{V}\). The causal graph \(G\) may contain vertices which describe mechanisms related to transportability and selection bias. In the following subsections we explain several important special cases of this problem definition, some that have been considered in the literature and some which have not been.

The SCM framework can be extended to describe missing data mechanisms. For each variable \(V_i\), two special vertices are added to the causal graph. The vertex \(V_i^*\) is the observed proxy variable which is linked to the true variable \(V_i\) via the missingness mechanism (Little and Rubin 1986; Mohan, Pearl, and Tian 2013): \[\begin{equation} V_i^* = \begin{cases} V_i, & \mathrm{if}\; R_{V_i} = 1, \\ \textrm{NA}, & \mathrm{if}\; R_{V_i} = 0, \end{cases} \end{equation}\] where \(\textrm{NA}\) denotes a missing value and \(R_{V_i}\) is called the response indicator (of \(V_i\)). In other words, the variable \(V_i^*\) that is actually observed matches the true value \(V_i\) if it is not missing (\(R_{V_i} = 1\)). We note that in this formulation, each true variable has its own response indicator, meaning that we do not consider shared indicators between variables or multiple indicators for a single variable. Furthermore, if there is no missingness associated with a given variable \(V_i\) meaning that it is fully observed, the corresponding response indicator \(R_{V_i}\) always has the value \(1\). The omission of a proxy variable and a response indicators of a specific variable from a graph encodes the assumption that the variable in question if fully observed. Note that intervention nodes are added for true variables and response indicators but not for proxy variables. On a symbolic level one could intervene on proxy variables, however we are only interested in interventions that keep equation 1 intact.

We implemented **do-search** in C++ and constructed an R
interface using the `Rcpp`

package (Eddelbuettel and Francois 2011). This
interface is provided by the R package `dosearch`

.

`library("dosearch")`

Calling the search from R is straightforward via the primary function that carries the name of package.

```
dosearch(
data, query, graph,
transportability = NULL, selection_bias = NULL, missing_data = NULL,
control = list()
)
```

The required inputs of the function are `data`

,
`query`

and `graph`

. Argument `data`

is
used to encode the set of known input distributions in the general
identifiability problem as a character string, where each distribution
is separated by a new line. For example, if we have access to
distributions \(P(W), P(Y { \, | \,
}X)\), and \(P(Z { \, | \,
}\textrm{do}(X), W)\), we would write

```
data <- "
P(W)
P(Y|X)
P(Z|do(X),W)
"
```

The individual distributions can also be given as a list of character vectors of length one:

```
data <- list(
"P(W)",
"P(Y|X)",
"P(Z|do(X),W)"
)
```

The \(\textrm{do}(\cdot)\)-operator
can either precede or succeed conditioning variables, but it must appear
only once in a given term, meaning that expressions such as
`P(Y|do(A),B,do(C))`

are not allowed, but should instead be
given as `P(Y|B,do(A,C))`

or `P(Y|do(A,C),B)`

. If
variable sets are desired, each member of the set has to be included
explicitly.

Argument `query`

is used to describe the query of the
general identifiability problem as a character string, similarly as
`data`

. If we are interested in identifying \(P(Y { \, | \, }\textrm{do}(X), W)\) we
would write

`query <- "P(Y|do(X),W)"`

Instead of describing distributions via text, it is also possible to use the following structure that encodes the role of each variable via a numeric vector:

`query <- c(Y = 0, X = 1, W = 2)`

Given a distribution of the form \(P(\mathbf{A} { \, | \,
}\textrm{do}(\mathbf{B}),\mathbf{C})\) and a variable \(V\), a value 0 means that \(V \in \mathbf{A}\), value 1 means that
\(V \in \mathbf{B}\) and value 2 means
that \(V \in \mathbf{C}\). This format
can also be used to input `data`

as a list of numeric
vectors:

```
data <- list(
c(W = 0),
c(Y = 0, X = 2),
c(Z = 0, X = 1, W = 2)
)
```

Finally, `graph`

encodes the semi-Markovian graph \(G\) of the causal model as a character
string with each edge on its own line. A directed edge from \(X\) to \(Y\) is given as `X -> Y`

and
a bidirected edge between \(X\) and
\(Y\) is given as
`X <-> Y`

. Intervention nodes should not be given
explicitly, since they are added automatically after calling
`dosearch`

. Furthermore, only vertices with incoming or
outgoing edges should be included in `graph`

. As an example,
we can encode a simple back-door graph with an added unobserved
confounded between \(X\) and \(Y\) as follows:

```
graph <- "
X -> Y
Z -> X
Z -> Y
X <-> Y
"
```

Alternatively, one may use `igraph`

graphs (Csardi and Nepusz
2006) in the syntax of the `causaleffect`

package
(Tikka and
Karvanen 2017) or DAGs created using the `dagitty`

package.

```
library("igraph")
graph <- graph.formula(X -+ Y, Z -+ X, Z -+ Y, X -+ Y, Y -+ X)
graph <- set_edge_attr(graph, "description", 4:5, "U")
```

```
library("dagitty")
graph <- dagitty("dag{X -> Y; Z -> X; Z -> Y; X <-> Y}")
```

The next two optional parameters, and , are used to denote those
vertices of \(G\) that should be
understood as either transportability nodes or selection bias nodes,
respectively. Providing these parameters may increase search performance
in relevant problems. Both of these parameters should be given as
character strings, where individual variables are separated by a comma,
for example `transportability = "S,T"`

. Parameter
`missing_data`

, as the name suggests, is used to define
missingness mechanisms (1) as a character string, where individual
mechanisms are separated by a comma. In order to describe that \(R_X\) is the response indicator of \(X\) we would write `R_X : X`

,
which also implicitly defines that `X*`

is the proxy variable
of `X`

.

The list `control`

can be used to set various additional
parameters that are not directly related to the identifiability problem
itself, but more so to the output of the search and other auxiliary
details, such as benchmarking and obtaining derivations that show how
the query distribution can be reached from the inputs using do-calculus.
One such control parameter determines whether to use the search
heuristic or not. Documentation of the `dosearch`

package
contains detailed information on the full list of control
parameters.

The return object of `dosearch`

is a list with three
components by default. The first component, `identifiable`

,
is a logical value that takes the value `TRUE`

when the
target distribution described by `query`

is identifiable from
the inputs. The second component, `formula`

, is a character
string describing the target distribution in terms of the inputs in
LaTeX syntax if the target is identifiable. Otherwise this component is
just an empty character string. The third component `call`

contains the arguments of the original function call.

Bareinboim, E., and J. Pearl. 2012. “Causal Inference by Surrogate
Experiments: Z-Identifiability.” In *Proceedings of the 28th
Conference on Uncertainty in Artificial Intelligence*, edited by N.
de Freitas and K. Murphy, 113–20. AUAI Press.

———. 2014. “Transportability from Multiple Environments with
Limited Experiments: Completeness Results.” In *Proceedings of
the 27th Annual Conference on Neural Information Processing
Systems*, 280–88.

Bareinboim, E., and J. Tian. 2015. “Recovering Causal Effects from
Selection Bias.” In *Proceedings of the 29th AAAI Conference
on Artificial Intelligence*, 3475–81.

Correa, J., J. Tian, and E. Bareinboim. 2018. “Generalized
Adjustment Under Confounding and Selection Biases.” In
*Proceedings of the 32nd AAAI Conference on Artificial
Intelligence*.

Csardi, G., and T. Nepusz. 2006. “The igraph Software Package for Complex Network
Research.” *InterJournal* Complex Systems: 1695. https://igraph.org.

Eddelbuettel, D., and R. Francois. 2011. “Rcpp:
Seamless R and C++ Integration.”
*Journal of Statistical Software* 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.

Greenland, S., J. M. Robins, and J. Pearl. 1999. “Confounding and
Collapsibility in Causal Inference.” *Statistical Science*
14 (1): 29–46. https://doi.org/10.1214/ss/1009211805.

Huang, Y., and M. Valtorta. 2006. “Pearl’s Calculus of
Intervention Is Complete.” In *Proceedings of the 22nd
Conference on Uncertainty in Artificial Intelligence*, 217–24. AUAI
Press.

Koller, D., and N. Friedman. 2009. *Probabilistic Graphical Models:
Principles and Techniques*. MIT Press.

Little, R. J. A., and D. B. Rubin. 1986. *Statistical Analysis with
Missing Data*. New York, NY, USA: John Wiley & Sons, Inc.

Mohan, K., J. Pearl, and J. Tian. 2013. “Graphical Models for
Inference with Missing Data.” In *Proceedings of the 26th
International Conference on Neural Information Processing Systems*,
1277–85.

Pearl, J. 1995. “Causal Diagrams for Empirical Research.”
*Biometrika* 82 (4): 669–88. https://doi.org/10.2307/2337329.

———. 2009. *Causality: Models, Reasoning, and Inference*. Second.
Cambridge University Press.

R Core Team. 2024. *R: A Language and Environment for
Statistical Computing*. Vienna, Austria: R Foundation for
Statistical Computing. https://www.R-project.org/.

Shpitser, I., K. Mohan, and J. Pearl. 2015. “Missing Data as a
Causal and Probabilistic Problem.” In *Proceedings of the 31st
Conference on Uncertainty in Artificial Intelligence*, edited by
Marina Meila and Tom Heskes, 802–11. AUAI Press.

Shpitser, I., and J. Pearl. 2006. “Identification of Joint
Interventional Distributions in Recursive Semi-Markovian
Causal Models.” In *Proceedings of the 21st National
Conference on Artificial Intelligence – Volume 2*, 1219–26. AAAI
Press.

Spirtes, P., C. Glymour, and R. Scheines. 1993. *Causation,
Prediction, and Search*. Springer-Verlag.

Tikka, S., A. Hyttinen, and J. Karvanen. 2020. *dosearch: Causal Effect Identification from
Multiple Incomplete Data Sources*. https://cran.r-project.org/package=dosearch.

———. 2021. “Causal Effect Identification from Multiple Incomplete
Data Sources: A General Search-Based Approach.” *Journal of
Statistical Software* 99 (5): 1–40. https://doi.org/10.18637/jss.v099.i05.

Tikka, S., and J. Karvanen. 2017. “Identifying Causal Effects with
the R Package causaleffect.” *Journal of Statistical
Software* 76 (12): 1–30. https://doi.org/10.18637/jss.v076.i12.