by Kevin Shu

- Correlation is not the same thing as causation.
- Two things that occur often together can in fact not be causally related if there are unobserved variables.

- $G = (V, E)$ is a directed acyclic graph.
- $V$ is a set of random variables.
- Each $\textbf{V} \in V$ has a set of parents, $\text{pa}(\textbf{V})$, which have edges into $\textbf{V}$.
- For each $\textbf{V} \in V$, $\textbf{V}$ is conditionally independent of $V \setminus (\text{pa}(\textbf{V}) \cup \{\textbf{V}\})$ given $\text{pa}(\textbf{V})$.

- Each node $\textbf{V}$ depends 'causally' on only its parents.
- have a conditional probability of $\textbf{V}$ given its parents: \[ \text{Pr}(\textbf{V} = x | \textbf{pa(v)} = y) \]

- Remove all of the arrows coming into $\textbf{v}$
- Replace $\textbf{v}$ with the constant variable $x$.
- All other conditional probabilities remain the same.

A graph with marked unobserved variables is said to be **identifiable** if for any consistent distribution on the observed variables, there is a unique interventional distribution.

A complete classification of identifiable graphs was given in Tian and Pearl in 2002, including an algorithm for determining if a given network is identifiable. [1]

- Assume that the Bayesian network is semi-Markovian, so that every unobserved node has 2 direct observed children.
- We draw undirected edges from $\textbf{V}_i$ to $\textbf{V}_j$ if there is an unobserved node with $\textbf{V}_i$ and $\textbf{V}_j$ as direct children.
- The causal effect of fixing $\textbf{X}$ on the rest of the variables is identifiable if there is no path of undirected edges from $\textbf{X}$ to any of its children [2].

- Not all networks of interest are identifiable, so the interventional distribution is ambiguous.
- Exact answers are not needed; we might really just care about 'will this intervention have a positive effect?'.
- We need to think about a 'space' of posssible interventional distributions, and try to control it.

- Recall that a Bayesian network is entirely specified by the parameters $\text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p)$.
- These parameters are just numbers satisfying some equations: \[ \forall p, \;\sum_{x} \text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p) = 1 \] \[ \forall x, p \;\text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p) \ge 0 \]

- The upshot of the previous two slides is that the possible models consistent with this observed distribution are those satisfying a set of
**polynomial**inequalities. - Such a set is said to be
**semialgebraic**. - There has been a lot of work on optimization of polynomial functions on semialgebraic sets.

- Recall that the interventional distribution is a polynomial map of the model parameters.
- We immediately get that the space of interventional distributions is semialgebraic.
- There are examples where the space of interventional distributions requires exponentially many polynomial inequalities to define.

- We might want to know something like "prescribing this drug will improve a patient's chance of recovery".
- To get this result, we would want to show that a lower bound on the minimum of some function over the space of interventional distributions.
- If this function is linear, we get a polynomial optimization problem (which has a compact formulation).

- Optimizing this kind of program is hard in general, though there has been a lot of work on the subject.
- But it can be a useful framework for understanding how some other conditions (besides conditional independence assumptions) can give us improved results.
- In particular, we can often use linear programming duality to construct bounds using this framework.

We will look at the following Bayesian network:

Recall that the mutual information between two variables is a measurement of how far they are from being independent.

This gives a quantitative version of the idea that if $X$ is independent of the confounder, then the true interventional distribution is equal to the conditional distribution of $X$ given $Y$.

** Comments on the proof: **

- Because the $L_1$ ball is polyhedral, bounding $L_1$ distance is equivalent to minimizing some collection of linear functions on the space of interventional distributions.
- Mutual information can be related to $L_1$ distance via Pinsker's inequality.

- Tian, J., & Pearl, J. (2002, August). A general identification condition for causal effects. In Aaai/iaai (pp. 567-573).
- Bhattacharyya, A., Gayen, S., Kandasamy, S., Maran, A. & Variyam, V.N.. (2020). Learning and Sampling of Atomic Interventions from Observations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:842-853 Available from http://proceedings.mlr.press/v119/bhattacharyya20a.html .
- Shu, K. (2021). Causal Channels. https://arxiv.org/abs/2103.02834