Causality and Polynomial Optimization

by Kevin Shu

Correlation and Causation

  • Correlation is not the same thing as causation.
  • Two things that occur often together can in fact not be causally related if there are unobserved variables.
Season Ice Cream Shark Attack

How do we find causation?

1. Experimentation

Hold some variables fixed, independent of all the other ones, and see what happens Season Ice Cream Shark Attack = X If we fix the amount of ice cream consumed, it no longer is correlated with the season. This is the main assumption being made in Pearl's causal theory.

2. Use Conditional Independence

image/svg+xml HealthConsciousness Toothbrush ToothHealth HeartHealth We can uniquely determine what the effect of brushing you teeth is on heart health assuming these conditional independence assumptions

Bayesian Networks

  • $G = (V, E)$ is a directed acyclic graph.
  • $V$ is a set of random variables.
  • Each $\textbf{V} \in V$ has a set of parents, $\text{pa}(\textbf{V})$, which have edges into $\textbf{V}$.
  • For each $\textbf{V} \in V$, $\textbf{V}$ is conditionally independent of $V \setminus (\text{pa}(\textbf{V}) \cup \{\textbf{V}\})$ given $\text{pa}(\textbf{V})$.
  • Each node $\textbf{V}$ depends 'causally' on only its parents.
  • have a conditional probability of $\textbf{V}$ given its parents: \[ \text{Pr}(\textbf{V} = x | \textbf{pa(v)} = y) \]


Given a Bayesian network, we can consider interventions on that network, where we fix some variable to take on a certain value.
image/svg+xml v pa(v) =x
  1. Remove all of the arrows coming into $\textbf{v}$
  2. Replace $\textbf{v}$ with the constant variable $x$.
  3. All other conditional probabilities remain the same.


We can write down what the interventional distribution is explicitly in terms of the model parameters (here $\textbf{W}$ is a subset of the variables) \[ \text{Pr}(\textbf{W} = w | \text{do}(\textbf{X} = x)) = \] \[ \sum_{\textbf{V}_i \not \in \textbf{W}}\prod_{\textbf{V}_i \in V \setminus X} \text{Pr}(\textbf{V}_i = v_i | \text{pa}(\textbf{V}_i) = \text{pa}(v_i), \textbf{X} = x) \]

Unobserved Variables

Not all variables are known in general. Unobserved variables can make the result of an intervention ambiguous. Patient Personality Drug Outcome

Can we conclude correlation from causation?

A graph with marked unobserved variables is said to be identifiable if for any consistent distribution on the observed variables, there is a unique interventional distribution.

A complete classification of identifiable graphs was given in Tian and Pearl in 2002, including an algorithm for determining if a given network is identifiable. [1]

An Identifiable Network

image/svg+xml HealthConsciousness Toothbrush ToothHealth HeartHealth

Identifiability Criterion

  • Assume that the Bayesian network is semi-Markovian, so that every unobserved node has 2 direct observed children.
  • We draw undirected edges from $\textbf{V}_i$ to $\textbf{V}_j$ if there is an unobserved node with $\textbf{V}_i$ and $\textbf{V}_j$ as direct children.
  • The causal effect of fixing $\textbf{X}$ on the rest of the variables is identifiable if there is no path of undirected edges from $\textbf{X}$ to any of its children [2].
Here, the causal effect of $X$ on $Y$ is unidentifiable (yellow edges represent undirected edges).

Limits of Identifiability

  • Not all networks of interest are identifiable, so the interventional distribution is ambiguous.
  • Exact answers are not needed; we might really just care about 'will this intervention have a positive effect?'.
  • We need to think about a 'space' of posssible interventional distributions, and try to control it.

Probability Equations

  • Recall that a Bayesian network is entirely specified by the parameters $\text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p)$.
  • These parameters are just numbers satisfying some equations: \[ \forall p, \;\sum_{x} \text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p) = 1 \] \[ \forall x, p \;\text{Pr}(\textbf{X} = x | \textbf{pa}(\textbf{X}) = p) \ge 0 \]

Observed Equations

Our knowledge of the marginal on observed variables also gives us some equations that we know these parameters satisfy: \[ \forall v, \; \sum_{u} \prod_{\textbf{V}_i \in V}\text{Pr}(\textbf{V}_i = v_i | \textbf{pa}(\textbf{V}_i) = \text{pa}(v_i))\] \[ = \text{Pr}(V_{obs} = v_{obs}) \] where this sum is over all settings to the unobserved variables, and $\text{Pr}(V_{obs})$ is the marginal distribution on the observed variables.

Semialgebraic Sets

  • The upshot of the previous two slides is that the possible models consistent with this observed distribution are those satisfying a set of polynomial inequalities.
  • Such a set is said to be semialgebraic.
  • There has been a lot of work on optimization of polynomial functions on semialgebraic sets.

Theorem 1: The Space of Interventional Distributions is Semialgebraic

Theorem (Tarski-Seidenberg) - The image of a semialgebraic set under a polynomial map is semialgebraic.
  • Recall that the interventional distribution is a polynomial map of the model parameters.
  • We immediately get that the space of interventional distributions is semialgebraic.
  • There are examples where the space of interventional distributions requires exponentially many polynomial inequalities to define.

Optimization over all Interventional Distributions

  • We might want to know something like "prescribing this drug will improve a patient's chance of recovery".
  • To get this result, we would want to show that a lower bound on the minimum of some function over the space of interventional distributions.
  • If this function is linear, we get a polynomial optimization problem (which has a compact formulation).
\[\text{minimize } \sum_{v,x} \alpha_{v,x} \text{Pr}(\textbf{W} = w | \text{do}(\textbf{X} = x))\] \[\text{subject to } \text{Pr}(\textbf{W} = w | \text{do}(\textbf{X} = x)) \] \[ \hspace{9em}= \sum_{\textbf{V}_i \not \in \textbf{W}}\prod_{\textbf{V}_i \in V \setminus X} \text{Pr}(\dots) \] \[ \hspace{7em} \forall v, \; \sum_{u} \prod_{\textbf{V}_i \in V}\text{Pr}(\textbf{V}_i = v_i | \textbf{pa}(\textbf{V}_i) = \text{pa}(v_i))\] \[\hspace{6.5em} = \text{Pr}(V_{obs} = v_{obs}) \] \[ \dots \]
  • Optimizing this kind of program is hard in general, though there has been a lot of work on the subject.
  • But it can be a useful framework for understanding how some other conditions (besides conditional independence assumptions) can give us improved results.
  • In particular, we can often use linear programming duality to construct bounds using this framework.

Theorem 2: Low Mutual Information Bound

We will look at the following Bayesian network:


Recall that the mutual information between two variables is a measurement of how far they are from being independent.

Theorem - Suppose that the conditional distribution of $Y$ given $X$ is $\eta$ and the interventional distribution of $Y$ given $X$ is $\zeta$, then \[ \|\zeta - \eta\|_1 \le \frac{1}{\min_{x} \text{Pr}(X = x)}\sqrt{\frac{1}{2}I(X ; U)} \]

This gives a quantitative version of the idea that if $X$ is independent of the confounder, then the true interventional distribution is equal to the conditional distribution of $X$ given $Y$.

Comments on the proof:

  • Because the $L_1$ ball is polyhedral, bounding $L_1$ distance is equivalent to minimizing some collection of linear functions on the space of interventional distributions.
  • Mutual information can be related to $L_1$ distance via Pinsker's inequality.


  1. Tian, J., & Pearl, J. (2002, August). A general identification condition for causal effects. In Aaai/iaai (pp. 567-573).
  2. Bhattacharyya, A., Gayen, S., Kandasamy, S., Maran, A. & Variyam, V.N.. (2020). Learning and Sampling of Atomic Interventions from Observations. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:842-853 Available from .
  3. Shu, K. (2021). Causal Channels.