Bayesian Graphical Models

Graphs as compact probability models: factorization, conditional independence, parameter learning, and structure scoring.

A Bayesian network combines a directed acyclic graph $G$ with local conditional distributions $\Theta$. The graph says which variables are direct parents of each node; the probability model factorizes as $\prod_i p(x_i\mid \operatorname{pa}(x_i))$.

1. D-separation and explaining away

Click a node to mark it observed. Chains (A→B→C) and forks (A←B→C) are blocked by observing the middle variable; colliders (A→B←C) are opened by observing the common effect or any descendant of it. The matrix on the right shows every pairwise conditional independence given the current evidence: green for independent, red for dependent.

Figure 1 · Toggle observed nodes; the matrix updates with all pairwise d-separations

unobserved observed X ⟂ Y | obs X ⊥̸ Y | obs

2. Dirichlet-multinomial CPT learning

For a discrete node, each row of a conditional probability table is a multinomial parameter. A Dirichlet prior acts like pseudo-counts. The posterior mean is $(\alpha_k+n_k)/(\sum_j\alpha_j+\sum_j n_j)$, so stronger priors move more slowly.

Figure 2 · Pseudo-counts and posterior CPT rows

posterior dry posterior wet empirical rate

Dirichlet strength 2

# observations 48

3. Linear Gaussian Bayesian network

When every node is a linear-Gaussian function of its parents, the joint distribution is multivariate Gaussian. For the chain $A\to B\to C$ with $B = \beta_1 A + \varepsilon_B$ and $C = \beta_2 B + \varepsilon_C$, the covariance $\Sigma$ becomes dense as the edge coefficients grow, but the precision $K=\Sigma^{-1}$ keeps the entry $K_{AC}=0$. Zeros in $K$ correspond exactly to conditional independencies given the rest; here, $A\perp C\mid B$.

Normalizing turns that contrast into a correlation. Pearson correlation is the normalized covariance, $r_{ij} = \Sigma_{ij}/\sqrt{\Sigma_{ii}\Sigma_{jj}}$, and partial correlation is the normalized negative precision, $\rho_{ij\cdot\text{rest}} = -K_{ij}/\sqrt{K_{ii}K_{jj}}$. The two answer different questions: $r_{AC}\neq 0$ because $A$ and $C$ are linked through the indirect path $A\!-\!B\!-\!C$, while $\rho_{AC\cdot B} = 0$ because conditioning on $B$ closes that path. For a Gaussian, $\rho_{ij\cdot\text{rest}} = 0 \iff K_{ij} = 0 \iff X_i \perp X_j \mid \text{rest}$: the partial-correlation zeros are exactly the missing edges of the Gaussian graphical model. Stripping indirect paths out of a correlation network is the standard use of partial correlation — Distance Correlation §7 pits it against Pearson and distance correlation on that task.

Figure 3 · Pearson vs partial correlation in a linear-Gaussian chain

large positive entry large negative entry near-zero entry

$\beta_1$ (A → B) 0.8

$\beta_2$ (B → C) 0.7

noise variance 0.5

4. Structure search sandbox

Structure learning trades fit against complexity. The score below is BIC for a Gaussian linear regression at each node: $\sum_i \log p(\mathcal D_i \mid \operatorname{pa}(x_i),\hat w_i) - \tfrac12 k_i \log N$. Click an edge slot to toggle it; both directions are separate slots so you can reverse an edge. Cycles are rejected. The true generative DAG is shown faintly for comparison with greedy search.

Figure 4 · Toggle edges; the score updates per node

included edge cycle would form matches true DAG candidate slot

What next

Static Bayesian networks connect to dynamic models, dependence measures, and Bayesian computation.

Dynamic BN

Hidden Markov Models

Repeat a small Bayesian network across time and compare filtering, smoothing, and Viterbi decoding.

Dependence

Distance Correlation

Pearson and partial correlations are common network-construction tools, but they miss nonlinear dependence.

Computation

Monte Carlo & MCMC

Use sampling when local messages or closed-form parameter updates are unavailable.