Bayesian Graphical Models

Graphs as compact probability models: factorization, conditional independence, parameter learning, and structure scoring.

A Bayesian network combines a directed acyclic graph $G$ with local conditional distributions $\Theta$. The graph says which variables are direct parents of each node; the probability model factorizes as $\prod_i p(x_i\mid \operatorname{pa}(x_i))$.

1. D-separation and explaining away

Click a node to mark it observed. Chains (A→B→C) and forks (A←B→C) are blocked by observing the middle variable; colliders (A→B←C) are opened by observing the common effect or any descendant of it. The matrix on the right shows every pairwise conditional independence given the current evidence: green for independent, red for dependent.

Figure 1 · Toggle observed nodes; the matrix updates with all pairwise d-separations
unobserved observed X ⟂ Y | obs X ⊥̸ Y | obs

2. Dirichlet-multinomial CPT learning

For a discrete node, each row of a conditional probability table is a multinomial parameter. A Dirichlet prior acts like pseudo-counts. The posterior mean is $(\alpha_k+n_k)/(\sum_j\alpha_j+\sum_j n_j)$, so stronger priors move more slowly.

Figure 2 · Pseudo-counts and posterior CPT rows
posterior dry posterior wet empirical rate

3. Linear Gaussian Bayesian network

When every node is a linear-Gaussian function of its parents, the joint distribution is multivariate Gaussian. For the chain $A\to B\to C$ with $B = \beta_1 A + \varepsilon_B$ and $C = \beta_2 B + \varepsilon_C$, the covariance $\Sigma$ becomes dense as the edge coefficients grow, but the precision $K=\Sigma^{-1}$ keeps the entry $K_{AC}=0$. Zeros in $K$ correspond exactly to conditional independencies given the rest; here, $A\perp C\mid B$.

Normalizing turns that contrast into a correlation. Pearson correlation is the normalized covariance, $r_{ij} = \Sigma_{ij}/\sqrt{\Sigma_{ii}\Sigma_{jj}}$, and partial correlation is the normalized negative precision, $\rho_{ij\cdot\text{rest}} = -K_{ij}/\sqrt{K_{ii}K_{jj}}$. The two answer different questions: $r_{AC}\neq 0$ because $A$ and $C$ are linked through the indirect path $A\!-\!B\!-\!C$, while $\rho_{AC\cdot B} = 0$ because conditioning on $B$ closes that path. For a Gaussian, $\rho_{ij\cdot\text{rest}} = 0 \iff K_{ij} = 0 \iff X_i \perp X_j \mid \text{rest}$: the partial-correlation zeros are exactly the missing edges of the Gaussian graphical model. Stripping indirect paths out of a correlation network is the standard use of partial correlation — Distance Correlation §7 pits it against Pearson and distance correlation on that task.

Figure 3 · Pearson vs partial correlation in a linear-Gaussian chain
large positive entry large negative entry near-zero entry

4. Structure search sandbox

Structure learning trades fit against complexity. The score below is BIC for a Gaussian linear regression at each node: $\sum_i \log p(\mathcal D_i \mid \operatorname{pa}(x_i),\hat w_i) - \tfrac12 k_i \log N$. Click an edge slot to toggle it; both directions are separate slots so you can reverse an edge. Cycles are rejected. The true generative DAG is shown faintly for comparison with greedy search.

Figure 4 · Toggle edges; the score updates per node
included edge cycle would form matches true DAG candidate slot

What next

Static Bayesian networks connect to dynamic models, dependence measures, and Bayesian computation.