Distance Correlation

Notes on Szekely, Rizzo & Bakirov (2007), Measuring and Testing Dependence by Correlation of Distances and Szekely & Rizzo (2014), Partial Distance Correlation with Methods for Dissimilarities.

The 2007 paper introduced distance correlation, a scalar coefficient that is zero exactly when two random vectors are independent. Unlike Pearson’s $r$, it is not limited to linear dependence. The construction starts with pairwise distance matrices, subtracts row and column means to remove uninformative structure, and averages the entry-wise product.

The 2014 paper replaces the 2007 double-centering with U-centering. With that centering, the same bilinear form becomes a positive semidefinite inner product on a Hilbert space of centered matrices. That gives an unbiased estimator, lengths and angles, orthogonal projection for partial distance correlation, and a version that works for non-Euclidean dissimilarities.

Part I · Distance correlation (2007)

1. What Pearson’s r misses

Pearson’s $r$ is a coefficient of linear dependence: it measures whether the cloud of points stretches along a diagonal. Symmetric nonlinear relationships, such as a parabola, a sine wave, or a circle, can have $r \approx 0$ while $X$ and $Y$ are tightly coupled, even deterministic. Rank-based generalizations (Spearman, Kendall) fix monotone misses but not the symmetric ones. The 2007 paper’s motivating quote: “distance correlation is zero only if the random vectors are independent.”

Relationship Noise 0.15

Pearson r

—

Spearman $\rho$

—

Distance correlation

—

Spearman’s $\rho$ patches the linearity assumption for monotone relationships, but still collapses on the symmetric ones (parabola, circle, cross). The 2007 paper specifically calls out rank-based tests as “ineffective for testing nonmonotone types of dependence,” precisely the gap distance correlation closes.

2. The 2007 definition

The population coefficient. Take a weighted $L^2$ distance between the joint characteristic function $\phi_{X,Y}$ and the product of marginals $\phi_X\phi_Y$:

$$V^2(X,Y) = \int |\phi_{X,Y}(t,s) - \phi_X(t)\phi_Y(s)|^2 w(t,s)\,dt\,ds$$

The weight $w(t,s) \propto |t|^{-(1+p)} |s|^{-(1+q)}$ is non-integrable on purpose: integrable weights collapse this measure to $\rho^2$ in the small-signal limit and so cannot distinguish dependence from independence. Under finite first moments, the integral reduces to a compact expected-distance identity:

$$V^2(X,Y) = \mathbb{E}|X-X'||Y-Y'| + \mathbb{E}|X-X'|\,\mathbb{E}|Y-Y'| - 2\mathbb{E}|X-X'||Y-Y''|$$

which is what the sample statistic estimates. Form the matrix of pairwise distances $a_{ij} = |X_i - X_j|$, double-center it (subtract row means, column means, add back the grand mean), do the same for $Y$, and take the entry-wise mean of the product:

$$V_n^2(X,Y) = {1\over n^2}\sum_{i,j} A_{ij}B_{ij}, \qquad dCor_n = {V_n(X,Y)\over \sqrt{V_n(X)V_n(Y)}}$$

Under finite first moments, $V^2(X,Y) = 0$ iff $X$ and $Y$ are independent, in arbitrary, possibly different, Euclidean dimensions. This is the 2007 paper’s headline theorem; everything else follows from it.

For one classical reference point: when $(X,Y)$ are bivariate standard normal with correlation $\rho$, distance correlation has a closed form (Theorem 7). It always lies below $|\rho|$, with the ratio $R(X,Y)/|\rho|$ bottoming out near $0.891$ as $\rho \to 0$:

$\rho$ (population correlation) +0.50

Bivariate normal sample (n = 400)

$R^2(\rho)$ and $\rho^2$: closed form (solid) vs. empirical $dCor^2$ (dot)

Pearson r (sample)

—

R (closed form)

—

dCor (sample)

—

$R / |\rho|$

—

Even in the one case where Pearson is the gold standard, distance correlation is a smooth, slightly attenuated version of $|\rho|$. The interest of dCor isn’t this regime; it’s everywhere else, where $\rho$ fails entirely.

3. Testing independence

The sample statistic $V_n^2$ is non-negative but has no clean closed-form null distribution. The 2007 paper proved that under independence and finite first moments, $nV_n^2$ converges in distribution to a quadratic form $\sum_j \lambda_j Z_j^2$ in i.i.d. standard normals, with eigenvalues that depend on the distribution of $(X,Y)$. Useful asymptotically, but distribution-dependent. The practical recommendation is a permutation test: hold $X$ fixed, shuffle the $Y$ values to break any real dependence, recompute the statistic, and see how often the shuffled value exceeds the observed one.

Shape Noise 0.20

Sample (n = 60)

Permutation null for $V_n^2$; observed in red

Observed $V_n^2$

—

Permutation p-value

—

Power comparisons in the 2007 paper (Examples 1–3) show the dCov-based test matching Wilks’ LRT on linear Gaussian alternatives and clearly dominating it, along with rank-based tests, on multiplicative-noise and log-quadratic alternatives.

What it replaced

Before 2007, each standard way to test for dependence between two random vectors had its own narrow regime:

Pearson’s $r$ and Wilks’ likelihood-ratio test. Optimal under Gaussianity. Heavy tails or nonlinear coupling break them; the 2007 paper shows Wilks’ LRT with inflated Type-I error on $t_1$-distributed data, and near-zero power on multiplicative-noise alternatives.
Spearman’s $\rho$, Kendall’s $\tau$, Puri-Sen rank correlation. Distribution-free for monotone alternatives. Power flatlines on symmetric non-monotone dependence (parabolas, sinusoids, multiplicative noise), visible directly in § 1 above.
The Mantel test (1967). Permutation correlation between two raw distance matrices. Widely used in ecology, but its statistic does not double-center the matrices, so it is not a consistent test of independence; it can return zero when $X$ and $Y$ are dependent. The 2014 paper compares its partial-dCor test directly against partial Mantel and dominates it.
Hoeffding’s $D$, Blum-Kiefer-Rosenblatt. Genuine if-and-only-if independence coefficients via the joint vs. product CDFs, but defined only for bivariate continuous distributions. They do not extend to vectors in $\mathbb{R}^p \times \mathbb{R}^q$.
Mutual information estimators (Kraskov et al. 2004, kernel-based). Also characterize independence in any dimension, but require bandwidth or k-neighbour tuning and don’t produce a single canonical scalar.
HSIC (Gretton et al. 2005). A near-contemporaneous kernel-based independence criterion, conceptually very close: also an inner product of centered objects in a Hilbert space (an RKHS). The two were later shown to coincide for a particular distance-induced kernel (Sejdinovic et al. 2013).

What distance correlation offers is a single scalar in $[0,1]$ that is parameter-free, defined in arbitrary dimensions, characterizes independence exactly, has a tractable permutation test, and is competitive with the LRT in the Gaussian regime while dominating it elsewhere.

Part II · Partial dCor and dissimilarities (2014)

The 2007 sample statistic $V_n^2 = n^{-2}\sum A_{ij}B_{ij}$ is a bilinear form on double-centered matrices, but it is not a clean inner product on a Hilbert space; it is biased, and there is no natural projection geometry on which to build a partial-correlation analogue. The 2014 paper’s observation is that a slightly different centering rule resolves all of these issues. Replace double-centering with U-centering ($\widetilde A$), and the bilinear form $(\widetilde A \cdot \widetilde B) = {1\over n(n-3)}\sum_{i\ne j}\widetilde A_{ij}\widetilde B_{ij}$ becomes a positive semidefinite inner product on a Hilbert space $H_n$ of centered matrices. The general recipe: pick a centering that removes the nuisance row, column, and grand-mean structure so the residuals’ bilinear form (a) still captures dependence and (b) is PSD. Three things follow:

The inner product is an unbiased estimator of $V^2(X,Y)$ (§ 4 below).
Length, angle, orthogonality, and orthogonal projection become legitimate operations, and partial distance correlation is the cosine of the angle between residuals after projecting out a third matrix (§ 5).
The inner product is invariant to additive shifts of the underlying dissimilarities, so any symmetric zero-diagonal dissimilarity matrix, not just Euclidean distances, plugs straight in (§ 6).

4. The unbiased estimator

Where the double-centered estimator $V_n^2$ divides row and column sums by $n$ and the grand sum by $n^2$, U-centering uses $n-2$ and $(n-1)(n-2)$, the “leave-one-out” counts that make the expectation algebra come out cleanly:

$$\widetilde A_{ij} = a_{ij} - {a_{i\cdot}\over n-2} - {a_{\cdot j}\over n-2} + {a_{\cdot\cdot}\over (n-1)(n-2)}, \qquad i\ne j$$

Diagonal entries are set to zero. Proposition 1: $(\widetilde A \cdot \widetilde B)$ is an unbiased estimator of the population $V^2(X,Y)$. Below: a Monte Carlo comparing the 2007 estimator with its 2014 replacement as $n$ increases.

True relationship Replicates per n

biased $V_n^2$ (double-centered) unbiased $(\widetilde A\cdot\widetilde B)$ (U-centered) long-run truth

With $X \perp Y$ the truth is exactly zero, and the bias of $V_n^2$ is most visible: positive, decaying like $1/n$. The unbiased estimator fluctuates symmetrically around zero (it can be negative; remember, it is an inner product, not a squared length).

5. The Hilbert-space picture

The U-centered matrices live in a Hilbert space $H_n$ with the inner product above. Once we have an inner product, every concept from Euclidean geometry (lengths, angles, orthogonal projections) transfers automatically. Partial distance correlation is just the standard geometric construction:

Form U-centered matrices $\widetilde A, \widetilde B, \widetilde C$ from $X, Y, Z$.
Project $\widetilde A$ and $\widetilde B$ onto the orthogonal complement of $\widetilde C$. Call the residuals $P_{Z^\perp}(\widetilde A) = \widetilde A - \alpha\widetilde C$ and $P_{Z^\perp}(\widetilde B) = \widetilde B - \beta\widetilde C$.
$R^*(X,Y;Z)$ is the cosine of the angle between the residuals.

The picture below is drawn from actual U-centered inner products: vector lengths and pairwise angles all reflect what the math says. Slide the controls to reshape the data and watch the residual angle, the partial distance correlation, change.

Z → X strength 0.70 Z → Y strength 0.70 Direct X → Y +0.00

Vectors live in $H_n$; they are sketched here in 2D using their true pairwise angles (the angle between $\widetilde A$ and $\widetilde B$ is exact; the placement of $\widetilde A$ and $\widetilde B$ on either side of $\widetilde C$ is chosen so the partial cosine is realised.)

R*(X,Y)

—

R*(X,Z)

—

R*(Y,Z)

—

R*(X,Y;Z)

—

$\widetilde A$ (from X) $\widetilde B$ (from Y) $\widetilde C$ (from Z). Dashed arrows are residuals after projecting out $\widetilde C$.

With direct link at zero, $X$ and $Y$ share only their $Z$ pathway: $\widetilde A$ and $\widetilde B$ both lean toward $\widetilde C$, and once that shared component is subtracted, the residuals are nearly perpendicular, with $R^*(X,Y;Z) \approx 0$. Move the direct slider, and the residuals swing back into alignment. The paper notes that $R^*(X,Y;Z) = 0$ does not in general imply conditional independence; it characterises orthogonality in $H_n$, which is strictly weaker.

$$R^*(X,Y;Z) = {R^*(X,Y) - R^*(X,Z)R^*(Y,Z)\over \sqrt{(1 - R^*(X,Z)^2)(1 - R^*(Y,Z)^2)}}$$

The familiar partial-correlation formula falls out as Proposition 2 of the paper, a direct consequence of the projection geometry. Note that here $R^*$ plays the role that squared dCor plays elsewhere, so its values are on the scale of $dCor^2$ and can be negative.

6. Beyond Euclidean distances

In ecology, genetics, and psychometrics, “dissimilarities” often violate the triangle inequality; Bray-Curtis on species counts is a standard example. The paper’s second contribution: distance-correlation methods still work, because U-centering only sees the inner product, not the original dissimilarities.

Two facts make this go. Theorem 2: every element of $H_n$ is the U-centered distance matrix of some Euclidean point configuration in $\mathbb{R}^p$ ($p \le n - 2$), recoverable via classical multidimensional scaling. Lemma 1(iv): U-centering is invariant to adding a constant $c$ to every off-diagonal dissimilarity (a “Cailliez constant” commonly used to force Euclidean embedding). The recovered MDS configuration moves around as $c$ changes, but the inner product, and hence every dCor statistic, does not.

Cailliez constant c +0.00

Bray-Curtis dissimilarities ($D + c$)

U-centered ($\widetilde A$): identical for every $c$

Classical MDS recovery (2D)

Min MDS eigenvalue

—

$|\widetilde A|^2$ (invariant)

—

R*(species, env)

—

Eight sites with simulated species counts; the second matrix is built from a related environmental gradient. The Bray-Curtis matrix is a metric but not generally Euclidean, so classical MDS produces a negative eigenvalue at $c = 0$ (Euclidean embedding is impossible). Sliding $c$ upward (the Cailliez additive constant) pushes that eigenvalue toward zero and rearranges the embedding, but the U-centered matrix and the R* readout are unchanged, as Lemma 1(iv) guarantees.

7. Network recovery: Pearson, partial, and distance correlation

Bayesian-network skeleton recovery from observational data is a classic use of correlation measures. Here data come from a known 4-node DAG: $A\to B$, $A\to C$, $B\to D$, $C\to D$. Marginally $A$ and $D$ are correlated through their two paths even though there is no direct edge between them. Pairwise Pearson correlation puts an edge wherever any two variables co-vary, so it cannot tell the indirect $A\!-\!D$ path from the direct ones. Partial correlation conditions on the other variables (the precision matrix, the inverse covariance) and recovers the direct-edge structure. Distance correlation has stronger power against non-linear dependence but, like raw Pearson, does not condition out indirect paths.

sample size N 200 threshold τ 0.16

true skeleton

Pearson |r| ≥ τ

partial correlation |ρ| ≥ τ

dCor ≥ τ

Pearson

—

Partial

—

dCor

—

The true skeleton has 4 edges (A–B, A–C, B–D, C–D). The spurious A–D appears in raw Pearson and dCor because $A$ and $D$ share two indirect paths. Partial correlation conditions on $B$ and $C$ and removes it. Stats show recovered edges as TP/FP out of the 6 candidate pairs at the current threshold. Click any edge (true, recovered, or omitted) to see its Pearson / partial / dCor numeric values.