Kernel von Mises Formula of the Influence Function

This paper replaces the old bottleneck—deriving influence functions (IF) by hand for every model—with a data-driven procedure built on kernels and spectral expansions. In particular it eases numerical ill-conditioning that often arises with point-mass perturbations, and through a regularized estimator it aims for both practical computability and theoretical consistency.
PDFView original paper (PDF)
[Abstract & Intro] Three-sentence summary + problem
① Classical influence-function computation forces a fresh derivation whenever the model changes, so automation is difficult.
② The traditional approach—poking the distribution with a point mass—makes the response sharp and prone to numerical instability.
③ This paper splits the data into several smooth patterns, computes influence for each, and recombines them so a computer—not hand derivation—can estimate the IF more stably.
Everyday analogy: Imagine a complex hot-pot recipe and you want to know how one piece of firm tofu changes the broth. The old style jabs the pot like a needle, so readings swing wildly. This paper nudges gently in several directions like soft ripples and aggregates the responses—closer to a stable taste meter.
[Background] Essential concepts
- Influence function (IF, ψP\psi_P): tells how much θ(P)\theta(P) moves when PP is perturbed slightly.
- Functional parameter θ(P)\theta(P): the statistical target takes the whole distribution as input—not a single sample (mean, risk, estimator, etc.).
- Kernel / RKHS: a space where functions are controlled smoothly; easier to build stable computational paths than with rough point-mass perturbations.
- Spectral decomposition and orthogonal basis eje_j: expand complex change into modes and sum them for easier computation and interpretation.
- Pathwise derivative: measure the instantaneous rate of change along a smooth path such as PtjP_t^j at t=0t=0.
[Proposed method: core idea]
The paper avoids point-mass perturbation directly; along eigenfunction-direction path perturbations PtjP_t^j it computes pathwise derivatives of θ\theta to reconstruct the IF. The centerpiece is Theorem 3.3 (Spectral von Mises formula), expressing the IF as a sum of per-mode contributions. A regularization strength λ\lambda suppresses blow-up of small-eigenvalue modes and improves computational stability.
[Proposed method: dissecting the key formula]
Core identity:
ψP,λ(x)=limrj=1r11+2λ/σj[ddtθ(Ptj)]t=0ej(x)\psi_{P,\lambda}(x)=\lim_{r\to\infty}\sum_{j=1}^{r}\frac{1}{1+2\lambda/\sigma_j}\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x)
Read simply: instead of asking for the total influence of input xx on the output in one shot, split the effect into several smooth wave-like modes, compute each contribution, then add. The sum j=1r\sum_{j=1}^{r} means split by mode and sum; in practice the infinite sum is truncated to the top rr modes, so rr is the approximation rank balancing compute cost and accuracy.
The middle factor [ddtθ(Ptj)]t=0\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0} is the instantaneous slope of how θ\theta reacts when the distribution is nudged along that mode at t=0t=0. Large values mean that mode shakes the model strongly. It is multiplied by ej(x)e_j(x), which records how much of mode jj is present in input xx. So large sensitivity along a mode and large ej(x)e_j(x) together inflate that modes contribution.
The prefactor 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j} is a safety valve (shrinkage). Modes with small σj\sigma_j are often noise-sensitive and destabilize computation; this factor automatically shrinks their contribution. Increasing λ\lambda strengthens shrinkage—curves get smoother and variance drops, but if λ\lambda is too large important signal is damped and bias can grow. In one line: keep mode-wise sensitivity where it helps, regularize unstable modes, and reconstruct a stable global IF.

Reading the formulas

θ(P)\theta(P): A functional target that takes a data distribution PP and returns the statistic the model cares about (e.g. mean, risk, coefficients). The input is the distribution itself, not a single sample.
ψP\psi_P: The influence function (IF). It describes how much θ(P)\theta(P) changes when PP is perturbed slightly—intuitively, a map of each data point’s leverage.
ψP,λ(x)=limrj=1r11+2λ/σj[ddtθ(Ptj)]t=0ej(x)\psi_{P,\lambda}(x)=\lim_{r\to\infty}\sum_{j=1}^{r}\frac{1}{1+2\lambda/\sigma_j}\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x): The paper’s spectral von Mises formula. It builds the IF by combining contributions from eigenmodes.
PtjP_t^j: The distribution obtained by smoothly shifting PP along the jj-th eigenfunction eje_j by amount tt. Uses a smooth path instead of a sharp point-mass spike for numerical stability.
[ddtθ(Ptj)]t=0\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}: The pathwise derivative: the instantaneous rate at which θ\theta changes as you move slightly along that direction near t=0t=0.
11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}: Regularized shrinkage. Modes with small σj\sigma_j (often noise-sensitive) are down-weighted to curb instability.
rr: Low-rank truncation. In practice we replace the infinite sum with the top rr modes to control cost.
λ\lambda: Regularization strength. Small λ\lambda can increase variance; large λ\lambda can increase bias—a bias–variance knob.
σj\sigma_j: The jj-th eigenvalue. It measures how much energy or information that mode carries and pairs with the shrinkage factor.
ej(x)e_j(x): Value of the jj-th eigenfunction at xx—how aligned the input xx is with that mode.
[ddtθ(Ptj)]t=0ej(x)\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x): Like a gain pedal: a large pathwise derivative and large ej(x)e_j(x) together boost that mode’s contribution.
j=1r()\sum_{j=1}^{r}(\cdots): Aggregates many modes instead of a single sharp perturbation, which helps reconstruct the IF more stably.
[Experiments and results]
The paper builds toy Monte Carlo experiments around the simplest functional target—the mean—to show how the proposed spectral estimator behaves in a computational setting. Two takeaways matter.
First, bias–variance trade-off via regularization strength λ\lambda. If λ\lambda is too small, small-eigenvalue modes dominate and estimates can oscillate (higher variance); if λ\lambda is too large, important modes are over-suppressed and bias grows. The shrinkage 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j} therefore acts as a practical knob between numerical stability and preserving information.
Second, consistency as sample size nn grows. As more data arrive, the estimator tracks the theoretical IF more closely, in line with results such as Theorem 4.7 in the paper—in plain words, the computer-estimated IF converges toward the mathematically expected IF.
From an engineering angle, this means sensitivity analyses need not be erratic run-to-run; with enough data and tuned regularization they can be reproducibly stable.
[Conclusion and limitations]
The main payoff is moving IF computation from an idiosyncratic pencil-and-paper derivation chore to a repeatable data-and-algorithm pipeline. Kernel-based spectral expansion plus pathwise derivatives supply a common computational frame; Nyström-style eigendecomposition estimates modes (σj,ej)(\sigma_j,e_j), then a regularized weighted sum reconstructs the IF—a clear implementation storyline.
Three practical uses stand out:
(1) flag training points that unduly drive predictions for label-error and outlier triage;
(2) compare how sample influence shifts before and after model updates for debugging;
(3) ground XAI or robust-ML narratives in data-level influence.
Open limitations remain explicit. Sharp rates of convergence are still open—consistency is shown, but how fast we approach the truth needs further theory. Fully automatic pathwise derivatives (tight autodiff integration across diverse models) is another engineering frontier. Treat the paper as a strong milestone for practical IF estimation, not the final word.

Diagram: a stark contrast — limitations vs. proposal

The left block highlights the classical failure mode: point-mass spikes make sensitivity swing wildly. The right pipeline shows the paper’s fix: spectral modes plus regularized weighting rebuilds a smooth, suppressible influence curve—so the gap is hard to miss.
Classical limitation

Point-mass · spikes → volatile, ill-conditioned sensitivity

1) Point-mass perturbation

Large sensitivity swings from spikes

Paper’s proposal

Spectral split → regularized reconstruction → stable IF

2) Spectral decomposition
Per-mode (σj,ej)(\sigma_j, e_j)
Small σj\sigma_j modes are down-weighted
3) Regularized reconstruction
Weighted sum restores a smooth IF
11+2λ/σj\frac{1}{1+2\lambda/\sigma_j} suppresses noisy modes