Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • 2026
      • CPAL
        • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
    • PolarQuant: Quantizing KV Caches with Polar Transformation
  • Architecture & algorithms
    • 2026
      • CPAL
        • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • 2025
      • ICML
        • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
      • ICLR
        • SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
  • Vision & multimodal
  • NLP & LLMs
    • 2026
      • CPAL
        • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/Theory & math/CPAL2026/Kernel von Mises Formula of the Influence Function

Kernel von Mises Formula of the Influence Function

This paper replaces the old bottleneck—deriving influence functions (IF) by hand for every model—with a data-driven procedure built on kernels and spectral expansions. In particular it eases numerical ill-conditioning that often arises with point-mass perturbations, and through a regularized estimator it aims for both practical computability and theoretical consistency.
PDFView original paper (PDF)↗
[Abstract & Intro] Three-sentence summary + problem
① Classical influence-function computation forces a fresh derivation whenever the model changes, so automation is difficult.
② The traditional approach—poking the distribution with a point mass—makes the response sharp and prone to numerical instability.
③ This paper splits the data into several smooth patterns, computes influence for each, and recombines them so a computer—not hand derivation—can estimate the IF more stably.
Everyday analogy: Imagine a complex hot-pot recipe and you want to know how one piece of firm tofu changes the broth. The old style jabs the pot like a needle, so readings swing wildly. This paper nudges gently in several directions like soft ripples and aggregates the responses—closer to a stable taste meter.

[Background] Essential concepts (expanded)

For each item: definition → intuition → why the paper needs it.
- Influence function (IF, ψP\psi_PψP​)
Definition: Describes how a functional target θ(P)\theta(P)θ(P) changes when the data-generating distribution PPP is perturbed slightly; ψP(x)\psi_P(x)ψP​(x) summarizes sensitivity in the direction associated with point xxx.
Intuition: Like Cook’s distance thinking—how much the fit moves if an observation is stressed—but here θ\thetaθ may depend on the whole distribution, not just a finite weight vector.
In this paper: Classical von Mises formulas often use sharp point-mass perturbations, which can make operators ill-conditioned. The spectral rewrite smooths the story.
- Functional parameter θ(P)\theta(P)θ(P)
Definition: A map from a distribution PPP to a quantity of interest (mean under PPP, ERM solution, risk, etc.).
Intuition: If the “world distribution” shifts, the objective itself shifts—IF studies that distribution-level sensitivity.
In this paper: Motivates derivatives along smooth paths PtjP_t^jPtj​ rather than one-off spikes.
- Kernel / RKHS
Definition: A kernel kkk induces a Hilbert space of functions with controlled smoothness (RKHS).
Intuition: Without regularity, responses to perturbations can be spiky; RKHS acts like a VIP club of well-behaved functions so estimates stay stable.
In this paper: Replaces harsh point-mass pushes with kernel-induced smooth directions.
- Spectral decomposition & orthonormal basis eje_jej​
Definition: Eigenvalues σj\sigma_jσj​ and eigenfunctions eje_jej​ decompose an operator; expansions ∑j(⋯ )ej\sum_j (\cdots) e_j∑j​(⋯)ej​ split a perturbation into modes then recombine.
Intuition: A Fourier-like split into components that can be weighted and summed with controlled energy.
In this paper: The ∑j=1r\sum_{j=1}^r∑j=1r​ and weights 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}1+2λ/σj​1​ implement mode-wise damping of noisy directions.
- Pathwise derivative
Definition: Along a smooth curve PtjP_t^jPtj​, study ddtθ(Ptj)∣t=0\left.\frac{d}{dt}\theta(P_t^j)\right|_{t=0}dtd​θ(Ptj​)​t=0​ instead of an instantaneous mass injection.
Intuition: Measure response to a gentle tilt, not a hammer blow—closer to sensitivity analysis / ODE intuition.
In this paper: Central to the spectral von Mises formula under smooth mode directions.
[Proposed method: core idea]
The paper avoids point-mass perturbation directly; along eigenfunction-direction path perturbations PtjP_t^jPtj​ it computes pathwise derivatives of θ\thetaθ to reconstruct the IF. The centerpiece is Theorem 3.3 (Spectral von Mises formula), expressing the IF as a sum of per-mode contributions. A regularization strength λ\lambdaλ suppresses blow-up of small-eigenvalue modes and improves computational stability.
[Proposed method: dissecting the key formula]
Core identity:
ψP,λ(x)=lim⁡r→∞∑j=1r11+2λ/σj[ddtθ(Ptj)]t=0ej(x)\psi_{P,\lambda}(x)=\lim_{r\to\infty}\sum_{j=1}^{r}\frac{1}{1+2\lambda/\sigma_j}\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x)ψP,λ​(x)=limr→∞​∑j=1r​1+2λ/σj​1​[dtd​θ(Ptj​)]t=0​ej​(x)
Read simply: instead of asking for the total influence of input xxx on the output in one shot, split the effect into several smooth wave-like modes, compute each contribution, then add. The sum ∑j=1r\sum_{j=1}^{r}∑j=1r​ means split by mode and sum; in practice the infinite sum is truncated to the top rrr modes, so rrr is the approximation rank balancing compute cost and accuracy.
The middle factor [ddtθ(Ptj)]t=0\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}[dtd​θ(Ptj​)]t=0​ is the instantaneous slope of how θ\thetaθ reacts when the distribution is nudged along that mode at t=0t=0t=0. Large values mean that mode shakes the model strongly. It is multiplied by ej(x)e_j(x)ej​(x), which records how much of mode jjj is present in input xxx. So large sensitivity along a mode and large ej(x)e_j(x)ej​(x) together inflate that modes contribution.
The prefactor 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}1+2λ/σj​1​ is a safety valve (shrinkage). Modes with small σj\sigma_jσj​ are often noise-sensitive and destabilize computation; this factor automatically shrinks their contribution. Increasing λ\lambdaλ strengthens shrinkage—curves get smoother and variance drops, but if λ\lambdaλ is too large important signal is damped and bias can grow. In one line: keep mode-wise sensitivity where it helps, regularize unstable modes, and reconstruct a stable global IF.
Symbol walkthrough (same section)
No separate card—one pass over the spectral formula’s notation.
- θ(P)\theta(P)θ(P): A functional target that takes a data distribution PPP and returns the statistic the model cares about (e.g. mean, risk, coefficients). The input is the distribution itself, not a single sample.
- ψP\psi_PψP​: The influence function (IF). It describes how much θ(P)\theta(P)θ(P) changes when PPP is perturbed slightly—intuitively, a map of each data point’s leverage.
- ψP,λ(x)=lim⁡r→∞∑j=1r11+2λ/σj[ddtθ(Ptj)]t=0ej(x)\psi_{P,\lambda}(x)=\lim_{r\to\infty}\sum_{j=1}^{r}\frac{1}{1+2\lambda/\sigma_j}\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x)ψP,λ​(x)=limr→∞​∑j=1r​1+2λ/σj​1​[dtd​θ(Ptj​)]t=0​ej​(x): The paper’s spectral von Mises formula. It builds the IF by combining contributions from eigenmodes.
- PtjP_t^jPtj​: The distribution obtained by smoothly shifting PPP along the jjj-th eigenfunction eje_jej​ by amount ttt. Uses a smooth path instead of a sharp point-mass spike for numerical stability.
- [ddtθ(Ptj)]t=0\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}[dtd​θ(Ptj​)]t=0​: The pathwise derivative: the instantaneous rate at which θ\thetaθ changes as you move slightly along that direction near t=0t=0t=0.
- 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}1+2λ/σj​1​: Regularized shrinkage. Modes with small σj\sigma_jσj​ (often noise-sensitive) are down-weighted to curb instability.
- rrr: Low-rank truncation. In practice we replace the infinite sum with the top rrr modes to control cost.
- λ\lambdaλ: Regularization strength. Small λ\lambdaλ can increase variance; large λ\lambdaλ can increase bias—a bias–variance knob.
- σj\sigma_jσj​: The jjj-th eigenvalue. It measures how much energy or information that mode carries and pairs with the shrinkage factor.
- ej(x)e_j(x)ej​(x): Value of the jjj-th eigenfunction at xxx—how aligned the input xxx is with that mode.
- [ddtθ(Ptj)]t=0ej(x)\left[\frac{d}{dt}\theta(P_t^j)\right]_{t=0}e_j(x)[dtd​θ(Ptj​)]t=0​ej​(x): Like a gain pedal: a large pathwise derivative and large ej(x)e_j(x)ej​(x) together boost that mode’s contribution.
- ∑j=1r(⋯ )\sum_{j=1}^{r}(\cdots)∑j=1r​(⋯): Aggregates many modes instead of a single sharp perturbation, which helps reconstruct the IF more stably.
[Experiments and results]
The paper builds toy Monte Carlo experiments around the simplest functional target—the mean—to show how the proposed spectral estimator behaves in a computational setting. Two takeaways matter.
First, bias–variance trade-off via regularization strength λ\lambdaλ. If λ\lambdaλ is too small, small-eigenvalue modes dominate and estimates can oscillate (higher variance); if λ\lambdaλ is too large, important modes are over-suppressed and bias grows. The shrinkage 11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}1+2λ/σj​1​ therefore acts as a practical knob between numerical stability and preserving information.
Second, consistency as sample size nnn grows. As more data arrive, the estimator tracks the theoretical IF more closely, in line with results such as Theorem 4.7 in the paper—in plain words, the computer-estimated IF converges toward the mathematically expected IF.
From an engineering angle, this means sensitivity analyses need not be erratic run-to-run; with enough data and tuned regularization they can be reproducibly stable.
[Conclusion and limitations]
The main payoff is moving IF computation from an idiosyncratic pencil-and-paper derivation chore to a repeatable data-and-algorithm pipeline. Kernel-based spectral expansion plus pathwise derivatives supply a common computational frame; Nyström-style eigendecomposition estimates modes (σj,ej)(\sigma_j,e_j)(σj​,ej​), then a regularized weighted sum reconstructs the IF—a clear implementation storyline.
Three practical uses stand out:
(1) flag training points that unduly drive predictions for label-error and outlier triage;
(2) compare how sample influence shifts before and after model updates for debugging;
(3) ground XAI or robust-ML narratives in data-level influence.
Open limitations remain explicit. Sharp rates of convergence are still open—consistency is shown, but how fast we approach the truth needs further theory. Fully automatic pathwise derivatives (tight autodiff integration across diverse models) is another engineering frontier. Treat the paper as a strong milestone for practical IF estimation, not the final word.

Diagram: a stark contrast — limitations vs. proposal

The left block highlights the classical failure mode: point-mass spikes make sensitivity swing wildly. The right pipeline shows the paper’s fix: spectral modes plus regularized weighting rebuilds a smooth, suppressible influence curve—so the gap is hard to miss.
Classical limitation

Point-mass · spikes → volatile, ill-conditioned sensitivity

1) Point-mass perturbation
Large sensitivity swings from spikes
VS
Paper’s proposal

Spectral split → regularized reconstruction → stable IF

2) Spectral decomposition
Per-mode (σj,ej)(\sigma_j, e_j)(σj​,ej​)
Small σj\sigma_jσj​ modes are down-weighted
→
3) Regularized reconstruction
Weighted sum restores a smooth IF
11+2λ/σj\frac{1}{1+2\lambda/\sigma_j}1+2λ/σj​1​ suppresses noisy modes

관련 AI논문

  • - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
  • - AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • - The Curse of Depth in Large Language Models