Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • 2026
      • CPAL
        • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
    • PolarQuant: Quantizing KV Caches with Polar Transformation
  • Architecture & algorithms
    • 2026
      • CPAL
        • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • 2025
      • ICML
        • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
      • ICLR
        • SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
  • Vision & multimodal
  • NLP & LLMs
    • 2026
      • CPAL
        • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/Architecture & algorithms/CPAL2026/AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers

AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers

In quant practice, alpha factors still sit awkwardly between hand-crafted formulas and black-box models. AlphaFormer pre-trains a Transformer on synthetic time series, then—given new market data—emits interpretable symbolic formulas end-to-end. This article dissects the linear alpha pool, IC-based metrics, and PPO-style stabilization line by line.
PDFOpen original PDF↗
[Abstract & introduction] Three-line summary + problem statement
Three-line summary
- ① Fatal limitation of prior work: GP- or RL-based symbolic regression must restart search from scratch on every new dataset, barely reusing learned “formula grammar.” It is like reinventing the recipe every morning.
- ② Limits of classical tools: Tree boosters and LSTMs predict well but stay black boxes; fully manual factor design cannot scale the enormous search space.
- ③ Core idea: AlphaFormer pre-trains a Transformer on diverse synthetic price paths, then, given real XtX_tXt​, instantly generates RPN-style alpha formulas—a chef who practiced in many fake kitchens before cooking in a new one.
Analogy: recipe-randomizing robot vs. master chef with grammar in muscle memory
Legacy symbolic search is a robot that re-samples spice ratios from scratch whenever the “kitchen” (market) changes. AlphaFormer pre-trains on synthetic kitchens, learns composition rules, then, seeing real ingredients XtX_tXt​, plates a formula (alpha factor) on the spot—interpretable without giving up predictive pressure.

[Background] Concepts you truly need

Read each item as definition → intuition → role in this paper so the later formulas feel motivated, not arbitrary.
- Alpha factor
Plain-language definition first: Fix a single day (time ttt) and suppose we care about SSS stocks. Each stock has ddd numbers (e.g. close, volume, recent returns). An alpha factor is a rule that reads everything at once and prints one “relatively more attractive?” score per stock. The SSS scores form a vector ztz_tzt​.
Picture a table: SSS rows (one stock per row) × ddd columns (one feature type per column). Call the full input XtX_tXt​; then Xt∈RS×dX_t \in \mathbb{R}^{S \times d}Xt​∈RS×d means “that day’s stock count × numbers per stock.” The output zt∈RSz_t \in \mathbb{R}^Szt​∈RS means component iii = score for stock iii.
Intuition: This is not “track one stock through time only.” It is line up many stocks on the same day and ask who ranks higher today—the cross-section. Long–short and ranking portfolios read those scores to choose longs, shorts, and weights.
In this paper: The generator’s end product is an interpretable symbolic formula implementing such a factor—this definition is our starting point.
- Symbolic regression
Definition: Search for an explicit operator tree (e.g. `mean(close, 20d)`), not only numeric weights.
Intuition: Prefer a human-readable recipe over a black box—important for compliance and risk narratives even though the search space is huge.
In this paper: Contrasts with GP/RL pipelines that cold-start symbolic search on every new dataset.
- RPN (reverse Polish notation)
Definition: Infix (human) sketch: `mean(close, 20d)` = 20-day average of close. The model emits the same meaning as a left-to-right token stream: `close`, then `20d`, then operator `mean`, then delimiter `end` (closes this sub-formula chunk). Those are vocabulary tokens in order, not a programming-array literal—avoid reading `[volume, 20d, mean, end]`-style brackets as data-structure syntax. A stack fixes evaluation order without parentheses.
Intuition: Matches how Transformers autoregress tokens left-to-right.
In this paper: The model emits alpha formulas as RPN token sequences, not as infix math text.
- IC (information coefficient)
Definition: Typically the daily Pearson correlation between predicted scores and realized labels (forward returns).
Intuition: A daily report card on whether predicted ranks line up with realized ranks; Rank IC stresses ordering and is less swayed by outliers.
In this paper: IC is the quality signal for pooling and (optionally) RL-style fine-tuning.
- Synthetic data
Definition: Pre-train on time series fabricated by generative models (GRU, Transformer, diffusion, etc.), often ensembled for diversity.
Intuition: Real tape is noisy and label-scarce; synthetics provide a practice gym to learn operator grammar before touching live markets.
In this paper: Enables grammar pre-training + lighter adaptation on real XtX_tXt​.
[Proposed method] Core formulation dissected
1) Alpha pool — mix many formulas
Given mmm candidate factors fkf_kfk​, aggregate linearly:
zt=g(Xt)=∑k=1mwk fk(Xt)z_t = g(X_t) = \sum_{k=1}^{m} w_k\, f_k(X_t)zt​=g(Xt​)=k=1∑m​wk​fk​(Xt​)
- Intuition: each fkf_kfk​ is a “chef”; wkw_kwk​ is vote weight; ggg is the ensemble head over the pool.
2) Learning pool weights — accelerator vs. L1 brake
L(w)=1ST∑t=1T∥g(Xt)−yt∥22+λ∥w∥1\mathcal{L}(w) = \frac{1}{ST} \sum_{t=1}^{T} \big\| g(X_t) - y_t \big\|_2^2 + \lambda \|w\|_1L(w)=ST1​t=1∑T​​g(Xt​)−yt​​22​+λ∥w∥1​
- First term (accelerator): yty_tyt​ is the label vector (e.g. forward return). Minimize mean squared error over stock×time mass STSTST.
- λ∥w∥1\lambda\|w\|_1λ∥w∥1​ (brake / scissors): L1 pushes many wkw_kwk​ to exact zeros, pruning useless factors for a sparse, interpretable pool.
3) IC as daily taste test — average correlation
σˉ(g(X),y)=1T∑t=1Tσ(g(Xt),yt)\bar{\sigma}(g(X), y) = \frac{1}{T} \sum_{t=1}^{T} \sigma\big(g(X_t), y_t\big)σˉ(g(X),y)=T1​t=1∑T​σ(g(Xt​),yt​)
σˉrank(g(X),y)=1T∑t=1Tσrank(g(Xt),yt)\bar{\sigma}_{\mathrm{rank}}(g(X), y) = \frac{1}{T} \sum_{t=1}^{T} \sigma_{\mathrm{rank}}\big(g(X_t), y_t\big)σˉrank​(g(X),y)=T1​t=1∑T​σrank​(g(Xt​),yt​)
- σ\sigmaσ is Pearson; σrank\sigma_{\mathrm{rank}}σrank​ is rank correlation—robust when you care about ordering more than raw magnitudes.
4) PPO to stabilize the generator — clipping + value head
L(θ,ϕ)=LCLIP(θ)+η Lvalue(ϕ)\mathcal{L}(\theta, \phi) = \mathcal{L}^{\mathrm{CLIP}}(\theta) + \eta\, \mathcal{L}^{\mathrm{value}}(\phi)L(θ,ϕ)=LCLIP(θ)+ηLvalue(ϕ)
LCLIP(θ)=− E^[min⁡(rt(θ) A^t, clip(rt(θ),1−ϵ,1+ϵ) A^t)]\mathcal{L}^{\mathrm{CLIP}}(\theta) = -\, \hat{\mathbb{E}}\Big[ \min\big( r_t(\theta)\,\hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\,\hat{A}_t \big) \Big]LCLIP(θ)=−E^[min(rt​(θ)A^t​, clip(rt​(θ),1−ϵ,1+ϵ)A^t​)]
Lvalue(ϕ)=∥Vϕ(D,P)−r∥22\mathcal{L}^{\mathrm{value}}(\phi) = \big\| V_\phi(\mathcal{D}, \mathcal{P}) - r \big\|_2^2Lvalue(ϕ)=​Vϕ​(D,P)−r​22​
- rt(θ)r_t(\theta)rt​(θ) (ratio): new-policy / old-policy probability ratio; clipped to [1−ϵ,1+ϵ][1-\epsilon,1+\epsilon][1−ϵ,1+ϵ] so updates stay smooth—a seat belt on policy steps.
- A^t\hat{A}_tA^t​: advantage—how much better than baseline; in this paper’s story, tied to IC-based reward after pool refresh minus value estimate.
- VϕV_\phiVϕ​: critic predicting expected return from data D\mathcal{D}D and pool state P\mathcal{P}P; Lvalue\mathcal{L}^{\mathrm{value}}Lvalue fits observed reward rrr.
- η\etaη: balances actor vs. critic losses.
How to read the symbols (same section)
Every symbol that appears in the four blocks above, in definition → role order—no separate glossary card.
- XtX_tXt​, SSS, ddd: XtX_tXt​ is the feature tensor at time ttt; shaping Xt∈RS×dX_t\in\mathbb{R}^{S\times d}Xt​∈RS×d means SSS stocks and ddd features per stock.
- ztz_tzt​: in RS\mathbb{R}^SRS; entry iii is the alpha score for stock iii.
- fkf_kfk​, mmm, wkw_kwk​, ggg: fk(Xt)f_k(X_t)fk​(Xt​) is the score vector from candidate formula kkk; mmm counts candidates in the pool; wkw_kwk​ is how much we trust that formula; g(Xt)g(X_t)g(Xt​) is the linearly mixed final predictor.
- yty_tyt​, TTT, STSTST, ∥⋅∥22\|\cdot\|_2^2∥⋅∥22​: yty_tyt​ is the label vector (e.g. forward return) at ttt. TTT is the number of days in the fit; SSS is universe size; STSTST is stock-day cells, so dividing by STSTST averages error over the full panel. ∥v∥22\|v\|_2^2∥v∥22​ is the sum of squared components of vvv.
- λ\lambdaλ, ∥w∥1\|w\|_1∥w∥1​: ∥w∥1=∑k=1m∣wk∣\|w\|_1=\sum_{k=1}^m|w_k|∥w∥1​=∑k=1m​∣wk​∣. Raising λ\lambdaλ pushes more wkw_kwk​ to exact zero, pruning factors (Lasso).
- σ\sigmaσ, σrank\sigma_{\mathrm{rank}}σrank​, σˉ\bar{\sigma}σˉ, σˉrank\bar{\sigma}_{\mathrm{rank}}σˉrank​: per-day Pearson vs rank correlation between g(Xt)g(X_t)g(Xt​) and yty_tyt​; the overbar averages over TTT days to smooth daily noise.
- θ\thetaθ, ϕ\phiϕ, PPO: θ\thetaθ parameterizes the policy (Transformer) that samples tokens; ϕ\phiϕ parameterizes the critic. rt(θ)r_t(\theta)rt​(θ) is the new-policy / old-policy probability ratio for the taken action; ϵ\epsilonϵ sets the clip interval [1−ϵ,1+ϵ][1-\epsilon,1+\epsilon][1−ϵ,1+ϵ]. A^t\hat{A}_tA^t​ is advantage (how much better than average). Vϕ(D,P)V_\phi(\mathcal{D},\mathcal{P})Vϕ​(D,P) estimates expected return given data D\mathcal{D}D and pool state P\mathcal{P}P; Lvalue\mathcal{L}^{\mathrm{value}}Lvalue matches that estimate to realized reward rrr. η\etaη weights the value loss against the policy loss.
One-sentence read
Pre-train grammar on synthetics, generate RPN formulas, linearly pool with Lasso, score with IC, and fine-tune generation with clipped PPO—that is the full loop.
[Toy walkthrough] Mental simulation—step-by-step
To connect each formula above to something concrete, imagine a toy market with only three stocks on one day ttt. Numbers are illustrative, not real quotes.
Setup: Stocks A, B, C. The generator already proposed m=3m=3m=3 symbolic factors f1,f2,f3f_1,f_2,f_3f1​,f2​,f3​. Each fkf_kfk​ outputs one score vector in R3\mathbb{R}^3R3 (one score per name). Example:
f1=(1, 0, −1),f2=(0.5, 1, 0),f3=(0.2, −0.3, 0.1)f_1=(1,\,0,\,{-}1),\quad f_2=(0.5,\,1,\,0),\quad f_3=(0.2,\,{-}0.3,\,0.1)f1​=(1,0,−1),f2​=(0.5,1,0),f3​=(0.2,−0.3,0.1)
(entries are (A,B,C)(A,B,C)(A,B,C) in order).
1) Linear alpha pool—weighted blend
With weights w=(0.5, 0.3, 0.2)w=(0.5,\,0.3,\,0.2)w=(0.5,0.3,0.2), the pooled score is zt=0.5f1+0.3f2+0.2f3z_t=0.5f_1+0.3f_2+0.2f_3zt​=0.5f1​+0.3f2​+0.2f3​. For stock A:
zt,A=0.5⋅1+0.3⋅0.5+0.2⋅0.2=0.69.z_{t,A}=0.5\cdot1+0.3\cdot0.5+0.2\cdot0.2=0.69.zt,A​=0.5⋅1+0.3⋅0.5+0.2⋅0.2=0.69.
Similarly zt,B=0.24z_{t,B}=0.24zt,B​=0.24, zt,C=−0.48z_{t,C}={-}0.48zt,C​=−0.48. So each name’s score is one weighted mix of the three factor rows—the vector ztz_tzt​ is the cross-sectional ranking input for that day.
2) Labels, MSE, and what L1 does
Let yty_tyt​ be realized forward returns (same length as ztz_tzt​). The first loss term is roughly average squared error between ztz_tzt​ and yty_tyt​ over many stock-days—if predictions are wrong, gradient descent nudges www.
The λ∥w∥1\lambda\|w\|_1λ∥w∥1​ term pushes small or redundant weights toward exact zero, which drops whole factors from the pool—like removing a sauce that no longer helps once you taste the full blend.
3) IC—a one-day “ranking exam”
If predicted ranks from ztz_tzt​ line up with return ranks from yty_tyt​, the daily Pearson correlation σ(g(Xt),yt)\sigma(g(X_t),y_t)σ(g(Xt​),yt​) moves positive. Say 0.080.080.08 means “a decent day.” Comparing monthly/rolling means σˉ\bar{\sigma}σˉ near 0.02 vs 0.06, the latter is materially healthier after smoothing noise.
4) PPO—intuition in one step
The policy proposes a new token sequence (a new formula), which changes the pool and IC; that improvement can be turned into reward rrr. The ratio rt(θ)r_t(\theta)rt​(θ) measures how aggressively the new policy changes action probabilities vs the old one; clipping caps that ratio so one update cannot jerk the policy too hard. That is what stabilizes learning.
5) Inference / deployment
After pre-training (and optional light RL), you mostly forward-pass new XtX_tXt​ to obtain formulas—not restart a giant GP symbolic search every batch. That is why latency drops vs cold-start mining.
One line: Pool = mixer; L1 = trash unused factors; IC = daily rank exam; PPO = seat belt on policy updates.
[Experiments & results]
- Search efficiency: Strong baselines need far more candidate factors; AlphaFormer reaches top-tier IC / Rank IC on CSI300 & CSI500 with ~one-third the generation budget in the paper’s story—not a wider needle, but a steadier hand.
- Inference efficiency: No massive online parameter re-fit during inference—important for near-real-time stacks.
- Generalization: Ensembling multiple generative architectures for synthetics boosts IC; China-pretrained models zero-shot to US S&P 500 still compete—suggesting partial transfer of time-series / operator grammar, not only venue noise.
Practical read: If you want interpretable factors under GPU-hour budgets, “synthetic pre-train + bounded RL fine-tune” is an attractive MLOps compromise.
[Conclusion & limitations]
Takeaways for practitioners (≤3)
1. White-box signals: RPN / operator trees are easy to share with risk as literal formulas.
2. Lower search tax: Grammar compression means less cold-start symbolic search on every new tape.
3. End-to-end story: generate → pool → IC → (optional) PPO keeps pipelines short and reproducible.
Limitations / future work
- Hardware: GPU-centric training & inference may exclude CPU-only legacy stacks.
- Regimes: Impressive zero-shot transfer still may need retrain or domain adaptation after structural breaks.
- Labels: IC is only as honest as your forward-return definition and leakage controls.

Visualization plan: chaotic search vs. controlled generation

Left: a search-space scatter of trials plus a jagged path that barely approaches the IC goal—cold-start symbolic mining. Right: a single pipeline—synthetic series → pre-training → tokenized formula generation → IC/pool—for AlphaFormer’s end-to-end story.

Legacy: GP / RL symbolic search

Each new dataset restarts wide exploration; many candidates still yield noisy IC paths.

Cumulative gainICTrial 1Trial NRandom searchOver-exploration

Proposed: AlphaFormer

Grammar from synthetics; fewer generations lift IC steadily and zero-shot transfer is plausible.

Few factors, high ICCumulative gainICPre-trained gen.Trial 1Trial N
AlphaFormer reframes “restart symbolic search every market” as grammar pre-training + safely clipped RL fine-tuning. Pool, L1, IC, and PPO play roles like mixer, scissors, judges, seat belt. Respect GPU dependence and label hygiene when you pilot.

관련 AI논문

  • - AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
  • - The Curse of Depth in Large Language Models
  • - Kernel von Mises Formula of the Influence Function