Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • 2026
      • CPAL
        • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
    • PolarQuant: Quantizing KV Caches with Polar Transformation
  • Architecture & algorithms
    • 2026
      • CPAL
        • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • 2025
      • ICML
        • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
      • ICLR
        • SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning
  • Vision & multimodal
  • NLP & LLMs
    • 2026
      • CPAL
        • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/AutoML & ML pipelines/ICLR 2025/SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning

SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning

Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, Bang Liu, Chenglin Wu

ICLR 2025 · arXiv:2410.17238

LLM agents often produce low-diversity, suboptimal ML code even after many tries, while classical AutoML is tied to fixed pipelines and lacks flexibility.
MCTS (Monte Carlo Tree Search) builds a tree of decisions/experiments and uses rollouts plus validation scores to choose which branch to try next. UCT-DP tweaks the usual UCT score for picking the next node so deep, expensive training steps are not always behind shallow, cheap exploration.
SELA represents pipelines as such a tree, schedules experiments with MCTS, and uses UCT-DP to prioritize deeper training-heavy branches. Formulas stay as in the paper; each block starts with plain-language notes.
PDFView PDF (arXiv)↗

Chapter 1: SELA and tree-search AutoML

Same as above in plain words: MCTS walks the tree using rollouts and validation scores to choose which branch to try next; UCT-DP changes the UCT score used when picking the next node so that deep, expensive training steps are less often pushed aside by shallow exploration.

What is Monte Carlo Tree Search (MCTS)?

In short: future experiments are arranged as a tree, and the same four steps repeat.
- ① Pick (selection): rules like UCT choose which node to visit next.
- ② Grow (expansion): attach a new child node (a new try) if it did not exist.
- ③ Roll (rollout): run code or a simulation on that branch to get a validation score.
- ④ Push up (backpropagation): send that score up to parents to update visit counts and averages.
SELA explores LLM-proposed pipeline branches with these four steps and validation scores.
What is UCT? (Upper Confidence Bound applied to trees) It is the scoring rule for choosing which sibling child to visit next. It mixes average reward so far (exploit) with how under‑visited a branch is (explore) in one formula, so you pick the next node by comparing numbers. The paper’s UCT-DP tweaks this UCT so deep training-heavy steps are not always behind shallow search.

Four steps (one cycle)

① Pick→② Grow→③ Roll→④ Push up
RootBranch ABranch BRolloutval. score s

Purple dashed line: one example path. Repeated runs accumulate scores on each branch.

[Abstract & intro] Three-line summary

Summary
- LLM limits: Code is often low-diversity and fails to converge to a good solution.
- Classical AutoML: Fixed pipelines (e.g., Auto-sklearn-style) resist dynamic reconfiguration when tasks change.
- SELA: Represent pipelines as a tree, schedule experiments with MCTS, use validation scores as feedback. UCT-DP biases search toward deep, training-heavy nodes.
Analogy: Following only the factory service playbook ≈ classical AutoML. Changing suspension, engine map, and tire pressure all at once and doing a single lap ≈ one-shot LLM codegen. SELA is like a race engineer who reads sector times and telemetry (validation scores) and branches on what to tune next.

Chapter 2: Background — five concepts

[Background]

- AutoML: Automate preprocessing, models, hyperparameters; often try–measure–iterate.
- LLM agent: From natural-language task + data summary, generate and run code. SELA splits plan vs code/execute.
- Search space: All feasible preprocessing × model × hyperparameter tuples—too large to exhaust.
- MCTS: Rollouts + statistics on a tree; balances exploration vs exploitation.
- Exploration vs exploitation: Visit rare children vs deepen high-reward paths. UCT-DP adds prefer deep training.

Chapter 3: Method — formulas with plain notes

[Method] Five steps

Paper formulas unchanged; each step starts with “plain words.”
(1) Insight Proposer
Plain words: The LLM writes down candidate ideas per stage (preprocessing, model choice, …). No code runs yet—only widening options.
Given task text ppp, dataset info ddd, and LLM MMM, for each stage τ∈T\tau \in Tτ∈T sample candidate ideas.
InsightProposer(p,d,M)→Λ={λiτ∣τ∈T, i=1,…,m}\mathrm{InsightProposer}(p, d, M) \rightarrow \Lambda = \{ \lambda_i^{\tau} \mid \tau \in T,\, i = 1,\ldots,m \}InsightProposer(p,d,M)→Λ={λiτ​∣τ∈T,i=1,…,m}
- What the symbols mean: ppp = what you want to solve; ddd = data summary; MMM = LLM; Λ\LambdaΛ = full idea list; λiτ\lambda_i^{\tau}λiτ​ = idea iii at stage τ\tauτ.
- Picture: From the task brief (ppp) and data summary (ddd), list candidate experiment ideas per stage—no code run or training job yet, only widening branches.
(2) Plan and code
Plain words: After the tree picks one experiment ccc,
(1) turn it into step-by-step instructions, then
(2) run code on real data and read one validation score sss.
For configuration ccc, build instructions IτI^{\tau}Iτ, run on data DDD, obtain code στ\sigma^{\tau}στ and score sss.
Eplan(p,d,c,M)→Iτ∈TE_{\mathrm{plan}}(p, d, c, M) \rightarrow I^{\tau \in T}Eplan​(p,d,c,M)→Iτ∈T
Ecode&execute(Iτ∈T,D,M)→(στ∈T,s)E_{\mathrm{code\&execute}}(I^{\tau \in T}, D, M) \rightarrow (\sigma^{\tau \in T}, s)Ecode&execute​(Iτ∈T,D,M)→(στ∈T,s)
- What the symbols mean: ccc = chosen combo; EplanE_{\mathrm{plan}}Eplan​ = plan → instructions; IτI^{\tau}Iτ = tasks per stage; Ecode&executeE_{\mathrm{code\&execute}}Ecode&execute​ = codegen + run; DDD = data; στ\sigma^{\tau}στ = code; sss = validation score.
- Picture: Top = pipeline / runbook plan; bottom = execute on data and read validation score sss.
(3) UCT-DP
Plain words: Score for which child to try next. First term ≈ “how good on average so far”; second term ≈ “bonus for rarely visited paths.” UCT-DP tweaks unvisited handling so search does not only skim shallow branches and can go deep where training happens.
UCTDP(x)=v(x)n(x)+αexploreln⁡nvisits(xparent)n(x)\mathrm{UCTDP}(x) = \frac{v(x)}{n(x)} + \alpha_{\mathrm{explore}} \sqrt{\frac{\ln n_{\mathrm{visits}}(x_{\mathrm{parent}})}{n(x)}}UCTDP(x)=n(x)v(x)​+αexplore​n(x)lnnvisits​(xparent​)​​
n(x)={αunvisitedif nvisits(x)=0nvisits(x)otherwisen(x) = \begin{cases} \alpha_{\mathrm{unvisited}} & \text{if } n_{\mathrm{visits}}(x) = 0 \\ n_{\mathrm{visits}}(x) & \text{otherwise} \end{cases}n(x)={αunvisited​nvisits​(x)​if nvisits​(x)=0otherwise​
- What the symbols mean: xxx = current node; v(x)v(x)v(x) = accumulated reward; nvisitsn_{\mathrm{visits}}nvisits​ = how often visited; parent = one level up; α\alphaα’s = exploration / unvisited knobs.
- Picture: First term = average quality; second = exploration push. UCT-DP keeps denominators small but non-zero for unvisited nodes so deep, promising paths stay competitive.
(4) NS
Plain words: Datasets use different scales. NS maps raw metrics so higher always means better before comparing benchmarks.
NS(sraw)={11+log⁡(1+sraw)lower-is-better (e.g., RMSE)srawhigher-is-better (e.g., F1)\mathrm{NS}(s_{\mathrm{raw}}) = \begin{cases} \dfrac{1}{1 + \log(1 + s_{\mathrm{raw}})} & \text{lower-is-better (e.g., RMSE)} \\ s_{\mathrm{raw}} & \text{higher-is-better (e.g., F1)} \end{cases}NS(sraw​)=⎩⎨⎧​1+log(1+sraw​)1​sraw​​lower-is-better (e.g., RMSE)higher-is-better (e.g., F1)​
- What sraws_{\mathrm{raw}}sraw​ means: the raw validation number your model produced.
(5) Rescaled NS
Plain words: Fix SELA’s NS = 1; other methods are ratios. > 1 means that method beat SELA on normalized score (for that setup).
RescaledNS(f)=NSfNSSELA\mathrm{RescaledNS}(f) = \frac{\mathrm{NS}_f}{\mathrm{NS}_{\mathrm{SELA}}}RescaledNS(f)=NSSELA​NSf​​
One sentence: Build Λ\LambdaΛ → pick ccc → run for sss → move on the tree with UCT-DP → compare with NS.

Chapter 4: Toy walkthrough

[Toy simulation]

Numbers are illustrative—only the flow matters.
Frame 1 — Proposer
Plain words: From “binary classification + missing values” the LLM still outputs a short list of preprocessing ideas.
ppp=binary classification, ddd=tabular with missing values → Λ\LambdaΛ contains imputation / scaling ideas.
Frame 2 — MCTS picks ccc
Plain words: Search picks one combo, e.g. scaling + logistic regression, and walks that branch.
e.g., scaling + logistic regression.
Frame 3 — EplanE_{\mathrm{plan}}Eplan​
Plain words: Turn the combo into a concrete pipeline order.
Fill `SimpleImputer` → `StandardScaler` → `LogisticRegression`.
Frame 4 — Execute
Plain words: Run code; read one validation score—here assume F1 =0.72=0.72=0.72.
Suppose validation F1 sraw=0.72s_{\mathrm{raw}}=0.72sraw​=0.72 → NS =0.72=0.72=0.72 for higher-is-better.
Frame 5 — UCT-DP update
Plain words: Feed the score back so this node competes with siblings next time; deep nodes stay fair thanks to unvisited handling.
Add reward to v(x)v(x)v(x), increment visits; compare UCTDP\mathrm{UCTDP}UCTDP among siblings.
Frame 6 — Rescaled NS
Plain words: SELA = 1; other method ÷ SELA. Below 1 here means SELA ahead.
If SELA NS =0.72=0.72=0.72 and baseline =0.65=0.65=0.65, ratio ≈0.90\approx 0.90≈0.90 (<1 favors SELA).

Chapter 5: Experiments

[Results]

On 20 ML datasets (arXiv abstract), SELA reports roughly 65%–80% win rate vs each baseline—consistent advantage. MCTS beats random search; more rollouts generally improve scores—useful for budgeting API/time.

Chapter 6: Conclusion & visualization guide

[Conclusion]

Takeaways (≤3)
1. Strong AutoML baselines without hand-picking every step.
2. Cache rollouts to cut API/GPU cost.
3. Tree logs explain which branch was taken.
Limits: Generalizing to robotics/SWE; larger search spaces need better sample efficiency; clearer interpretability needs UI work.

[Diagram] Summary

- Classic: Linear / one-shot flows—weak feedback may not reach target quality.
- SELA: MCTS + UCT-DP on a tree, updated with validation scores—the left/right panels below only sketch this contrast.

At-a-glance comparison

Left: fixed order and one-shot generation—feedback can be weak. Right: tree search that uses validation scores to choose branches. Simplified figure below.

Baseline: fixed pipeline / one-shot generation

One-shot full pipelines or rule-only flows have weak feedback loops; points may not converge to a good score.

Experiment difficulty axis

Data prep
Choose model
Train & validate
Stops here
All-in-one code
Low validation score

Start → Target quality

Scattered trial path

Hard to control

SELA: tree search + UCT-DP

Branch per stage, update average reward from validation scores, and bias toward deep training when promising.

Insight pool Λ (LLM) · Controlled experiments · MCTS rollouts · simulation

Start
Weaker branch
Running average
Near the goal
Scores feed back upward

Convergence on the tree

Close to target score

SELA places LLM ideas on a tree and uses MCTS; UCT-DP avoids wasting time on shallow branches before real training. NS = “make every dataset’s score point the same way (higher = better)”; Rescaled NS = “set SELA to 1 and see others as multiples.” Caching and logs help cost and explainability. One line: search is experiment with feedback.