Everyone's AI
Machine learningAI Papers

Learn

  • AI Papers
  • Theory & math
    • CPAL2026
      • Kernel von Mises Formula of the Influence Function
  • Optimization & efficiency
  • Architecture & algorithms
    • CPAL2026
      • AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • Tabular & prediction
  • AutoML & ML pipelines
    • ICML 2025
      • AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML
  • Vision & multimodal
  • NLP & LLMs
    • CPAL2026
      • The Curse of Depth in Large Language Models
  • Trust & XAI
  • Data-centric & features
  • Edge & web AI
  • Domain applications
🏅My achievements
Learn/AI Papers/AutoML & ML pipelines/ICML 2025/AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Patara Trirat, Wonyong Jeong, Sung Ju Hwang

ICML 2025

AutoML-Agent goes beyond “helping AutoML”: it automates the whole loop from data retrieval, preprocessing, model design, HPO, code generation, to deployment—using a multi-agent LLM framework. This article dissects the paper’s core math (input → planning → decomposition → execution → verification) line by line.
PDFOpen original PDF↗
[Abstract & Introduction] Three-line summary + problem statement
3-line summary:
- Fatal problem: Many AutoML tools are powerful but hard to set up; without expertise you struggle to even start.
- Classical limitation: LLM-based attempts often cover only a slice of the pipeline (e.g., preprocessing only) and use planning too shallowly, wasting exploration.
- Core fix and benefit: AutoML-Agent uses specialized multi-agent collaboration, Retrieval-Augmented Planning to generate better candidate plans, and multi-stage verification to ensure deployment-ready code.
Analogy:
- Traditional AutoML is like a meal kit: ingredients are there, but you still manage cooking order and “heat.”
- Some LLM helpers are like a toaster that occasionally reads a recipe—useful, but it does not cook and serve end-to-end.
- AutoML-Agent is the 5-star hotel service: multiple “kitchen roles” (data/model/implementation) collaborate so that one menu request becomes a complete pipeline from ingredients to serving.
Now let’s turn that full-loop automation into equations and steps.
[Background] Concepts you must know
These are the 5 ideas you need to read the math without getting lost.
- Full-Pipeline AutoML
- Definition: Automating the *entire* chain—data retrieval/selection, preprocessing, model design, HPO, code generation, and deployment.
- Why it matters: even perfect model tuning fails if the data pipeline is broken.
- Multi-Agent System
- Definition: Instead of forcing one LLM to do everything, the system splits roles (manager/planner/executor). Agents share intermediate outputs.
- Intuition: you get fewer “single-person bottlenecks” and clearer ownership.
- Retrieval-Augmented Planning (RAP)
- Definition: Planning is strengthened with retrieved external knowledge (papers, code repos, Kaggle-style examples), not only internal LLM memory.
- Core effect: it makes plan search *more efficient*.
- Plan Decomposition & Parallel Sub-Tasks
- Definition: Split a big plan into data and model sub-tasks (and implementation when needed) so parts can run in parallel with minimal dependencies.
- Intuition: prep and cooking proceed together, instead of waiting for one whole recipe.
- Multi-Stage Verification
- Definition: The system checks success progressively: does the code run, is the performance good, and is the solution deployable.
- Why it matters: “looks right” is not enough for real engineering.
With these pieces in place, the equations become a map.
[Proposed Method] Core formulation & perfect math dissection
This section treats AutoML-Agent as a mathematical pipeline: input → plan → decomposition → execution → final implementation.
Core formulation (the paper’s math-shaped story):
- Given a user instruction III, AutoML-Agent standardizes it into RRR.
- Using RAP, it generates a set of candidate plans PPP.
- For each plan pip_ipi​, it decomposes the work into data and model parts (with optional implementation), producing results OiO_iOi​.
- Finally it selects O∗O^{*}O∗ and converts it into a deployable system M∗\mathcal{M}^{*}M∗.

(1) Prompt Parsing
R=Ap(I)R = \mathcal{A}_{p}(I)R=Ap​(I)
- Intuition: III is natural language. Ap\mathcal{A}_pAp​ converts it into an execution-friendly structure.
- Symbols here: III is the user’s natural-language instruction; RRR is the standardized request after parsing; Ap\mathcal{A}_pAp​ (Prompt Agent) maps I→RI \to RI→R.
(2) RAP-based Candidate Plan Generation
P={p1,…,pP}=Amgr(RAP(R))P = \{p_{1}, \dots, p_{P}\} = \mathcal{A}_{mgr}(RAP(R))P={p1​,…,pP​}=Amgr​(RAP(R))
- Intuition: the manager uses retrieved hints to propose multiple plans—so exploration is guided.
- Symbols here: PPP is the candidate plan set {p1,…,pP}\{p_1,\dots,p_P\}{p1​,…,pP​}; pip_ipi​ is the iii-th plan. RAP(R)RAP(R)RAP(R) augments RRR with retrieved knowledge. Amgr\mathcal{A}_{mgr}Amgr​ (Manager Agent) outputs those candidates.
(3) Plan Decomposition & Execution
For each pip_ipi​:
- Data agent:
sid=PD(R,Ad,pi)s_{i}^{d} = PD(R, \mathcal{A}_{d}, p_{i})sid​=PD(R,Ad​,pi​)
Oid=Ad(sid)O_{i}^{d} = \mathcal{A}_{d}(s_{i}^{d})Oid​=Ad​(sid​)
- Model agent:
sim=PD(R,Am,pi,Oid)s_{i}^{m} = PD(R, \mathcal{A}_{m}, p_{i}, O_{i}^{d})sim​=PD(R,Am​,pi​,Oid​)
Oim=Am(sim)O_{i}^{m} = \mathcal{A}_{m}(s_{i}^{m})Oim​=Am​(sim​)
- Intuition: sss is a state/summary for “what to do next,” and OOO is the agent’s artifact.
- Symbols here: PD(⋅)PD(\cdot)PD(⋅) decomposes a plan into states and sub-tasks. Ad\mathcal{A}_dAd​ / Am\mathcal{A}_mAm​ are Data and Model agents. sids_i^dsid​, sims_i^msim​ and OidO_i^dOid​, OimO_i^mOim​ are their states and outputs (preprocessing vs. model/HPO artifacts).
(4) Final Implementation
M∗=Ao(I∗)\mathcal{M}^{*} = \mathcal{A}_{o}(I^{*})M∗=Ao​(I∗)
- Intuition: once the best result O∗O^{*}O∗ is chosen, operation agent generates deployment-ready code.
- Symbols here: M∗\mathcal{M}^{*}M∗ is the final deployable system; O∗O^{*}O∗ is the selected best outcome among candidates; I∗I^{*}I∗ is the input/settings for implementation (typically mapped from O∗O^{*}O∗). Ao\mathcal{A}_oAo​ (Operation Agent) turns this into code.

(5) Evaluation: Comprehensive Score (CS)
CS=0.5×SR+0.5×NPSCS = 0.5 \times SR + 0.5 \times NPSCS=0.5×SR+0.5×NPS
NPS=11+sNPS = \frac{1}{1+s}NPS=1+s1​
- Intuition: blend “runs successfully (SR)” and “how good it is (NPS).”
- Symbols here: CSCSCS combines SR and NPS; SRSRSR is success rate (run + deploy); NPSNPSNPS is normalized performance with error sss (smaller is better), via NPS=1/(1+s)NPS=1/(1+s)NPS=1/(1+s).
[Math Working Simulation] Toy Data Walkthrough
We simulate the pipeline with small numbers (illustrative).
Setting:
- III: “Find a dataset for image classification, preprocess it, train a fast model, and output deployable code, while avoiding runtime failures.”
Frame 1: Prompt Parsing
- R=Ap(I)R=\mathcal{A}_p(I)R=Ap​(I) gives structured constraints like task=classification and failure-avoid.
Frame 2: RAP generates candidates
- P=Amgr(RAP(R))P=\mathcal{A}_{mgr}(RAP(R))P=Amgr​(RAP(R)) produces P=2P=2P=2 candidate plans:
- p1p_1p1​: start with a small dataset, efficient preprocessing + lightweight model.
- p2p_2p2​: handle class imbalance first with resampling and a steadier training schedule.
Frame 3: Data agent (for each plan)
- For p1p_1p1​: s1d=PD(R,Ad,p1)s_1^d=PD(R,\mathcal{A}_d,p_1)s1d​=PD(R,Ad​,p1​), then O1d=Ad(s1d)O_1^d=\mathcal{A}_d(s_1^d)O1d​=Ad​(s1d​) (split/augmentation/dataloader-ready outputs).
- Similarly obtain O2dO_2^dO2d​ for p2p_2p2​.
Frame 4: Model agent
- The model agent reads each OidO_i^dOid​:
- sim=PD(R,Am,pi,Oid)s_i^m=PD(R,\mathcal{A}_m,p_i,O_i^d)sim​=PD(R,Am​,pi​,Oid​)
- Oim=Am(sim)O_i^m=\mathcal{A}_m(s_i^m)Oim​=Am​(sim​) gives model choice and HPO proposals.
Frame 5: Selection & implementation
- The manager compares verification results and chooses O∗O^{*}O∗ (say O∗=O2O^{*}=O_2O∗=O2​).
- Operation agent generates M∗=Ao(I∗)\mathcal{M}^{*}=\mathcal{A}_o(I^{*})M∗=Ao​(I∗) (deployable training + inference code).
Frame 6: CS score (simple version)
- Success rate SR=0.9SR=0.9SR=0.9
- If error s=0.25s=0.25s=0.25, then NPS=11.25=0.8NPS=\frac{1}{1.25}=0.8NPS=1.251​=0.8
- So
CS=0.5×0.9+0.5×0.8=0.85CS=0.5\times0.9+0.5\times0.8=0.85CS=0.5×0.9+0.5×0.8=0.85
One-line takeaway: RAP guides planning, decomposition enables parallel work, and verification closes the loop into deployable outcomes.
[Experiments & Results]
The paper evaluates end-to-end automation across 14 datasets grouped into 7 task/modality settings.
- Dataset snapshot:
  • ModalityImage
  • TaskClassification
  • Example datasetsButterfly Image, Shopee-IET
  • ModalityText
  • TaskClassification
  • Example datasetsEcommerce Text, Textual Entailment
  • ModalityTabular
  • TaskClassification
  • Example datasetsBanana Quality, Software Defects
  • ModalityTabular
  • TaskRegression
  • Example datasetsCrab Age, Crop Price
  • ModalityTabular
  • TaskClustering
  • Example datasetsSmoker Status, Higher Education Students Performance
  • ModalityTime Series
  • TaskForecasting
  • Example datasetsWeather, Electricity
  • ModalityGraph
  • TaskNode Classification
  • Example datasetsCora, Citeseer
ModalityTaskExample datasets
ImageClassificationButterfly Image, Shopee-IET
TextClassificationEcommerce Text, Textual Entailment
TabularClassificationBanana Quality, Software Defects
TabularRegressionCrab Age, Crop Price
TabularClusteringSmoker Status, Higher Education Students Performance
Time SeriesForecastingWeather, Electricity
GraphNode ClassificationCora, Citeseer
Additional tabular datasets (for comparison):
- Smoker Status (Binary): predicts whether a person smokes (binary classification).
- Click Prediction Small: predicts ad click/CTR (binary classification).
- MFeat Factors: a tabular benchmark built from multiple factor features for ML evaluation.
- Wine Quality White: regression task predicting white-wine quality from chemical measurements.
- Colleges: tabular dataset using student/college attributes for admission/performance prediction.
- House Prices: regression benchmark predicting house sale prices from property features.
Key experimental results (numbers):
- Constraint-aware success rate: average 87.1%.
- Search speed: about 8x faster than SELA (MCTS).
- Time & cost efficiency: average 525 seconds and about $0.30 cost (GPT-4o 기준).
The real implication: if you want interpretable engineering outcomes under budget limits, full-pipeline automation with retrieval-guided planning and verification is a compelling recipe.
[Conclusion & Limitations]
Final meaning & practical value (≤3):
1. Full-pipeline mindset: defines AutoML as a continuous pipeline, not a single step.
2. RAP + multi-agent: turns plan search from single-shot generation into guided candidate exploration.
3. Verification-first reliability: reduces the typical LLM failure mode—“looks right, but breaks.”
Limitations / Future work:
- Skeleton/template reliance: genuinely novel tasks may still require stronger base templates.
- Backbone LLM dependency: stronger LLMs usually produce better plans and code.
- Metric sensitivity: success/performance depends on how SR/NPS (and the verification criteria) are defined.
Finally, a single orchestration diagram summarizes the full pipeline.

[Diagram] Full-pipeline orchestration board

One flowchart panel: standardize the user instruction into RRR, strengthen planning with RAP, run data → model → code stages on decomposed sub-tasks in parallel, then advance only verified artifacts to deployment.

Precision control

Full-Pipeline Control

AutoML-Agent: RAP + multi-agent + multi-stage verification

User taskNatural languageParsed requestStructured for toolsRAPDataModelCodeChecksShipDeploy
Standardize request RRR, generate candidate plans with RAP, decompose into data/model tasks, run in parallel, and verify until deployment-ready.
AutoML-Agent treats automation as an end-to-end system: RAP accelerates planning, decomposition enables parallel execution, and multi-stage verification locks reliability. So even with long math, the whole story compresses into one flow: input standardization → candidate plans → parallel execution → deployable final code.

관련 AI논문

  • - The Curse of Depth in Large Language Models
  • - AlphaFormer: End-to-End Symbolic Regression of Alpha Factors with Transformers
  • - Kernel von Mises Formula of the Influence Function