AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Patara Trirat, Wonyong Jeong, Sung Ju Hwang

ICML 2025

AutoML-Agent goes beyond “helping AutoML”: it automates the whole loop from data retrieval, preprocessing, model design, HPO, code generation, to deployment—using a multi-agent LLM framework. This article dissects the paper’s core math (input → planning → decomposition → execution → verification) line by line.

PDFOpen original PDF

[Abstract & Introduction] Three-line summary + problem statement 3-line summary: - Fatal problem: Many AutoML tools are powerful but hard to set up; without expertise you struggle to even start. - Classical limitation: LLM-based attempts often cover only a slice of the pipeline (e.g., preprocessing only) and use planning too shallowly, wasting exploration. - Core fix and benefit: AutoML-Agent uses specialized multi-agent collaboration, Retrieval-Augmented Planning to generate better candidate plans, and multi-stage verification to ensure deployment-ready code. Analogy: - Traditional AutoML is like a meal kit: ingredients are there, but you still manage cooking order and “heat.” - Some LLM helpers are like a toaster that occasionally reads a recipe—useful, but it does not cook and serve end-to-end. - AutoML-Agent is the 5-star hotel service: multiple “kitchen roles” (data/model/implementation) collaborate so that one menu request becomes a complete pipeline from ingredients to serving. Now let’s turn that full-loop automation into equations and steps.

[Background] Concepts you must know These are the 5 ideas you need to read the math without getting lost. - Full-Pipeline AutoML - Definition: Automating the *entire* chain—data retrieval/selection, preprocessing, model design, HPO, code generation, and deployment. - Why it matters: even perfect model tuning fails if the data pipeline is broken. - Multi-Agent System - Definition: Instead of forcing one LLM to do everything, the system splits roles (manager/planner/executor). Agents share intermediate outputs. - Intuition: you get fewer “single-person bottlenecks” and clearer ownership. - Retrieval-Augmented Planning (RAP) - Definition: Planning is strengthened with retrieved external knowledge (papers, code repos, Kaggle-style examples), not only internal LLM memory. - Core effect: it makes plan search *more efficient*. - Plan Decomposition & Parallel Sub-Tasks - Definition: Split a big plan into data and model sub-tasks (and implementation when needed) so parts can run in parallel with minimal dependencies. - Intuition: prep and cooking proceed together, instead of waiting for one whole recipe. - Multi-Stage Verification - Definition: The system checks success progressively: does the code run, is the performance good, and is the solution deployable. - Why it matters: “looks right” is not enough for real engineering. With these pieces in place, the equations become a map.

I

I

[Conclusion & Limitations] Final meaning & practical value (\leq3): 1. Full-pipeline mindset: defines AutoML as a continuous pipeline, not a single step. 2. RAP + multi-agent: turns plan search from single-shot generation into guided candidate exploration. 3. Verification-first reliability: reduces the typical LLM failure mode—“looks right, but breaks.” Limitations / Future work: - Skeleton/template reliance: genuinely novel tasks may still require stronger base templates. - Backbone LLM dependency: stronger LLMs usually produce better plans and code. - Metric sensitivity: success/performance depends on how SR/NPS (and the verification criteria) are defined. Finally, a single orchestration diagram summarizes the full pipeline.

[Diagram] Full-pipeline orchestration board

R

Precision control

Full-Pipeline Control

AutoML-Agent: RAP + multi-agent + multi-stage verification

R

Datasets and Evaluation Setup

Experiments cover image, text, tabular, time-series, and graph benchmarks, evaluating both success rate and normalized performance.

AutoML-Agent treats automation as an end-to-end system: RAP accelerates planning, decomposition enables parallel execution, and multi-stage verification locks reliability. So even with long math, the whole story compresses into one flow: input standardization \to candidate plans \to parallel execution \to deployable final code.

Modality	Task	Example datasets
Image	Classification	Butterfly Image, Shopee-IET
Text	Classification	Ecommerce Text, Textual Entailment
Tabular	Classification	Banana Quality, Software Defects
Tabular	Regression	Crab Age, Crop Price
Tabular	Clustering	Smoker Status, Higher Education Students Performance
Time Series	Forecasting	Weather, Electricity
Graph	Node Classification	Cora, Citeseer

Modality	Task	Example datasets
Image	Classification	Butterfly Image, Shopee-IET
Text	Classification	Ecommerce Text, Textual Entailment
Tabular	Classification	Banana Quality, Software Defects
Tabular	Regression	Crab Age, Crop Price
Tabular	Clustering	Smoker Status, Higher Education Students Performance
Time Series	Forecasting	Weather, Electricity
Graph	Node Classification	Cora, Citeseer

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

[Diagram] Full-pipeline orchestration board

Datasets and Evaluation Setup

관련 AI논문

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

[Diagram] Full-pipeline orchestration board

Datasets and Evaluation Setup

관련 AI논문