As the era of large models enters the deep waters of Scaling Laws, data is increasingly becoming the deciding factor for a model's upper limit of capability. However, current "data engineering" still relies heavily on manual experience: "Which data should be selected? What should the proportions be? Should it be rewritten? How to filter and verify?"
Recently, the Shanghai AI Laboratory, in collaboration with Fudan University, officially launched and open-sourced DataChef, a data recipe generation model designed for Large Language Model (LLM) adaptation tasks. For the first time, DataChef models end-to-end data recipe generation as a global decision-making problem, forming an automatic optimization loop through Online Reinforcement Learning (Online RL). By simply inputting a target task, the AI can automatically generate complete, executable data processing code and data recipes.
Experiments show that across 6 held-out test tasks (Physics, Math, Code, Finance, Meteorology, and Chinese Idioms), the data recipe generation capability of DataChef-32B approaches that of the closed-source top-tier model Gemini-3-Pro. The data recipes it generates not only surpass SOTA filtering algorithms designed by human experts, such as DEITA, but also reach the level of industrial-grade expert data recipes on some complex tasks.
Relevant code, data, and models are now open source. Welcome to experience and explore!
GitHub Link: (Click 'Read More' at the end of the article to go directly)
https://github.com/yichengchen24/DataChef
HuggingFace Link:
https://huggingface.co/yichengchen24/DataChef-32B
Paper Link:
https://arxiv.org/abs/2602.11089
Demo Link:
https://huggingface.co/spaces/yichengchen24/DataChef
(a) Paradigm Definition: Given a task description (training an LLM to adapt to the mathematics domain), evaluation criteria (AIME'25), and available raw datasets (a list of math-related Hugging Face datasets), the model outputs a data recipe, including an executable data processing pipeline and the resulting training data, used to adapt the Base LLM to the target domain.
(b) Main Results: In 6 held-out test tasks (PHYSICS, AIME, LiveCodeBench, ClimaQA, OpenFinData, and CHID), DataChef's data recipe generation capability has approached that of the closed-source top-tier model Gemini-3-Pro.
Core Breakthrough: Turning "Data Alchemy" into an Evolvable Automated System
Traditional data engineering faces three major challenges:
Heavy Reliance on Expert Experience: Data selection, proportioning, and cleaning rules often rely on manual trial and error.
Extremely High Verification Costs: To evaluate the quality of a data recipe, one typically needs to complete expensive model training to see the results.
Explosively Infinite Search Space: The combination of multiple data sources × multiple processing operators × multiple task objectives makes efficient manual traversal impossible.
In response to these industry bottlenecks, DataChef offers a completely new solution.
Paradigm Innovation: First Definition of End-to-End Data Recipe Generation
DataChef breaks away from traditional local heuristic rules, elevating "data recipe generation" to an end-to-end task. The model only needs to receive the target benchmark and available data sources as input to directly output complete Python data processing pipeline code, truly achieving "what you think is what you get."
Paradigm: Given a task description, evaluation criteria, and available raw datasets, the model outputs a data recipe, including an executable data processing pipeline and the resulting training data. During training, code executability and data quality serve as the Reward. During inference, the obtained training data is directly used for LLM adaptation.
Infrastructure: Building Massive Multi-Domain Datasets
To support this new paradigm, the research team constructed a massive data foundation: covering 19 core domains including mathematics, code, finance, and medicine, containing 31 evaluation sets and 257 source datasets, providing systematic training and evaluation infrastructure for the open-source community.
Dataset Overview: Detailed display of domain information, benchmarks, and specific uses.
Mechanism Evolution: Online Reinforcement Learning Drives AI Self-Evolution
The research team introduced the Data Verifier mechanism, which can low-cost, real-time predict data performance on downstream tasks and use this as a "Reward" signal for reinforcement learning. This allows the model to explore rapidly within the vast code combination space, completely solving the fatal pain points of traditional schemes such as "long training feedback cycles and expensive trial-and-error costs."
Experiments prove that compared to traditional data evaluation metrics (IFD, RewardModel, VendiScore), Data Verifier possesses superior correlation and robustness.
Data Evaluation Metric Correlation Analysis: Compared to existing methods like DEITA, RewardModel, IFD, and VendiScore, Data Verifier demonstrates significantly better correlation and robustness. (Left) Box plot of correlation coefficients across 6 evaluation tasks; (Right) Scatter plot of the correlation between scores of various metrics and actual downstream performance in language and code tasks.
Open-Source Small Model Demonstrates Leapfrog Combat Power
Performance Approaches Gemini-3-Pro
In multiple rigorous tests, DataChef, with only 32B parameters, demonstrated strong robustness and effectiveness, with overall performance approaching the level of the closed-source top-tier model Gemini-3-Pro. Specifically, in terms of average scores on In-domain and Out-of-domain tasks, DataChef-32B achieved high scores of 89.3 and 75.4 respectively, surpassing the open-source model Kimi-K2-Instruct-0905 with 1T parameters (83.7 / 58.2), and rivaling Gemini-3-Pro (91.2 / 76.6).
Main experimental results on 6 held-out test tasks: Whether in In-domain or Out-of-domain tasks, DataChef-32B demonstrated excellent data recipe generation capabilities. Its overall performance approaches the level of the closed-source top-tier model Gemini-3-Pro.
Surpassing Human Expert Data Recipes
DataChef is no longer limited to selecting the best subsets from existing data but constructs entirely new processing logic by automatically generating arbitrary code.
Surpassing Human Heuristic Data Selection SOTA: Compared to traditional data selection methods like SINGLE-SOURCE, IFD, and DEITA, DataChef achieved highly competitive performance.
Defeating Industrial Recipes: On the highly challenging AIME'25 and ClimaQA evaluation benchmarks, the data recipes produced by DataChef-32B even surpassed the industrial-grade expert recipes used by the official Qwen post-training models!
This proves that AI is fully capable of learning superior data solutions within large-scale code spaces.
Real Case: Restoring Automated Pipelines
Taking the ClimaQA task as an example, DataChef can accurately 洞察 target requirements and automatically generate efficient data processing pipelines:
Intelligent Data Augmentation: Automatically calls LLMs to synthesize and augment samples in task-specific formats, precisely boosting target model capabilities;
Precise Feature Extraction: Through self-generated keyword logic, it extracts the most matching and relevant data subsets, significantly improving data validity.
Case Study: DataChef generating a data processing pipeline for the ClimaQA task.
Summary
The emergence of DataChef marks the first time end-to-end data recipe generation has been modeled as an optimizable global decision task. This signifies that large model data engineering is bidding farewell to the era of highly manual-dependent "handicraft workshops" and moving towards a new paradigm of automation, scalability, and industrial intelligence. With the full open-sourcing of related capabilities, DataChef will provide invaluable new ideas and tool support for automated data engineering, frontier LLM training, Automated AI Research, and Self-evolving AI.
AutoSkill: Evolving AI from "Following Orders" to Continuously Growing Digital Employees Embodied Reinforcement Learning Framework RLightning Released: One Codebase Implements Everything from Single-Machine Development to Large-Scale Verification, Accelerating Physical Intelligence Algorithm Iteration From "Problems" to "Insights": AgentPanel Opens a New Paradigm of AGI for Science with Collective Intelligence Collaboration
https://chat.intern-ai.org.cn/
Click the card below to follow us and get the latest news on the ShuSheng Large Model.
Welcome to submit technical articles: Add WeChat breezy0101