30B Model for Scientific Research Outperforms GPT-5.4, Performance Increases from 1.7% to 33.3%

UniPat AI, in collaboration with Peking University, has developed a model named UniScientist specifically for cutting-edge scientific research, based on Qwen3-30B-A3B-Thinking-2507.

In the FrontierScience-Research benchmark test, it achieved 33.3%, surpassing the recently released top model GPT-5.4 (33.0%).

All this stems from the research team's deep effort in creating a dataset.

The original Qwen3-30B-A3B-Thinking-2507, when directly run on FrontierScience-Research, had only a 1.7% success rate.

After fine-tuning with the specially crafted dataset, it soared to 33.3%, an improvement of 31.6%.

Models and Human Experts Achieve Complementary Advantages

High-quality scientific data has always been the core bottleneck that restricts the development of artificial intelligence.

While data written entirely by humans is professional and rigorous, it is extremely costly and difficult to scale. Data synthesized purely by algorithms may be vast in quantity, but often lacks the discernment precision and domain foundation that human experts uniquely possess.

The research team keenly captured the wonderful asymmetry between the two.

Large language models have extremely broad cross-disciplinary knowledge reserves and can generate massive content with high efficiency.

Human experts, on the other hand, have irreplaceable acumen in judging right from wrong and verifying logic.

This has facilitated a brand-new data production cooperation model. Language models act as tireless creators, proposing various research ideas across disciplinary boundaries. Human experts then act as stringent reviewers, specifically responsible for auditing the accuracy and rationality of these ideas.

This division of labor has greatly improved both data quality and coverage.

The research team thus built an extremely large-scale research-grade training corpus. This corpus covers over 50 broad scientific fields and contains over 4,700 real research instances.

Each sample requires domain experts to invest 1 to 2 hours for meticulous annotation and polishing.

The data not only covers fundamental disciplines like quantum physics and organic chemistry but also extends to socio-cultural anthropology, computational linguistics, even geophysics and immunology.

This vast dataset almost encompasses all major branches of human scientific exploration. Each field's data is equipped with structured scoring criteria as supervision signals. This high-quality data has become valuable nourishment for new types of research intelligent agents.

Dynamic Evidence Integration Reshapes the Scientific Research Process

The most critical step in enabling machines to conduct scientific research is to transform the act of doing science into a clear mathematical or logical model.

Traditional QA systems are only responsible for outputting answers, while real scientific research is a dynamic process of continuous trial and error and iteration.

The research team defines open-ended scientific research as active evidence integration and model abduction.

Under this framework, when facing a research task, the agent always maintains an evolving evidence library. This evidence library is like a clue board in a detective's hand, pinned with various proven information.

These evidences are divided into two major categories.

One category is objective evidence based on external literature and authoritative sources, equivalent to the wisdom of predecessors that scientists gain by standing on the shoulders of giants.

The other category is derivative evidence obtained through symbolic analysis, numerical computation, and simulation experiments, representing the outcomes of scientists' own experimental work.

This process fully demonstrates the charm of autonomous exploration by intelligent agents.

To improve the evidence library at hand, the system must purposefully acquire information and design experiments under resource constraints.

Whenever a new intermediate result is obtained, the system dynamically adjusts its next plan.

The entire operating mechanism is like a precision gear system in constant operation.

The system first obtains and verifies new goal-oriented evidence, then derives new conclusions through reproducible inference.

The system updates existing scientific hypotheses and identifies the theory that best explains all current evidence.

When the evidence chain is sufficiently complete and stable, all findings are summarized into a rigorous scientific report.

To cope with such complex cycles, the system must possess a set of hardcore capabilities. When collecting evidence, it must accurately retrieve facts and verify authenticity. When building models, it must learn to use deductive reasoning to update hypotheses and generate new verifiable evidence.

This ability is closed-loop; every hypothesis update prompts the system to seek the most distinguishing information among different explanations.

To mass-produce such high-quality research questions, the team invented a progressive erudite synthesis engine. This engine is like an extremely complex factory, specifically expanding verified scientific conclusions into open-ended research topics.

The entire processing process is divided into four meticulous steps.

Search agents repeatedly search through vast papers and authoritative websites, continuously expanding the evidence pool.

Subsequently, the model constructs a coherent research background based on these materials, placing scattered knowledge in a specific scientific context.

Then the model condenses this knowledge into a comprehensive research topic containing multiple sub-questions.

Experts and algorithms jointly validate and refine the questions to ensure they have genuine scientific value.

Objective Scoring System Enhances Research Quality

For open-ended scientific reports, traditional machine scoring often falls short. The research team took a novel approach, decomposing grand scientific reports into multiple closed and verifiable objective checkpoints.

This method is like customizing an extremely strict health check metrics for research reports.

Each scoring item must satisfy objective consistency. Faced with the same scientific report, using the same set of standards repeatedly, the results must be completely consistent. This effectively eliminates subjective, vague, or highly unstable judgment conditions.

The scoring standards must also have strong discriminability. When faced with research reports of different completion levels, these standards should show obvious score gaps, clearly distinguishing excellent insights from perfunctory nonsense.

Each standard must be atomic; it tests only a single knowledge point each time, never mixing multiple conclusions in judgment.

Domain experts extract the core knowledge points needed to solve problems and initialize them as a required evidence list. Search agents further expand this list based on research questions. The最终形成的标准集 is like a set of unit tests for key knowledge points, transforming originally hard-to-measure open-ended tasks into quantifiable scores.

For example, the following case in the chemical field:

The evaluation standards do not just test whether the model has memorized a standard answer. They truly test whether the model can complete the full scientific closed-loop from consulting literature to proposing hypotheses, then designing experiments and performing sensitivity analysis.

Besides常规的监督微调, the team also introduced a learning objective called report aggregation.

Given a research task and multiple candidate reports generated by different agents, the model needs to learn to take the essence and discard the dross and ultimately integrate into an ultimate report that synthesizes the best of all.

The training reference standards are obtained through score-based rejection sampling, only reports exceeding a preset score threshold are adopted.

This aggregation ability grants the agent the wisdom to examine research quality, rethink competing viewpoints, and reorganize evidence.

This is precisely the most core part in real scientific research, where scientists every day synthesize multiple sources of information, evaluate conflicting findings, and weave the best evidence into a coherent narrative.

Code Interpreter Empowers Scientific Computing

The research team used Qwen3-30B-A3B-Thinking-2507 as the base model, running about 1200 GPU hours on an NVIDIA H200 GPU cluster, to give birth to UniScientist.

UniScientist boasts an astonishing 128,000 token context length and allows calling tools up to 100 times per task.

Its toolbox is very rich, including web search, academic literature retrieval, page抓取, and an extremely crucial code interpreter.

The code interpreter is far from a simple auxiliary accessory; it is the soul hub for agents to conduct reproducible computations.

In the past, language models facing complex scientific reasoning often relied on internal text networks for vague mental simulation.

This纯文本的推演 in many hard-core scientific fields is neither efficient nor accurate. It is difficult to accurately describe the complex changes in fluid mechanics purely in text.

The addition of the code interpreter completely changes the game.

It transforms the research cycle from单纯文字讲故事 into a rigorous computational process of alternating testing and revision.

Hypotheses proposed by agents no longer stay on paper but are converted into executable lines of code.

The results of running this code can confirm, refute, or further sharpen various competing scientific explanations.

Breakthroughs in scientific research often depend on running targeted analysis and simulation under clear constraints.

The code interpreter gives agents the ability to verify truth firsthand, truly narrowing the gap between AI and real scientific operations.

After repeated reviews by the expert team, many research questions synthesized by this system have reached the proposal quality of mature project leaders.

They not only have clear directions and novel perspectives but also exhibit astonishing professional depth.

Model-synthesized questions have intricately mixed structures, where subsequent reasoning often builds on previous steps.

This step-by-step exploration process perfectly replicates the mental journey of human scientists solving difficult problems.

Top-Tier Benchmarks Show Exceptional Research Potential

The research team conducted rigorous evaluations of the system on five highly representative authoritative benchmarks.

These include professional frontier scientific research tests that are close to training data, frontier science Olympiad tests that examine pure scientific knowledge reserves, and deep research series benchmarks that assess general research and information integration capabilities.

This 30B parameter scale model scored a high 28.3 on the FrontierScience-Research benchmark. Through test-time scaling technology, its score can even soar to 33.3.

In the FrontierScience-Olympiad test, it achieved an excellent score of 66.0 without using any tools. With tool usage and aggregation technology, the score reached 71.0, completely matching the top-tier closed-source giant models.

Even in out-of-domain deep research benchmark tests, its performance remains outstanding.

It scored 46.0 on DeepResearch Bench, comparable to OpenAI Deep Research's score of 47.0.

In the DeepResearch Bench II test, it surpassed OpenAI's model with 45.4 and Gemini's model with 44.6 with a score of 48.0.

In the ResearchRubrics evaluation, it also achieved a high standard score of 59.9.

Even in bare tests without any external tools, it has made a qualitative leap from the base model.

This intrinsic enhancement of scientific research ability fully demonstrates the immense power of progressive erudite synthetic data.

Broad cross-disciplinary large-scale information acquisition capability has completely broken the limitations of single-domain experts in knowledge breadth.

Currently, the system's practical capabilities are mainly limited to reproducible reasoning and simulation-based computation.

It still cannot perfectly schedule real-world physical research resources, such as assign tasks on large computing clusters or coordinating complex laboratory operations.

Connecting this intelligent brain to real experimental equipment and computing infrastructure will be the core direction for exploring automated scientific discovery in the future.

Reference Materials:

https://unipat.ai/blog/UniScientist

https://github.com/UniPat-AI/UniScientist

https://huggingface.co/UnipatAI/UniScientist-30B-A3B

END

30B Model for Scientific Research Outperforms GPT-5.4, Performance Increases from 1.7% to 33.3%

Related Articles

分享網址