QuantCode-Bench: A Benchmark for Evaluating LLM-Generated Quant Code Quality

"QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies"

Large language models (LLMs) perform well on general programming tasks, but their ability to generate executable algorithmic trading strategies remains largely unexplored. This paper introduces QuantCode-Bench, a benchmark built around the Backtrader framework, containing 400 tasks with both single-turn and multi-turn interactive settings. It employs a four-stage evaluation pipeline that can distinguish between four specific model capabilities. Experiments show that cutting-edge models achieve high compilation rates in a single-turn setting but underperform in subsequent stages. Multi-turn settings improve performance, but some failures stem from misinterpretations of natural language specifications.

Abstract

Large language models perform well on general programming tasks, but their ability to generate executable algorithmic trading strategies remains unexplored. This paper introduces QuantCode-Bench, a benchmark for systematically evaluating the ability of modern LLMs to generate strategies for the Backtrader framework based on English text descriptions. The benchmark includes 400 tasks of varying difficulty and evaluates model performance through a multi-stage process, comparing single-turn and intelligent multi-turn settings. The best model's "Judge Pass" rate in single-turn is approximately 70–76%, reaching 95–98% in multi-turn settings. Analysis indicates that current models' main limitations lie in trading logic execution, API usage, and adherence to task semantics. Trading strategy generation is a specialized code generation task requiring both technical correctness and multi-faceted consistency.

Introduction

Existing code benchmarks often focus on general programming tasks and do not fully reflect model behavior in domain-specific application scenarios. The task of generating algorithmic trading strategies is unique, requiring understanding of the subject matter, adherence to APIs, and the production of meaningful behavior—requirements that existing benchmarks fail to meet for evaluation purposes.

This paper introduces QuantCode-Bench, a benchmark built around the Backtrader framework with 400 tasks, featuring both single-turn and multi-turn interactive settings. Using a four-stage evaluation pipeline, it can distinguish four model capabilities. Experiments reveal that while cutting-edge models achieve high compilation rates in a single-turn setting, their subsequent performance is poor; multi-turn settings boost performance, but some failures are due to incorrect interpretations of natural language specifications.

This paper introduces a benchmark specifically designed for generating executable algorithmic trading strategies. It proposes a multi-level evaluation framework that distinguishes technical executability, the presence of trading behavior, and semantic alignment with task specifications. It compares the performance of modern models in single-turn and agent settings, conducts detailed error analysis to identify major failure modes at each stage of the pipeline, and releases the benchmark for reuse in finance-specific code generation research.

QuantCode-Bench

Task Definition

QuantCode-Bench evaluates the ability of models to generate Backtrader trading strategies from text descriptions, featuring four nested requirements: the strategy must be syntactically correct, execute successfully in a backtesting environment, complete at least one trade on historical data, and align with the trading concept. This benchmark is more stringent than general coding tasks, with verification stages progressing sequentially. The primary metric, Judge Pass, is the proportion of tasks where the strategy passes the entire evaluation pipeline.

Dataset

The QuantCode-Bench dataset comprises 400 trading strategy generation tasks, with descriptions sourced from multiple origins, varying in form, structure, and detail. Each task is structurally enriched, extracting indicators, entry/exit conditions, and additional rules, and assigned a difficulty level (Easy, Medium, Hard).

Backtrader

Backtrader was chosen because it is widely used as an open-source framework for backtesting and trading strategy prototyping, and its API has a certain complexity. It requires models to correctly handle indicators, data feeds, order execution methods, and indexing conventions, making the benchmark test more aligned with real-world code generation and reducing the likelihood of success by simply applying templates.

Evaluation Methodology

Verification Process

QuantCode-Bench evaluation uses a four-stage process:

1.) Compilation: The code is syntactically correct and can be interpreted.

2.) Backtest: The strategy executes without runtime errors on benchmark historical data.

3.) Trading: The strategy completes at least one trade.

4.) Judging: An LLM judge confirms the strategy matches the task description.

This process allows for pinpointing failure points and decomposing unsuccessful generations, avoiding a situation where a single metric masks different causes of failure for domain-specific tasks.

LLM Judge

The final stage of the pipeline verifies the semantic consistency of the generated strategy against the original task description, as a strategy can be technically feasible but substantively wrong. For this, a large language model (LLM) is used to judge the code. Judgment criteria include: whether the indicators correspond to or are equivalent to the original description, whether the strategy's key entry/exit and behavioral logic are implemented, and whether the code is a relevant implementation rather than a generic template substitution. This method aligns with literature on "LLM-as-a-Judge."

Evaluation Settings

Two interaction settings are considered:

Single-turn Setting: The model attempts to generate a correct strategy on its first attempt after receiving the task description, testing one-shot generation quality. It is sensitive to the model's initial knowledge of the domain, library, and common strategy templates.

Intelligent Multi-turn Setting: After each failure, the model receives structured feedback containing the error type and a system message, allowing up to 10 revision attempts. This tests the model's ability to iteratively correct errors, perform localized search, and utilize diagnostic information. Similar evaluation mechanisms have proven useful in code and software engineering benchmarks.

Results

Single-turn

Table 3 shows the single-turn results for QuantCode-Bench, revealing a core pattern of the benchmark: cutting-edge models generally perform strongly during the compilation stage but diverge significantly in the later stages of the evaluation pipeline. Single-turn results exhibit a stark discrepancy between pipeline stages. For most strong models, compilation is no longer a bottleneck, but a high compilation rate does not imply a high Judge Pass rate, indicating that the main quality losses for modern cutting-edge models occur during the backtesting and trading stages.

Multi-turn

Results in Table 4 show that the main differences between models lie not in syntactic correctness but at the levels of execution, trading signal generation, and semantic compliance. Iterative feedback is particularly effective for strong models, where a large number of errors can be fixed locally within a few attempts.

Error Analysis

Failure Stage Distribution (Single-turn)

Table 5 summarizes the single-turn results by the first stage of failure, showing where the bottlenecks are concentrated in the overall pipeline. Compilation is no longer the main issue for modern models; the main failure points are in later stages like Backtest and No Trades. This indicates that the core difficulty of trading strategy generation has shifted from Python syntax to the correct implementation of the strategy within a specific execution environment.

Classification of Backtest Errors and Late Failures

Table 6 provides a fine-grained classification of runtime and late-stage failures, revealing the dominant patterns within the Backtest, No Trades, and Late-Judge failure categories. The most common failure type is a strategy that compiles and backtests successfully but generates no trades, often due to overly strict entry conditions, insufficient historical context for feature calculation, or flawed indicator logic. The second most common is the __bool__ / Line object error, reflecting improper handling of Backtrader line objects under boolean conditions. The Missing attribute/method error accounts for a smaller proportion, indicating that direct API hallucinations are less prevalent than failures in logic activation and line object handling.

Agent Setting Errors

Tables 7 and 8 summarize the final outcome distribution in the agent setting, comparing single-turn error categories with the composition of the final turn of failed agent trajectories. Compared to the single-turn case, the composition of unresolved failures in the final turn shifts towards categories that reflect persistent semantic and logical issues. Some error proportions increase, while "Missing attribute/method" remains uncommon. "Judge rejection" becomes the primary failure reason among strategies unresolved after 10 attempts. Iterative debugging is primarily effective at fixing technical errors but less effective when the model misunderstands the task. The agent setting mainly addresses program repair and does not completely eliminate limitations in semantic interpretation of natural language specifications.

Discussion

Algorithmic trading strategy generation involves programming, financial logic, and intelligent search. QuantCodeBench reveals that while powerful models have mastered parts of the problem, limitations remain.

Modern LLMs have largely solved surface-level syntax generation; the main challenge has shifted to the formalization of operations, influencing subsequent code generation evaluation benchmarks.

A comparison between single-turn and intelligent interactive settings shows that a large number of errors are of a locally fixable type, and a model's usefulness depends on both single-turn accuracy and its ability to iteratively repair code.

In benchmarks for natural language-to-code or policy generation, difficulty depends not only on conceptual depth but also on the quality of the specification.

Comparing general-purpose and code-specific models shows that programming specialization does not guarantee superiority in domain-specific strategy generation. General-purpose models perform better when their semantic understanding and instruction-following capabilities are strong.

The judging stage is crucial. Without semantic verification, success rates would be overestimated. Open-ended tasks require semantic verification as a primary evaluation procedure.

Limitations

While QuantCode-Bench covers important practical tasks, the current version has limitations:

Strategies are only evaluated within the Backtrader framework and environment, limiting the transferability of results to other algorithmic trading libraries and ecosystems. It could be extended to QuantConnect/LEAN and Zipline. Multi-framework evaluation could distinguish a model's strategy synthesis ability from its API adaptation ability.
The final semantic evaluation relies on an LLM judge. Although stronger than pure technical verification, it cannot guarantee absolute semantic correctness; the judge may overlook subtle mismatches and is subject to common biases associated with LLM judging.
The profitability, risk robustness, and economic quality of the generated strategies were not assessed. The work's focus is the model's ability to generate executable strategies from descriptions, not the investment quality of the strategy.

Conclusion

This paper introduces QuantCode-Bench, a benchmark for evaluating the ability of LLMs to generate executable algorithmic trading strategies. It formalizes the task into nested requirements, enabling the assessment of code quality and the model's ability to translate natural language trading ideas into valid implementations.

Results show that cutting-edge models are far from fully solving the task in a single generation, with the maximum single-shot Judge Pass rate being around three-quarters. However, the agent setting with iterative feedback significantly improves performance, with the best model reaching 95–98%. This indicates that a large number of errors are fixable, and model behavior in an interactive debugging loop is at least as important as single-shot generation accuracy.

Trading strategy generation demands mastery of specialized APIs, construction of executable code, formulation of realistic trading logic, and adherence to the semantics of natural language specifications. Modern models perform well on syntax and foundational aspects but show limitations in accurately formalizing trading intent and precisely implementing strategies in one shot.

QuantCode-Bench can be used for future research in domain-specific code generation, agent-based software repair, and LLM evaluation within the financial domain.

QuantCode-Bench: A Benchmark for Evaluating LLM-Generated Quant Code Quality

Related Articles

分享網址