Published by Jiqizhixin
This article is authored by Wang Jie, an angel investor in Moore Threads and one of China's first-generation AI investors. In August and December 2025, he published two articles titled "The Emerging AI Economy" and "Forty Questions About the AI Economy", providing an outlook and interpretation of the upcoming AI economy. This is his third recent article, proposing an approach to evaluating AI large models from the perspective of economic productivity.
AI Production Capacity Function: Evaluating AI Models from the Perspective of Economic Productivity
Measuring Model Capability as Economic Productivity: A Production Capacity Function for Artificial Intelligence
1. Introduction
1.1 Background
• AI large models have evolved from technology and products to comprehensively impacting the economy and society, necessitating an indicator to evaluate AI's working capability for real economic tasks.
• Existing mainstream evaluation benchmarks include MMLU, BIG-bench, SWE-Bench, WebArena, GAIA, AgentBench, MiniWoB, etc. These benchmarks are widely used to measure model performance in knowledge understanding, reasoning, programming, and other tasks.
• However, existing evaluation benchmarks have limitations:
• Task Homogeneity Assumption: All tasks contribute equally to the total score, failing to distinguish differences in task value.
• Lack of Human and Organizational Acceptance.
• Ignoring Inference Cost: Token consumption is not included in capability measurement and is only treated as an additional indicator.
• Inability to Reflect Economic Output: Therefore, macroeconomic statistics (GDP, TFP) cannot distinguish the true marginal contribution of AI.
• Key Gap: Lack of an expression function connecting "AI Model → Production Capacity → Productivity → Macroeconomy".
1.2 Research Question
• How to measure the "true production capacity" of AI for economic output in a unified, quantifiable manner?
• We propose an AI production capacity function that takes tasks as the basic unit, token as the input/measurement basis, and GDP as the output. It is defined as the upper limit of a model's capacity (capacity) to stably convert computing resources (tokens) into economic value under given task space and social acceptance constraints. It explicitly includes the following elements:
• By introducing the "Economic Turing Test", tasks in the task set shift from merely reflecting "technical correctness/feasibility" to also reflecting "being accepted by humans in the real economic environment/desirability".
• The economic value of tasks the model can successfully complete.
• The probability of successfully completing tasks.
• The inference resources (tokens) consumed to complete tasks.
• We hope to answer the following questions:
• How much economic value can an AI model create for every token consumed? That is, the GDP/token problem.
• Can AI model capability be transformed from "pointwise performance on several benchmarks" to "value-weighted expected output density in the entire economic task space (task-space integral)"?
• How can different models, different economic tasks, AI capabilities possessed by different countries, and different development stages of AI large models be compared?
2. Limitations of Current Model Capability Evaluation Schemes
2.1 AI Capability Evaluation and Benchmark
• Traditional benchmarks (such as MMLU, BIG-bench, SWE-Bench, AgentBench) only test success rates or pass rates.
• Cannot answer "Unit AI Input → Economic Output".
• Lack endogenous handling of "economic system acceptability".
2.2 Introducing AI Production Capacity Function C(M)
• C(M) simultaneously considers:
• Task Economic Value
• Task Heterogeneity
• Success Probability (Technical Capability)
• Resource Consumption (Cost Constraint)
• C(M) unifies task economic value, task heterogeneity, success probability, and resource consumption in one framework, serving as a measurement mapping from technical capability to economic productivity.
3. Basic Concepts and Definitions
3.1 Definition and Core Function
Model capability is defined as the expected economic value of task outcomes that a model can stably generate per unit token on its economically acceptable task set.
3.2 Numerator: Economic Output
• Meaning: The total economic value realizable by the model (m) on the acceptable task set. This economic value comes from statistics.
• Key Property: Tasks in the task set must meet two conditions:
• The AI model can complete this type of task (solving the "can it be done" problem).
• The result of the AI model completing this type of task passes the "Economic Turing Test" (solving the problems of "is it done well" and "is it accepted by humans").
• When task values are heterogeneous, the formula automatically assigns weights to high-value tasks.
3.3 Denominator: Token Input
• The denominator is the sum of tokens consumed to complete the j tasks, calculated in millions of tokens. The definition of single task token consumption:
• Single task token consumption = Average token consumption per attempt for a single task / Success rate of the model for that task. The average token consumption per attempt = Total tokens consumed for all attempts of that task / Total attempts. This data comes from statistics.
• Single task token consumption endogenously reflects the model's cost efficiency.
• Success rate endogenously reflects:
• Capability Level
• Stability
• Reproducibility
3.4 Task
• Human work uses individual labor as the smallest execution unit, while AI work uses "task" as the smallest execution unit.
• For AI, a task is a clearly formalized goal instance. It defines expected results, action space, constraints, and completion criteria, enabling Agents to transform open-ended environmental problems into plannable, executable, and evaluable decision processes.
3.5 Task Set (J_m)
• A set of executable tasks defined for the model (m); it must meet two entry conditions:
1. Technical Feasibility
2. Passing the Economic Turing Test
• Construction of Task Sets: We need to construct task sets from all tasks in current economic activities adapted to the Agent's working mode and carry out statistical work on tasks based on this.
3.6 Economic Turing Test (ETT, Economic Turing Test)
• Definition: If the output of the model completing a task is accepted by humans in the real economic environment, it is considered successful; otherwise, it is considered a failure.
• Value is [0, 1], i.e., "fail" is 0, "pass" is 1.
• Difference from the traditional Turing Test.
• Role in the Production Function: The Economic Turing Test is equivalent to an institutional and preference constraint, determining which AI outputs can be counted in GDP.
3.7 Task Economic Value (V_j)
• Task economic value statistical methods include:
• Human Work Cost: The work cost/salary required for humans to complete the same task.
• Market Transaction Price: If the task is tradable in the market, what the market transaction price is.
• Shadow Price: For tasks without market prices but that affect social welfare, risk, or long-term output, estimate their implied marginal value.
• This data comes from statistics.
3.8 Dimensions and Interpretation
• Unit: Currency / Million tokens.
4. Relationship with Traditional Production Functions
4.1 AI as a New Production Factor
• AI = Task execution capability expressed in tokens; it is a productive factor driven by computing resources, manifested as task execution capability, and scalable in digital environments.
• In the macroeconomic field, this is Model-Level AI Capability in Macroeconomics; it can measure AI's contribution to total economic output at the macro level.
• In the economic growth field, this is Task-Based AI Capability Models for Economic Growth.
4.2 Embedding into Traditional Production Functions
• AI can be viewed as a "capital-replicable task execution capability". Its economic attributes possess both the task completion function of labor and the scale replication characteristics of capital.
• In the AI economic stage, task execution capability is expressed in tokens, meaning tokens, as an intermediate variable of the production function, are a precisely measurable proxy variable.
• Relationship with TFP: It may cause TFP in traditional production functions to shift from a residual to an explainable component; AI productivity is an explainable TFP component.
4.3 Comparison with Labor Productivity
• In the industrial economy, labor productivity is usually expressed as "output created per unit of labor input", typically in the form of GDP/man-hour. Its intuitive meaning is: how much output labor can achieve per unit of time under given technology, capital, and organizational conditions.
• The AI production capacity function proposed in this paper has a clear structural correspondence with it: it characterizes AI's production capacity by "the economic value stably converted per unit token", typically in the form of GDP/token (or GDP/million tokens). The GDP/token form can enter a more general productivity analysis framework.
5. Applications and Extensions
The AI production capacity function C(M) given above primarily accomplishes two tasks: first, providing a formal definition of model capability as economic productivity; second, explaining how this definition establishes a connection with macro-production analysis. On this basis, this section further discusses the application and extension directions of this framework.
5.1 Model Comparison
• AI production capacity ranking among different models: Ranking different models by "economic value output capability per unit token".
• Comparison of working capabilities of different versions of the same model: Clearer distinction between whether "technical score improvement" and "economic production capacity improvement" are synchronized.
5.2 Time Dimension Comparison
• Comparison of model capabilities at different stages, dynamic characterization of technological progress:
• "Cost Decline": Average token consumption per attempt decreases, i.e., improved inference efficiency, more effective tool calling, or more compact strategies.
• "Quality Improvement": Increased success probability per unit task, i.e., enhanced capability level, stability, or reproducibility of the model on existing tasks.
• "Capability Boundary Expansion": Expansion of the task set (J_m), i.e., the model can cover more tasks, especially higher-value or more complex tasks.
5.3 Cost Structure Analysis
• C(M) can be used to analyze the commoditization process on the inference side: if the success rates of multiple models on certain task sets converge, inter-model competition often shifts from "can it be done" to "who can do it at lower cost and more stably". The production capacity function in this paper can provide a unified perspective for understanding model service price competition and inference optimization strategies.
• C(M) provides an intermediate variable for analyzing energy and computing power constraints: C(M) can be combined with "energy and computing power cost per token" to build a multi-layer mapping from resource constraints to model capability to economic output.
5.4 Industry and National Level
• Differences in Industry Task Structure:
• Certain industries (such as software development, digital marketing, online customer service, standardized document processing) have higher task formalization degrees and digital environment compatibility, thus it is easier to form stable (J_m) and achieve higher C(M) application returns.
• Other industries (such as high-risk medical decision-making, complex on-site operations, heavily regulated processes) may limit the release of AI production capacity due to strict ETT constraints, complex task value assessment, or non-digital execution environments.
• "National-level AI Production Capacity", comparison of AI production capabilities of different economies:
• "National-level AI Production Capacity" is the comprehensive realization level of AI production capacity by that economy based on its accessible AI foundation models, task digitization degree, organizational adoption capability, institutional acceptance boundary, and infrastructure conditions.
• The significance of this framework lies in providing a unified conceptual and measurement interface for this layered analysis of "model capability—task structure—institutional environment—economic output".
5.5 Policy and Investment Guidance
The AI production capacity function proposed in this paper can provide quantitative tools for AI model R&D investment, model deployment selection, AI input-output accounting, public procurement, economic policy, and investment analysis formulation. It is a general measurement language connecting technical evaluation, deployment decisions, industry analysis, and macro policy.
• In the stage of rapid AI diffusion, relying solely on benchmark rankings for decision-making easily leads to resource allocation bias towards "optimal technical performance" rather than "optimal economic productivity".
• For enterprise users, model selection should not be based solely on public evaluation rankings but should compare based on C(M) or its approximate estimates under the target task set, thereby aligning procurement decisions with business value creation capability.
• In AI input-output accounting and public procurement, C(M) provides a more auditable quantitative framework.
• At the industrial policy level, policymakers can use this framework to identify which industries' task structures are more suitable for AI to penetrate first, which institutional constraints are limiting high-value tasks from entering (J_m), and which infrastructure bottlenecks (energy, computing power, data centers, organizational digitization) are constraining AI production capacity from converting into actual output.
• In investment analysis, C(M) and its constituent items can also provide a supplementary perspective for judging the competitive advantages of AI-related enterprises or industries.
6. Conclusion
• This paper proposes a model capability measurement method based on economic productivity, namely the upper limit of capacity (capacity) for a model to stably convert computing resources (tokens) into economic value under given task space and social acceptance constraints.
• Traditional benchmark success rate rankings cannot accurately reflect the economic productivity of models; this paper provides an operable measurement framework to convert existing benchmark data into economic productivity measures.
• Shifting from pointwise performance to task-space integral; existing benchmarks measure the correctness rate of models at given task points; this paper measures the value-weighted expected output density of models over the entire economic task space.
• The model capability definition proposed in this paper incorporates task economic value, task heterogeneity, success probability, and resource consumption, thereby advancing AI model capability from "performance" in the technical evaluation context to "production capacity" and "productivity" measures in the economic analysis context.
• This paper provides an operational framework for enterprises, research institutions, and policymakers to measure, compare, and optimize AI productivity, making AI productivity observable, quantifiable, and comparable. It provides a theoretical and empirical basis and new quantitative tools for benchmark design, model R&D, model comparison, technological progress analysis, AI cost analysis, industry and national-level AI capability assessment, model deployment, resource allocation, economic policy, and macro productivity calculation. It also provides a bridge variable connecting micro model evaluation with macro productivity analysis for AI economics research, serving as an analytical foundation that can be further refined, substantiated, and institutionalized.
• As AI further penetrates the economic system, the datafication, standardization, and substantiation work surrounding C(M) is expected to become an important foundation for understanding the true economic contribution of AI and its stage evolution.
• This paper should be understood as a foundational measurement framework rather than a completed final empirical system. Its further implementation still relies on several key tasks: systematic construction of task sets, standardization of Economic Turing Test determination mechanisms, and continuous statistics of real deployment data. These issues do not diminish the theoretical significance of this paper's framework; rather, they indicate that its core value lies in providing a unified form for organizing subsequent measurement work. Rather than waiting theoretically for a perfect and closed AI economic indicator, it is better to first establish an iterative, scalable, and substantiable capability measurement framework so that model capability can be progressively observed, compared, and tested in economic analysis.
Author Profile
Author Wang Jie, one of China's first-generation AI investors, has fully experienced various development and investment stages of the mobile internet. Since 2017, he has been mainly engaged in AI industry investment, investing in companies such as Moore Threads, BYD Semiconductor, GDS, JD Technology, Carsome, Qi An Xin, MiningLamp Technology, etc. Author's email: jie_wang7@sina.com.
© THE END
Reproduction please contact this public account for authorization
Submission or seeking coverage: liyazhou@jiqizhixin.com