General-purpose code large models often report port errors when writing Verilog and directly exceed hardware limits when tuning CUDA kernels.
It is not a lack of capability, but a fundamental ignorance of the rules governing industrial code.
The InCoder-32B model, jointly released by Beihang University and several other institutions, generated 2.5 million execution-verified industrial code data points in real simulation environments, covering five major industrial domains: chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling.
Currently, the paper has received nearly 300 upvotes on Hugging Face Daily Paper, sparking intense interest in the open-source community. Both the full and quantized versions of the model weights have been open-sourced!
Why Do General-Purpose Code Models Still Struggle with Industrial Code?
In recent years, code large models have made significant progress in general programming tasks. Models like Claude continue to break records on benchmarks such as SWE-bench, demonstrating strong practical value in scenarios like algorithmic problem solving, web development, and automated GitHub issue repair.
However, there is a fundamental difference between general programming and industrial programming. Industrial code—including chip RTL design (Verilog/SystemVerilog), GPU kernel development (CUDA/Triton), embedded firmware writing (C/ARM), compiler-level assembly optimization (x86-64), and parametric 3D modeling (CadQuery)—not only involves specialized language constructs and domain-specific APIs but also requires the model to possess an accurate understanding of hardware semantics, resource constraints, and physical behaviors.
Taking GPU kernel optimization as an example, the paper presents a case study of CUDA RMS Normalization:
The root cause lies in the essential distinction between industrial and general code: industrial code demands that the model understand hardware semantics, master specialized language constructs, and strictly adhere to resource constraints.
InCoder-32B: A Code Foundation Model Tailored for Industrial Code
When configuring CUDA grids, Claude directly assigned the spatial_size (262144) to gridDim.y. However, CUDA hardware regulations stipulate that the upper limit for gridDim.y is 65535; this assignment leads to an illegal parameter error at runtime. This is not an algorithmic logic failure but a lack of awareness regarding GPU hardware constraints. InCoder-32B addresses this by flattening all spatial dimensions into one dimension and scheduling via gridDim.x, thereby circumventing hardware limitations.
Statistical data from the paper further confirms this gap: current state-of-the-art models achieve a function call success rate of only 28.80% on Triton operator generation tasks and an accuracy of merely 33.3% in formal equivalence verification of Verilog code. These figures indicate that existing code large models are constructed around general-purpose programming languages for both training data and evaluation systems, resulting in severe under-coverage of the industrial code domain.
InCoder-32B is the first code foundation model oriented towards industrial code intelligence. Adopting a 32-billion parameter Decoder-only Transformer architecture, it aims to serve multiple industrial code domains with a single model.
Unlike previous works focusing on single industrial sub-domains, such as RTLCoder for Verilog or Kevin for CUDA, InCoder-32B incorporates chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling into a unified training framework. While covering industrial code capabilities, the model maintains competitiveness in general code tasks, achieving a balance between industrial specialization and general utility.
Core Methodology: Scaling Industrial Code Data Production in Real Simulation Environments
The verification of correctness for industrial code differs fundamentally from general code. A Python function can be quickly validated via unit tests, but a Verilog module requires RTL simulation and logic synthesis to confirm its feasibility on real silicon; CUDA/Triton kernels must run on actual GPUs to verify numerical correctness and performance targets; embedded firmware needs to boot on target microcontrollers or their simulators to confirm register configurations and interrupt behaviors; CAD scripts require validation that the generated 3D entities are geometrically faithful to design specifications.
Therefore, the key insight of the paper is: the correctness of industrial code can only be verified through execution in real deployment environments. This implies that the prerequisite for规模化 production of high-quality industrial code training data is the construction of a complete set of production-grade execution and verification infrastructure.
To this end, the team reconstructed four major categories of industrial simulation environments, adhering to the core principle of replicating the toolchains and execution semantics actually used by industrial engineers, rather than constructing simplified alternatives.
Chip Design Environment: Digital design in the semiconductor industry follows a strict process: RTL writing, behavioral simulation, logic synthesis, and physical implementation. The team reconstructed the first three stages using public EDA tools: Icarus Verilog executes Verilog behavioral simulation; Verilator translates SystemVerilog RTL into optimized C++ models for high-speed simulation, consistent with simulators used in open-source chip projects like CHIPS Alliance and lowRISC; Yosys maps RTL to gate-level netlists to verify synthesizability and extract area and timing estimates. These three are encapsulated in the same containerized image, ensuring that the quality judgment standards for training data are entirely consistent with the standards determining whether a design succeeds on real silicon.
GPU Optimization Environment: Directly deployed on NVIDIA A100 nodes. The CUDA path integrates nvcc via PyTorch's runtime compilation interface, consistent with the compilation and loading methods of custom kernels in FlashAttention and xFormers; the Triton path uses the official compilation stack, where Python functions decorated with @triton.jit are compiled into GPU code and cached upon the first call, matching the path used by inference frameworks like vLLM and SGLang. Kernels are launched on A100 hardware identical to production loads, memory is managed by the standard CUDA allocator, and timing is measured using CUDA events, ensuring that performance signals obtained during data synthesis are directly transferable to real deployment.
3D Modeling: Built based on OpenCascade (a widely used solid modeling kernel in the industry supporting Boolean operations, fillets, extrusions, rotations, lofts, etc.) and CadQuery. Generated scripts run on the same version of OpenCascade as production tools like FreeCAD and KiCad. Geometric fidelity is evaluated by meshing the output entities and comparing volumes with reference bodies. The verification standard requires not just syntactic correctness but geometric faithfulness to design specifications.
Code Optimization: For the embedded direction, the STM32F407 (ARM Cortex-M4) serves as the target platform, using the arm-none-eabi-gcc cross-compiler along with CMSIS device header files and chip memory layout linker scripts. Verification is executed on the Renode simulator, which provides a complete virtual replica of the STM32F407, including GPIO, UART, SPI/I2C buses, timers, ADC+DMA, and interrupt controllers. Each peripheral model replicates the register layout and interrupt behavior described in the reference manual. This fidelity is crucial for industrial code verification because defects in the embedded domain often stem from register configuration errors or interrupt priority conflicts, which can only be exposed on real or high-fidelity simulated hardware. The x86-64 assembly direction replicates standard compiler benchmarking processes, repeating measurements under fixed CPU frequencies and bound core affinity.
Based on the above simulation environments, the team constructed 2.5 million execution-verified SFT samples. The entire data production process consists of four steps:
Task Construction: Decomposes raw industrial code tasks into structured instructions, including natural language requirement descriptions, interface constraints (port lists, function signatures, APIs), target platform and toolchain configurations, dependencies, and verification scripts.
Candidate Generation: Generates diverse candidate solutions through complementary strategies such as template perturbation and cross-language migration, ensuring coverage of different implementation strategies and coding styles.
Execution Verification: Performs full-link verification of candidate solutions in the aforementioned simulation environments—compilation checks, simulation runs, test execution, performance analysis, and formal verification.
Feedback-Driven Repair: This is the most critical link in the data production process. When a candidate solution fails execution, the pipeline captures the complete feedback context—compilation error messages, runtime logs, counterexample inputs, waveform differences, and performance bottlenecks—and then appends this feedback to the failed solution to generate a repaired version. This closed-loop repair trajectory (failed solution + environment feedback + repaired solution) is also included in the SFT corpus, corresponding to the real workflow of engineers diagnosing problems from tool outputs and iteratively fixing them.
The final training samples comprise three types: direct answers (direct path from requirement to implementation), defect repairs (closed-loop trajectories of failure-feedback-repair), and performance/structural optimizations (functionally correct solutions further optimized for efficiency or architectural quality).
Three-Stage Training
InCoder-32B adopts a three-stage progressive training approach: The pre-training stage utilizes 4,096 GPUs and 15 trillion tokens, fusing data from public code repositories, technical literature, and domain-specific websites to complete curriculum learning from the function level to the project level. The mid-term training expands the context from 8K to 128K in two steps while injecting reasoning QA, Agent trajectories, and industrial artifact data. The post-training stage uses the aforementioned 2.5 million execution-verified industrial code SFT data to specialize industrial capabilities.
Model Performance
InCoder-32B was comprehensively evaluated on 14 general code benchmarks and 9 industrial code benchmarks.
In terms of general code, the model maintains strong competitiveness: HumanEval 94.5%, MBPP 91.8%, and SWE-bench Verified 74.8% (leading level among open-source models of similar scale). It also performs outstandingly on agent tasks: Terminal-Bench 35.0, Mind2Web 55.8%, and tau-2-bench Telecom 86.8%.
In terms of industrial code, InCoder-32B achieved significant breakthroughs on multiple benchmarks:
Notably, as a 32B parameter open-source model, InCoder-32B's IoU on CAD-Coder (53.5%) significantly surpasses Claude-Sonnet-4.6 (32.4%), and it achieved the best results among open-source models across all three levels of KernelBench. This demonstrates that the specialized training route for industrial code is indeed effective.
Error Analysis: Where Industrial Code Proves Difficult
The team conducted a systematic manual error analysis on 1,882 failure samples across 9 industrial benchmarks, categorizing them into five core issues:
Compilation and Syntax Errors are the most prevalent failure type, particularly prominent in the chip design domain. In RealBench, 71% of failures stemmed from malformed literals, mismatched port declarations, and inconsistent bit widths; in ArchXBench, 51% of failures arose from misused named ports and uncertain bit widths of symbolic literals. Although the model has acquired extensive domain vocabulary, it has not yet fully internalized the strict grammatical rules of industrial code.
Insufficient Industrial API Knowledge is the second major issue. In EmbedCGen, 47% of failures were linking errors originating from undefined or type-incorrect HAL/CMSIS function calls; in TritonBench, 33% of failures were NameErrors and 24% were TypeErrors, all pointing to incorrect usage of the Triton API. These industrial-proprietary APIs appear infrequently in general code training corpora, leading to insufficient model coverage.
Insufficient Functional Correctness manifests as code that compiles but fails tests. In VeriRepair, 79% of failures belong to this category. The code is syntactically correct but contains implicit logical errors, such as incorrect state machine transition conditions or numerical semantic deviations. In CAD-Coder, 93% of geometric failures stem from systematic misunderstandings of Euler angle conventions. Such implicit logical errors represent the most challenging problem currently.
Output Format Violations account for 46% in VeriScope, where the model generated unparseable output, failing to follow the structured format required by the evaluation.
Insufficient Performance Optimization mainly appears in the GPU and compiler domains. In KernelBench, 33% of the failed code was functionally correct but did not meet execution speed targets; in SuperCoder, 83% of failures involved directly copying input assembly without any optimization. This reflects that the model's reasoning capabilities regarding underlying hardware behaviors such as memory hierarchy, instruction pipelines, and parallel scheduling still require improvement.
Open Source Information
The model and code are now open-sourced on Hugging Face and GitHub under the Apache 2.0 license:
Hugging Face: https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder
GitHub: https://github.com/CSJianYang/Industrial-Coder
Arxiv: https://arxiv.org/abs/2603.16790
About the Authors:
Among the core contributors to this work are two undergraduate students from Beihang University: Wu Jiajun, a senior in the School of Computer Science, and Cheng Junhang, a junior at the Institute of Advanced Technology.