Kimi K2.6 Goes Open Source! Plus, 300 Agent Employees Included?

Moonshot AI released Kimi K2.6 last night, and once again, it's open source.

But what's even more noteworthy is that its coding capabilities not only reached the SOTA (State-of-the-Art) peak in the open-source community but also surpassed two leading closed-source models.

It scored 58.6 on SWE-Bench Pro, exceeding GPT-5.4 (xhigh) and Claude Opus 4.6 (max effort).

In other words: an open-source model has beaten the two currently strongest closed-source models.

For open-source coding models, this should be the first time achieving a dominant advantage on mainstream benchmarks.

Of course, we know that benchmark scores are only half the story... K2.6 also features an Agent cluster function that combines brute force with elegance, which I will detail later.

01 Benchmark Scores

Let's look at the hard data first.

K2.6 leads almost across the board on programming and Agent-related benchmarks:

SWE-Bench Pro: 58.6 (Open Source SOTA)
SWE-Bench Verified: 80.2
SWE-Bench Multilingual: 76.7
Terminal-Bench 2.0: 66.7
HLE w/ tools: 54.0
BrowseComp: 83.2
LiveCodeBench v6: 89.6

It didn't fall behind in math and vision either, scoring 96.4 on AIME 2026 and 93.2 on MathVision w/ python.

Yuchen Jin reposted Kimi's official tweet and commented:

"Open Source SOTA! SWE-Bench Pro 58.6, surpassing GPT-5.4 (xhigh) and Claude Opus 4.6 (max effort). Kimi's release speed is accelerating; it truly qualifies as an S-tier open-source model team."

02 More Than Just Scores

Of course, we know that having high scores is one thing, but whether a model can withstand long hours of high-intensity work in real-world scenarios is another.

And just as we know this, Kimi certainly knows it too... Therefore, K2.6's progress in this aspect might be even more noteworthy than its benchmark scores.

It can work continuously for 12 hours without crashing.

An official case study: Using K2.6 to locally deploy the Qwen3.5-0.8B model in Zig language on a Mac. The entire process involved over 4,000 tool calls, spanned 14 iterations, and lasted 12 hours.

Ultimately, it achieved an inference speed of 193 tokens/sec, which is 20% faster than LM Studio.

Another case was even more hardcore: A comprehensive refactoring of the exchange-core financial matching engine. It took 13 hours, involved over 1,000 tool calls, and modified more than 4,000 lines of code. Medium-load throughput increased by 185%, and overall performance improved by 133%.

In other words, K2.6 can now work like a reliable engineer for over ten hours straight without dropping the ball.

Moreover, it isn't picky about languages. Rust, Go, Python, frontend, DevOps workflows—it can deliver stable output across the board. As the official statement puts it:

"Generalization capabilities across languages and frameworks."

Vercel stated that K2.6's performance on Next.js benchmarks improved by over 50%. CodeBuddy reported an 18% increase in long-context stability and a 96.60% tool call success rate.

Furthermore, K2.6 has a very practical improvement: The average number of steps is reduced by approximately 35% compared to K2.5.

Fewer steps mean less token consumption, fewer opportunities for errors, and faster speeds.

Finding the correct answer via a shorter path is actually a more intuitive measure of a model's "smartness."

Internal Kimi Code Bench results corroborate this: K2.6 improved from 57.4 in K2.5 to 68.2, a direct increase of nearly 20%.

03 300 Agents on Duty

Now, for the main event.

Although the Agent cluster function was introduced in K2.5, my feeling is that it has truly matured with K2.6.

We only need to give it a task, and it will automatically decompose it, creating a bunch of different "avatars" with specific roles to work in parallel.

While K2.5 was capped at 100 sub-agents and 1,500 steps, K2.6 has directly ramped this up to 300 sub-agents and 4,000 steps.

One person, one instruction, one team.

Of course, I had to try it out myself.

04 Programming Tool Analysis Test

I gave the K2.6 Agent cluster a single sentence:

"Please use the Agent cluster to complete a deliverable package regarding the '2025-2026 Global AI Programming Tool Market Analysis': a 10-page industry analysis PDF, an Excel data sheet, and a 15-page PPT."

Then, it began.

It spent a few minutes formulating an execution plan, breaking the task down into 12 dimensions:

Market landscape, competitive landscape, deep dive into Cursor, deep dive into GitHub Copilot, comparison of other major tools, open-source ecosystem, functional and technical comparison, pricing and business models, enterprise adoption, technical trends, security and trust governance, and regional market differences.

Each dimension required independent search, analysis, and writing.

Then, the era of infinite avatars began.

05 Assembling Its Own Team

K2.6 first automatically created 12 sub-agents, each with a name, an avatar, and a role definition.

Xiang Ge is the progress compilation expert, Qingzhi is the translation expert, Hemingway (yes, really named Hemingway) is the renowned writer responsible for drafting, Secretary Ma is the business consultant, Cui Hao is the data analyst, Ah Zhe is the quality control expert...

There were 12 in total, each performing their own duties.

Apologies for not capturing a GIF above; Kimi created a very cool interactive interface that you really should try to see. Seeing this lineup, I was slightly stunned—is this... building me a project team?

Then, these 12 Agents started working in parallel.

It opened "Kimi's Computer" (a built-in browser environment), and all 12 Agents simultaneously searched for information on different dimensions online, possibly browsing hundreds or even thousands of pages.

06 One-Hour Assembly Line

The entire workflow was divided into several major phases:

Phase 1: Landscape Scanning (Completed in 5 rounds of search)

Phase 2: Dimension Decomposition (12 dimensions defined)

Phase 3: Parallel Deep Research (12 sub-agents working simultaneously)

Phase 4-6: Cross-Validation and Insight Extraction

Then it entered the production phase:

Stage 2: Report Writing (9 chapters + Executive Summary)

Stage 3: Excel Data Sheet Creation

Stage 4: PDF Generation (12-page professional report)

Stage 5: PPT Generation (15-slide presentation)

During the production phase, it dispatched three sub-agents in parallel: Ba Tai was responsible for Excel, Chen Ye for PDF, and Jia Qing for PPT. All three started working simultaneously.

At this point, I noticed a detail:

When Chen Ye was creating the PDF, it was actually writing code in Python within a sandbox to generate the file. It installed Chromium and used an HTML-to-PDF conversion method to ensure layout quality.

There was even a small hiccup in the middle: an issue with the generated report image dimensions was discovered by an Agent, which then actively went to modify the CSS to fix it.

The entire process took about one hour.

07 Delivery Results

Finally, it delivered three complete sets of files to me:

A PDF industry report, with a cover design that... actually looks quite professional, complete with a table of contents, chapters, and data charts. The content covered market landscape (Copilot 42% vs Cursor $2B ARR), adoption rates (84% developer usage, 91% enterprise adoption but only 29% trust), technical trends (Agentic Coding revolution, MCP protocol standards), security challenges, the Chinese market (30% penetration rate, CAGR 38.4%), and more.

An Excel data sheet comparing the functions, pricing, and user scale of major AI programming tools.

EXCEL, note the multiple sheets.

A 15-page PPT, complete with charts, data, and an analytical framework.

Of course, looking at it with a critical eye (since this topic is obviously my comfort zone), there were no major flaws, but there were still some minor issues.

So if you were to hand this directly to a publisher for a book, you'd still need to review it. However, for daily reference, learning, or analysis, it is more than sufficient.

But the flaws aren't the point; the key here is: This is the result of one sentence, one hour, and zero human intervention.

If I handed this task to Claude Code, it would likely ask me: "Why don't you go to sleep first?" and then go on strike on its own... But now, I just typed one sentence, played a few games of Honor of Kings, and came back to find the files neatly arranged.

From One Sentence to Three Sets of Files

If I had to point out a downside, it's that it took a bit long, but I can only blame myself for assigning such a complex task.

08 Full-Stack Capability Upgrade

Besides the Agent cluster, K2.6 also has significant upgrades in frontend generation.

The official team also demonstrated K2.6 Agent's frontend capabilities:

WebGL Shader Animation: Directly writes GLSL/WGSL code to create liquid metal, caustics, and ray-tracing effects.

若影片無法播放，請改看來源頁。

Video Hero Section: Calls video generation APIs to create movie-quality hero sections, synthesizing them into the page, synchronized with scrolling.

若影片無法播放，請改看來源頁。

3D Scenes: Builds real 3D scenes using Three.js + React Three Fiber, paired with GSAP ScrollTrigger for scroll-driven animations.

若影片無法播放，請改看來源頁。

Design Language Understanding: Brutalist, cinematic, Swiss grid, Y2K chrome, magazine layouts—K2.6 understands these design vocabularies, outputting webpages with their own atmosphere.

若影片無法播放，請改看來源頁。

And not just frontend; more critically, it now supports backend: User registration/login + Database, handling both frontend and backend with a single prompt.

若影片無法播放，請改看來源頁。

It has evolved from "Help me draw a page" to "Help me generate a complete application."

The official team also launched an internal Kimi Design Bench to measure frontend design capabilities. In a comparison between K2.6 Agent and Gemini 3.1 Pro on Google AI Studio, Kimi won 47.5%, tied 21.1%, and Google won 31.4%.

09 The Meaning of Open Source

Netizen SmartFind commented:

"The scores are indeed eye-catching, but the real shift is autonomy. When models can run continuously for several hours, coordinate multiple Agents, and deliver across tech stacks, the bottleneck shifts from 'how to write code' to 'what should be built'."

And all of this is open source.

The weights are on HuggingFace, the API is open, and there's a dedicated Kimi Code CLI tool. The price is one-sixth that of Claude Opus 4.6.

Netizens were overwhelmingly positive in their刷屏 comments:

Alamin claimed:

"Open-source is no longer catching up, it's starting to set the pace."

Looking back at the timeline, K2.5 was released at the end of January this year, and K2.6 arrived in April. Less than three months for another major version iteration.

Yuchen Jin said "Kimi's release speed is accelerating," and indeed, it is.

10 Finally

K2.6 signals a shift: The competition among AI programming tools has moved from "whose model has higher scores" to "who can help you do more things."

Benchmark scores are the ticket; the Agent cluster is the product power.

One person inputs one sentence, 300 Agents work in parallel for an hour, and deliver all the results you want.

For the first time, open-source models are not just chasers. So I'm even starting to look forward to it:

What will it look like when K3 arrives?

◇ ◆ ◇

Relevant Links:

Technical Blog: https://www.kimi.com/blog/kimi-k2-6
Model Weights: https://huggingface.co/moonshotai/Kimi-K2.6
Kimi Official Site: https://kimi.com
Kimi Code: https://kimi.com/code
API: https://platform.moonshot.ai

Kimi K2.6 Goes Open Source! Plus, 300 Agent Employees Included?

01 Benchmark Scores

02 More Than Just Scores

03 300 Agents on Duty

04 Programming Tool Analysis Test

05 Assembling Its Own Team

06 One-Hour Assembly Line

07 Delivery Results

08 Full-Stack Capability Upgrade

09 The Meaning of Open Source

10 Finally

Related Articles

分享網址