Costs Cut by 90%, Accuracy Hits 100%! MIT's Counterintuitive Architecture Challenges Silicon Valley Dogma

There is a very interesting phenomenon in the AI world.

On one side, tech companies are frantically giving large models "hands and feet," enabling them to operate various software on behalf of humans; on the other, the enterprise customers who actually need AI are shaking their heads.

A group of researchers from the Technical University of Munich, TU Darmstadt, MIT, and others, have pierced this veil in their latest paper.

Chart showing performance comparison between RUBICON and other AI agents

The problem with implementing AI in enterprises is not that models aren't smart enough, but that the data is too messy.

When cutting-edge Large Language Model (LLM) agents struggled in a simulated enterprise environment, with a devastating 0% accuracy rate, a new architecture called RUBICON, using a simple and straightforward query language, achieved a 100% accuracy rate. And it did so using smaller, cheaper models.

Enterprise data is scattered across various siloed systems. Today's agentic AI always wants the LLM to be the brain, understanding and operating everything, which results in chaos, high costs, and unreliability.

RUBICON has taken a different path. It hands operational control back to the user, allowing them to explicitly tell the system which data source to look in and what to look for, using a minimalist query language called AQL. The LLM is only responsible for working within these precisely defined boundaries.

Diagram illustrating the RUBICON architecture workflow

All intermediate steps are transparent and visible, allowing user intervention at any time. This method replaces the probabilistic nature of end-to-end reasoning with structured determinism.

Smart Brains Mismatched with Messy Data

Over the past few years, turning LLMs into agents has become the mainstream approach.

The thinking is that LLMs have strong reasoning capabilities and should act as the conductor, deciding for themselves when to query which database, when to call which tool, and finally synthesizing an answer.

This LLM-centric architecture looks great on the surface.

But a real-world enterprise is nothing like the clean test sets in a lab.

The paper bluntly points out that the difficulties enterprises face with AI are almost entirely data integration problems, not reasoning deficits.

Critical information is scattered across completely different systems like databases, document systems, email services, and external web pages.

Each system has its own query language, data structure, and access permissions.

These systems are data fortresses with rigorous architectures and extreme performance demands, while an LLM is a probabilistic master of wordplay. Letting the latter manage the former is like asking a poet to command an aircraft carrier fleet.

Why doesn't the currently popular Text-to-SQL work well in the enterprise?

The paper identifies four critical differences.

Public academic test sets are small in data volume; for an LLM, they are just a handful of documents.

However, enterprise data warehouses often store massive amounts of data, on an entirely different scale.

Academic test sets pursue a clean, single schema. In enterprises, to accelerate access, there are redundant views and materialized views everywhere. The same question can be queried in countless ways, immediately confusing the LLM.

Enterprise data is full of internal jargon. An abbreviation might represent a complex project, and a code string might underpin an entire business process. Relying on the LLM to guess is completely unrealistic.

Real-world business queries are also much more complex than those in test sets.

When these differences accumulate, the paper observes a precipitous drop of over 50% in LLM accuracy on real enterprise data warehouses compared to benchmarks. This goes directly from usable to unusable.

You Decide What to Find and Where to Look

Since that path leads nowhere, RUBICON's solution is elegantly simple.

Instead of obsessing over making the machine understand everything, it returns the steering wheel to the human driver.

The core of the architecture is a query algebra called AQL (Agentic Query Language).

This language is extremely concise, with just three core commands: FIND (what to look for), FROM (where to find it), WHERE (what the conditions are).

The condition part is written in natural language, while the user explicitly specifies the other parts.

For example, suppose you want to know which professors in a university's research labs have won a Turing Award or a Nobel Prize. For RUBICON, a possible AQL command looks like this:

Example of an AQL query for finding award-winning professors

As you can see, the user must clearly state that data should be fetched from two specific sources: Wikipedia and the university data warehouse, and specify the fields.

The LLM's job is compressed into a very narrow scope: it is only responsible for understanding the natural language condition after WHERE and translating it into executable commands for each data source.

It no longer has to guess where to find data, nor worry about how to join it together. This translation work is done by Wrappers that connect to different data sources.

Each wrapper is responsible for transforming a data source (even an email system or video library) into a normalized relational table view.

This makes all data look like rows and columns in a database, making downstream operations extremely explicit.

This design directly transforms the opaque chain-of-calls in LLM agents into an explicit, examinable relational operation pipeline.

RUBICON has two operating modes.

In interactive mode, after executing an AQL command, the user gets a visual, spreadsheet-like intermediate result.

The user can pause to check, immediately correct any errors found, and then save the result to send to the next command. Each step is concrete, traceable, and reproducible.

If you want to repeat a task, compilation mode packages the entire sequence of commands, much like a traditional database query plan, allowing an optimizer to find the most efficient execution path, at a far lower cost than repeated LLM calls.

0% vs. 100%

Talk is cheap. The team set up an interesting micro-benchmark between RUBICON and current agentic AI approaches.

They simulated a typical enterprise information clutter environment, including Wikipedia, an anonymized university data warehouse with 97 tables, a university research lab website, the Gmail email system, and the LLM's own pre-training knowledge base.

They carefully designed seven questions. Each question had to span exactly two data sources to be answered, with the other three sources acting as complete distractions.

Table 1: The true data source relevance for seven benchmark queries. Green indicates mandatory data sources (R), yellow indicates optional data sources (O), and gray indicates irrelevant data sources (-).

Table showing Relevance matrix of data sources for seven benchmark queries

The models used were OpenAI's GPT-5-mini, Google's Gemini-3-flash-preview, and Anthropic's Claude-Sonnet-4.6.

They competed in two postures: one was the standard chat mode without any tools (Vanilla LLM), and the other was a LangChain agent equipped with full data source access rights, using the currently popular ReAct (Reasoning-Acting) loop.

The result was a stunning clean sweep.

All Vanilla LLM and LangChain agent configurations achieved an accuracy of 0%. Not a single correct answer.

Bar chart showing 0% accuracy for all LLM agents vs. 100% for RUBICON

The reason for failure wasn't the models hallucinating, but a systematic coordination failure.

The models either forgot to query a necessary source, stopped halfway through, or failed to correctly correlate results from different sources.

For the earlier example about award-winning professors, the LangChain agent would often just scrape a list of award winners from Wikipedia but fail to verify against the data warehouse whether these people were professors at the school, ultimately listing a bunch of outsiders.

Given the tools, the model couldn't control its own hands. The paper describes how giving models greater autonomy and stronger reasoning settings resulted in a wider spread of failure and higher costs.

In contrast, RUBICON achieved a 100% accuracy rate. These seven questions were just a step-by-step combination of AQL commands for it, with no possibility of missing a query or forgetting a join.

Restraining Hands and Feet Actually Saves Money

The efficiency comparison is equally stark. Let's look at the summarized cost and latency data in Table 3.

Table 3: Summary of average efficiency metrics for all queries. k̄ is the average number of tool calls per query (0 for Vanilla mode).

Table comparing cost, latency, and token usage between RUBICON and other models

The Vanilla mode is quiet, calling no tools, with an input token count of less than 80, resulting in extremely low costs.

Once it becomes a ReAct agent, the situation immediately spirals out of control. In pursuit of that sorry 0% accuracy, they start trying frantically.

The GPT agent's average input token count per query soared to 20,000 to 46,000.

The Gemini agent was even more dramatic. In natural language mode, it used over 270,000 input tokens, and in AQL mode, it surged to nearly 470,000 tokens, making as many as 22.71 tool calls. The monetary cost per query was $0.28, and the time-to-first-response took over 4 minutes.

Claude's situation was similar; its expensive per-token pricing, combined with extensive exploration, could lead to a single query costing over $0.50.

These models, with more and more reasoning, longer and longer contexts, and more frequent tool calls, traded it all for a more and more solid 0 score.

RUBICON uses GPT-5-mini, maintaining costs at a consistently low level, calling exactly two tools per query (corresponding to the two mandatory data sources), with no aimless wandering or overthinking.

RUBICON returns the critical decision-making power of where to find data to the user. This ensures correct results and naturally bypasses a major pitfall that agentic AI finds hard to handle on its own: query plan selection.

The paper uses the same award-winning professor question as an example.

A user aiming for this goal can write two different AQL commands.

Plan A is "find winners first, then find people": first, find all Turing and Nobel laureates from Wikipedia (this list is not long), and then check in the university data warehouse to see which ones are professors.

Plan B is "find people first, then find prizes": first, pull up all professors (potentially many), and then browse each professor's Wikipedia page to see if they have an award record.

Both plans are logically correct, but the costs are vastly different. Plan B forces the system to perform a Wikipedia lookup for every single professor, causing the overhead to grow linearly with the number of professors. Plan A uses a highly selective filter, drastically reducing the number of subsequent lookups.

In agentic AI, this directly translates to more or fewer tool calls, token consumption, and waiting time.

An LLM choosing its own path has a strong element of randomness. A poor choice can cause costs to explode and speed to slow down to unacceptable levels.

RUBICON hands the choice to the user, or it can use a classic cost-based query optimizer to automatically pick the most efficient path. This is something current LLM agents are fundamentally incapable of doing.

The research concludes by citing an MIT report that tracked over 300 enterprise AI projects. The report found that less than 5% of custom AI projects achieved quantifiable returns.

Models are getting stronger, and their autonomy is widening, but the failure modes caused by hallucinations and omissions haven't fundamentally changed.

The researchers deliver a piece of ancient software engineering wisdom to this current boom.

First, sort out your data, manage your interfaces, and then talk about intelligence. This seemingly "clunky" architecture is, in fact, much closer to the AI that enterprises truly need.

Reference:

https://arxiv.org/pdf/2604.21413

Costs Cut by 90%, Accuracy Hits 100%! MIT's Counterintuitive Architecture Challenges Silicon Valley Dogma

Smart Brains Mismatched with Messy Data

You Decide What to Find and Where to Look

0% vs. 100%

Restraining Hands and Feet Actually Saves Money

Related Articles

分享網址