Turing Award Winner Yann LeCun Bets $1 Billion Against LLMs: What Is This New AI Architecture?

Turing Award winner Yann LeCun, deeply involved as co-founder and executive chairman, alongside Chinese scientist Saining Xie (co-author of DiT), has co-founded AMI Labs, which has completed a staggering $1.03 billion seed funding round. With a fundraising scale of $1.03 billion, AMI Labs is formally challenging the current LLM paradigm in a way almost unprecedented in capital markets.

Can the simple expansion path of relying on "piling up computing power and data" truly lead to general intelligence that is capable of planning, understanding, and acting?

Questioning the LLM path to general intelligence

This article will discuss another equally important route, but one whose theoretical focus and engineering path are clearly different—the new AI architecture advocated by LeCun, centered on world models, joint-embedding predictive architectures, and representation learning.

Current generative artificial intelligence is almost dominated by the narrative that "large language models equal general intelligence." But the alternative route, represented by LeCun, consistently questions: Is autoregressive prediction at the token level enough to create an intelligent agent that truly understands the world, can plan long-term, and act in a real-world environment?

Conceptual visualization of new AI vs LLMs

Centered on this problem, a new technical vision is gradually taking shape: no longer treating "generating the next pixel, the next frame, the next word" as the core of intelligence, but instead letting the system learn the stable structures, predictable constraints, and action consequences of the world in an abstract representation space, and then building language, planning, and control on top of this layer. The most representative current implementation of this route is the Joint-Embedding Predictive Architecture (JEPA) and its branch evolving into video world models.

1. Why Large Language Models Are Not Enough

The starting point of this new architecture is not to deny the engineering value of large language models, but to point out: Language prediction excels at compressing knowledge that humans have already written down, but it does not automatically equate to a true mastery of the physical world, causal structures, bodily actions, and long-term goals. According to the position paper "A Path Towards Autonomous Machine Intelligence," if a machine is to learn like an animal or a human, it must possess at least three categories of abilities: forming hierarchical representations of world states, making predictions and plans across multiple time scales, and choosing actions in environments that are not fully observable and not fully predictable. This definition itself shifts the problem from "generating language" to "learning a world model."

From this perspective, the current mainstream generative models have two fundamental limitations.

First, they usually model directly in the data space, meaning they approximate conditional distributions over pixels, sound waves, or tokens.

Second, they often conflate the training objective with the ultimate intelligence goal. However, the real world is not a static corpus, but a highly multimodal, partially observable, dynamic system full of bifurcations. Given the same world state at a moment, multiple equally plausible outcomes may exist in the next moment. If a model is forced to give a single deterministic answer on raw pixels, the easiest thing it learns is not "why the future will be like this," but "averaging multiple possibilities." This is a key reason why early video prediction models often produced blurry outputs.

2. Core Judgment

This route does not deny generation itself, but denies that "exhaustively generating at the lowest level of data details" should be the main path to intelligence. Its core judgment is: An intelligent system should first learn to predict those abstract structures that are truly predictable and relevant to the task, while leaving high-frequency details, incidental noise, and irreducible randomness to latent variables, downstream decoders, or specialized generation modules. The I-JEPA paper states this very clearly: It is a "non-generative" self-supervised method. Its approach is not to reconstruct image pixels, but to predict the representations of other regions in the same image from the context of one part; to force the model to learn semantic-level information, the target block must be large enough, and the context must be sufficiently distributed. In other words, the model is not memorizing every pixel but forcing itself to grasp "what this region generally is, what its relationship with the surrounding structure is, and which changes are worth predicting next."

In the video version, this point is amplified. V-JEPA's official introduction defines it as a "non-generative model" that predicts masked video segments in an abstract representation space, rather than directly filling in pixels; the purpose is to concentrate the model's computational resources on high-level conceptual information, rather than wasting it on inconsequential details irrelevant to downstream tasks. The official explanation uses a very intuitive example: If a tree appears in a video, what the system truly needs to grasp is "there is a tree in the scene, how it is moving, and its relationship with other objects," not predicting the tiny quiver of every single leaf.

3. From Siamese Networks to Anti-Collapse

To understand JEPA, one must first understand the tradition of representation learning it inherits. The key issue here is not "how to generate," but "how to obtain representations that are non-collapsed, transferable, and semantic." The Siamese network approach is very important here: provide two views of the same object, and require two encoders to produce consistent but not overly redundant representations. The real difficulty lies in representation collapse—i.e., the model maps different inputs to almost identical vectors, resulting in a superficially low loss but actually learning nothing. The Barlow Twins paper explains this very clearly: the recurring issue in self-supervised representation learning is trivial constant solutions; its solution is to measure the cross-correlation matrix between the outputs of two branches and push it towards the identity matrix, thus maintaining consistency between different views while reducing redundancy between different dimensions.

Barlow Twins anti-collapse representation learning

This idea has continued to develop in a series of non-contrastive self-supervised methods. The DINO paper further demonstrated that teacher-student self-distillation on Vision Transformers can produce surprisingly strong semantic structures, even explicitly revealing semantic segmentation information in unsupervised features—a property not so natural in supervised ViTs or convolutional networks. This step is crucial because it shows that, without relying on manual labels, models can learn highly structured semantic representations by "predicting another view of yourself." JEPA advances further down this path—no longer just seeking consistency between two views, but directly making conditional predictions on masked regions at the abstract representation level.

4. What Exactly Is JEPA: Predicting the World in Representation Space

The core of JEPA can be summarized in one sentence: Given a context representation, predict the latent representation of the target region. I-JEPA's method is this: first, use an encoder to map the visible context into a representation space, then let a predictor estimate the representation of the masked target block; the target representation comes from another encoder branch, but the training objective is not pixel reconstruction, but the consistency between the two representations. This design has two profound benefits. First, the model naturally favors the semantic, relational, and structural layers, because only this kind of information can be stably predicted when details are missing. Second, it separates "uncertainty" from the surface details: factors that cannot be deduced from the current context do not need to be forced into the main prediction, and can be left to latent variables, subsequent sampling, or more specialized generative components.

In a more complete vision of a world model, this "abstract representation prediction" is also combined with latent variables. The position paper explicitly proposes: a world model must be able to represent multiple plausible futures, and latent variables are precisely used to represent those hidden factors that cannot be determined from current observations but will influence future evolution. If a vehicle ahead is about to turn at a fork, then "left" and "right" are both plausible predictions; a good world model should not output a blurred intermediate image but should represent this bifurcation as a latent structure that can be sampled, planned over, and searched.

5. From I-JEPA to V-JEPA 2: How This Route Advances Toward World Models and Planning

If I-JEPA primarily proved this method's feasibility for image representation learning, then V-JEPA and V-JEPA 2 attempt to extend it to temporal understanding, future prediction, and robotic planning. V-JEPA's official description emphasizes that it predicts masked spatiotemporal regions in a learned latent space rather than predicting raw video frames, thus allowing it to focus on motion, interactions, and event structures. By 2025's V-JEPA 2, the goal expanded further: the paper proposes first pretraining in an action-agnostic way on over a million hours of internet videos and images, then incorporating a small amount of robot trajectory data to form a self-supervised video world model capable of "understanding, predicting, and planning."

Judging by the results, V-JEPA 2 is already more than just a "representation learner." The paper reports it achieved 77.3 top-1 accuracy on Something-Something v2 and 39.7 recall@5 on action anticipation for Epic-Kitchens-100; when aligned with a large language model, it achieved state-of-the-art performance at the time on several video question-answering tasks for an 8B parameter scale; in the robotics section, the authors also trained an action-conditioned world model, V-JEPA 2-AC, based on less than 62 hours of unlabeled robot video, and achieved zero-shot grasping, placing, and image-goal planning on Franka robot arms in two new labs.

But this result must be understood cautiously.

First, V-JEPA 2's current strongest evidence is still concentrated in visual world modeling, action anticipation, and controlled robotic scenarios. It has not yet proven that it can replace large language models for open-domain knowledge reasoning.

Second, the fact that the paper highlights strong "video Q&A performance" itself indicates that when a task requires a natural language interface, this route still needs to be coupled with a language model.

Therefore, a more accurate assessment is: it offers a potential world modeling foundation that could sit beneath, alongside, or before LLMs for next-generation intelligent systems. Language models could serve as interface layers, interpretation layers, or knowledge scheduling layers, but they may no longer be the core learning mechanism of the entire system.

6. The True New Architecture Is a Complete System of "World Model + Cost Module + Actor + Memory"

If this route is only understood as a new self-supervised algorithm, its ambition is underestimated. The position paper actually proposes a complete autonomous agent structure: a perception module extracts task-relevant state representations from sensors; a world model module completes invisible states and predicts possible future world states; a cost module, composed of "intrinsic cost" and a "trainable critic," measures the energy/discomfort of the system in current or future states; an actor module proposes action sequences and optimizes them through the world model and cost module; short-term memory stores past, current, and imagined future states; and a configurator, acting like an executive control system, reconfigures perception, world model, cost, and actor online based on the specific task.

LeCun's proposed autonomous agent architecture

This structure divides "seeing the world," "imagining the future," "evaluating consequences," and "choosing actions" into modular, interfaceable components, rather than compressing everything into a single next-word predictor. Notably, the paper explicitly describes the actor as a module that uses the world model and cost gradients for optimization and search, similar to model-predictive control; it even emphasizes that the actor must search not only for actions but also for latent variable configurations to plan under uncertainty. This creates a unified closed loop between classical control, planning, value learning, and world model learning. The role that JEPA plays here is mainly to ensure that the "world model" link is no longer built from the pixel level but is established on stable abstract representations.

7. What This Route Has Truly Solved, and What It Has Not Yet Resolved

What it has currently truly solved is "how to make models learn more semantic, transferable visual representations better suited for prediction and planning without large amounts of manual labels." I-JEPA proved that non-generative representation prediction can scale efficiently and produce high-quality semantic features on image tasks; V-JEPA and V-JEPA 2 showed that this idea can extend into video understanding, action anticipation, and even a degree of robotic planning. In this sense, this route indeed offers a technical pathway closer to world modeling than "relying entirely on pixel reconstruction or text generation."

But what it has yet to resolve is equally important.

First, long-term causal reasoning in open worlds, unified cross-modal representations, complex combinatorial generalization in language, hierarchical task decomposition, and lifelong memory remain unsolved problems.

Second, although JEPA-like methods emphasize uncertainty and multiple futures, in current mainstream implementations, uncertainty is often manifested more in the design of latent variables or subsequent planning structures, rather than maintaining explicit probabilistic beliefs throughout like some Bayesian architectures do.

Third, this route's success in robotics is still at the stage of "small amounts of action data + controlled tasks + short-term planning," still a significant distance from general embodied intelligence in open environments.

8. How Does It Differ from Karl Friston's Active Inference AI Architecture?

If we juxtapose Yann LeCun's JEPA-world model route with Karl Friston's active inference route, we find that both ostensibly oppose the notion that "pure autoregressive generation equals intelligence," and both emphasize world models, prediction, action, embodiment, and uncertainty. However, their underlying philosophies and engineering focuses are different. Active inference originates from the variational free energy framework, with its core proposition being: an intelligent agent simultaneously accomplishes perception, learning, and action by minimizing variational free energy and expected free energy; in this process, explicit beliefs, Bayesian updating, risk, and information gain are unified. The JEPA route is more like an engineering blueprint for scalable learning systems: it emphasizes first learning high-quality world representations, then attaching action, cost, memory, and planning to this representation system.

Core Differences Between the Two Routes

Comparison Dimension	JEPA/World Model Route	Active Inference Route
Theoretical Starting Point	Core is self-supervised representation learning and world model engineering, aiming to build a scalable perception-prediction-planning foundation.	Core is the free energy principle and Bayesian inference, aiming to unify the explanation of perception, learning, action, and exploration.
Main Training Object	Predicting abstract representations of masked regions or future states, not directly generating pixels/tokens.	Maintaining and updating probabilistic beliefs about latent variables, states, and policies.
Handling of Uncertainty	Typically expressed through latent variables, multiple futures, or downstream planning mechanisms; can be strong or weak in engineering.	Uncertainty is a first-class citizen; risk and information gain are explicitly written into the objective function.
Action Selection	Optimized via world model + cost module + actor, approaching model-predictive control.	Unified handling of exploitation and exploration by minimizing expected free energy.
Relationship with LLMs	More like providing a lower-level world model for LLMs; language models can serve as interface layers or upper modules.	Can be combined with LLMs, but the focus is usually on explicit belief updates and message passing, not large-scale autoregressive language modeling.
System Style	Leaning towards scalable deep learning and representation learning engineering.	Leaning towards normative theory, probabilistic graphical models, and belief propagation.
Current Strong Evidence	Image/video representation, video understanding, action anticipation, controlled robotic planning.	Cognitive modeling, planning, navigation, exploration, and some active inference agent prototypes.

The difference between the two can be summarized in one sentence: The JEPA route asks "how to construct a representation-prediction system that won't be bogged down by data details and can learn the stable structures of the world"; the active inference route asks "how an agent, under explicit uncertainty, integrates perception, action, exploration, and preferences through a unified Bayesian objective function." The former is more like a learning paradigm leading to engineering scalability, the latter more like a normative framework leading to a unified theory of intelligence. The two are not mutually exclusive: it is entirely conceivable for a future system to use a JEPA-like world model at the bottom layer to learn abstract states, and an active inference-style belief update and policy selection mechanism on top to handle uncertain decisions.

Conclusion

Does intelligence fundamentally come from language generation first, or from world modeling first? If an intelligent agent must live in a real world that is partially observable, full of bifurcations, and requires action to validate predictions, then the answer is likely the latter.

Future, stronger intelligent systems are highly unlikely to have a single autoregressive language model monopolizing the core position. They will likely be jointly composed of a world model, memory, cost/value, action optimization, and a language interface. Among these, the JEPA route provides a new foundation for "how the world is represented and predicted," while the active inference route provides normative principles for "how beliefs are updated and how actions are chosen under uncertainty." In this sense, these new AI architectures are betting that understanding the world is, after all, closer to intelligence than merely paraphrasing it.

Turing Award Winner Yann LeCun Bets $1 Billion Against LLMs: What Is This New AI Architecture?

Related Articles

分享網址