Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a “levels × laws” taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes (physical, digital, social, and scientific) that determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level–regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

Figure 1: Organizational structure of this survey. The paper is organized around three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific worlds), with supporting sections on evaluation, implementation, and open problems. — **Figure 1: Organizational structure of this survey.** The paper is organized around three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific worlds), with supporting sections on evaluation, implementation, and open problems.

1 Introduction

"One may say the eternal mystery of the world is its comprehensibility." — Einstein (1936)

The ambition to build internal models of reality has a long intellectual history, appearing in philosophical accounts of mental models (Craik, 1943; Johnson-Laird, 1983) and in modern machine learning as learned latent dynamics that support prediction, control, simulation, and scientific reasoning (Ha and Schmidhuber, 2018; Hafner et al., 2020; Karniadakis et al., 2021). The phrase world model is now widely used across research communities, but its precise technical meaning varies considerably (Ding et al., 2025a; Zhu et al., 2024). In reinforcement learning, agents learn transition structure to imagine futures before acting (Sutton, 1991; Ha and Schmidhuber, 2018; Hafner et al., 2020; Schrittwieser et al., 2020). In computer vision, world models often denote video or 3D generators that maintain visual dynamics and temporal coherence (Brooks et al., 2024; Bruce et al., 2024; Agarwal et al., 2025; Liang et al., 2026a; Bian et al., 2025; Kong et al., 2025). In language modeling and agent systems, the term can refer to text-grounded simulation for planning, web interaction, and social environments (Wang et al., 2024d; Gu et al., 2025b; Park et al., 2023; Zhang et al., 2026d; Zhang et al., 2026c). In robotics, learned dynamics serve safe planning, data-efficient policy learning, and sim-to-real transfer (Wu et al., 2023a; Yang et al., 2024b; Min et al., 2024). For science, systems pair surrogate models with hypothesis-driven experimentation (Karniadakis et al., 2021; Lu et al., 2024a).

From a complementary perspective, world models and agents are closely coupled. At its core, a world model learns the state-transition dynamics of an environment: given a current state and an action, it predicts the resulting next state. An agent, conversely, selects actions given a task objective and its current observations. These two components are mutually supportive. Agents rely on world models to anticipate the consequences of candidate actions, enabling look-ahead planning and sample-efficient learning (Hafner et al., 2025; Schrittwieser et al., 2020; Dong et al., 2026; Dong et al., 2025). Conversely, world models benefit from agent-generated experience, which provides targeted, task-relevant trajectories that improve the model’s accuracy in decision-critical regions of the state space (Sutton, 1991). This close coupling motivates the capability-based perspective adopted in this survey: while world models serve many purposes, we operationally define their value by the quality of decisions they enable for downstream agents.

Because world models constitute a foundational component whose value extends beyond any single agent architecture, their growing importance makes conceptual clarity all the more urgent. Yet the diversity outlined above also creates conceptual fragmentation: a vision researcher may evaluate a world model by the visual fidelity of its generated frames, while a reinforcement learning practitioner judges the same term by whether it improves task performance. As a result, papers may report strong progress under one interpretation of world model while remaining incomparable under another. This paper addresses that fragmentation by providing a common language that can align communities without erasing domain-specific differences.

1.1 Motivation

Current survey landscape. Several recent surveys have attempted to organize this rapidly growing literature. Ding et al. (2025a) propose a dual taxonomy of understanding versus predicting, mapping world models onto application domains such as autonomous driving, robotics, and social simulacra. Zhu et al. (2024) focus on the generative capabilities catalyzed by Sora, surveying world models for video generation, autonomous driving, and autonomous agents. Yue et al. (2025) provide a roadmap for 2D visual world modeling with a four-generation capability taxonomy (G1–G4) applied to robotics, autonomous driving, and gaming. Their G1–G4 taxonomy is useful for distinguishing increasingly interactive visual generation systems; our L1–L3 hierarchy is complementary rather than competing, because it abstracts away from the visual modality and asks whether a system supports local prediction, decision-usable simulation, or evidence-driven revision across physical, digital, social, and scientific regimes. Roughly, early G-levels emphasize appearance and action-conditioned prediction, whereas our L2/L3 boundary is determined by constraint-valid rollout and persistent model update. Domain-specific surveys have also proliferated: Li et al. (2025e) provide a three-axis framework (functionality, temporal modeling, spatial representation) specifically for embodied AI; Feng et al. (2025c) and Tu et al. (2025) survey world models for autonomous driving; Kong et al. (2025) examine 3D and 4D world modeling; Zhang et al. (2025d) survey world models for robotic manipulation; and a growing number of position papers question what it means for a learned model to “understand” physics (LeCun, 2022; Kang et al., 2025a). In AI for science, Wei et al. (2025b) survey autonomous scientific discovery across life sciences, chemistry, materials, and physics, unifying process-oriented, autonomy-oriented, and mechanism-oriented perspectives. A parallel line of surveys addresses agent planning and reasoning: Wei et al. (2025a) survey LLM planning capabilities across plan generation and verification, Huang et al. (2024c) taxonomize planning mechanisms into decomposition, selection, and reflection, Cao et al. (2025a) provide a systematic comparison of fine-tuning versus search-based planning methods, Zhao et al. (2025) organize agentic reasoning into single-agent, tool-based, and multi-agent frameworks, and Arunkumar et al. (2026) propose a unified agent taxonomy spanning perception, planning, action, and collaboration. These surveys complement ours: they focus on how agents decide and act, whereas we focus on the predictive substrate (the world model) that makes those decisions informed. Despite their valuable contributions, existing surveys share a common organizational principle that we argue is fundamentally limiting: they partition the field by modality or by application domain. Our work differs by organizing the field through a capability-based taxonomy that cuts across modalities, covering decision-making domains from embodied manipulation and autonomous driving to web agents, multi-agent coordination, and scientific discovery pipelines.

Figure 2: Positioning of this survey relative to existing world model and agent surveys. Four clusters, Embodied World Models, Generative World Models, Language Agents, and AI for Science, each cover subsets of the field. Our survey (center) integrates cross domain coverage with a capability based taxonomy (L1/L2/L3 × four regimes), bridging largely isolated communities. — **Figure 2: Positioning of this survey relative to existing world model and agent surveys.** Four clusters, Embodied World Models, Generative World Models, Language Agents, and AI for Science, each cover subsets of the field. Our survey (center) integrates cross domain coverage with a capability based taxonomy (L1/L2/L3 × four regimes), bridging largely isolated communities.

Gaps in existing surveys. The modality-centric and domain-centric taxonomies leave two critical gaps. First, they fail to capture the capability progression that cuts across modalities. A key example is model-based reinforcement learning, where latent-space “imagination” rollouts can match or exceed model-free baselines across diverse domains such as Atari, continuous control, and Minecraft (Hafner et al., 2025; Schrittwieser et al., 2020; Hafner et al., 2020). We formalize this progression as a three-level capability hierarchy: one-step prediction, long-horizon simulation, and evidence-driven model revision. A second motivation for our framework is the intensifying debate over whether large-scale generative models are merely plausible generators or genuine world simulators. Existing surveys have surfaced this tension (Brooks et al., 2024; Bruce et al., 2024; Kang et al., 2025a; Ding et al., 2025a), but a capability-based taxonomy helps state the question more precisely in terms of rollout, intervention sensitivity, and constraint consistency. We identify four progressively stronger capabilities, namely rollout, intervention sensitivity, constraint consistency, and closed-loop use, that characterize world models and go beyond generic predictors (formalized in Section 2). Moreover, existing surveys underrepresent the role of world modeling in agentic AI applications, including web agents, tool-use agents, and multi-agent systems, where learned environment dynamics are essential for planning and action selection (Gu et al., 2025b; Wang et al., 2024d; Park et al., 2023). The goal of this paper is to establish a capability-based taxonomy with clear and testable boundary conditions, and to use it to connect research communities that currently evaluate world modeling systems with different assumptions, objectives, and metrics.

Figure 2 positions this survey relative to existing work along two axes: scope (domain-specific to cross-domain) and organizing principle (modality-centric to capability-centric). Figure 1 shows the organizational structure of the paper at a glance, grouping sections by the three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and the four governing-law regimes (physical, digital, social, and scientific worlds).

1.2 Scope and Organizing Principle

Governing principles across domains. We organize the paper along two orthogonal axes: (i) capability level (L1/L2/L3, defined formally in Section 2), and (ii) governing-law regime, the constraints that legitimate transitions must satisfy in a domain. These levels are stages of world-modeling capability rather than mutually exclusive model classes: the same system may invoke different levels at different moments depending on task demand. Figure 3 provides a schematic overview of these four regimes.

Laws of the Physical World: perception; physical interaction; robotic manipulation, navigation, autonomous driving, egocentric video prediction, action-conditioned video modeling, 3D world modeling.
Laws of the Digital World: program semantics; web navigation, software tool use, GUI environments.
Laws of the Social World: beliefs; goals; norms; social coordination, dialogue, multi-agent settings.
Laws of the Scientific World: latent mechanisms; experimental observables; causal structure; scientific discovery pipelines, measurement-coupled prediction, hypothesis-driven experimentation.

Figure 3: Schematic illustrations of the four governing-law regimes. Representative scenes for each regime: a humanoid agent manipulating blocks (Physical World), code and UI surfaces (Digital World), a network of interacting agents with speech acts (Social World), and instrumented experimentation with robotic microscope and pipette (Scientific World). — **Figure 3: Schematic illustrations of the four governing-law regimes.** Representative scenes for each regime: a humanoid agent manipulating blocks (Physical World), code and UI surfaces (Digital World), a network of interacting agents with speech acts (Social World), and instrumented experimentation with robotic microscope and pipette (Scientific World). Each regime's formal constraints are discussed in Section 2.5.

In particular, the physical and scientific regimes are separated by how constraints are accessed: physical-world systems often admit analytic or simulator-based verification of transitions, whereas scientific-world systems typically require empirical validation because the governing mechanisms are only partially known. Regimes are not “orthogonal modalities”; real systems mix them. The value of the taxonomy is diagnostic; it clarifies which invariants a method tries to preserve and which queries it can answer reliably.

More generally, a world model can predict transitions along any organizing dimension, such as spatial scales, frequency bands, or causal depth, provided it maintains the capability criteria along that axis. Throughout, we use world model to denote learned (or hybrid) operators that support intervention-aware transition queries, and world modeling to denote the staged process of strengthening those operators.

How an agent uses the three levels at runtime. The L1/L2/L3 taxonomy is not a static classification of systems but a description of the capability an agent invokes at any given moment. A single deployed system can operate at different levels depending on the task demand:

L1 (Predictor). The agent executes fast, reactive one-step predictions (such as perception, low-level motor control, or token-by-token generation) without maintaining a multi-step plan.
L2 (Simulator). The agent upgrades to this level when the task requires comparing candidate action sequences, reasoning counterfactually about alternative futures, or verifying that a planned trajectory respects governing-law constraints; here the agent rolls out a multi-step simulation before committing.
L3 (Evolver). The agent escalates to this level when its current model produces systematic prediction failures that cannot be resolved by re-planning within the existing model structure, that is, when the model itself must be revised, assets distilled, and updates validated before the next deployment.

This runtime dispatch view clarifies why L3 is not a replacement for L1/L2 but a governance layer that improves the stack when evidence demands it. Within a full agentic stack, world models are only one component: tool use determines how the agent acts on the environment, memory determines what evidence persists across episodes, multi-agent coordination shapes the effective transition dynamics in social settings, and reflection determines when failures trigger revision rather than mere re-planning. Our focus is the world-model substrate, but its role is always in service of these broader agentic loops.

Figure 4: Timeline of representative world-modeling systems (2018–2026) organized by capability level. L1 Predictor denotes one-step dynamics, L2 Simulator denotes decision-usable multi-step rollout, and L3 Evolver denotes full evidence-driven model revision. — **Figure 4: Timeline of representative world-modeling systems (2018–2026) organized by capability level.** The roadmap shows 70 survey anchors, capped at five systems per year–level cell for readability. L1 Predictor denotes one-step dynamics, L2 Simulator denotes decision-usable multi-step rollout, and L3 Evolver denotes full evidence-driven model revision; partial L3 loops remain in Table 8. Each pill is colored by governing-law regime: **Physical** (blue), **Digital** (green), **Social** (orange), and **Scientific** (purple).

1.3 Contributions and Positioning

This paper makes three principal contributions (Figure 4):

Capability-based roadmap for world modeling in agentic AI (L1 → L2 → L3). We propose a three-level capability hierarchy with testable boundary conditions: L1 Predict World (one-step prediction), L2 Simulate World (long-horizon, action-conditioned rollout with constraint satisfaction), and L3 Modify World (evidence-driven model growth through autonomous data collection and dynamics revision). These are stages of capability, not types of models.
Cross-domain synthesis via governing laws. We unify computer vision, language modeling, model-based RL and robotics, and AI for science into a single capability coordinate system. Different governing laws (Section 2) define the types or partitions of world models, partially independent of the L1 → L2 → L3 capability axis. This two-dimensional organization (capability level × law regime) reveals shared principles across communities that have developed in isolation, while clarifying domain-specific challenges that make direct transfer non-trivial.
L3 as a distinct capability level. Evidence-driven model growth, where a system autonomously collects new evidence and revises its own dynamics model, has appeared in scattered forms across scientific discovery (Lu et al., 2024a), autonomous experimentation, and online adaptation. We argue this capability is qualitatively different from L2 rollout and formalize it as a distinct level, identifying the open problems that must be resolved to realize this capability at scale.

Positioning. We present this paper as a position-driven survey proposing a capability taxonomy for world modeling. It advances a specific conceptual framework, namely the L1/L2/L3 capability hierarchy paired with a governing-law regime taxonomy, and argues for its adoption across the world modeling community. Unlike a pure survey, it proposes testable boundary conditions and uses them to re-examine how existing systems are classified. Unlike a pure position paper, it substantiates each argument with a comprehensive literature review spanning computer vision, reinforcement learning, robotics, natural language processing, and AI for science. This paper does not introduce a new benchmark or leaderboard; instead, it offers a unifying conceptual framework for interpreting and comparing existing systems and evaluations.

Outline. Section 2 establishes the conceptual and notational foundations: it motivates the three capability stages from epistemological intuition, gives each a formal definition with testable boundary conditions, and clarifies the distinctions between world modeling and generic prediction, world models and planners, and world modeling and commonsense. Sections 3–5 present the three capability levels in detail with representative methods and cross-domain analysis. Section 6 discusses evaluation methodology, Section 7 addresses architectural and computational considerations, and Section 8 identifies emerging trends and open problems. Section 9 concludes. We note that L3 is not a terminal stage; Section 8 introduces meta-world modeling, in which the governing laws themselves become learnable, and identifies the open problems this entails.

2 Preliminaries

This section establishes the conceptual and notational foundations used throughout the paper. (1) From epistemology to a capability hierarchy draws on philosophical traditions to propose a three-level decomposition of world modeling capability (L1 Predictor, L2 Simulator, L3 Evolver) and to motivate why the boundaries fall where they do. (2) Notation and formal definitions fixes a unified symbol system and uses it to give each stage (L1, L2, L3) a precise definition with testable boundary conditions. (3) Conceptual boundaries clarifies the distinctions between world modeling and generic prediction, between world models and planners, and relates world modeling to the broader notion of commonsense reasoning that underwrites the reliable everyday action that agents must exhibit beyond narrow predictive tasks.

2.1 Philosophical Motivations

A natural question for any world-modeling survey is: what stages of understanding does a system pass through as it moves from pattern matching to genuine modeling? Epistemology, the study of what counts as knowledge and how knowledge grows, offers a useful lens. Different philosophical traditions identify qualitatively different kinds of epistemic achievement; we draw on these traditions to propose a three-level capability hierarchy for world models. These philosophical analogies are heuristic rather than historical or one-to-one. We do not claim that ML systems implement philosophical programmes, but that philosophical distinctions help us see why certain capability boundaries recur across domains and what design questions each stage foregrounds (Figure 5). Due to space constraints, detailed philosophical motivations and contemporary examples, together with extended historical context, are deferred to Appendix A.

Figure 5: From local prediction to evidence-driven revision: a hierarchical view of world modeling. Level 1 models empirical regularities for prediction, Level 2 supports possible-world semantics and counterfactual simulation, and Level 3 introduces evidence-driven revision through continual interaction with the environment. — **Figure 5: From local prediction to evidence-driven revision: a hierarchical view of world modeling.** Level 1 models empirical regularities for prediction, Level 2 supports possible-world semantics and counterfactual simulation, and Level 3 introduces evidence-driven revision through continual interaction with the environment. This hierarchy frames world modeling as an ascending process from pattern recognition, to temporal rollout, to adaptive model evolution in real-world practice.

L1 Predictor: from pattern to one-step forecast. The simplest epistemic achievement is learning patterns from data: given past observations, predict the next one. In philosophy, this is the terrain of Hume’s constant conjunction (Hume, 1739); an agent records statistical co-occurrences without certifying why they hold. When a model learns one-step latent transitions from trajectories, it occupies exactly this epistemic position: it extracts succession from data and bets that the pattern persists. This view aligns with predictive coding framework in cognitive science (Rao and Ballard, 1999; Friston, 2010) and the “Bayesian brain” hypothesis that perception is probabilistic inference (Clark, 2015), motivating one-step latent forecasting as a computational primitive (Lake et al., 2017). We call this stage L1 (Predictor). This Humean stance has inherent fragility. The i.i.d. assumption underlying most ML is effectively Hume’s Uniformity Principle (the premise that the future will resemble the past), so when the distribution shifts, L1 models that rely on learned regularities fail to generalize. Nevertheless, this provides the most basic inductive bias, which is the foundation of modeling.

L2 Simulator: rollout and counterfactual. Pattern matching alone does not answer what would happen if we acted differently. The next stage adds intervention and counterfactual reasoning: the ability to roll out coherent futures under chosen actions or hypothetical initial conditions and use the results for decision-making. David Lewis’s theory of closest possible worlds (Lewis, 1973) captures this jump: effective counterfactual reasoning explores worlds maximally similar to our own, where only a minimal intervention distinguishes actual from counterfactual outcomes, providing a principled basis for reasoning about what would have happened under alternative actions taken by the agent at decision points. We call this stage L2 (Simulator). Because L2 rollouts are model-relative, their reliability depends on the learned model’s own transition structure rather than on direct access to ground-truth dynamics. They risk epistemic drift, which produces internally coherent trajectories for the training manifold. Plato’s Allegory of the Cave (Plato, 1992) offers a vivid metaphor: a simulator excelling at predicting shadows on a wall may remain fundamentally bounded by the wall’s dimensions, unable to access the fire casting those shadows.

L3 Evolver: model revision from evidence. Even a powerful simulator eventually encounters situations where its predictions systematically fail, not because of parameter error but because the model class itself is too narrow. Epistemology offers a rich vocabulary for this transition. Lakatos’s distinction between a hard core (architecture, inductive biases) and a protective belt (learned parameters) (Lakatos, 1978) provides a useful parallel. Gradient steps mostly adjust the belt, while persistent structured errors may require changes to the core, such as new modules, parsers, constraints, or simulator hooks. We call this stage L3 (Evolver): the capacity to rebuild the laboratory when evidence demands it. This extends the full design–execute–observe–reflect loop: the system not only simulates but actively designs experiments, executes them, observes outcomes, and reflects to revise its model stack. Duhem–Quine holism (Duhem, 1954; Quine, 1951) explains why blame-assignment is non-trivial. Errors redistribute across modules until diagnostics isolate the brittle component. Proposed revisions should yield measurable improvements on held-out probes, regression suites, or experimental outcomes, rather than post-hoc adjustments that preserve the existing model despite contrary evidence from the environment.

2.2 Representation in World Modeling: Lessons from Scientific Theories

The capability hierarchy in Section 2.1 addresses what a world model can do, but leaves open a prior question: in what form should the world model actually be represented? This question should not solely be treated as an implementation detail, yet it determines whether the capabilities defined above, especially L3 revision, are realizable in practice across the diverse application domains covered in later sections.

Figure 6: Historical development of world modeling across four eras: Mathematical Principles (–1956), Symbolic Intelligence (1956–1986), Connectionist Resurgence (1986–2020), and Generative Revolution (2020–present). Two AI winters (1974–1980, 1987–1993) mark transitions between paradigms. The caption argues that a good representation of a world model should be instantiation-agnostic. — **Figure 6: Historical development of world modeling across four eras:** Mathematical Principles (–1956), Symbolic Intelligence (1956–1986), Connectionist Resurgence (1986–2020), and Generative Revolution (2020–present). Two AI winters (1974–1980, 1987–1993) mark transitions between paradigms. This argues that *a good representation of a world model should be instantiation-agnostic*.

Historically, symbolic approaches to machine intelligence struggled to scale (see Section 8.1), leading modern systems to adopt latent, implicit representations. Here scientific theories offer a telling contrast. Newton’s laws, Maxwell’s equations, and the Standard Model are instances of world models expressed in compact symbolic form, and arguably represent the most successful human instances of L3 systems: explicit, revisable, and composable. This contrast forces a question the field has largely avoided: is the endpoint of world modeling symbolic discovery, with neural latents as a scaffold, or are latent dynamics themselves the goal?

In scientific discovery, model updates arise at multiple scales: small anomalies trigger local modifications, while persistent discrepancies such as the “two dark clouds” in late 19th-century physics expose epistemic gaps that force revisions to a theory’s invariance structure. The shift from Newtonian to relativistic mechanics, for instance, replaced Galilean invariance with Lorentz invariance. Modern ML systems also encode invariances, such as translation equivariance in convolutions and shape bias in attention-based models, but do so implicitly, through architecture and training, rather than as explicitly modifiable structures. This suits L1 prediction and L2 simulation under a fixed model, but at L3 (where the task is to revise the model structure itself) it becomes a liability. Symbolic representations, by contrast, expose governing principles as first-class objects that can be directly inspected and modified.

We therefore take representation to be a foundational question about what a world model is, not a choice among interchangeable designs. Latent dynamics are indispensable as a scaffold for L1 and L2, but the endpoint of L3, namely genuine revision of governing laws, requires a symbolic substrate. On this view, L1→L2→L3 is a progression not only in rollout depth, but in how laws are discovered, composed, and revised. Practical instantiations or implementations across regimes are surveyed in Section 7. In the next Section 2.4, we introduce a foundational formalism that is instantiation-agnostic.

The preceding section proposed three capability stages from epistemological intuition. We now fix a unified symbol system, and Section 2.4 uses it to give each stage a precise definition. To cover model-based RL, predictive representation learning, video/world simulation, and generative modeling, we ground the notation in a Partially Observable Markov Decision Process (POMDP). Figure 7 places this POMDP structure at the heart of the three-level taxonomy: each capability stage is visualized as a highlighted scope on the same graphical model. The environment is denoted by the tuple ℰ = (𝒳, 𝒜, Ω, T, O, R, γ), where 𝒳 is the (unobserved) state space, 𝒜 the action space, and Ω the observation space (pixels, tokens, audio, etc.). Transitions and observations follow x_{t+1} ∼ T(x_{t+1} | x_t, a_t), o_t ∼ O(o_t | x_t). Under partial observability, agents maintain a belief b_t or a learned latent state z_t. Classical belief updates are written b_{t+1} = Bel(b_t, a_t, o_{t+1}); we reserve the symbol τ for latent trajectories below. Learned systems infer latents from history: z_t = f_ϕ(o_{≤t}, a_{≤t-1}) or q_ϕ(z_t | o_{≤t}, a_{≤t-1}).

• T, O: environment transition and observation mechanisms.
• q_ϕ(·): inference (history → latent).
• p_θ(·): learned local predictive or generative factors (one-step dynamics, decoders, etc.), with parameters θ (and analogously ϕ, ψ for inference and rendering).
• p̂(·): trajectory-level (or otherwise composed) distributions; the hat marks an explicit approximate object, e.g. the rollout marginal induced by repeated application of p_θ.
• π, R, γ: planner / policy, reward, and discount. These consume world-model queries but are not part of the world-model factorization (q_ϕ, p_θ, p_ψ); the conceptual separation is discussed in Section B.2.

Convention: p̂ is reserved for composed objects such as p̂(τ | z_0, a_{1:H}, c); ordinary one-step dynamics are always written p_θ(z_t | z_{t-1}, a_t). Table 1 provides a concise reference for the symbols used in this paper.

a_{1:H} = (a_1, …, a_H) denotes an action sequence of length H applied starting immediately after an anchor state z_0. The future segment is τ = (z_1, z_2, …, z_H).

so that p̂(τ | z₀, a_{1:H}, c) matches the L2 formalism in Section 4. From an arbitrary time index t, the same convention applies after a trivial shift: anchor at z_t, condition on a_{t+1:t+H}.

With the symbol system established in Section 2.3, we now give each capability stage a precise definition with testable boundary conditions.

An L1 world model provides local predictive operators that factorize into up to four components:

Inference / filtering: q_ϕ(z_t | o_{≤t}, a_{≤t-1}), (1)

Forward dynamics: p_θ(z_t | z_{t-1}, a_t) or, without actions, p_θ(z_t | z_{t-1}), (2)

Observation decoder: p_ψ(o_t | z_t), (3)

Inverse dynamics: π_η(a_t | z_{t-1}, z_t). (4)

These operators target one-step (or short-horizon) accuracy under the training distribution; no guarantee is made about the coherence of multi-step composition. Section 3 presents representative methods in detail.

An L2 world model extends L1 from local operators to decision-usable multi-step simulation. It must support trajectory-level queries of the form p̂(τ | z₀, a_{1:H}, c), τ = (z₁, …, z_H), subject to three boundary conditions that collectively mark L1 → L2:

1. Long-horizon coherence: rollouts remain usable over H steps rather than degrading immediately via compounding error.

2. Intervention sensitivity: counterfactual edits (action or premise changes) induce stable and directionally meaningful trajectory changes.

3. Constraint consistency: generated futures respect the governing laws of the target regime (the physical, digital, social, or scientific world).

The key difference from L1 is not one-step quality but rollout fidelity under composition.

The three L2 boundary conditions are complementary rather than redundant. Long-horizon coherence concerns whether rollout quality survives composition over time; intervention sensitivity concerns whether changes in actions or premises induce stable and directionally meaningful changes in the predicted future; and constraint consistency concerns whether the resulting trajectories remain valid under the governing laws of the target regime. None of these implies the others in general: a model may generate coherent but action-insensitive rollouts, or action-sensitive rollouts that still violate domain constraints. In practice they can also trade off against one another, for example when aggressive constraint enforcement stabilizes trajectories at the cost of reduced responsiveness to interventions.

A fourth capability, closed-loop use (supporting planning, acting, and self-improvement through interaction with the modeled environment), further separates world modeling from generic prediction but is orthogonal to L1/L2/L3: a weather emulator can be an L2 world model with no embedded planner (see Appendix B for extended discussion). We reserve “closed-loop” for two different senses that must not be conflated: using a world model inside a control or planning loop is an orthogonal deployment property, whereas revising the world-model stack itself from deployment evidence is the defining hallmark of L3.

An L3 world model extends L2 from rollout over a fixed scaffold to evidence-driven model revision. In addition to simulation queries, an L3 system maintains an explicit update loop over model assets: (ℳ_t, d_t) → (diagnose + distill + validate) → ℳ_{t+1}, where ℳ_t is the current world-modeling stack at revision step t and d_t is new deployment evidence (trajectories, errors, counterexamples, tests). Three boundary conditions mark L2 → L3:

1. Evidence-grounded diagnosis: failures are attributed to actionable causes using replayable evidence.

2. Persistent asset update: fixes are promoted as reusable assets (skills, rules, parsers, tests), not only ephemeral in-context patches.

3. Governed validation: updates pass regression and robustness gates (including rollback and canary policies) before default enablement.

The key difference from L2 is that the model itself becomes an object of revision, not merely a fixed scaffold to be queried (Lu et al., 2024a; Boiko et al., 2023). Recapping the scopes in Figure 7: L1 (Predictor) is a single-step transition p_θ(z_t | z_{t-1}, a_{t-1}) with its supporting inference and decoding operators, acting locally on one edge of the latent chain; L2 (Simulator) composes those local operators into a trajectory p̂(τ | z₀, a_{1:H}, c) under a fixed model ℳ_t and governing-law constraint c; and L3 (Evolver) revises the model stack ℳ_t → ℳ_{t+1} from distilled evidence d_t, yielding a different latent graph whose effective environment ℰ' ∼ 𝒳' may differ from the original, whether because the world itself has shifted, because the agent has uncovered previously unmodeled structure, or because the hypothesis space has been expanded. The three levels form a containment hierarchy: L2 invokes L1 at each step, and L3 invokes L2 each time it probes the world for evidence before committing to a model update.

The formal components above describe an agent whose decisions are determined by three elements: the state it believes the world to be in, the action it can execute, and the task (or constraint c) it must satisfy. This triple, not a flat observation-to-action mapping, defines the interface between world model and planner. Building a useful z_t involves two orthogonal challenges that structure Section 3: (i) spatial representation: compressing a high-dimensional observation o_t into a compact latent that retains decision-relevant structure (geometry, semantics, affordances), and (ii) temporal fusion: integrating history (o_{≤t}, a_{≤t-1}) so that z_t approximates a Markov belief even in partially observable settings. Actions are not flat variables: they can emerge from representation learning rather than being pre-defined, with the core dynamics captured by the latent representation and everything else serving as a decoder (LeCun, 2022). Real agent behavior decomposes across temporal scales and abstraction levels, including low-level motor primitives, mid-level skills, and high-level task plans. The world model must predict transitions at the granularity that matches the planner’s query horizon. This action hierarchy interacts directly with the L1→L2 boundary: local dynamics suffice for primitive-level prediction (Sun et al., 2025a), but skill- and task-level rollouts require the multi-step coherence that defines L2. At the L3 level, the agent must not only predict transitions across temporal scales but also decide when its own transition model is inadequate and initiate model revision. L3 treats the world-modeling stack itself as an object of action. Diagnostic probes, architecture modifications, and regression tests become “meta-actions” that operate on the model rather than on the environment itself, reshaping how the system learns rather than merely how it acts.

As introduced in Section 1.2, we organize the survey along two orthogonal axes: capability level (L1/L2/L3) and governing-law regime. This subsection elaborates the four regimes and the constraints each imposes on the learned transition function. We distinguish Laws of the Physical World (governing agents that perceive and act in physical environments), Laws of the Digital World (governing deterministic program semantics: code, APIs, and state machines), Laws of the Social World (governing the dynamics of minds and institutions: beliefs, goals, and norms), and Laws of the Scientific World (governing systems that exist independently of human design, whose dynamics must be discovered from empirical observation). These four regimes are representative, not exhaustive. Real-world systems often operate under multiple regimes simultaneously. For example, autonomous driving involves both physical dynamics and social norms, while drug design couples natural mechanisms with digital simulation pipelines.

Laws of the Physical World constrain transitions through the physical dynamics that embodied agents must respect: contact mechanics, collision response, gravitational acceleration, friction, and kinematic feasibility. In robotics manipulation, autonomous driving, and interactive 3D simulation, the learned transition p_θ(z_t | z_{t-1}, a_t) must encode these physical interactions faithfully (Todorov et al., 2012; Hu et al., 2023; Wang et al., 2024h). This regime is distinguished by analytically characterizable governing equations. A physics engine or analytic model can verify whether a predicted transition is consistent with rigid-body constraints and Newtonian mechanics. Constraint violations appear as objects passing through each other, gravity reversing mid-rollout, or physically impossible deformations. Such failures are immediately detectable because the ground-truth dynamics admit closed-form or numerically exact reference solutions.

Laws of the Digital World constrain transitions through deterministic program semantics, including API contracts, UI state machines, file-system logic, and network protocols. In web navigation, code generation, and software testing, the transition function p_θ(z_t | z_{t-1}, a_t) is largely deterministic but branches heavily through error codes, permission checks, and edge cases (Gu et al., 2025b; Yao et al., 2022). This regime is defined by transitions that are both specifiable and verifiable. The program can be executed and its output compared against the model’s prediction. Constraint violations appear as producing an API call that does not exist, ignoring returned error codes, or violating type constraints. Because the underlying system is a formal artifact, such errors are mechanically checkable.

Laws of the Social World constrain transitions through beliefs, goals, norms, social contracts, and institutional rules. In social simulation, dialogue systems, and multi-agent interaction, p_θ(z_t | z_{t-1}, a_t) maps joint actions and mental states to new mental states and social outcomes (Park et al., 2023; Zhou et al., 2025b). Two properties set this regime apart. Transitions are reflexive, meaning that agents’ beliefs about the state actively change the state itself. They are also normative, governed not only by what will happen but by what should happen according to shared conventions. Constraint violations appear as breaking a promise without consequence, forgetting a prior commitment, or ignoring established social norms. Such failures undermine coherence because social outcomes depend on mutual expectation.

Laws of the Scientific World constrain transitions through latent causal mechanisms that must be discovered from empirical observation rather than specified a priori. In weather prediction, molecular dynamics, protein folding, and drug design, p_θ(z_t | z_{t-1}, a_t) encodes atmospheric dynamics, chemical kinetics, or biological processes whose exact functional forms are unknown or too complex to write analytically (Karniadakis et al., 2021; Lam et al., 2023; Abramson et al., 2024). This regime differs in that the governing equations are not available in closed form. The world model must learn them from data and be validated against experimental measurement. Constraint violations appear as predicting physically impossible molecular configurations, violating conservation laws that hold empirically, or ignoring known causal dependencies. Detection typically requires comparison with laboratory or observational data rather than symbolic verification.

With these foundations in place, the following sections instantiate each capability level in turn: Section 3 surveys L1 methods, Section 4 addresses L2 simulation, and Section 5 examines L3 model revision. Appendix B clarifies the distinctions between world modeling and generic prediction, world models and planners, and world modeling and the commonsense reasoning that agents rely on in unscripted settings.

The hierarchical structure begins with L1, which assesses a world model’s local predictive ability by requiring it to sustain a meaningful internal state and use local predictive mechanisms to anticipate the next state, including potential observations or actions. In the unified graphical model of Figure 7, L1 is the scope of a single edge z_{t-1}→ z_t conditioned on action a_{t-1}; everything in this section elaborates the operators that populate this one-step transition and examines how they are realized in contemporary world-model systems.

L1 concerns the local predictive ability of a world model for an agent acting in an environment to accomplish a task or goal. More precisely, an agent is a system that, given observations, makes decisions and takes actions in order to satisfy an objective. In this paper, the role of an L1 world model is therefore not merely to predict the next signal, but to provide local predictive operators that support such decision-making at the granularity of one step (or a short fixed horizon). This epistemic stance aligns with Hume’s constant conjunction: regularities are extracted from observed data without claiming causal necessity (Section 2.1).

The POMDP formulation that underlies L1 originates from the reinforcement learning literature, where an agent must select actions under partial observability to maximize cumulative reward (Kaelbling et al., 1998; Puterman, 1994). In this setting, the agent maintains an internal belief over hidden states and crafts a policy π(a_t | b_t) that maps beliefs to actions. This formulation constitutes the prototypical agent–environment loop (Sutton, 1991). For an agent that interacts with the environment to accomplish a task, the POMDP decomposes into four local operators: state inference, forward dynamics, observation decoding, and inverse dynamics. Together these describe the foundational learning problems for world models at the L1 level.

Following this formulation (Section 2), L1 is characterized by local predictive operators operating on a learned internal state z_t (resembling a belief state), where the central modeling concept centers on a one-step (or short fixed-horizon) transition operator. In practical terms, z_t is deduced from observations and actions and functions as a learned approximation to the latent environmental state and/or belief (Hafner et al., 2025; Schrittwieser et al., 2020). The concept of learning such latent dynamics can be traced back to locally linear latent models for control (Watter et al., 2015) and Gaussian-process dynamics (Deisenroth and Rasmussen, 2011), and has been enhanced by contemporary deep learning architectures (Ha and Schmidhuber, 2018; Hafner et al., 2020). The term “Markov” in L1 denotes the Markovian property in the learned internal state z_t, indicating that z_t is adequate (or nearly adequate) for predicting the subsequent local step, rather than the direct observability of the environmental state (Hafner et al., 2019; 2025; Gelada et al., 2019).

At the model level, L1 factorizes into four local operators over z_t (Table 2). The core operator is latent dynamics (z_{t-1}→ z_t); the others are common supporting operators:

State inference (observation → state, Eq. 1): z_t = f_ϕ(o_{≤t}, a_{≤t-1}) or q_ϕ(z_t | o_{≤t}, a_{≤t-1}). The learned belief-like state summarizes relevant history for prediction (Hafner et al., 2019; Lesort et al., 2018).

Forward dynamics (state → next state; core L1 operator, Eq. 2): z_t ∼ p_θ(z_t | z_{t-1}, a_t) (action-conditioned) or z_t ∼ p_θ(z_t | z_{t-1}) (action-free).

Observation decoding (state → observation, Eq. 3): p_ψ(o_t | z_t), mapping latent state back to observation space (Kingma and Welling, 2014; Rezende et al., 2014).

Inverse dynamics (Eq. 4): π_η(a_t | z_{t-1}, z_t), used as an auxiliary objective or for representation shaping (Pathak et al., 2017; Hafner et al., 2020).

Continuing our representative methods table, V-JEPA and DINOv2 exemplify the state-of-the-art in representation learning for state inference. JEPA variants predict masked patch embeddings directly in latent space, which encourages semantic understanding without low-level pixel reconstruction. DINOv2 uses self-distillation to produce robust visual features usable for downstream tasks. We then move into Model-Based RL. PILCO and E2C are pioneering works that use Gaussian processes and locally linear models, respectively, for latent dynamics. World Models and the Dreamer family (through DreamerV3) introduce the powerful Recurrent State Space Model (RSSM), which combines a deterministic recurrent path with a stochastic component to form a belief state. MuZero takes a different approach with deterministic MLP dynamics trained end-to-end for value prediction. Other16 methods like TD-MPC2 and MBPO demonstrate scalable20 or ensemble-based dynamics. The final category in this table is Token / Diffusion-Based methods. IRIS tokenizes observations with a VQ-VAE and uses an autoregressive Transformer, while DIAMOND uses a diffusion model directly in pixel space. The table summarizes key innovations and which local operators (state inference, forward dynamics, observation decoding, inverse dynamics) each method implements.

We can now categorize L1 techniques based on the four local operators. State inference is the

Table 4 maps the three L2 boundary conditions to concrete instantiations in each governing-law regime. More precisely, an L2 system supports trajectory-level queries of the form p̂(τ | z0, a1:H, c), τ = (z1, …, zH), where a1:H denotes an action sequence and c denotes optional constraints imposed by the governing-law regime. Intervention-structured rollouts align with the interventional rung of Pearl’s causal hierarchy (Section 2.1). What separates L2 from L1 is not one-step predictive quality alone, but coherent multi-step rollout under the governing laws. L2 thus stitches per-edge L1 operators into a full trajectory z0 → z1 → ⋯ → zH (top block of Figure 7).

Table 6 shifts focus to two other major application domains: Social World and Scientific World. These represent

Aurora

Paper

Code

✔

✗

✔

3D Swin weather foundation

Lingshu-Cell

Paper

—

✔

✗

✔

Masked diffusion cellular WM

4.2.1 Laws of the Physical World

In the physical domain, L2 models should respect geometry, kinematics, and conservation laws. The governing constraints are contact, reachability, stability, and energy conservation; violations of any of these will mislead a planner into proposing actions that fail catastrophically in real execution.

Physics simulation.

Rigid-body control simulators. Classical physics simulators remain the foundation layer for executable transition validity in embodied world modeling. MuJoCo provides articulated rigid-body dynamics and contact-rich control, with dm_control packaging these capabilities into a standardized continuous-control suite. Brax pushes differentiable rigid-body simulation toward accelerator-scale throughput, while Isaac Gym and Isaac Lab emphasize massive GPU-parallel robotics simulation.

Scalable and general-purpose simulation platforms. Genesis positions itself as a generative and universal physics engine, reflecting the broader trend toward higher-throughput simulators that can jointly support both control and large-scale synthetic-data generation.

Interaction-centric embodied simulators. At the graphics-and-robotics interface, SAPIEN provides part-aware, interaction-centric simulation, and ManiSkill3 scales GPU-parallel rendering for generalizable embodied AI. These systems are not learned simulators; they are explicit law executors whose value lies in precise contact handling, articulated constraints, and reproducible rollouts.

Video generation models.

Appearance-first long-horizon video generation. A scalable route to physical-world simulation is the video interface: given current observations and optional actions, the model returns imagined future frames. This line begins with appearance-first rollout, where systems such as Sora, Lumiere, and VideoPoet demonstrate coherent visual dynamics over extended horizons, with geometry-aware structure increasingly emerging beyond pixel-level realism. FramePack and Self-Forcing reduce long-horizon drift through frame-context packing.

Action-conditioned and interactive video worlds. A second direction moves from passive continuation toward intervention-aware generation. Genie learns latent action spaces from unlabeled Internet video, while GAIA-1 conditions future generation on explicit control signals for counterfactual evaluation. More recent systems push this line toward real-time, long-horizon, and streaming interaction: Oasis explores open-ended interactive generation in a unified transformer world; WorldPlay emphasizes long-term geometric consistency for real-time interactive world modeling; Matrix-Game 3.0 extends interactive generation to streaming settings with explicit long-horizon memory; Yume-1.5 studies text-controlled interactive world generation; and LongLive targets real-time interactive long video generation. Taken together, these systems mark a shift from passive video prediction toward controllable, intervention-aware, and temporally persistent video worlds.

Decision-oriented video world models. In model-based RL, SimPLe and DIAMOND make the decision-theoretic role of video world models explicit. In robotics, DreamZero and DreamDojo demonstrate zero-shot and generalist policy learning via video world models, while FutureVLA couples visuomotor prediction directly with Vision-Language-Action policies to unify perception and control.

Evaluation and limitations. Within our L2 framing, however, visual plausibility does not equal decision-usability. Intervention sensitivity remains fragile, long-horizon coherence is easily overstated when judged by perceptual quality alone, and constraint consistency is difficult to verify from rendered frames. Standard metrics such as FVD capture distributional realism; VBench-style suites better decompose controllability; VBench-2.0 extends evaluation to physics consistency and commonsense reasoning; and VChain introduces visual chain-of-thought for causal coherence. Video interfaces are the most scalable observation-layer entry point, but planner-critical structure remains implicit in pixels; Appendix C surveys geometry-carrying alternatives that make such structure explicit.

Robotics and sim-to-real transfer.

World models transferred to real robots. DayDreamer showed that Dreamer-family world models can transfer from simulation to physical robots while handling sensor noise, contact dynamics, and actuation delays. DreamZero achieves zero-shot policy learning via world action models that predict both next states and actions, and FutureVLA embeds visuomotor prediction within Vision-Language-Action models to improve action grounding.

Physics-grounded bridges for sim-to-real robustness. PIN-WM integrates differentiable physics with learned visual world modeling, creating “digital cousins” via physics-aware randomization.

Representation requirement. Across these systems, the key question is not whether richer representations are possible, but what is the weakest representation that still preserves planner-critical structure, such as object persistence, free space, contact onset, support relations, and action-conditioned change over useful horizons. Extended details on 3D-structured world models and autonomous driving appear in Appendix C.

4.2.2 Laws of the Digital World

The Laws of the Digital World govern transitions in systems defined by formal specifications, from finite automata (UI state machines) and context-free grammars (structured data formats) to Turing-complete programs (general software). Unlike the Laws of the Physical World or the Laws of the Social World, these constraints are explicitly specified and mechanically verifiable: a transition either satisfies the program’s semantics or it does not. Because software transitions approximate deterministic state machines and failures are loggable (error codes, popups, permission denials, timeouts), the core challenge for a Simulator in code worlds is structured state prediction (DOM trees, program state, game state) rather than visual fidelity.

Coding agents.

An emerging paradigm represents world models as executable programs rather than neural networks. CodeWM uses LLMs guided by Monte Carlo Tree Search to generate Python programs that serve as explicit, interpretable world models for reinforcement learning across 18 environments. WorldCoder takes a complementary approach, with an LLM agent building a Python world model incrementally through environment interaction for sample-efficient transfer. WKM provides both global task knowledge and dynamic state knowledge to guide LLM agent planning, while CWM, a 32B open-weights LLM specifically trained for code world model research, achieves 65.8% on SWE-bench Verified. A conceptually distinct variant pushes further: rather than using an LLM to generate a code world model, the world model is a running software system. Web World Models implement world state as ordinary web code (TypeScript modules, HTTP handlers, database schemas), delegating logical consistency to deterministic execution of the web stack while LLMs generate context and high-level decisions. These code-based approaches yield interpretable, composable, and verifiable world models that neural dynamics can only approximate.

Web agents.

Web agents usually browse websites; therefore, modeling and simulating state transitions within a website is crucial for building effective web world models. WebDreamer introduced the idea of using an LLM as an implicit world model of the internet, but subsequent work showed that off-the-shelf LLMs are insufficient: dedicated training with transition-focused abstraction is needed. A growing body of work addresses the co-evolution of agent and world model. WebEvolver tightly couples the two in a mutual improvement loop, while DreamGym builds experience models with chain-of-thought reasoning, achieving over 30% improvement on WebArena. At larger scale, WebSynthesis combines world models with MCTS-based planning using entirely synthetic data, and WebWorld trains an open-web simulator on over one million trajectories supporting 30+ step simulation. AUI takes a different approach, employing a Coder to optimize websites by leveraging feedback from a Computer-Use Agent in an iterative collaboration loop. Orthogonal design choices include generating trajectories from tool specifications alone (Simia), adding a metacognitive layer that decides whether to consult the world model at each step (WAC) and agent-collected data to handle out-of-distribution behaviors.

GUI agents.

GUI agents typically execute actions in real environments. However, in scenarios where actions may be dangerous or lead to undesired outcomes, it is beneficial to estimate them beforehand. A GUI world model can simulate and evaluate these actions, thereby providing a more reliable assessment. Therefore, MobileDreamer transforms GUI images into task-related sketches for structured state prediction, while MobileWorldBench provides systematic evaluation with 1.4 million (state, action, future state) triplets. Complementary to explicit GUI world models, UI-AGILE shows that effective reinforcement learning and precise inference-time grounding remain equally important for strong downstream GUI-agent performance. A central design question is the output representation: ViMo generates future observations as images using symbolic text representation, while gWorld generates renderable web code as the predicted next state, suggesting that generating the code that renders the GUI can be more faithful than generating pixels directly. At the OS level, NeuralOS simulates desktop GUIs by predicting screen frames from user inputs, while CUWM targets desktop software where persistent document state must be preserved across long-horizon workflows. Code2World further extends this line by treating code as a renderable world, where generated programs directly produce visual states (e.g., HTML) upon execution. This enables modeling environment dynamics as executable code generation, tightly coupling perception, action, and state transition in interactive domains such as GUIs.

4.2.3 Laws of the Social World

Societal world models extend L2 to human interaction, where governing laws are beliefs, desires, intentions, norms, and institutions rather than physics. Social worlds exhibit three distinctive properties, in particular, opacity (agents cannot directly observe each other’s mental states), reflexivity (beliefs about social state create feedback loops), and normativity (transitions are governed partly by shared norms). Such traits make the transition function partially constituted by collective agreement rather than natural law. A usable social simulator separates surface language from underlying social state: dialogue can vary, but core states (goals, beliefs, relations, norms) must remain consistent and yield interpretable transitions, as formalized by the Rational Speech Acts framework. Concretely, a social compatibility term can encode commitment consistency: if agent i promises action b at time t, later states receive low compatibility when i violates b without explanation, renegotiation, or sanction. Similar terms can score norm compliance, role consistency, or belief-state coherence over the trajectory.

Theory of mind as social state.

The computational foundation was laid by Bayesian ToM (BToM), which formalizes mental state inference as probabilistic inverse planning over rational agents. Neural approaches began with ToMnet, whose character, mental state, and prediction networks jointly infer traits and beliefs, and recent work such as LaBToM bridges Bayesian inverse planning with formal epistemic language. However, current models lack robust mental state reasoning: FANToM reveals “illusory ToM” across all state-of-the-art LLMs, and ExploreToM achieves accuracy as low as 9% for GPT-4o. A complementary challenge is the dual-structure problem: a social agent must simultaneously model others’ mental states (theory of mind) and maintain its own persistent internal state across long interactions, in particular, goals, persona, memory, and knowledge. Cognitive Architectures for Language Agents (CoALA) formalizes this dual structure as separate memory and action spaces that must remain mutually consistent, and provides a principled framework for understanding how current LLM agents do and do not achieve stable self-representation.

Strategic interaction.

CICERO integrates a language model with piKL planning for Diplomacy, jointly optimizing game actions and dialogue while modeling second-order beliefs, achieving more than 2× the average human score. Deal or No Deal pioneered dialogue rollouts for forward simulation of negotiation dynamics. Werewolf and Avalon games serve as concentrated testbeds for deception, trust, and belief manipulation, revealing that deceivers consistently prevail by exploiting cognitive limitations.

Sandbox simulation.

Generative Agents demonstrated emergent social dynamics: a 25-agent simulation used memory-based state tracking and periodic reflection, while Sotopia formalized social simulation evaluation across seven dimensions. Scale has increased dramatically: Project Sid deployed 1,000 agents exhibiting emergent specialization and governance, and OASIS scaled to one million agents reproducing information spreading and group polarization. At the individual level, Argyle et al. demonstrate “silicon sampling”, which conditions LLMs on specific demographic profiles to simulate survey responses from targeted subpopulations and shows strong alignment with American National Election Studies data, opening a path toward individual social world modeling. Generative Social Choice extends this to democratic aggregation, using LLMs to generate representative statements from diverse synthetic participants and enabling deliberation.

4.2.4 Laws of the Scientific World

In AI for Science, the transition from L1 to L2 shifts the focus from modeling local states or structures to simulating dynamics over multiple steps. These dynamics arise along two axes. The first concerns the temporal evolution of a system, where the model predicts how a natural system unfolds over time under given conditions or interventions. The second concerns the scientific research itself, where the model simulates sequences of hypotheses, experiments, and outcomes to support reasoning and action. These two forms define the corresponding forms of simulation in scientific world models: forward simulation of system dynamics, and decision simulation based on surrogate evaluation of candidate experiments.

Forward simulation.

World models approximate the evolution of scientific systems by replacing expensive numerical solvers with learned transition operators. GNS showed that message passing on particle graphs can simulate fluids, rigid bodies, and deformable materials with generalizable dynamics. The Fourier Neural Operator established resolution-invariant operator learning via spectral convolutions, achieving 1000× speedup over traditional solvers and underpinning subsequent weather and fluid surrogates. At planetary scale, Pangu-Weather and GraphCast outperform the ECMWF operational system on 90% of verification targets. GenCast extends these to probabilistic forecasting via a diffusion architecture, outperforming the ensemble system on 97.2% of targets. NeuralGCM integrates learned parameterizations within a differentiable general circulation model, producing emergent phenomena such as tropical cyclones and illustrating the value of coupling mechanistic structure with learned components. Aurora further scales this paradigm to a foundation model of the Earth system, achieving strong performance across multiple forecasting tasks at substantially reduced computational cost. In molecular science, neural network potentials pioneered by Behler and Parrinello enabled orders-of-magnitude speedup over density functional theory for molecular dynamics, establishing the foundation for all subsequent ML fields.

Decision simulation.

World models reduce the cost of scientific discovery by simulating the experimental decision loop in-silico. Representative systems span molecular design (ChemBO), biological sequence optimization with population-based model ensembles and meta-level search reallocation (P3BO), and materials discovery guided by user-defined algorithmic objectives (BAX). Across these systems, the model simulates not only individual outcomes but the sequential process of experiment selection, maintaining and updating beliefs over candidates while identifying inconsistencies during optimization. However, these capabilities remain confined to a fixed data regime: the model cannot actively design and execute experiments to acquire new information that challenges its current assumptions. As a result, while such systems can correct optimization errors, they cannot resolve uncertainty arising from incomplete knowledge, leading to accumulated bias over long horizons. L3 world models (Section 5) overcome this by actively gathering evidence to revise the model.

4.2.5 Cross-Domain Analysis

Figure 8: Diagnostic map of the four governing-law regimes. The axes are schematic rather than metric: the horizontal axis reflects how formally specifiable and mechanically verifiable the transition rules are, while the vertical axis reflects how directly the relevant state and constraints are observable. The purpose of the figure is comparative rather than classificatory: it highlights why different regimes demand different forms of rollout validation even when all are instances of L2 simulation. Real systems are often mixed-regime and may sit between regions rather than inside a single box.

Table 7: Cross-domain comparison of L2 simulators. Each governing-law regime imposes different constraints, failure modes, and evaluation priorities.

Physical

Geometry, kinematics

Continuous

Contact instability, drift

Stability; failure clustering

Social

Beliefs, norms, ToM

Goals, relations

Role drift, goal forgetting

Counterfactual sensitivity

Digital

API contracts, UIs

DOM, permissions

Grounding breaks, races

Error-branch coverage

Scientific

Mechanisms, evidence

Hypotheses

Hallucinated mechanisms

Evidence-chain repair

Figure 8 positions the four regimes along two diagnostic axes: formalizability and observability of the governing constraints. Across all four regimes, a recurring pattern emerges: a good Simulator does not have to look more like the world; it must look more like the constraints. Physics uses geometry/contact constraints; software uses state machines and structured feedback channels; social worlds use role/norm consistency; science uses evidence chains and falsifiability. Making constraints explicit (loggable, replayable, regressable) often improves long-horizon stability more than increasing perceptual fidelity. Table 7 summarizes the governing laws, state types, common failure modes, and evaluation focus for each regime.

Cross-regime systems.

Many real-world deployments do not fall neatly into a single governing-law regime; instead, they require an L2 simulator to maintain coherent rollouts across multiple constraint families simultaneously. When regimes interact, a violation in one domain can cascade into another: a physically implausible vehicle maneuver may render a social-intent prediction meaningless, or a software bug may invalidate an otherwise sound experimental plan. Designing and evaluating cross-regime systems therefore demands joint constraint satisfaction rather than per-regime evaluation in isolation.

Autonomous driving: physical (vehicle dynamics, contact mechanics) + social (pedestrian intent prediction, traffic norm compliance).
Minecraft agents (Voyager): physical (3D navigation, combat dynamics) + digital (crafting recipes, inventory management, game-state logic).
Diplomacy (CICERO): social (negotiation, trust modeling, alliance formation) + digital (game-state management, rule enforcement).
Autonomous laboratories (A-Lab): scientific (experiment design, hypothesis evaluation) + physical (sample manipulation, instrument constraints).

The boundary conditions above can be unified as a single principle: whether the world model remains fixed or becomes plastic during deployment. This transition from L2 to L3 manifests in three aspects: whether the model can update its parameters and structure after deployment, how it accumulates new capabilities over time, and whether it passively consumes data or actively generates it through experimentation.

Fixed vs. adaptive. An L2 simulator is typically fixed post-training. It can generate infinite rollouts based on its training data, but its core transition function does not evolve; it explores the implications of its frozen knowledge. In contrast, an L3 system is adaptive post-deployment: it treats its own parameters or structure as a hypothesis to be updated, i.e. the model incorporates new evidence.

Modes of growth. L3 growth goes beyond simple data buffering and encompasses three different modes:

Parameter update: modifying weights via gradient descent or Bayesian updates on new evidence, e.g., online learning, continual RL fine-tuning, and Bayesian model updates.
Architecture update: dynamically adding new modules, experts, or capacity to handle complexity, for example, expanding the context window or allocating new memory slots.
Hypothesis-space expansion: extending the model class to represent explanations that were previously inexpressible. This corresponds to introducing new variables, mechanisms, or abstractions, shifting from "I don't know which of these k options is true" to "the correct explanation is not among the current k options." This is the most challenging mode and is closely tied to abduction and genuine scientific discovery.

Passive vs. active. While L2 systems may support passive online learning (updating weights on a stream of incoming data) or decision simulating, L3 is characterized by an active trial-and-error loop. It does not just wait for data; it acts to generate data that maximizes information gain regarding a specific hypothesis or area of uncertainty. This active stance transforms the agent from a consumer of experience to a designer of experiments, a qualitative shift that connects directly to the philosophy of abduction and scientific method. L3 should not be defined by closed-loop use in the generic planning sense; rather, it is defined by closing the evidence-to-revision loop, so that deployment outcomes are used to diagnose, update, and validate the world-modeling stack itself over successive iterations of use.

Examples and Applications. L3 is most tractable in domains that are highly instrumented, offer rapid feedback, and provide well-defined evaluation criteria. Empirical support for L3 is uneven across domains: autonomous science and other highly instrumented settings provide the clearest demonstrations, whereas social, code, and embodied environments remain partly empirical and partly prospective design space. We illustrate this landscape, together with the characteristic evidence signals and failure modes in each, across four governing-law regimes in Figure 10.

Figure 10: L3 evolution across four governing-law regimes. Each panel illustrates the design–execute–observe–reflect loop in a representative domain: (a) Physical intelligence—adaptive probing revises contact dynamics; (b) Social intelligence—norm drift triggers social-model revision; (c) Digital intelligence—evaluator-driven program search with regression gates; (d) Scientific intelligence—closed-loop autonomous experimentation at a synchrotron beamline.

Physical intelligence. In embodied settings, L3 manifests as adaptive probing to infer and update dynamics models. When a robot encounters unexpected contact dynamics, such as a slippery surface or a deformable object, the system can actively execute diagnostic actions (small perturbations designed to disambiguate between hypotheses about the contact model) and use the resulting evidence to update its dynamics model. The anomaly signals in this regime are inherently physical: force/torque deviations, unexpected contact events, and discrepancies between predicted and observed end-effector trajectories provide quantitative evidence for model updates. Recent work demonstrates that robots can autonomously detect physical damage and re-train persistent self-models: Hu et al. (2025b) show that an egocentric visual self-model detects morphology changes via prediction-versus-observation mismatch and re-trains to recover locomotion. AdaptSim (Ren et al., 2023) meta-learns an adaptation policy that iteratively revises simulation parameters from small amounts of real-world task performance data, closing the sim-to-real gap through evidence-driven simulation revision rather than fixed domain randomization, with each real-world deployment informing the next round of simulation updates.

Digital intelligence. Software and web environments are naturally suited to L3 because state is fully observable, actions are deterministically replayable, and regression testing provides a built-in validation gate. Evaluator-driven discovery loops exemplify this regime. Romera-Paredes et al. (2024) pair a pretrained LLM with an automated evaluator in an evolutionary loop: the LLM generates candidate programs, the evaluator scores them against a formal specification, and high-scoring solutions are fed back for further refinement. This loop discovered new constructions for the cap set problem (a long-standing open problem in combinatorics) and new bin-packing heuristics that outperform known baselines. The evaluator serves as an automated regression gate, a key L3 property, although the system realizes only the design and observe components (program generation and automated scoring) without active information expansion or persistent model revision. Novikov et al. (2025) extend this evolutionary coding paradigm: by pairing LLM-generated program mutations with automated correctness evaluators, the system improved on Strassen's matrix multiplication algorithm after 56 years and solved 20% of open mathematical problems beyond the prior state of the art, illustrating the power of formal verification as an L3 gatekeeper in algorithmic domains. CodeIt (Butt et al., 2024) closes a tighter loop: the LLM is fine-tuned from its own search trajectories via prioritized hindsight replay, so that the generative model itself (serving as an implicit world model of program space) persistently improves across tasks. The AI Scientist-v2 (Yamada et al., 2025) pushes further into computational experiments by employing agentic tree search for experiment selection: the system autonomously formulates hypotheses, designs and executes experiments, analyzes results, and writes complete manuscripts. A VLM feedback loop iteratively refines figures and content. In 2025, this system produced an entirely AI-generated paper that passed peer review at an ICLR workshop. However, the system's experiments are computational (running ML training jobs), and its revision loop operates on paper quality rather than mechanistic understanding, illustrating the gap between L3 in well-instrumented computational domains and the harder challenge of genuine scientific discovery. In AUI (Lin et al., 2025a), a Coder–Computer-Use Agent loop instantiates this principle in website: the Coder iteratively revises website implementations, while the CUA acts as an automated evaluator by executing task trajectories and verifying functional correctness (e.g., navigation success and task completion). The resulting feedback, grounded in executable interactions rather than static inspection, serves as a regression signal that guides subsequent code updates, forming a closed-loop optimization process aligned with L3 properties.

Social intelligence. L3 in social domains requires revising the agent's social model when predicted behavior of other agents deviates from observed behavior, for example, when Theory-of-Mind predictions fail systematically or when social norms drift over time. This is currently the hardest regime for L3 because attribution is inherently ambiguous (a failed social prediction may reflect incorrect beliefs about the other agent's goals, an outdated norm model, or stochastic behavior) and because social experiments are ethically constrained. Early work on norm emergence and convention formation in multi-agent populations represents a preliminary step toward social L3, but persistent, validated revision of social world models from deployment evidence remains largely open. A preliminary step toward social L3 is the evolutionary synthesis of multi-agent governance rules: Kumar et al. (2026) use LLM-driven genetic programming to evolve interpretable constitutions from societal stability scores, surpassing human-designed rules by 123%.

Scientific intelligence. The most complete current examples of L3 come from autonomous science, where the full design–execute–observe–reflect loop is closed by instrumentation. The paradigm of autonomous closed-loop scientific discovery was established by Robot Scientist Adam (Sparkes et al., 2010), the first machine to autonomously design experiments about gene function, execute them, observe the outcomes, and revise its model. Its successor system demonstrated closed-loop cycles of experiment design, execution, and model revision in yeast systems biology, accelerating biological model development (Coutant et al., 2019). CAMEO (Kusne et al., 2020) implements closed-loop materials discovery via Bayesian active learning at a synchrotron beamline: the system predicts which phase a candidate composition will form, synthesizes it, characterizes the product via X-ray diffraction, updates its Bayesian belief model, and actively selects the next experiment to maximize information gain. Each experimental cycle takes seconds to minutes, and the system discovered a novel phase-change memory material without additional human training. A-Lab (Szymanski et al., 2023) extends this to fully autonomous synthesis: three robotic arms automate powder dosing, heating, and XRD characterization, with an active-learning algorithm generating improved recipes when targets fail. In 17 days of closed-loop operation, A-Lab performed 353 experiments and realized 36 compounds from 57 targets. Crucially, analysis of failed syntheses provided structured evidence to refine future synthesis strategies; the failures were not discarded but distilled into persistent knowledge. Strieth-Kalthoff et al. (2024) extend the self-driving laboratory paradigm to distributed, multi-site operation: a delocalized SDL autonomously discovers novel organic laser emitters by iteratively updating a Bayesian surrogate from synthesis and characterization data across geographically separated facilities. BacterAI (Dama et al., 2023) demonstrates that L3 can operate with zero prior biological knowledge: the system iteratively designs and executes experiments to map microbial amino acid requirements, revising its metabolic model purely from experimental evidence. In computational chemistry, MOOSE-Chem (Yang et al., 2025e) demonstrates that an LLM-based framework can rediscover chemistry hypotheses published in Nature and Science in 2024 using only pre-2024 literature, providing evidence that the hypothesis-generation component of the L3 loop is already feasible for natural-science domains. Its successor, MOOSE-Chem2 (Yang et al., 2025d), introduces hierarchical search over fine-grained hypothesis components to improve both precision and novelty of generated discoveries. Appendix D presents worked examples spanning all four regimes. Broader agentic systems are pushing the L3 loop further into biomedicine. Biomni (Huang et al., 2025a) provides a general-purpose biomedical AI agent that integrates over 100 tools and 59 databases spanning 25 subfields, enabling autonomous execution of tasks from causal gene prioritization to drug repurposing. BioLab (Jin et al., 2025) extends this to end-to-end autonomous life-sciences research via a multi-agent system built on biological foundation models. OriGene (Zhang et al., 2025i) demonstrates a self-evolving virtual disease biologist that autonomously discovers therapeutic targets through iterative hypothesis refinement. The AI co-scientist system (Gottweis et al., 2025) employs a generate–debate–evolve approach to hypothesis generation, with multi-agent tournament processes that have been validated in drug repurposing and epigenetic target discovery. Complementing these systems, Yang et al. (2026) introduce a dynamic benchmark revealing that current LLMs still fall short on genuine biological knowledge derivation, underscoring the persistent gap between literature retrieval and true L3 revision that actually updates the underlying model.

Table 8 summarizes representative L3 systems across the four governing-law regimes, indicating which stages of the design–execute–observe–reflect loop each system realizes.

Evidence quality and falsifiability. The quality of evolution depends on the quality of evidence. Table 9 organizes the revision signals that trigger L3 model updates in each governing-law regime: what the agent detects, why it indicates the current model is wrong, and how falsifiable the signal is.

Epistemic gap detection (Dama et al., 2023) occurs when an observation falls outside the model's representational scope.

A useful principle is to prefer falsifiable evidence (Section 2.1). A screenshot combined with a DOM snapshot, error code, and action sequence is reproducible and refutable; "I think the page didn't load" is not. Human feedback should not be treated as a single falsifiability class: subjective or preference feedback is weakly falsifiable, whereas expert diagnostic feedback can be strongly falsifiable when its claims are subsequently checked by tests, experiments, or structured evaluation. Evolver's progress depends on making lessons verifiable, and reversible when wrong. This requirement connects directly to the anomaly and epistemic-gap triggers defined in Section 5.1: an anomaly is actionable only when the deviation between prediction and observation can be quantified from recorded evidence, and an epistemic gap is recognizable only when the system can demonstrate that no existing hypothesis adequately accounts for the observation. In large-scale deployments, evidence must also be compressible and indexable. Practical systems maintain multi-resolution evidence: a compact error category combined with a state fingerprint and diff summary for fast retrieval, together with pointers to heavier artifacts (screenshots, DOM snapshots, full logs) for deep audits. Evidence quality is also tightly coupled to privacy and safety constraints: an Evolver pipeline must separate what is stored persistently (sanitized logs, hashed fingerprints) from what is kept transient or behind access controls, protecting sensitive data while retaining an audit trail (Xie et al., 2024; Yang et al., 2025a).

Continuous self-improvement also introduces governance challenges, including benchmark overfitting, knowledge contamination, and misattribution of failures to wrong components. These risks and the practical measures to mitigate them (versioning, rollback, regression gates) are discussed as open problems in Section 8.

5.4 L3 in Context: Maturity, Governance, and Outlook

Having established the L3 evolution loop, its domain instantiations, and the role of evidence quality, we now examine its practical status and implications. This subsection addresses two complementary questions: maturity, i.e., where L3 systems have been successfully realized across governing-law regimes; and governance, i.e., what risks arise from persistent, automated model revision. Together, these perspectives characterize L3 both as a modeling paradigm and as a deployed system that must evolve reliably under real-world constraints.

Maturity across different domains.

We summarize maturity across the four governing-law regimes:

Scientific (Established). The most mature regime, offering fast, structured feedback, unambiguous anomaly signals (hypothesis falsification), and well-defined revision targets (surrogate model parameters, synthesis recipes) (Kusne et al., 2020; Szymanski et al., 2023; Sparkes et al., 2010; Dama et al., 2023). Primary bottleneck: instrument access and real-data budget.
Digital (Partial). Regression testing provides an automated validation gate, but many systems still lack the active information-expansion boundary condition (Romera-Paredes et al., 2024; Novikov et al., 2025; Butt et al., 2024). Primary bottleneck: active experiment design is often absent.
Physical (Emerging). Promising but limited by attribution difficulty: a failed manipulation can stem from perception, dynamics, actuation, or environmental change, and isolating the brittle component requires careful experimental design (Ren et al., 2023; Hu et al., 2025b). Primary bottleneck: failure attribution across perception, dynamics, and actuation.
Social (Aspirational). Social experiments are ethically constrained, attribution is inherently ambiguous, and behavioral ground truth is noisy (Kumar et al., 2026). Primary bottleneck: attribution ambiguity and ethical constraints on social experimentation.

Governance challenges.

Three governance risks arise specifically from persistent, automated model revision. Benchmark overfitting occurs when the regression gate is too close to the training distribution; the system learns to pass tests rather than improve genuinely. Knowledge contamination occurs when the revision loop incorporates evidence that is itself biased or adversarially constructed, silently degrading the model on OOD inputs. Misattribution cascades occur when a fix for one failure mode inadvertently degrades another component; without comprehensive regression suites, the net effect of an update can be negative. Mitigations include held-out probe sets that are refreshed independently of training data, canary deployment that surfaces regressions before full rollout, and causal ablations that isolate the contribution of each update.

Relationship to Sections 6 and 7.

From an evaluation perspective (Section 6), assessing L3 requires protocols that go beyond single-episode accuracy: the key metric is whether the system improves across revision cycles without regressing on held-out probes. From an implementation perspective (Section 7), L3 places the heaviest demands on the system stack (persistent storage, replay infrastructure, regression harnesses, and rollback mechanisms) that are often underspecified in current architectures. Building toward L3 therefore means investing in evaluation infrastructure as much as in model capacity.

6 Evaluations

Evaluating world models for agentic AI requires moving beyond standard generative metrics toward decision-centric protocols organized around three boundary conditions: long-horizon coherence, intervention sensitivity, and constraint consistency. This section first motivates this shift (Section 6.1), then maps the benchmark landscape by governing-law regime and provides detailed evaluation protocols for each condition (Section 6.2), and finally shows how the same benchmark can test L1, L2, or L3 depending on the evaluation protocol (Section 6.3). World-model-specific evaluations show that even frontier models still suffer from substantial capability gaps, while no single benchmark fully captures the space of interest. For further clarification, Appendix E provides a capability coverage matrix and a Minimal Reproducible Evaluation Package (MREP).

6.1 From prediction-centric to decision-centric evaluation

While standard generative metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), SSIM, and per-pixel reconstruction loss capture perceptual quality, they are at best weak indicators (Brooks et al., 2024; Ball et al., 2025) of agentic capability and offer limited predictive power for the downstream decisions an agent must ultimately make once embedded in the real-world environment.

As a result, a world model can generate visually convincing rollouts while still breaking down during planning because of hallucinated object dynamics, action-insensitive transitions, or subtle physics violations. These errors are often invisible to distribution-level metrics but devastating for downstream decision-making.

The root cause is a mismatch between what is measured and what matters. The object of evaluation should not be a single-step prediction in isolation, but the trajectory-level rollout and specifically whether this rollout is reliable enough for a planner to act on (Section 4). Aggregate measures such as mean success rates further obscure the picture by masking high variance across task instances (Agarwal et al., 2021; Henderson et al., 2018).

We therefore organize evaluation around the three boundary conditions that mark L1 to L2 (Section 4.1):

Long-horizon coherence: rollouts remain decision-usable over multiple steps rather than degrading via error.
Intervention sensitivity: counterfactual edits (action or premise changes) induce stable and directionally meaningful trajectory changes.
Constraint consistency: generated futures respect the governing laws of the target regime (Section 4).

These three conditions hold across the four governing-law regimes introduced in Section 4, and together they give a common framework within which we can organize the evaluation protocols, benchmark analyses, and reporting standards described in the remainder of this section.

World-model evaluation is ultimately meaningful only insofar as it reflects downstream decision quality. Benchmarks that capture long-horizon coherence, intervention sensitivity, or constraint consistency are valuable not merely as diagnostics, but because these properties should translate into better plan selection, fewer costly invalid actions, and greater task success under distribution shift. The relevant bridge is therefore not "Does the model look realistic?" but "Does improved model validity change what the agent chooses, and does that shift in choice in turn improve real-world task outcomes?"

Two aggregate metrics operationalize these conditions for downstream decision-making. The Action Success Rate (ASR) measures how often a planner that uses the world model's rollouts to select actions achieves the task goal in the real environment, and the Counterfactual Outcome Deviation (COD) measures intervention sensitivity by comparing rollout outcomes under two policies that differ at a single intervention step. When COD is low, a world model is largely unresponsive to changes in action, which makes it uninformative for counterfactual planning. Together, ASR and COD provide a more direct link between world-model quality and downstream agentic performance: ASR assesses whether the model supports good decisions, whereas COD assesses whether the model responds in a meaningful way to action-level interventions.

The value of a taxonomy is not categorization for its own sake, but guiding system design. This section decomposes world-model implementations along three architectural axes, namely representation, dynamics, and control interface (Section 7.1), and examines how the governing-law regime constrains which combinations are viable in practice (Section 7.2). Deploying these systems raises cross-cutting engineering challenges: the choice between end-to-end and modular training, latency-compute tradeoffs, sim-to-real transfer, and graceful degradation under model uncertainty. A learned world model amortizes simulation cost into a fixed computation graph at inference time, whereas explicit simulation typically scales more directly with the number of entities, interactions, solver steps, or horizon length. This does not mean neural inference is literally O(1) in every relevant variable: its cost still depends on model size, input resolution, sequence length, and rollout depth. The practical advantage is instead that learned dynamics can offer near-constant-cost approximations with respect to aspects of system complexity that would otherwise require increasingly expensive explicit simulation. Efficiency techniques matter here not as generic deployment tricks but because they interact differently with the three capability levels. For L1 systems, compression mainly trades off against one-step predictive accuracy. For L2 systems, memory and rollout efficiency directly affect achievable horizon, counterfactual branching, and thus long-horizon coherence. For L3 systems, the same efficiency choices affect whether regression-gated update loops are cheap enough to run continuously in deployment. Scaling further demands efficiency techniques: few-step distillation for real-time planning, quantization and pruning under the constraint that compounding errors amplify even minor per-step degradation, and KV cache compression for long-horizon autoregressive dynamics. A more extended treatment of these deployment and efficiency topics appears in Appendix F.

7.1 Architectural building blocks: representation, dynamics, and control

Building a world-model system requires choosing components along three axes (Table 11). Each choice carries distinct tradeoffs that determine which capability level (L1/L2/L3) the resulting system can reach and in which governing-law regime the resulting design will be most effective along each of these three axes.

Representation.

At one extreme, symbolic or programmatic states (e.g., VirtualHome) offer interpretability and enable hard constraint enforcement, but demand heavy manual engineering and cover only pre-specified state spaces; they are best evaluated by success rate and error-branch coverage. At the other extreme, latent continuous representations, such as the RSSM in DreamerV3 and V-JEPA2, handle high-dimensional multimodal inputs with relatively little hand-designed structure. Their weakness is that, over long horizons, they are more susceptible to semantic drift and state aliasing, making long-horizon consistency and failure attribution especially important for evaluation. VL-JEPA develops a joint embedding predictive architecture which predicts the continuous embeddings of the target text. VLog uses a learnable token to retrieve the narration then serve as a video-centric vocabulary in long video understanding. Between these two extremes lie structured 3D representations, including occupancy models such as RoboOccWorld and point-flow models such as PointWorld. They are appealing because they fit physical constraints more naturally, but this advantage often comes with reconstruction and computational bottlenecks. As a result, reachability and stability become particularly important in evaluation. Finally, discrete token representations (e.g., VQ-VAE codebooks in IRIS) enforce compositionality and enable exact likelihood training via cross-entropy, bridging continuous perception with autoregressive dynamics.

Dynamics.

Stochastic latent dynamics, exemplified by DreamerV3, express uncertainty and multimodality through principled ELBO training and uncertainty-aware rollouts, but may degrade or become miscalibrated over long horizons. Where uncertainty modeling is less critical, deterministic value-aware dynamics (MuZero, TD-MPC2) optimize the transition function directly for downstream value prediction, trading generative flexibility for tighter integration with the control objective. Autoregressive token dynamics (iVideoGPT, LWM) offer a unified scalable interface that handles multiple modalities through a shared vocabulary, though long-horizon logical consistency remains a weak point. Diffusion-based dynamics (the Sora technical line, DIAMOND, and interactive environments such as Genie) deliver photorealistic observation-level transitions, but the multi-step denoising they require at inference time often comes with weak action controllability.

Control interface.

Online MPC-style approaches (TD-MPC2, PETS) replan at every step using short-horizon rollouts, providing fast correction at the cost of compute and latency pressure. Tree search and expansion (MuZero, EfficientZero) enable counterfactual branching and systematic look-ahead, though they amplify model errors and can exploit benchmark loopholes. Rather than planning in the environment at all, imagined-rollout policy optimization (the Dreamer family) trains a policy entirely on model-generated trajectories, avoiding real interaction during learning but requiring highly accurate dynamics. At the deployment end, offline policy distillation (GR-1) enables cheap inference yet is fragile under distribution shift, motivating OOD stress tests. A distinct strategy altogether, replayable-environment interfaces (OSWorld, SWE-agent) sidestep learned dynamics entirely, treating the real environment as its own simulator and relying on receipt parsing and state fingerprinting. More broadly, part of the control problem is deciding when external computation should be invoked at all, rather than treating tool use as either mandatory or absent; adaptive tool-integration work provides a useful planner-side example of this distinction.

Table 11: Architectural building blocks for world models. Three design axes (representation, dynamics, control interface) are cross-referenced with concrete options, representative systems, strengths, and dominant failure modes.

7.2 Design tradeoffs across governing-law regimes

The building blocks above are not interchangeable; the governing-law regime determines which combinations are viable and which failure modes dominate. Table 12 summarizes how deployment-regime latency budgets constrain the viable dynamics model classes and their control interfaces.

Physical-world systems.

Everything hinges on contact, reachability, and stability under continuous actions. Representations must preserve geometry and contact relations; dynamics must be stable over short-to-medium horizons; and the control interface must be fast enough for closed-loop correction. Latent or structured 3D representations paired with MPC or imagined-rollout policies dominate this regime. Short-horizon rollouts reduce compounding error, and MPC provides an online correction mechanism. The main pitfalls are the preassumed existence of de facto 3D scenes, degraded 3D reconstruction capabilities, semantic drift in latent space, constraint violations that remain plausible in the learned representation, and the sim-to-real gap for contact-rich interactions. It is useful to distinguish at least three transfer curves in practice: transfer across input modalities, transfer across sensor suites, and transfer across environments, since each exposes a distinct failure mode of the learned dynamics and demands its own diagnostic instrumentation.

Digital-world systems.

State-machine and branch consistency, rather than learned dynamics, are the primary bottleneck. Symbolic or DOM-based states paired with replayable environments are the dominant design in this setting. Because they expose explicit state machines and support strong evidence logging, they make failures easier to trace and thus support Evolver-style asset distillation. This transparency, however, is not without cost: grounding may break under UI changes, loading variability and race conditions introduce non-deterministic noise, and benchmark artifacts remain vulnerable to reward gaming and to subtle shifts in the underlying software stack.

Social-world systems.

The dominant bottleneck is maintaining coherent agent identity and relational state across extended interactions. Persona state must persist over hundreds of turns without drift, yet Theory-of-Mind (ToM) inference, which updates beliefs about other agents’ goals, knowledge, and intentions, imposes per-step costs that grow with the number of modeled agents. Multi-agent communication compounds the problem: n-agent interactions generate O(n^2) pairwise belief updates per step, making naïve scaling infeasible for the 10,000+-agent simulations now appearing in the literature. Norm-consistency checking adds a further constraint: valid social rollouts must respect evolving norms (politeness conventions, negotiation protocols, institutional rules), and violations must be detectable at rollout time rather than post hoc. The overarching challenge is that agent identity is not a fixed state vector but an emergent property of interaction history; maintaining stable identity under multi-turn dynamics while still allowing genuine belief revision remains an open architectural problem that current LLM-based agents address only superficially through system-prompt conditioning.

Generative simulation systems.

The central tension is between visual fidelity and action controllability. High-fidelity diffusion or autoregressive models excel at producing photorealistic outputs useful for demonstration and synthetic data generation, but action-conditioning is often unstable and long-horizon consistency is difficult. A system can be mistakenly treated as planning-ready when it is not decision-usable; evaluation should prioritize action-response consistency and long-horizon stability over raw perceptual realism.

Scientific-world systems.

Evidence-chain validity and falsifiability matter more than perceptual quality in this regime (cf. the Popperian reading in Section 2.1). Representations must be interpretable and traceable to experimental evidence; dynamics must respect known mechanism boundaries; and the control interface should support experiment selection and belief updates rather than action execution. The distinctive risks are hallucinated mechanisms that appear plausible but lack grounding, correlation mistaken for causation, and negative results that are silently discarded rather than propagated through the model.

VLA vs. native world models.

A crosscutting architectural question is whether to embed world-model capabilities inside a Vision-Language-Action (VLA) model or to build a dedicated world-model module. VLAs inherit the scaling infrastructure and pretraining data of large language models, but their world-modeling capacity is implicit and difficult to isolate or evaluate. Recent efforts to make this capacity more explicit include spatially guided training that injects geometric structure into VLA policy learning, aiming to bridge the gap between implicit visual knowledge and the explicit physical state awareness that world models require. Related work makes this implicit capacity more procedural than geometric: Pixel Reasoner equips VLMs with explicit visual operations such as zoom-in and select-frame for curiosity-driven evidence gathering, while Visual Rationale Learning treats such visual actions as core reasoning primitives rather than optional tools, together highlighting a broader shift toward explicit perceptual control inside VLM-like agents even when no standalone transition model is exposed. Native world models expose an explicit transition function that can be queried, composed, and stress-tested independently. Competition between these paradigms is partly a sociotechnical question: the massive investment in LLM infrastructure creates path dependencies that favor VLA-style integration even when a dedicated module might be technically superior. From an evaluation standpoint, the litmus test is whether the system’s predictions can be decoupled from its language generation and tested against the three boundaries (Section 4.1). Some architectural choices are also sociotechnical rather than purely algorithmic: whether the field converges on native world models or VLA-style surrogates may partly depend on tool ecosystems, available datasets, and hardware compatibility, besides intrinsic modeling power.

These regimes are not mutually exclusive. In practice, mature systems often stack multiple design patterns: symbolic or workflow planning at the top for high-level task decomposition, replayable environments in the middle for receipt validation and failure attribution, and short-horizon continuous control at the bottom for real-time correction. This suggests that the relevant unit of analysis is the composed system, not any single module in isolation. Representation, dynamics, and control should therefore be evaluated together, in light of the constraints they impose and the evidence they make available. Many apparent disagreements in the literature then look less like fundamental disputes about whether world models work, and more like differences in where systems land along these design axes.

Table 12: Deployment latency budgets and engineering bottlenecks by regime. Inference latency budgets range from sub-100 ms for real-time robotics to minutes for offline scientific planning; the table maps each regime to viable dynamics model classes and primary engineering bottlenecks. These are deployment budget ranges rather than measured benchmark results; empirical throughput depends on model size, hardware, batching, simulator implementation, and verification overhead.

7.3 Implementation Roadmap

Table 13 distills the architectural guidance from the preceding sections into a concise roadmap organized by capability level and governing-law regime. For each cell, we list the representation format that best preserves the regime’s planner-critical structure, the dynamics model class that is most tractable at that capability level, and the single most important engineering bottleneck that must be addressed to reach the next level.

Table 13: Design roadmap across governing-law regimes. For each regime, we summarize the representation, dynamics, and bottleneck at L1–L3.

Three cross-cutting engineering principles hold across all cells. First, separate what is learned from what is enforced: hard constraint layers (collision checkers, state-machine validators, regression gates) should be applied at inference time rather than learned implicitly, because soft enforcement through training loss cannot guarantee zero-violation rollouts. Second, instrument before you iterate: logging, replay, and failure attribution infrastructure should be built into the system from the start; without replay, L3 revision becomes anecdotal and ungovernable. Third, match the representation to the planner’s query: a representation that looks realistic but does not expose the variables the planner needs (free space, permission state, reaction rate) is worse than a lower-fidelity representation that does.

Early AI attempted to hand-code world models as logical rules and constraints. STRIPS introduced the first action-schema representation for robotic planning, but the Frame Problem revealed that every action requires explicit axioms specifying what does not change, a burden that grows combinatorially. The Lighthill Report catalyzed the first AI winter (1974–1980) by exposing the gap between laboratory demonstrations and real-world competence. The second winter (1987–1993) followed the brittleness of expert systems and the collapse of the Lisp-machine market: hand-crafted knowledge bases such as CYC could not gracefully handle uncertainty and commonsense exceptions. The overarching lesson was clear: purely symbolic world models do not scale to open-world domains.

Connectionist Resurgence (1986–2020).

The revival of neural networks, from backpropagation through deep convolutional networks and Transformers, shifted the paradigm from hand-coded rules to learned representations. World models re-emerged in model-based reinforcement learning, from latent dynamics models to general pixel-based control.

Generative Revolution (2020–present).

Diffusion models and large-scale language models such as GPT-3 have catalyzed a qualitative shift, building on the Transformer backbone established in the preceding era. Video generation models and LLM-based agents are blurring the boundary between prediction and simulation, though systematic physics violations persist. More broadly, the field is converging toward a neuro-symbolic frontier that combines neural dynamics modules for learning transition functions (L1/L2) with symbolic components for constraint enforcement and hypothesis-space expansion (L3).

Across all four eras, representation learning serves as shared infrastructure: the quality of the learned state z_t determines the ceiling for prediction (L1), simulation (L2), and revision (L3) alike. Whether the representation is a latent vector, a discrete token sequence, a 3D point cloud, or a program, the governing-law regime determines which invariants the representation must preserve.

This historical arc suggests a consistent lesson: progress in world modeling has come not from scale alone, but from changing what is represented, what is compositional over horizon, and what can be revised from evidence. The open problems below are organized around remaining bottlenecks at L1, L2, and L3.

8.2 Open Problems by Capability Level

The preceding sections reveal a clear trajectory: world models are progressing from isolated one-step predictors toward integrated, agent-facing simulators that must respect domain-specific governing laws over extended horizons. Across all four regimes, this progression exposes a common pattern. In embodied domains, visual plausibility is outpacing physical faithfulness: models generate convincing video but violate conservation laws and object permanence under rollout, with the best systems achieving only 0.262 success rate on physical-consistency tests. In social domains, large-scale agent simulations reproduce emergent phenomena such as opinion polarization and governance formation, but LLM agents exhibit systematic biases toward consensus that diverge from human behavioral patterns. In code domains, agents treat software as deterministic state machines while real systems are partially observable, asynchronous, and multi-tenant. In scientific domains, neural surrogates trained on simulation data degrade when applied to real experimental measurements, exposing a surrogate-to-reality gap analogous to sim-to-real in robotics. The overarching theme is that the bottleneck has shifted from generating plausible futures to ensuring those futures are decision-usable: faithful to governing constraints, responsive to interventions, and calibrated against real-world evidence.

We organize ten concrete open problems by the capability level at which they most directly arise.

Representation and Local Prediction.

Physical faithfulness beyond visual plausibility. Current video and 3D world models achieve perceptual realism but fail physical-consistency tests: PhyWorldBench reports that the best of twelve frontier models attains only a 0.262 success rate on conservation-law and object-permanence probes, with long-horizon error accumulation as the core structural weakness. Closing this gap requires physically grounded representations that enforce constraints under counterfactual rollout, not merely pixel fidelity. Spatially guided training strategies that inject geometric supervision into vision-language-action models offer one promising direction.
Metric-aware video world modeling. Extending geometry-grounded editing from image pairs to temporally coherent video demands four coupled abilities: metric estimation across time, temporal composition of short-step predictions, identity and appearance preservation across frames, and instruction grounding that aligns predicted motion with semantic specifications. Moving to video provides denser temporal supervision and stronger identity constraints than image-pair approaches. Subject-faithful controllable editing methods such as RealCustom++ may supply useful interface components. Evaluation must measure metric controllability directly, not just perceptual quality.
Programmable visual representation. Current visual world models represent state as raw pixels or latent embeddings, neither of which is compositional or precisely editable. Code offers a structured alternative: VCode reconstructs images as SVG programs preserving symbolic semantics over pixel fidelity, Code2Video shows that executable Manim scripts outperform pixel-generation models on structured content by making every spatial and temporal element directly addressable, and VIGA extends the paradigm to 3D by reconstructing scenes and simulating physical interactions through generated Blender code. The open problem is unifying these code-based representations into a single world-model interface for both 2D and 3D compositional editing.

Simulation Fidelity and Intervention.

Partially observable software as a POMDP. No existing code world model maintains belief distributions over hidden backend state (server sessions, database rows, in-flight requests, and background processes), nor reasons about asynchronous transitions with variable latency. Injecting realistic asynchronous failures into standard benchmarks causes significant drops in task success across all state-of-the-art agents. Solving this requires temporal-belief architectures that jointly model what has happened, what is in progress, and what the agent cannot yet observe.
Concurrent multi-user state. Real software is multi-tenant; world models must predict state under a Dec-POMDP where concurrent users' actions are unobservable. Conflict-free replicated data types provide the formal substrate for merging concurrent updates, but no current world model integrates distributed-systems semantics with learned belief tracking over hidden users and pending writes. This problem sits at the intersection of the Digital World and Social World regimes, requiring joint reasoning about software state and multi-agent intent.
Agent-human behavioral alignment at scale. LLM agents exhibit systematic biases toward moderation and consensus, producing two failure modes: mode collapse, where diverse simulated populations converge to homogeneous behavior, and calibration inadequacy, where single-turn persona alignment fails under multi-turn dynamics. Language and cultural priors can inject diversity, but this effect diminishes as cultural distance between populations shrinks. Systematic methods for grounding simulated behavior in real human behavioral distributions is lacking.

Evidence-Driven Revision and Self-Evolution.

Existing autonomous science flagships such as CAMEO and A-Lab demonstrate that closed-loop model revision is feasible in highly instrumented domains, while evaluator-guided algorithmic discovery systems such as FunSearch and AlphaEvolve demonstrate partial L3 loops with strong validation. Several open problems must be solved before L3 generalizes.

Continual learning of societal transition functions. Large-scale simulations with 10,000+ agents across millions of interactions reproduce emergent phenomena such as opinion polarization and governance formation, yet cannot autonomously detect when social dynamics have shifted. The core challenge is to identify an outdated transition model, acquire corrective evidence, and revise without catastrophic forgetting of stable patterns. This problem connects to L3, where model revision must be triggered by distributional evidence rather than supervised labels.
Closing the surrogate-to-reality gap. Scientific surrogates validated on simulation data degrade on real measurements; prediction error decreases as a power law with computational data but plateaus without real-data calibration, mirroring the sim-to-real gap in robotics. In the scientific regime, this surrogate-to-reality gap is the direct analogue of sim-to-real transfer in robotics; the implementation-side mitigations therefore provide a useful template, even though the measurement bottlenecks and evidence budgets differ. Notably, L2 scientific simulators such as GraphCast, NeuralGCM, and Aurora provide the prediction substrate on which L3 revision operates; their fidelity sets the ceiling for downstream evidence-driven diagnosis. The OPAL-surrogate framework provides hierarchical Bayesian credibility gates that formalize when a surrogate is trustworthy. The central open question is how to allocate scarce real experimental observations optimally between model calibration and scientific discovery.
Modeling laws that themselves evolve. In biology, ecology, and climate, governing dynamics are non-stationary: viral fitness landscapes shift, climate forcing alters atmospheric dynamics, and evolutionary pressures create tipping points. World models must learn second-order meta-transition operators governing how the model itself drifts, together with revision triggers that detect law change from observational evidence. Causal discovery under non-stationarity provides identifiability results but treats change as variation within a fixed meta-model rather than as structural law replacement.
Harness designs for agentic world modeling. Agent performance has evolved through three successive abstractions: prompt engineering optimizes what the model is told, context engineering curates the information state across turns, and harness engineering designs the executable environment surrounding the model: tools, memory, feedback loops, and inter-agent topology. This progression implies that agent behavior is governed not by the model alone but by transition dynamics of its execution environment, making harness design a form of world modeling for software agents. The problem is how to learn and synthesize harnesses from interaction data, treating the execution environment itself as the object of modeling rather than a fixed engineering assumption.

Cross-regime shared challenges.

Despite the diversity of governing-law regimes, three open problems recur across all four domains and constitute the deepest bottlenecks for world modeling in agentic AI. Deployment shift: world models trained on offline data or simulation systematically underperform when the environment drifts. UI layouts change, physical contact properties shift, social norms evolve, and scientific instruments recalibrate. Robust world modeling requires online mechanisms that detect distribution shift early and trigger targeted revision rather than waiting for catastrophic failure. Constraint enforcement: all four regimes have governing laws that valid trajectories must satisfy (contact stability, state-machine consistency, norm compliance, evidence-chain validity), yet current models enforce these constraints only softly through training objectives; hard enforcement at inference time, via symbolic layers, constrained rollout, or verification gates, remains an open architectural problem. Persistent update governance: L3 systems that revise themselves from evidence face a trilemma of stability (avoid regressing on past capabilities), plasticity (incorporate new evidence quickly), and auditability (trace every update to its evidence source); no current system resolves all three, and the governance infrastructure (versioning, canary deployment, rollback policies, and regression harnesses) is underspecified in most published architectures. The MREP framework offers a starting point for standardizing evaluation across these shared challenges by providing version-locked, reproducible evaluation packages that make cross-regime comparison tractable.

This section of the bibliography continues to catalogue24 influential and foundational works across artificial intelligence, machine learning, and their real-world applications. The references span a broad spectrum, from theoretical explorations of mental models and the structure of scientific revolutions to cutting-edge developments in autonomous systems and generative AI.

Several entries highlight the profound impact of AI in the natural sciences. For example, the work by Jumper et al. (2021) on AlphaFold represents a landmark achievement in computational biology, enabling highly accurate protein structure prediction and opening new frontiers in drug discovery and disease understanding. Similarly, Kochkov et al. (2024) introduced Neural General Circulation Models, a novel approach for weather and climate prediction that leverages machine learning to potentially outperform traditional physics-based models. The application of machine learning to solve complex physical equations is12 is further evidenced by Li et al. (2021b) and Kovachki et al. (2023), who pioneered the use of Fourier Neural Operators and other neural operator architectures to learn mappings between function spaces, drastically accelerating simulations for engineering and scientific problems. The integration of physical laws into learning systems, termed Physics-informed machine learning, is surveyed by Karniadakis et al. (2021) and implemented for robotic manipulation by Li et al. (2025d) in their PIN-WM framework, illustrating a critical trend toward more reliable and data-efficient models.

The bibliography also documents the rapid evolution of generative world models and robotics. Research such as Hu et al. (2025b) on egocentric visual self-modeling allows robots to predict their own motion and adapt to changes, a fundamental capability for autonomy. The scaling of 3D world models for complex, real-world manipulation tasks is pursued in works like Huang et al. (2026b) with PointWorld and Huang et al. (2025b) with ParticleFormer, which handles multi-object, multi-material scenarios. On the generative side, advanced video generation techniques with world-model-like understanding are benchmarks are challenged and evaluated by Kang et al. (2025a) from a physics law perspective, and Benchmark suites like VBench (Huang et al., 2024d) are developed to rigorously assess the capabilities of video generative models. The intersection of language models, planning, and agency is another key focus. Jacobs et al. (2024) explore game-theoretic equilibria for consensus-based generation, while2345678910 benchmarks like SWE-Bench (Jimenez et al., 2024) test the ability of language models to resolve real-world software engineering issues,14 demonstrating their potential as coding agents. The integration of these models into embodied tasks is13 benchmarked by platforms like RLBench (James et al., 2020) and the extensive BEHAVIOR-1K (Li et al., 2024a) benchmark, which simulates 1,000 everyday human-centered activities.

Lin et al. (2025c) K. Q. Lin, Y. Zheng, H. Ran, D. Zhu, D. Mao, L. Li, P. Torr, and A. J. Wang. VCode: a multimodal coding benchmark with SVG as symbolic visual representation. arXiv preprint arXiv:2511.02778, 2025c.

Lin et al. (2025d) S. Lin, X. Xia, Y. Ren, C. Yang, X. Xiao, and L. Jiang. Diffusion adversarial post-training for one-step video generation. arXiv preprint arXiv:2501.08316, 2025d.

Lin et al. (2023) Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, A. Dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido, and A. Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.

Lipman et al. (2023) Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023.

Liu et al. (2025) H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with blockwise ringattention. In International Conference on Learning Representations, 2025.

Liu et al. (2021) L. Liu, S. Zhang, Z. Kuang, A. Zhou, J.-H. Xue, X. Wang, Y. Chen, W. Yang, Q. Liao, and W. Zhang. Group fisher pruning for practical network compression. In International Conference on Machine Learning, pages 7021–7032. PMLR, 2021.

Liu et al. (2024a) X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang. AgentBench: Evaluating LLMs as agents. In International Conference on Learning Representations, 2024a.

Liu et al. (2024b) Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. In International Conference on Machine Learning, pages 32332–32344, 2024b.

Lu and Song (2025) C. Lu and Y. Song. Simplifying, stabilizing and scaling continuous-time consistency models. In International Conference on Learning Representations, 2025.

Lu et al. (2022) C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Advances in Neural Information Processing Systems, volume 35, pages 5775–5787, 2022.

Lu et al. (2024a) C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The AI scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292, 2024a.

Lu et al. (2025a) C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research, 22(4):730–751, 2025a.

Lu et al. (2025b) G. Lu, B. Jia, P. Li, Y. Chen, Z. Wang, Y. Tang, and S. Huang. GWM: Towards scalable gaussian world models for robotic manipulation. In IEEE/CVF International Conference on Computer Vision, pages 9263–9274, 2025b.

Lu et al. (2024b) X. Lu, A. Zhou, Z. Lin, Q. Liu, Y. Xu, R. Zhang, X. Yang, J. Yan, P. Gao, and H. Li. TerDiT: Ternary diffusion models with transformers. arXiv preprint arXiv:2405.14854, 2024b.

Luo et al. (2025) D. Luo, B. Tang, K. Li, G. Papoudakis, J. Song, S. Gong, J. Hao, J. Wang, and K. Shao. ViMo: A generative visual GUI world model for app agents. arXiv preprint arXiv:2504.13936, 2025.

Ma et al. (2026) W. Ma, S. Sun, T. Yu, R. Wang, T.-S. Chua, and J. Bian. Thinking with blueprints: Assisting vision-language models in spatial reasoning via structured object representation. arXiv preprint arXiv:2601.01984, 2026.

Magne et al. (2026) L. Magne, A. Awadalla, G. Wang, Y. Xu, J. Belofsky, F. Hu, J. Kim, L. Schmidt, G. Gkioxari, J. Kautz, Y. Yue, Y. Choi, Y. Zhu, and L. Fan. NitroGen: An open foundation model for generalist gaming agents. arXiv preprint arXiv:2601.02427, 2026.

Majumder et al. (2025) B. P. Majumder, H. Surana, D. Agarwal, B. D. Mishra, A. Meena, A. Prakhar, T. Vora, T. Khot, A. Sabharwal, and P. Clark. DiscoveryBench: Towards data-driven discovery with large language models. In International Conference on Learning Representations, 2025.

Makoviychuk et al. (2021) V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State. Isaac Gym: High performance GPU-based physics simulation for robot learning. In Advances in Neural Information Processing Systems, 2021.

Mao et al. (2025) X. Mao, Z. Li, C. Li, X. Xu, K. Ying, T. He, J. Pang, Y. Qiao, and K. Zhang. Yume-1.5: A text-controlled interactive world generation model. arXiv preprint arXiv:2512.22096, 2025.

Mao et al. (2026a) Z. Mao, M. Huang, F. Ding, M. Liu, Q. He, and Y. Zhang. RealCustom++: Representing images as real textual word for real-time customization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(2):2078–2095, 2026a.

Mao et al. (2026b) Z. Mao, M. Huang, Y. Lin, Q. Wang, L. Zhang, and Y. Zhang. Toward accurate image generation via dynamic generative image transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(5):5910–5927, 2026b.

Marcus (2018) G. Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.

Marra et al. (2024) G. Marra, S. Dumančić, R. Manhaeve, and L. De Raedt. From statistical relational to neuro-symbolic artificial intelligence: A survey. Artificial Intelligence, 328:104062, 2024.

McCarthy and Hayes (1969) J. McCarthy and P. J. Hayes. Some philosophical problems from the standpoint of artificial intelligence. In Machine Intelligence, volume 4, pages 463–502. Edinburgh University Press, 1969.

Mees et al. (2022) O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022.

Micheli et al. (2023) V. Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In International Conference on Learning Representations, 2023.

Micheli et al. (2024) V. Micheli, E. Alonso, and F. Fleuret. Efficient world models with context-aware tokenization. In International Conference on Machine Learning, pages 35623–35638. PMLR, 2024.

Min et al. (2024) C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y. Guo, J. Xing, L. Jing, Y. Nie, and B. Dai. DriveWorld: 4D pre-trained scene understanding via world models for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15522–15533, 2024.

Minami et al. (2025) S. Minami, Y. Hayashi, S. Wu, K. Fukumizu, H. Sugisawa, M. Ishii, I. Kuwajima, K. Shiratori, and R. Yoshida. Scaling law of Sim2Real transfer learning in expanding computational materials databases for real-world predictions. NPJ Computational Materials, 11(146), 2025.

Mittal et al. (2025) M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Muñoz, X. Yao, R. Zurbrügg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y. Feng, A. Garg, R. Gasoto, L. Gulich, Y. Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V. Makoviychuk, G. Malczyk, H. Mazhar, M. Moghani, A. Murali, M. Noseworthy, A. Poddubny, N. Ratliff, W. Rehberg, C. Schwarke, R. Singh, J. L. Smith, B. Tang, R. Thaker, M. Trepte, K. V. Wyk, F. Yu, A. Millane, V. Ramasamy, R. Steiner, S. Subramanian, C. Volk, C. Chen, N. Jawale, A. V. Kuruttukulam, M. A. Lin, A. Mandlekar, K. Patzwaldt, J. Welsh, H. Zhao, F. Anes, J.-F. Lafleche, N. Moënne-Loccoz, S. Park, R. Stepinski, D. V. Gelder, C. Amevor, J. Carius, J. Chang, A. H. Chen, P. de Heras Ciechomski, G. Daviet, M. Mohajerani, J. von Muralt, V. Reutskyy, M. Sauter, S. Schirm, E. L. Shi, P. Terdiman, K. Vilella, T. Widmer, G. Yeoman, T. Chen, S. Grizan, C. Li, L. Li, C. Smith, R. Wiltz, K. Alexis, Y. Chang, D. Chu, L. J. Fan, F. Farshidian, A. Handa, S. Huang, M. Hutter, Y. Narang, S. Pouya, S. Sheng, Y. Zhu, M. Macklin, A. Moravanszky, P. Reist, Y. Guo, D. Hoeller, and G. State. Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning. arXiv preprint arXiv:2511.04831, 2025.

Moerland et al. (2023) T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning, 16(1):1–118, 2023.

Nagabandi et al. (2018) A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In IEEE International Conference on Robotics and Automation, pages 7559–7566, 2018.

Nasiriany et al. (2024) S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y. Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems, 2024.

Nasiriany et al. (2026) S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y. Zhu. RoboCasa365: A large-scale simulation framework for training and benchmarking generalist robots. arXiv preprint arXiv:2603.04356, 2026.

Newton (1687) I. Newton. Philosophiæ Naturalis Principia Mathematica. Royal Society, London, 1687.

Ngo et al. (2013) H.-V. V. Ngo, T. Martinetz, J. Born, and M. Mölle. Auditory closed-loop stimulation of the sleep slow oscillation enhances memory. Neuron, 78(3):545–553, 2013.

Nguyen et al. (2023) T. Nguyen, J. Brandstetter, A. Kapoor, J. K. Gupta, and A. Grover. ClimaX: A foundation model for weather and climate. In International Conference on Machine Learning, pages 25904–25938. PMLR, 2023.

Nie et al. (2026) W. Nie, J. Berner, N. Ma, C. Liu, S. Xie, and A. Vahdat. Transition matching distillation for fast video generation. arXiv preprint arXiv:2601.09881, 2026.

Noé et al. (2019) F. Noé, S. Olsson, J. Köhler, and H. Wu. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019.

Noé et al. (2020) F. Noé, A. Tkatchenko, K.-R. Müller, and C. Clementi. Machine learning for molecular simulation. Annual Review of Physical Chemistry, 71:361–390, 2020.

Novikov et al. (2025) A. Novikov, N. Vũ, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. R. Ruiz, A. Mehrabian, M. P. Kumar, A. See, S. Chaudhuri, G. Holland, A. Davies, S. Nowozin, P. Kohli, and M. Balog. AlphaEvolve: A coding agent for scientific and algorithmic discovery. Technical report, Google DeepMind, 2025.

Okada and Taniguchi (2022) M. Okada and T. Taniguchi. DreamingV2: Reinforcement learning with discrete world models without reconstruction. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 985–991, 2022.

Oord et al. (2018) A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.

Oquab et al. (2024) M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024.

Ouyang et al. (2026) M. Ouyang, S. Hu, K. Q. Lin, H. T. Ng, and M. Z. Shou. GameWorld: Towards standardized and verifiable evaluation of multimodal game agents. arXiv preprint arXiv:2604.07429, 2026.

Pan et al. (2026) L. Pan, L. Zou, S. Guo, J. Ni, and H.-T. Zheng. Natural-language agent harnesses. arXiv preprint arXiv:2603.25723, 2026.

Park et al. (2022) J. S. Park, L. Popowski, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Social simulacra: Creating populated prototyping communities for social computing research. In ACM Symposium on User Interface Software and Technology, 2022.

Park et al. (2023) J. S. Park, J. C. O'Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.

Park et al. (2020) S. Park, J. Lee, S. Mo, and J. Shin. Lookahead: A far-sighted alternative of magnitude-based pruning. arXiv preprint arXiv:2002.04809, 2020.

Pathak et al. (2017) D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pages 2778–2787. PMLR, 2017.

Pearl (2009) J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009.

Peebles and Xie (2023) W. Peebles and S. Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.

Peirce (1932) C. S. Peirce. Collected Papers of Charles Sanders Peirce, volume 2. Harvard University Press, Cambridge, MA, 1932.

Piao et al. (2025) J. Piao, Y. Yan, J. Zhang, N. Li, J. Yan, X. Lan, Z. Lu, Z. Zheng, J. Y. Wang, D. Zhou, C. Gao, F. Xu, F. Zhang, K. Rong, J. Su, and Y. Li. AgentSociety: Large-scale simulation of LLM-driven generative agents advances understanding of human behaviors and society. arXiv preprint arXiv:2502.08691, 2025.

Piatti et al. (2024) G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Schölkopf, M. Sachan, and R. Mihalcea. Cooperate or collapse: Emergence of sustainable cooperation in a society of LLM agents. In Advances in Neural Information Processing Systems, volume 37, pages 111715–111759, 2024.

Plato (1992) Plato. Republic. Hackett Publishing, 1992.

Popper (1959) K. R. Popper. The Logic of Scientific Discovery. Hutchinson, 1959. English translation; original German 1935.

Price et al. (2024) I. Price, A. Sanchez-Gonzalez, F. Alet, T. R. Andersson, A. El-Kadi, D. Masters, T. Ewalds, J. Stott, S. Mohamed, P. Battaglia, R. Lam, and M. Willson. Probabilistic weather forecasting with machine learning. Nature, 637:84–90, 2024.

Puig et al. (2018) X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba. VirtualHome: Simulating household activities via programs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.

Puig et al. (2024) X. Puig, E. Undersander, A. Szot, M. D. Cote, T.-Y. Yang, R. Partsey, R. Desai, A. W. Clegg, M. Hlavac, S. Y. Min, V. Vondruš, T. Gervet, V.-P. Berges, J. M. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi. Habitat 3.0: A co-habitat for humans, avatars, and robots. In International Conference on Learning Representations, 2024.

Puterman (1994) M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.

Qian et al. (2024) C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. ChatDev: Communicative agents for software development. In Annual Meeting of the Association for Computational Linguistics, pages 15174–15186, 2024.

Qiao et al. (2024) S. Qiao, R. Fang, N. Zhang, Y. Zhu, X. Chen, S. Deng, Y. Jiang, P. Xie, F. Huang, and H. Chen. Agent planning with world knowledge model. In Advances in Neural Information Processing Systems, volume 37, pages 114843–114871, 2024.

Qin et al. (2024) Y. Qin, Z. Shi, J. Yu, X. Wang, E. Zhou, L. Li, Z. Yin, X. Liu, L. Sheng, J. Shao, L. Bai, W. Ouyang, and R. Zhang. WorldSimBench: Towards video generation models as world simulators. arXiv preprint arXiv:2410.18072, 2024.

Qin et al. (2025) Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, W. Zhong, K. Li, J. Yang, Y. Miao, W. Lin, L. Liu, X. Jiang, Q. Ma, J. Li, X. Xiao, K. Cai, C. Li, Y. Zheng, C. Jin, C. Li, X. Zhou, M. Wang, H. Chen, Z. Li, H. Yang, H. Liu, F. Lin, T. Peng, X. Liu, and G. Shi. UI-TARS: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025.

Senior et al. (2020) A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis. Improved protein structure prediction using potentials from deep learning. Nature, 577:706–710, 2020.

Shaj et al. (2023) V. Shaj, S. G. Zadeh, O. Demir, L. R. Douat, and G. Neumann. Multi time scale world models. In Advances in Neural Information Processing Systems, volume 36, pages 26764–26775, 2023.

Shanahan (1997) M. Shanahan. Solving the Frame Problem: A Mathematical Investigation of the Common Sense Law of Inertia. MIT Press, 1997.

Shanahan et al. (2023) M. Shanahan, K. McDonell, and L. Reynolds. Role play with large language models. Nature, 623:493–498, 2023.

Shang et al. (2023) Y. Shang, Z. Yuan, B. Xie, B. Wu, and Y. Yan. Post-training quantization on diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.

Shang et al. (2025) Y. Shang, X. Zhang, Y. Tang, L. Jin, C. Gao, W. Wu, and Y. Li. RoboScape: Physics-informed embodied world model. arXiv preprint arXiv:2506.23135, 2025.

Shao et al. (2023) Y. Shao, L. Li, J. Dai, and X. Qiu. Character-LLM: A trainable agent for role-playing. In Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, 2023.

Shen et al. (2026) Z. Shen, X. Hu, X. Li, T. Fang, J. Li, and S. Zhang. World-model-augmented web agents with action correction. arXiv preprint arXiv:2602.15384, 2026.

Shi et al. (2017) T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144, 2017.

Singh et al. (2024) P. K. Singh, K. A. Farrell-Maupin, and D. Faghihi. A framework for strategic discovery of credible neural network surrogate models under uncertainty. Computer Methods in Applied Mechanics and Engineering, 427:117061, 2024.

Song et al. (2023a) X. Song, W. Yao, Y. Fan, X. Dong, G. Chen, J. C. Niebles, E. Xing, and K. Zhang. Temporally disentangled representation learning under unknown nonstationarity. In Advances in Neural Information Processing Systems, volume 36, pages 8092–8113, 2023a.

Song and Dhariwal (2024) Y. Song and P. Dhariwal. Improved techniques for training consistency models. In International Conference on Learning Representations, 2024.

Song et al. (2023b) Y. Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. In International Conference on Machine Learning, volume 202, pages 32211–32252. PMLR, 2023b.

Sparkes et al. (2010) A. Sparkes, W. Aubrey, E. Byrne, A. Clare, M. N. Khan, M. Liakata, M. Markham, J. Rowland, L. N. Soldatova, K. E. Whelan, M. Young, and R. D. King. Towards robot scientists for autonomous scientific discovery. Automated Experimentation, 2(1):1, 2010.

Stalnaker (1968) R. C. Stalnaker. A theory of conditionals. In Studies in Logical Theory, volume 2 of American Philosophical Quarterly Monograph Series, pages 98–112. Blackwell, 1968.

Stanić et al. (2023) A. Stanić, Y. Tang, D. Ha, and J. Schmidhuber. Learning to generalize with object-centric agents in the open world survival game crafter. IEEE Transactions on Games, 16(2):384–395, 2023.

Strieth-Kalthoff et al. (2024) F. Strieth-Kalthoff, H. Hao, V. Rathore, J. Derasp, T. Gaudin, N. H. Angello, M. Seifrid, E. Trushina, M. Guy, J. Liu, X. Tang, M. Mamada, et al. Delocalized, asynchronous, closed-loop discovery of organic laser emitters. Science, 384(6697):eadk9227, 2024.

Su et al. (2025a) A. Su, H. Wang, W. Ren, F. Lin, and W. Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966, 2025a.

Su et al. (2025b) Z. Su, Z. Chen, W. Shen, H. Wei, L. Li, H. Yu, and K. Yuan. RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations. arXiv preprint arXiv:2501.16383, 2025b.

Sumers et al. (2024) T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024.

Sun et al. (2025a) Q. Sun, L. Yang, W. Tang, W. Huang, K. Xu, Y. Chen, M. Liu, J. Yang, H. Zhu, Y. Wang, T. He, Y. Chen, X. Dai, N. Ye, and Q. Gu. Learning primitive embodied world models: Towards scalable robotic learning. arXiv preprint arXiv:2508.20840, 2025a.

Sun et al. (2025b) W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. WorldPlay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614, 2025b.

Sutton (1991) R. S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.

Szot et al. (2021) A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems, volume 34, pages 251–266, 2021.

Szymanski et al. (2023) N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y. Zeng, and G. Ceder. An autonomous laboratory for the accelerated synthesis of inorganic materials. Nature, 624:86–91, 2023.

Tang et al. (2024) H. Tang, D. Key, and K. Ellis. WorldCoder, a model-based LLM agent: Building world models by writing code and interacting with the environment. In Advances in Neural Information Processing Systems, volume 37, pages 70148–70212, 2024.

Tao et al. (2024) S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. kai Chan, Y. Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V. N. Rajesh, Y. W. Choi, Y.-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425, 2024.

Tassa et al. (2020) Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, P. Trochim, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, and N. Heess. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.

Taubenfeld et al. (2024) A. Taubenfeld, Y. Dover, R. Reichart, and A. Goldstein. Systematic biases in LLM simulations of debates. In Conference on Empirical Methods in Natural Language Processing, pages 251–267, 2024.

Telang et al. (2021) P. R. Telang, M. P. Singh, and N. Yorke-Smith. Maintenance of social commitments in multiagent systems. In AAAI Conference on Artificial Intelligence, volume 35, pages 11369–11377, 2021.

Tobin et al. (2017) J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.

Todorov et al. (2012) E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.

Tu et al. (2025) S. Tu, X. Zhou, D. Liang, X. Jiang, Y. Zhang, X. Li, and X. Bai. The role of world models in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498, 2025.

Turing (1950) A. M. Turing. Computing machinery and intelligence. Mind, 59(236):433–460, 1950.

Unterthiner et al. (2018) T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.

Vafa et al. (2024) K. Vafa, J. Y. Chen, A. Rambachan, J. Kleinberg, and S. Mullainathan. Evaluating the world model implicit in a generative model. In Advances in Neural Information Processing Systems, volume 37, pages 26941–26975, 2024.

Valevski et al. (2025) D. Valevski, Y. Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations, 2025.

Vallinder and Hughes (2025) A. Vallinder and E. Hughes. Cultural evolution of cooperation among LLM agents. In International Conference on Autonomous Agents and Multiagent Systems, pages 2771–2773, 2025.

van de Ven et al. (2024) G. M. van de Ven, N. Soures, and D. Kudithipudi. Continual learning and catastrophic forgetting. arXiv preprint arXiv:2403.05175, 2024.

van den Oord et al. (2017) A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.

van Es et al. (2025) M. W. van Es, C. Higgins, C. Gohil, A. J. Quinn, D. Vidaurre, and M. W. Woolrich. Large-scale cortical functional networks are organized in structured cycles. Nature Neuroscience, 28(10):2118–2128, 2025.

Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

Vidaurre et al. (2018) D. Vidaurre, L. T. Hunt, A. J. Quinn, B. A. Hunt, M. J. Brookes, A. C. Nobre, and M. W. Woolrich. Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. Nature Communications, 9(1):2987, 2018.

Wang et al. (2025a) C. Wang, H. Wang, X. Chen, J. Liu, T. Xue, C. Peng, D. Qi, F. Lin, and Y. Yan. From illusion to intention: Visual rationale learning for vision-language reasoning. arXiv preprint arXiv:2511.23031, 2025a.

Wang et al. (2024a) F.-Y. Wang, Z. Huang, A. W. Bergman, D. Shen, P. Gao, M. Lingelbach, K. Sun, W. Bian, G. Song, Y. Liu, X. Wang, and H. Li. Phased consistency models. In Advances in Neural Information Processing Systems, volume 37, pages 83951–84009, 2024a.

Wang et al. (2024b) G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024b.

Wang et al. (2025b) H. Wang, L. Li, C. Qu, W. Xu, F. Zhu, W. Chu, and F. Lin. To code or not to code? adaptive tool integration for math language models via expectation-maximization. In Annual Meeting of the Association for Computational Linguistics, pages 3060–3075, 2025b.

Wang et al. (2025c) H. Wang, C. Qu, Z. Huang, W. Chu, F. Lin, and W. Chen. VL-Rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025c.

Wang et al. (2025d) H. Wang, X. Ye, F. Tao, C. Pan, A. Mallik, B. Yaman, L. Ren, and J. Zhang. AdaWM: Adaptive world model based planning for autonomous driving. In International Conference on Learning Representations, 2025d.

Wang et al. (2026a) J. Wang, Y. Jiang, T. He, J. Sun, Q. Zhang, J. He, J. Cao, Z. Gan, M. Sun, Q. Shao, and X. Yue. MVISTA-4D: View-consistent 4D world model with test-time action inference for robotic manipulation. arXiv preprint arXiv:2602.09878, 2026a.

Wang et al. (2024c) Q. Wang, J. Yang, Y. Wang, X. Jin, W. Zeng, and X. Yang. Making offline RL online: Collaborative world models for offline visual reinforcement learning. In Advances in Neural Information Processing Systems, volume 37, pages 97203–97230, 2024c.

Wang et al. (2022) R. Wang, P. Jansen, M.-A. Côté, and P. Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? In Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022.

Wang et al. (2024d) R. Wang, G. Todd, Z. Xiao, X. Yuan, M.-A. Côté, P. Clark, and P. Jansen. Can language models serve as text-based world simulators? In Annual Meeting of the Association for Computational Linguistics, 2024d.

Wang et al. (2024e) R. Wang, H. Yu, W. Zhang, Z. Qi, M. Sap, Y. Bisk, G. Neubig, and H. Zhu. Sotopia-π: Interactive learning of socially intelligent language agents. In Annual Meeting of the Association for Computational Linguistics, pages 12912–12940, 2024e.

Wang et al. (2024f) S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang. Boosting LLM agents with recursive contemplation for effective deception handling. In Annual Meeting of the Association for Computational Linguistics, pages 9909–9953, 2024f.

Wang et al. (2024g) T. Wang, H. Dong, Y. Jiang, D. C. Parkes, and M. Tambe. On diffusion models for multi-agent partial observability: Shared attractors, error bounds, and composite flow. arXiv preprint arXiv:2410.13953, 2024g.

Wang et al. (2019) X. Wang, W. Shi, R. Kim, Y. Oh, S. Yang, J. Zhang, and Z. Yu. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, 2019.

Wang et al. (2024h) X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. DriveDreamer: Towards real-world-driven world models for autonomous driving. In European Conference on Computer Vision, pages 55–72. Springer, 2024h.

Wang et al. (2025e) Y. Wang, W. Luo, J. Bai, Y. Cao, T. Che, K. Chen, Y. Chen, J. Diamond, Y. Ding, W. Ding, L. Feng, G. Heinrich, J. Huang, P. Karkus, B. Li, P. Li, T.-Y. Lin, D. Liu, M.-Y. Liu, L. Liu, Z. Liu, J. Lu, Y. Mao, P. Molchanov, L. Pavao, Z. Peng, M. Ranzinger, E. Schmerling, S. Shen, Y. Shi, S. Tariq, R. Tian, T. Wekel, X. Weng, T. Xiao, E. Yang, X. Yang, Y. You, X. Zeng, W. Zhang, B. Ivanovic, and M. Pavone. Alpamayo-R1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail. arXiv preprint arXiv:2511.00088, 2025e.

Wang et al. (2025f) Z. Wang, Y. Zhang, X. Yue, X. Yue, Y. Li, W. Ouyang, and L. Bai. Transition models: Rethinking the generative learning objective. arXiv preprint arXiv:2509.04394, 2025f.

Wang et al. (2026b) Z. Wang, Z. Liu, J. Li, K. Huang, B. Xu, F. Kang, M. An, P. Wang, B. Jiang, Y. Wei, Y. Xietian, J. Pei, L. Hu, B. Jiang, H. Xue, Z. Wang, H. Sun, W. Li, W. Ouyang, X. He, Y. Liu, Y. Li, and Y. Zhou. Matrix-Game 3.0: Real-time and streaming interactive world model with long-horizon memory. arXiv preprint arXiv:2604.08995, 2026b.

Wang et al. (2024i) Z. M. Wang, Z. Peng, H. Que, J. Liu, W. Zhou, Y. Wu, H. Guo, R. Gan, Z. Ni, J. Yang, M. Zhang, Z. Zhang, W. Ouyang, K. Xu, S. W. Huang, J. Fu, and J. Peng. RoleLLM: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Annual Meeting of the Association for Computational Linguistics, pages 14743–14777, 2024i.

Watter et al. (2015) M. Watter, J. T. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in Neural Information Processing Systems, volume 28, pages 2746–2754, 2015.

Wei et al. (2025a) H. Wei, Z. Zhang, S. He, T. Xia, S. Pan, and F. Liu. PlanGenLLMs: A modern survey of LLM planning capabilities. arXiv preprint arXiv:2502.11221, 2025a.

Wei et al. (2025b) J. Wei, Y. Yang, X. Zhang, Y. Chen, X. Zhuang, Z. Gao, D. Zhou, G. Wang, Z. Gao, J. Cao, Z. Qiu, M. Hu, C. Ma, S. Tang, J. He, C. Song, X. He, Q. Zhang, C. You, S. Zheng, N. Ding, W. Ouyang, N. Dong, Y. Cheng, S. Sun, L. Bai, and B. Zhou. From AI for science to agentic science: A survey on autonomous scientific discovery. arXiv preprint arXiv:2508.14111, 2025b.

Wilf et al. (2024) A. Wilf, S. Lee, P. P. Liang, and L.-P. Morency. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. In Annual Meeting of the Association for Computational Linguistics, pages 8292–8308, 2024.

Wolpert (1996) D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.

World Labs team (2025a) World Labs team. Marble: A multimodal world model. World Labs Technical Post, 2025a. URL https://www.worldlabs.ai/blog/marble-world-model.

World Labs team (2025b) World Labs team. RTFM: A real-time frame model. World Labs Research Preview, 2025b. URL https://www.worldlabs.ai/blog/rtfm.

Wu et al. (2024a) H. Wu, Y. Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In International Conference on Learning Representations, 2024a.

Wu et al. (2024b) J. Wu, H. Wang, Y. Shang, M. Shah, and Y. Yan. PTQ4DiT: Post-training quantization for diffusion transformers. In Advances in Neural Information Processing Systems, volume 37, pages 62732–62755, 2024b.

Wu et al. (2024c) J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao, and M. Long. iVideoGPT: Interactive VideoGPTs are scalable world models. In Advances in Neural Information Processing Systems, volume 37, pages 68082–68119, 2024c.

Wu et al. (2023a) P. Wu, A. Escontrela, D. Hafner, K. Goldberg, and P. Abbeel. DayDreamer: World models for physical robot learning. In Conference on Robot Learning. PMLR, 2023a.

Wu et al. (2023b) Y. Wu, Y. He, Y. Jia, R. Mihalcea, Y. Chen, and N. Deng. Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Conference on Empirical Methods in Natural Language Processing, pages 10691–10706, 2023b.

Xia et al. (2024) H. Xia, Z.-H. Lin, W.-C. Ma, and S. Wang. Video2Game: Real-time, interactive, realistic and browser-compatible environment from a single video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4588, 2024.

Xiang et al. (2020) F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, L. Yi, A. X. Chang, L. J. Guibas, and H. Su. SAPIEN: A simulated part-based interactive environment. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.

Xiao et al. (2024) G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. In International Conference on Learning Representations, 2024.

Xiao et al. (2026) Z. Xiao, J. Tu, C. Zou, Y. Zuo, Z. Li, P. Wang, B. Yu, F. Huang, J. Lin, and Z. Liu. WebWorld: A large-scale world model for web agent training. arXiv preprint arXiv:2602.14721, 2026.

Xie et al. (2024) T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In Advances in Neural Information Processing Systems, volume 37, pages 52040–52094, 2024.

Xing et al. (2024) J. Xing, M. Xia, Y. Zhang, H. Chen, W. Yu, H. Liu, X. Wang, T.-T. Wong, and Y. Shan. DynamiCrafter: Animating open-domain images with video diffusion priors. In European Conference on Computer Vision, pages 399–417. Springer, 2024.

Xu et al. (2024a) H. Xu, R. Zhao, L. Zhu, J. Du, and Y. He. OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Annual Meeting of the Association for Computational Linguistics, pages 8593–8623, 2024a.

Xu et al. (2026a) X. Xu, H. Li, J. Ye, Y. Chen, J. Zeng, X. Chen, L. Xu, D. Lin, W. Li, and J. Pang. FutureVLA: Joint visuomotor prediction for vision-language-action model. arXiv preprint arXiv:2603.10712, 2026a.

Xu et al. (2026b) X. Xu, A. Liang, Y. Liu, L. Li, L. Kong, Z. Liu, and Q. Liu. U4D: Uncertainty-aware 4D world modeling from LiDAR sequences. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026b.

Xu et al. (2024b) Y. Xu, Z. Wang, J. Wang, D. Lu, T. Xie, A. Saha, D. Sahoo, T. Yu, and C. Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. arXiv preprint arXiv:2412.04454, 2024b.

Xu et al. (2023) Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023.

Yamada et al. (2025) Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha. The AI Scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025.

Yan et al. (2026) T. Yan, T. Tang, X. Gui, Y. Li, J. Zhesng, W. Huang, L. Kong, W. Han, X. Zhou, X. Zhang, Y. Zhan, K. Zhan, C. zhong Xu, and J. Shen. Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026.

Yang et al. (2026) C. Yang, X. Lin, S. Li, W. Wang, R. Guo, F. Feng, and T.-S. Chua. Can large language models derive new knowledge? A dynamic benchmark for biological knowledge discovery. arXiv preprint arXiv:2603.03322, 2026.

Yang et al. (2024a) J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. SWE-Agent: Agent-computer interfaces enable automated software engineering. In Advances in Neural Information Processing Systems, volume 37, pages 50528–50652, 2024a.

Yang et al. (2025a) P. Yang, H. Ci, and M. Z. Shou. macOSWorld: A multilingual interactive benchmark for gui agents. arXiv preprint arXiv:2506.04135, 2025a.

Yang et al. (2024b) S. Yang, Y. Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators. In International Conference on Learning Representations, 2024b.

Yang et al. (2025b) S. Yang, W. Huang, R. Chu, Y. Xiao, Y. Zhao, X. Wang, M. Li, E. Xie, Y. Chen, Y. Lu, S. Han, and Y. Chen. LongLive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025b.

Yang et al. (2025c) S. Yang, H. Xi, Y. Zhao, M. Li, J. Zhang, H. Cai, Y. Lin, X. Li, C. Xu, K. Peng, J. Chen, S. Han, K. Keutzer, and I. Stoica. Sparse VideoGen2: Accelerate video generation with sparse attention via semantic-aware permutation. arXiv preprint arXiv:2505.18875, 2025c.

Yang et al. (2024c) Z. Yang, X. Du, J. Li, J. Zheng, S. Poria, and E. Cambria. Large language models for automated open-domain scientific hypotheses discovery. In Annual Meeting of the Association for Computational Linguistics, pages 13545–13565, 2024c.

Yang et al. (2024d) Z. Yang, Z. Zhang, Z. Zheng, Y. Jiang, Z. Gan, Z. Wang, Z. Ling, J. Chen, M. Ma, B. Dong, P. Gupta, S. Hu, Z. Yin, G. Li, X. Jia, L. Wang, B. Ghanem, H. Lu, C. Lu, W. Ouyang, Y. Qiao, P. Torr, and J. Shao. Oasis: Open agent social interaction simulations with one million agents. arXiv preprint arXiv:2411.11581, 2024d.

Yu et al. (2025a) J. Yu, Y. Qin, X. Wang, P. Wan, D. Zhang, and X. Liu. GameFactory: Creating new games with generative interactive videos. In IEEE/CVF International Conference on Computer Vision, pages 11590–11599, 2025a.

Yu et al. (2020) T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100, 2020.

Yu et al. (2025b) X. Yu, X. Qi, Z. Li, K. Zhang, R. Zhang, Z. Lin, E. Shechtman, T. Wang, and Y. Nitzan. Self-evaluation unlocks any-step text-to-image generation. arXiv preprint arXiv:2512.22374, 2025b.

Yu et al. (2026) X. Yu, B. Peng, R. Xu, Y. Shen, P. He, S. Nath, N. Singh, J. Gao, and Z. Yu. Reinforcement world model learning for LLM-based agents. arXiv preprint arXiv:2602.05842, 2026.

Yue et al. (2025) J. Yue, Z. Huang, Z. Chen, X. Wang, P. Wan, and Z. Liu. Simulating the visual world with artificial intelligence: A roadmap. arXiv preprint arXiv:2511.08585, 2025.

Zeng et al. (2025) Z. Zeng, J. Liu, S. Chen, T. He, Y. Liao, Y. Tian, J. Wang, Z. Wang, Y. Yang, L. Yin, M. Yin, Z. Zhu, T. Cai, Z. Chen, J. Chen, Y. Du, X. Gao, J. Guo, L. Hu, J. Jiao, X. Li, J. Liu, S. Ni, Z. Wen, G. Zhang, K. Zhang, X. Zhou, J. Blanchet, X. Qiu, M. Wang, and W. Huang. FutureX: An advanced live benchmark for LLM agents in future prediction. arXiv preprint arXiv:2508.11987, 2025.

Zhang et al. (2025a) C. Zhang, Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu. AppAgent: Multimodal agents as smartphone users. In CHI Conference on Human Factors in Computing Systems, 2025a.

Zhang et al. (2025b) D. Zhang, J. Lei, J. Li, X. Wang, Y. Liu, Z. Yang, J. Li, W. Wang, S. Yang, J. Wu, P. Ye, W. Ouyang, and D. Zhou. Critic-V: VLM critics help catch vlm errors in multimodal reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9050–9061, 2025b.

Zhang et al. (2026a) H. Zhang, G.-H. Yuan, C. Yuan, T. Xu, T. Bian, H. Cheng, W. Huang, D. Zhao, and Y. Rong. Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells. arXiv preprint arXiv:2603.25240, 2026a.

Zhang et al. (2024a) L. Zhang, Y. Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun. Copilot4D: Learning unsupervised world models for autonomous driving via discrete diffusion. In International Conference on Learning Representations, 2024a.

Zhang et al. (2025c) L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. arXiv preprint arXiv:2504.12626, 2025c.

Zhang et al. (2025d) P.-F. Zhang, Y. Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen. A step toward world models: A survey on robotic manipulation. arXiv preprint arXiv:2511.02097, 2025d.

Zhang and Chen (2023) Q. Zhang and Y. Chen. Fast sampling of diffusion models with exponential integrator. In International Conference on Learning Representations, 2023.

Zhang et al. (2023a) W. Zhang, G. Wang, J. Sun, Y. Yuan, and G. Huang. STORM: Efficient stochastic transformer based world models for reinforcement learning. In Advances in Neural Information Processing Systems, volume 36, 2023a.

Zhang et al. (2026b) W. Zhang, B. Terver, A. Zholus, S. Chitnis, H. Sutaria, M. Assran, A. Bar, R. Balestriero, A. Bardes, Y. LeCun, and N. Ballas. Hierarchical planning with latent world models. arXiv preprint arXiv:2604.03208, 2026b.

Zhang et al. (2025e) X. Zhang, Y. Huang, C. Ma, Z. Chen, L. Ma, Y. Du, S.-C. Zhu, Y. Yang, and X. Feng. Social world model-augmented mechanism design policy learning. arXiv preprint arXiv:2510.19270, 2025e.

Zhang et al. (2025f) X. Zhang, J. Lin, X. Mou, S. Yang, X. Liu, L. Sun, H. Lyu, Y. Yang, W. Qi, Y. Chen, G. Li, L. Yan, Y. Hu, S. Chen, Y. Wang, X. Huang, J. Luo, S. Tang, L. Wu, B. Zhou, and Z. Wei. SocioVerse: A world model for social simulation powered by LLM agents and a pool of 10 million real-world users. arXiv preprint arXiv:2504.10157, 2025f.

Zhang et al. (2025g) X. Zhang, W. Zhang, A. Wang, S.-K. Ng, and Y. Deng. MASim: Multilingual agent-based simulation for social science. arXiv preprint arXiv:2512.07195, 2025g.

Zhang et al. (2026c) X. Zhang, Z. He, Y. Zhu, S. Wu, S. Yu, M. Chu, W. Zhang, H. Tan, and J. Jia. SearchGym: Bootstrapping real-world search agents via cost-effective and high-fidelity environment simulation. arXiv preprint arXiv:2601.14615, 2026c.

Zhang et al. (2026d) X. Zhang, S. Wu, Y. Zhu, H. Tan, S. Yu, Z. He, and J. Jia. Scaf-GRPO: Scaffolded group relative policy optimization for enhancing LLM reasoning. In International Conference on Learning Representations, 2026d.

Zhang et al. (2025h) Y. Zhang, S. Mao, T. Ge, X. Wang, Y. Xia, M. Lan, and F. Wei. K-Level reasoning: Establishing higher order beliefs in large language models for strategic reasoning. In Findings of the Association for Computational Linguistics: NAACL, pages 7212–7234, 2025h.

Zhang et al. (2023b) Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, Z. Wang, and B. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems, volume 36, pages 34661–34710, 2023b.

Zhang et al. (2024b) Z. Zhang, Y. Li, Y. Wu, Y. Xu, A. Kag, I. Skorokhodov, W. Menapace, A. Siarohin, J. Cao, D. Metaxas, S. Tulyakov, and J. Ren. SF-V: Single forward video generation model. In Advances in Neural Information Processing Systems, volume 37, pages 103599–103618, 2024b.

Zhang et al. (2025i) Z. Zhang, Z. Qiu, Y. Wu, S. Li, D. Wang, Z. Zhou, D. An, Y. Chen, Y. Li, Y. Wang, C. Ou, Z. Wang, J. X. Chen, B. Zhang, Y. Hu, W. Zhang, Z. Wei, R. Ma, Q. Liu, B. Dong, Y. He, Q. Feng, L. Bai, Q. Gao, S. Sun, and S. Zheng. OriGene: A self-evolving virtual disease biologist automating therapeutic target discovery. bioRxiv 2025.06.03.657658, 2025i.

Zhang et al. (2025j) Z. Zhang, Q. Zhang, W. Cui, S. Shi, Y. Guo, G. Han, W. Zhao, J. Sun, J. Cao, J. Wang, H. Cheng, X. Ju, Z. Che, R. Xu, and J. Tang. Occupancy world model for robots. arXiv preprint arXiv:2505.05512, 2025j.

Zhao et al. (2025) B. Zhao, L. G. Foo, P. Hu, C. Theobalt, H. Rahmani, and J. Liu. LLM-based agentic reasoning frameworks: A survey from methods to scenarios. arXiv preprint arXiv:2508.17692, 2025.

Zhao et al. (2026) H. Zhao, S. Zhou, H. Yang, Z. Qin, and T. Zhou. Neuro-symbolic synergy for interactive world modeling. arXiv preprint arXiv:2602.10480, 2026.

Zhao et al. (2024) T. Zhao, T. Fang, H. Huang, E. Liu, R. Wan, W. Soedarmadji, S. Li, Z. Lin, G. Dai, S. Yan, H. Yang, X. Ning, and Y. Wang. ViDiT-Q: Efficient and accurate quantization of diffusion transformers for image and video generation. arXiv preprint arXiv:2406.02540, 2024.

Zhen et al. (2025) H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y. Du, and C. Gan. TesserAct: Learning 4D embodied world models. arXiv preprint arXiv:2504.20995, 2025.

Zheng et al. (2025a) D. Zheng, Z. Huang, H. Liu, K. Zou, Y. He, F. Zhang, Y. Zhang, J. He, W.-S. Zheng, Y. Qiao, and Z. Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025a.

Zheng et al. (2023) K. Zheng, C. Lu, J. Chen, and J. Zhu. DPM-Solver-v3: Improved diffusion ode solver with empirical model statistics. In Advances in Neural Information Processing Systems, volume 36, pages 55502–55542, 2023.

Zheng et al. (2025b) K. Zheng, Y. Wang, Q. Ma, H. Chen, J. Zhang, Y. Balaji, J. Chen, M.-Y. Liu, J. Zhu, and Q. Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency. arXiv preprint arXiv:2510.08431, 2025b.

Zheng et al. (2024) W. Zheng, W. Chen, Y. Huang, B. Zhang, Y. Duan, and J. Lu. OccWorld: Learning a 3D occupancy world model for autonomous driving. In European Conference on Computer Vision, pages 55–72. Springer, 2024.

Zheng et al. (2025c) X. Zheng, H. Lin, K. He, Z. Wang, Q. Fu, H. Fu, Z. Zheng, and Y. Liang. MCU: An evaluation framework for open-ended game agents. In International Conference on Machine Learning, pages 78221–78259. PMLR, 2025c.

Zheng et al. (2026) Y. Zheng, L. Zhong, Y. Wang, R. Dai, K. Liu, X. Chu, L. Lv, P. Torr, and K. Q. Lin. Code2World: A GUI world model via renderable code generation. arXiv preprint arXiv:2602.09856, 2026.

Zhou et al. (2024a) M. Zhou, H. Zheng, Z. Wang, M. Yin, and H. Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In International Conference on Machine Learning, pages 62307–62331. PMLR, 2024a.

Zhou et al. (2024b) S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2024b.

Zhou et al. (2024c) X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y. Bisk, D. Fried, G. Neubig, and M. Sap. SOTOPIA: Interactive evaluation for social intelligence in language agents. In International Conference on Learning Representations, 2024c.

Zhou et al. (2025a) X. Zhou, D. Liang, S. Tu, X. Chen, Y. Ding, D. Zhang, F. Tan, H. Zhao, and X. Bai. HERMES: A unified self-driving world model for simultaneous 3D scene understanding and generation. In IEEE/CVF International Conference on Computer Vision, pages 27817–27827, 2025a.

Zhou et al. (2025b) X. Zhou, J. Liu, A. Yerukola, H. Kim, and M. Sap. Social world models. arXiv preprint arXiv:2509.00559, 2025b.

Zhu et al. (2025) H. Zhu, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, J. Chen, C. Shen, J. Pang, and T. He. Aether: Geometric-aware unified world modeling. In IEEE/CVF International Conference on Computer Vision, pages 8535–8546, 2025.

Zhu et al. (2024) Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y. Wang, B. Shi, K. Wang, C. Zhang, Y. You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang. Is Sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024.

Autonomous driving is a particularly clear L2 setting because useful rollouts must jointly preserve geometric accuracy (lane structure, free space), dynamic consistency (vehicle kinematics, traffic flow), and counterfactual sensitivity: if the ego vehicle brakes earlier or changes lanes, surrounding trajectories and occupancy should update coherently rather than merely continuing the same scene (Hu et al., 2025a; Wang et al., 2025e; Liang et al., 2026a). Earlier systems like GAIA-1 (Hu et al., 2023) and DriveWorld (Min et al., 2024) established scene generation conditioned on control signals.

Subsequent work has branched along two main axes. Along the representation axis, Copilot4D (Zhang et al., 2024a) introduced unsupervised 4D modeling via discrete diffusion on LiDAR point clouds, OccWorld (Zheng et al., 2024) moved to 3D occupancy with a GPT-like spatial-temporal transformer, and Hermes (Zhou et al., 2025a) unified BEV scene understanding with future generation. Along the fidelity–controllability axis, VISTA (Gao et al., 2024) demonstrated 576×1024 resolution at 10 Hz with 15-second coherent rollouts, while DriveDreamer (Wang et al., 2024h) built world models entirely from naturalistic driving data using a diffusion backbone. AD-R1 (Yan et al., 2026) builds the first closed-loop simulator by combining impartial world modeling with a rich curriculum of plausible collisions and off-road events.

A further line of work concerns policy alignment under fine-tuning rather than base representation alone: AdaWM (Wang et al., 2025d) addresses representation degradation during RL fine-tuning via low-rank alignment that preserves pre-trained structure while adapting to new driving policies. This progression also marks a shift from open-loop scene generation to closed-loop control support, where actions are not merely conditioning variables but candidate interventions whose consequences must be compared before execution.

Software, Web, and Game Systems

Game world models.

Game worlds occupy a distinctive position at the intersection of physical and digital intelligence: visual dynamics follow physics-like rules (rendering, object motion, collision), yet transitions are ultimately governed by deterministic game logic (score updates, level triggers, inventory changes). This overlap makes games a natural testbed for world models that must integrate perceptual prediction with rule-based reasoning. NitroGen (Magne et al., 2026), NVIDIA’s open vision-action foundation model trained on 40K hours of gameplay across 1000+ games, achieves 52% improvement on unseen games via large-scale behavior cloning. Earlier work at L1, including DIAMOND (Alonso et al., 2024) and Genie (Bruce et al., 2024) (Section 3), established frame-by-frame prediction; the L2 challenge is long-horizon, action-conditioned simulation respecting both visual dynamics and underlying game rules. GameNGen (Valevski et al., 2025) demonstrated that a diffusion model trained on DOOM gameplay can serve as a real-time neural game engine at 20 FPS, generating interactive frames indistinguishable from the original engine. Video2Game (Xia et al., 2024) converts a single video into an interactive 3D game-like environment with real-time physics and rendering, bridging passive video understanding with interactive world simulation. Across these domains, state includes DOM structure, focus, file system, and application state machines; evaluable tasks span OS (Xie et al., 2024; Yang et al., 2025a), web (Zhou et al., 2024b; Deng et al., 2023; Yao et al., 2022), and software debugging workflows (Jimenez et al., 2024; Yang et al., 2024a; Shi et al., 2017).

Social Simulation and Multi-Agent Systems

ToM prompting and reasoning.

Structured prompting strategies suggest the bottleneck in social reasoning is reasoning structure rather than knowledge. SymbolicToM (Sclar et al., 2023) constructs explicit per-character belief graphs after each story event, supporting up to third-order beliefs through graph traversal (ACL 2023 Outstanding Paper). SimToM (Wilf et al., 2024) implements perspective-taking as a two-stage process inspired by Simulation Theory from cognitive science: first filtering context to what the target character knows, then answering from that filtered view. K-Level Reasoning (Zhang et al., 2025h) implements the behavioral economics Level-K framework recursively in LLMs for negotiation. Thought-Tracing (Kim et al., 2025) implements approximate Bayesian inference via Sequential Monte Carlo-like hypothesis generation, significantly outperforming reasoning models like o3-mini, suggesting social reasoning may require fundamentally different computational mechanisms than mathematical deduction.

Sandbox architectures and scale.

Project Sid (AL et al., 2024) deployed up to 1,000 agents across six towns in Minecraft using the PIANO architecture (Parallel Information Aggregation via Neural Orchestration), a brain-inspired modular design with separate concurrent modules for cognition, planning, motor execution, and speech. Emergent phenomena included autonomous professional specialization, personality-driven social network formation, democratic governance, and cultural transmission including spontaneous religious proselytization. Sotopia extensions include Sotopia-π (Wang et al., 2024e) (interactive self-reinforcement learning for social skills) and Lifelong-Sotopia (multi-episode long-term consistency evaluation). AgentSociety (Piao et al., 2025) simulated 10,000+ agents generating 5 million interactions in an integrated urban-social-economic environment with emotion and cognitive modeling inspired by Maslow’s hierarchy. Deployed platforms such as Moltbook provide persistent social environments where AI agents autonomously post, discuss, and form community norms, bridging the gap between simulation and real-world agent societies.

Emergent social phenomena.

Only 2 of 15 LLMs achieve sustainable cooperation in commons dilemma scenarios (Piatti et al., 2024), and cooperation evolution across generations of LLM agents proves strongly model-dependent (Vallinder and Hughes, 2025). Yet norms and conventions do emerge: Ren et al. (2024) document norm formation in LLM societies, and Ashery et al. (2025) find social conventions with critical mass tipping points, where collective biases appear at the group level that do not exist in individual agents. Melting Pot (Leibo et al., 2021) provides 50+ substrates covering cooperation, competition, deception, and coordination for systematic evaluation of such dynamics. Role-playing systems such as RoleLLM (Wang et al., 2024i), CharacterLLM (Shao et al., 2023), and ChatHaruhi (Li et al., 2023a) probe character-consistency through persona fine-tuning and memory-based maintenance. Shanahan et al. (2023) argue that LLMs maintain implicit world models of character situations through distributional representations. Werewolf and Avalon serve as concentrated testbeds for deception and trust: comprehensive Avalon investigation (Lan et al., 2024) documented emergent leadership and camouflage strategies, ReCon (Wang et al., 2024f) introduced recursive perspective transitions for deception handling, and The Traitors (Curvo, 2025) found that deceivers consistently prevail by exploiting the cognitive limitations of honest participants.

Digital twin societies.

S³ (Gao et al., 2023) simulates information propagation, emotion contagion, and attitude polarization on social media platforms; an extended version successfully predicted 2024 US presidential election results, demonstrating predictive validity for real-world phenomena. SocioVerse (Zhang et al., 2025f) validates social simulation against a pool of 10 million real-world users, enabling election prediction, breaking-news response, and economic survey replication at unprecedented scale. PersuasionForGood (Wang et al., 2019) modeled persuasion as a social state transition process, tracking how 10 distinct strategies shift attitudes, establishing that social dynamics are personalized rather than universal.

Institutional and formal approaches.

As Dignum and Dignum (2025) argue, current LLM-based agents exhibit behavioral autonomy without explicit reasoning structures. The BDI (Belief–Desire–Intention) architecture (Rao and Georgeff, 1995), normative multi-agent systems (Boella and van der Torre, 2007), electronic institutions (Esteva et al., 2001), and formal commitment models (Telang et al., 2021) provide the missing machinery: explicit, inspectable representations of mental states, social obligations, and institutional roles. MetaGPT (Hong et al., 2024) encodes organizational knowledge through Standardized Operating Procedures, and ChatDev (Qian et al., 2024) implements chat-chain architectures with communicative dehallucination, both showing that explicit institutional constraints outperform individual agent prompting for organizational coherence. Strategic dialogue systems further test social dynamics: CraigslistBargain (He et al., 2018) decoupled strategy from generation, NegotiationArena (Bianchi et al., 2024) quantifies irrational behaviors, the Consensus Game (Jacob et al., 2024) formalizes LM decoding as equilibrium search, and the Game-theoretic LLM framework (Hua et al., 2024) incorporates backward induction into agent workflows.

Memory and KV cache compression are becoming critical bottlenecks as autoregressive video generation models scale up. During long rollouts, the key-value (KV) cache — which stores past attention activations to avoid recomputation — expands linearly with sequence length, resulting in substantial memory pressure and limiting output durations to roughly 60 seconds on current hardware. Researchers are pursuing four complementary strategies to tame this growth. First, token eviction techniques selectively discard low-importance cache entries, retaining only highly influential tokens (often termed "heavy hitters") and preserving attention-sink tokens that stabilize attention distributions; these methods bound the cache size while attempting to minimize degradation in generation quality. Second, chunk-level autoregressive generation sidesteps cache explosion by segmenting the output into blocks, although 60-second ceilings remain16. Third, KV quantization schemes — which compress cache entries to lower bit widths, as demonstrated in LLM frameworks such as KIVI, KVQuant, QuaRot, and RotateKV — have shown mature4 in language model serving but cannot be directly ported to video diffusion models because different activation statistics lead to unacceptable quality loss. Fourth, novel spatiotemporal-aware compression frameworks explicitly exploit the inherent redundancy across frames and regions, achieving4 better tradeoffs than generic quantizers. On the memory management side, FlashAttention, PagedAttention, and Quest address the computational16 overhead while RingAttention and LoongTrain enable distributed16 training across long sequences. The latest token-pivot framework achieves 32× compression on image/video tokens without fine-tuning,16 suggesting16 that16 memory and cache4 technologies will be16 indispensable for the12 next12 generation of12 video12 generation systems.