OpenAI's Jiayi Weng: Beyond Gradients, Is the Next AI Training Paradigm on the Horizon?

Machine Heart Editorial Team

If one day, a pure piece of program code written by an AI programming tool—no neural network, no gradient descent, no “training” whatsoever—scores the theoretical maximum in a classic game and achieves results comparable to Deep RL in robot control, how would you explain it?

This is not a sci-fi premise, but a real experiment recently documented in a blog post by OpenAI research engineer Jiayi Weng. He originally intended to write a few cheap, small rules for game testing, but ended up creating something that shocked even him. This led him to re-examine a long-underestimated method—heuristics—and he believes it might be ushering in its own era.

Hand-written rules themselves are nothing new. Expert systems have existed for decades, and their problems are old as well: fix case A by adding a rule today, discover case B is broken tomorrow, patch it the day after, and eventually, no one dares to touch it. As the scale grows, maintenance costs crush people. Weng's core observation is: the coding agent changes precisely this cost curve. When an agent can automatically read logs, watch replays, modify code, run tests, and record experiments, a hand-written rule system gains the potential for continuous growth for the first time—what was previously just a patch now starts to be worth owning long-term.

This also directly touches on the old problem of Continual Learning. The root cause of catastrophic forgetting in neural networks is that old abilities are only implicitly stored in parameters and are easily overwritten by new data. In the Heuristic Learning framework he proposes, old abilities can be directly written into regression tests, replays with fixed seeds, and explicit failure records. The history is explicit, readable, and reconstructable. This doesn't solve the forgetting problem, but rather transforms “preventing forgetting” into a more engineering-oriented problem to handle.

Of course, Weng also points out the boundaries of this method: the expressive power of code is ultimately limited, and complex perception and long-range generalization remain the domain of neural networks. He believes the more promising direction is to combine the two—using a Heuristic System to quickly process online data and accumulate regressable experience, and then periodically internalizing this experience into a neural network.

He summarizes this vision in one sentence: Anything that can be continuously iterated upon starts to become solvable. This aligns with the logic of previous paradigm shifts—from pretrain to RLHF, and then to large-scale RL, each step has pushed the boundary of what is “verifiable” further outward. Heuristic Learning might be the next circle.

Jiayi Weng is one of the core engineers for OpenAI's post-training RL infrastructure. When he joined OpenAI in 2022, his interviewer was none other than John Schulman. Subsequently, he led the development of the core RL infrastructure for OpenAI's post-training phase, a system that supports the training iterations of the GPT series in RLHF, alignment, and inference optimization.

The following is the original text of Jiayi Weng's blog post "Learning Beyond Gradients," republished with permission from Machine Heart:

Original link: https://trinkle23897.github.io/learning-beyond-gradients/

Continual Learning has long been difficult to solve, primarily hindered by the catastrophic forgetting of neural networks: learning new things easily washes away old abilities. So, if we don't focus solely on neural network weights, are there other solutions?

As LLM agents get stronger, the speed and quality of coding are improving. But recently, I've been more intrigued by another phenomenon: a coding agent, without training a new network or updating weights, simply by continuously observing failures, modifying code, adding tests, and watching replays, can nurture a program system to become stronger and stronger.

This makes me re-evaluate heuristics, i.e., hand-written rules and programmatic strategies. In the past, many heuristics were not useless, but no one could afford to maintain them; what the coding agent changes is this maintenance cost curve. Thus, rules that were once only one-off patches are beginning to become code worth owning long-term.

Anything that can be continuously iterated upon starts to become solvable. This is precisely the problem Continual Learning has always aimed to solve. Could this be the next paradigm after Pretrain, RLHF, and Large-scale RL/RLVR?

An Anomaly

While maintaining EnvPool in my spare time, I wanted to use a cheaper strategy to test the correctness of game environments, as running a neural network every time for CI is a waste of testing resources.

The initial question was simply: Could I write some cheap, reproducible heuristics that perform much better than random, specifically designed to push the environment into informative states?

I tried using codex (gpt-5.4) to write a rule-based version, completely independent of NNs. Unexpectedly, after a few tweaks, the results were far more absurd than I anticipated:

In the brick-breaking game Atari Breakout, the strategy went from 387 -> 507 -> 839 -> 864, eventually reaching the theoretical maximum score;
In a simulated quadruped robot joint control task, MuJoCo Ant, a pure Python program strategy first learned a rhythmic gait, then incorporated a short-horizon model planner, ultimately achieving a score of 6000+, entering the magnitude of common Deep RL results;
In a simulated robot running task, MuJoCo HalfCheetah, relying on explainable gait/pose rules and online planning, it iterated to an average score of 11836.7 over 5 retests, also entering the magnitude of common Deep RL results;
For the entire set of Atari 57 games, a total of 57 games x 2 input types x 3 runs = 342 coding-agent search trajectories were run, with mixed performance; but at a fixed environment interaction step count, the median HNS game score was already far above the curve of Deep RL algorithms like PPO at around 1M environment steps.

Seeing these results for the first time was extremely shocking, but what concerned me more was this: codex did not train a neural network; it was maintaining a software system that could continue to grow.

The final Breakout strategy went far beyond a simple "ball is on the left, so move left." What grew was action probing, state reading, ball and paddle detection, landing prediction, stuck-loop detection, regression testing, video replay, and experiment recording. The Ant strategy also exceeded a single gait formula, containing rhythmic control, pose feedback, contact information, and short-horizon model rollouts.

Thus, I realized it's necessary to create a new concept here: the object being updated is no longer just a policy function, but a software system with memory, feedback entry points, and regression mechanisms.

Heuristic Learning

After interacting with codex a bit more, I wanted to define this process as Heuristic Learning (HL):

The main body of HL is composed of program code;
It shares the closed loop of state, action, feedback, and update common in today's Deep RL practice; but the update target shifts from neural network parameters to software structures;
Its feedback is processed by a coding agent and can come from environment rewards, test cases, logs, videos, replays, or human feedback;
Its updates do not go through backpropagation; the coding agent directly modifies the policy, state detectors, tests, configurations, or memory;
HL is the process of learning and updating; the object long-term maintained by HL is called a Heuristic System (HS);
An HS exceeds an isolated policy.py: it at least contains a programmatic strategy, state representation, feedback entry points, experiment records, replays or tests, memory, and an update mechanism executed by the coding agent. A single rule is not enough; only when rules, feedback, history, and the next round of updates are all connected does it become an HS.

To put it in a table:

Compared to Deep RL, Heuristic Learning has many favorable properties:

Explainability: Neural networks are hard to interpret; HL's code strategies can be translated into human language;
Sample Efficiency: A single effective code update can jump directly to a new strategy, without needing to slowly adjust the learning rate;
Regression-testable: Old abilities can become tests, replays, or golden cases;
Overfitting can be constrained: Code heuristics can also overfit to seeds, environmental details, or testing loopholes, but simplification, regression, and multi-seed checks can form a kind of engineering regularization;
It can partially avoid Catastrophic Forgetting: Old abilities don't need to be remembered entirely by the model; they can be written into the rule set and tests.

The key point is, a class of heuristics that were previously not worth writing due to high maintenance costs might suddenly become worth owning long-term.

Why Heuristic Learning Didn't Develop Earlier

If HL's predecessors are expert systems and rule systems, then before coding agents matured, the maintenance cost for such things was prohibitively high.

Manual maintenance of heuristics by humans easily turns into this: Add a rule to fix case A today. Discover case B is broken tomorrow. Add another if the day after. Eventually, no one dares to delete anything.

The problem is not that heuristics are useless, but that no human effort could afford to maintain them. Before, human-maintained expert systems were somewhat like hand-spun yarn before the Industrial Revolution: as scale grew, stability and maintenance costs crushed people. The spinning machine changed the production capacity curve; the coding agent changes the heuristic maintenance curve. It's like a nutrient pipeline that can deliver intelligence, continuously nourishing an HS, allowing it to iterate and evolve.

The common agentic feedback loop currently looks mainly like: feature request -> agent writes code -> passes test -> human gives a little feedback -> next round patch.

But as large model capabilities improve, the frequency of human intervention will gradually decrease, giving this feedback loop the opportunity to automatically close in certain clearly bounded systems, thus enabling automated mass production of HS using HL:

Environment feedback / test failure / log anomaly -> coding agent reads context -> modifies policy/test/memory -> re-runs -> writes results back to trials and summary -> next round continues.

How Heuristic Learning Does Continual Learning

Catastrophic forgetting in neural networks occurs when new data pushes parameters toward new tasks, overwriting old abilities. HL can also forget, for example:

A new rule fixes one failure mode while breaking an old scenario;
New memory repeatedly leads the agent in the wrong direction;
A new, overly narrow test causes the strategy to learn to exploit the loophole;
A new patch changes a public interface, silently breaking old callers;
Rules pile up until the agent itself can no longer maintain them.

So, HL does not automatically solve Continual Learning. It transforms "preventing forgetting" into something more engineering-oriented.

In HL, old abilities can be solidified as:

Regression tests;
Replays with fixed seeds;
Golden traces;
Failure videos;
Version diffs;
Explicitly written-down failure directions.

Unlike neural networks that compress experience into weights, HL's history is explicit, readable, deletable, and reconstructable. It is responsible for "remembering," and also for compressing a pile of local patches into a simpler representation.

(An HS that only grows without compression will eventually become an unmaintainable legacy codebase. It will "remember" many things, but in such a bad way that no one dares to touch it, thus rotting.)

Therefore, a healthy HS needs at least two operations to sustain itself:

Absorb feedback: Write new failures, new logs, new rewards back into the system.
Compress history: Fold a pile of local patches back into a simpler, more maintainable representation.

This transforms Continual Learning from "how to update parameters" to "how to maintain a software system that continuously absorbs feedback."

The Complexity of a Heuristic System

Here, I define Coupling Complexity as the level of policy complexity a coding agent can maintain to support HL. In other words, it's how many mutually dependent states, rules, tests, feedback, and history must be tended to simultaneously within a single update.

This quantity cannot be measured by lines of code. A 500-line strategy with clear module boundaries, comprehensive tests, and reproducible states might be very easy to maintain; an 80-line strategy where every line is interdependent, with no logging or replays, can be a ticking time bomb, collapsing at the slightest touch.

On the code side, coupling complexity is constrained by module boundaries, interface stability, test coverage, log observability, rollback cost, and state reproducibility. Good modularization cuts global coupling into local coupling, thereby reducing complexity; good tests allow the coding agent to avoid simulating the entire system in its head every time.

On the coding agent side, the amount of coupling complexity it can accept depends on model capability, context length, memory quality, tool quality, and overall iteration speed. A stronger model can handle more interactions simultaneously; longer context allows it to lose fewer clues; memory can retain experience across iterative rounds; tools for searching, locating, running, and replaying can offload some of the cognitive burden externally.

Putting these two sides together, we get a set of judgments:

The clearer the feedback, the higher the coupling complexity that a unit of agent intelligence can maintain;
Given identical tools and feedback, a stronger model can handle higher coupling complexity;
Modularization, testing, and replay transfer part of the coupling complexity into the environment;
Memory and tools increase the agent's effective context;
An HS that only grows without compression will have its coupling complexity continuously rise until it exceeds maintenance capacity.

The Breakout strategy could reach a perfect score of 864 partly because the rules are simple, and partly because failures could be replayed via video, locally reproduced, and verified by regression. Ant is much more complex, but it can be broken down into modules like rhythm, posture, contact, and residual MPC.

Montezuma is a good counter-example. An unattended run in Atari57 scored 400 points, but that route consisted of 86 macro-actions, essentially open-loop execution. This example shows that some environments need stronger program structures, like composable macro-actions, recoverable search states, and long-term memory. Ordinary if-else cannot solve all problems.

The Next Paradigm?

The current paradigm shift has progressed from the initial pretrain, to RLHF, and then to large-scale RL / RLVR. Anything that can be verified begins to be solvable.

Online Learning and Continual Learning can be partially solved via Heuristic Learning, using the agentic coding produced by current RLVR. From this vision, I would call it the next paradigm: Anything that can be continuously iterated upon starts to become solvable.

Why say "partially solved"? Because Heuristic Learning cannot do everything that neural networks can. It is limited by the expressive power of code, such as complex perception and long-range generalization. For instance, within my current understanding, I cannot envision an agent cobbling together a pure Python code without neural networks to solve ImageNet.

Thus, the question lies in how to combine neural networks and HL to simultaneously solve Online Learning and Continual Learning. The most promising direction is: using HL to process online data, quickly generating online experience, internalizing the online experience into trainable, regressable, and filterable data, and then periodically updating the neural network.

Taking a robot as an example, if we borrow the System 1/2 terminology, a possible division of labor might be as follows:

Dedicated, shallow NN: Serves as part of System 1, fast and cheap, responsible for perception, classification, and object state estimation;
HL: Also serves as part of System 1, responsible for processing the latest data, rules, tests, replays, memory, safety boundaries, and local recovery;
LLM agent: Acts as System 2, responsible for providing feedback to HL, improving data, and periodically extracting HL-generated data to update itself.

This setup can be further broken down into a hierarchical structure: Joint-level HL -> Limb-level HL -> Whole-body balance HL -> Task-level HL.

The lower layers are responsible for safety and low-latency control, the middle layers for gait and contact, and the higher layers for tasks, recovery, and long-term memory. The coding agent might not directly "understand walking"; it acts more like an update pipeline plugged into the system: continuously feeding failure videos, sensor streams, simulation results, and test results into the system, and then rewriting the feedback into code, parameters, protection rules, and memory.

LLM agents can be shared or isolated within the robot's body for self-learning. The problem here is: how can the specific data distribution provided by HL be kept from causing the LLM's periodic updates to collapse? This is a classic post-training problem with many established solutions, which I'll refrain from expanding on here for certain reasons.

Agentic coding has changed the speed of writing code and also redefined which code is worth owning long-term.

Many heuristics in the past seemed to have no future, often due to maintenance costs; they were not necessarily too weak themselves. What the coding agent changes is this maintenance cost curve. Rules, tests, logs, memory, and patches were originally just scattered engineering materials. Now, they can begin to form a continuously updating Heuristic System, capable of truly solving the problems that Online Learning and Continual Learning have failed to address.

Welcome to the next paradigm!

Appendix: Experiment Process and Reproduction Entry Points

The complete artifact repo is at https://github.com/Trinkle23897/learning-beyond-gradients. The commands below assume you have cloned this repo and are running from the repository root. The GitHub Pages only displays the article and necessary static files; complete scripts, CSVs, videos, and experiment materials are all in the repo.

In the following experiments, the codex model version was gpt-5.4; the latest model version has not been tested. The following experiment reports were all written by codex itself.

A.1 Summary of the Experiment Process

At first, I directly asked Codex: "Write a strategy to solve Breakout." The results were mediocre. Low scores lacked explanatory power: you couldn't tell if the action semantics were wrong, the state detection was wrong, the evaluation setup was wrong, or the strategy structure itself was inadequate. Later, I changed the task to another form: don't just hand in a policy.py; maintain a complete closed loop.

The loop roughly looked like this: Probe actions and observations -> Write state detectors -> Write strategy -> Run complete episodes -> Record trials.jsonl and summary.csv -> Generate videos or curves -> Look at failure modes -> Modify strategy -> Simplify code and perform regression.

At this point, the shape of the task had changed. The final output product shifted from a single policy file to an experimental system that could continue to be modified. It had detectors, records, replays, failure modes, and clues on what to change next round.

Breakout

Breakout seems like a geometry problem on the surface: where is the ball, where is the paddle, and where will the ball land after hitting a wall? The trouble comes in the later stages. The strategy can consistently catch the ball but no longer hits new bricks, and the score gets stuck in a stable loop.

In the first round, Codex confirmed the action space and observation shape, then looked for the colors of the paddle, ball, and bricks from the RGB image, and used those image tags to scan the 128 RAM bytes. Early experiment records looked something like this:

trial_name score cumulative_env_steps note
shape_action_probe - 32 inspect obs/info/action
ram_byte_corr_probe_v1 - 5,032 correlate RAM bytes
ram_fit_action_probe_v2 - 9,532 action 2=right, 3=left
baseline_v0 99 16,303 initial RAM intercept
tunnel0_v1 387 43,303 no tunnel offset

387 is the first local high score that easily fools people. The strategy could already stably catch the ball, but it had sent the ball into a cycle: it wouldn't die, but it also wouldn't continue clearing bricks. A human writing to this point might easily continue tuning "ball-catching precision." After Codex watched the video and the trajectory of the last few dozen steps, it pinpointed the problem as a lack of perturbation in the ball's path.

若影片無法播放，請改看來源頁。

Video artifact: heuristic_breakout_score387_tunnel0_render210x160.mp4.

The first effective mechanism was breaking the cycle: if there's no reward for a long continuous period, periodically add an offset to the predicted landing point to knock the ball out of the local loop. This change pushed the score from 387 to 507.

Later, another failure mode was encountered: for high-speed, low-altitude balls, if chased using normal intercept distances, the paddle would be misled by excessive look-ahead. Codex added fast_low_ball_lead_steps=3, and the score jumped from 507 to 839.

Moving from 839 to 864 was more like tending to a system that had already become complex. Codex tried dead zones, serve offsets, stuck offsets, brick balance biases, and look-ahead steps; many directions were useless. What finally worked was a late-game condition: when the score passed the first wall, the stuck offset only took effect while still far from the paddle; it was gradually withdrawn when close to catching the ball, otherwise it would mislead the paddle in the final few bricks phase. Simultaneously, it added a small paddle drift compensation to offset the one-step delay between the action and paddle position.

若影片無法播放，請改看來源頁。

Video artifact: heuristic_breakout_ci3985ae2_score864_render210x160.mp4.

The final verification for the default RAM configuration over three runs was 864 / 864 / 864. Later, Codex also migrated the same geometric control back to pure image input: no RAM, only using RGB segmentation to find the paddle, ball, and brick balance. The pure image version first scored 310, then 428, and finally, by lowering the threshold for the late-game "gradual withdrawal of stuck offset" to be effective throughout the entire session, it reached 864 for the first time after 7 local policy rounds, corresponding to 14,504 local policy environment steps.

One can't write this as "pure image went from zero to perfect score in 14.5K steps." The real process was: Codex first figured out the geometric control, cycle-breaking, and late-game offset withdrawal structures in the RAM version; once the structure was stable, it swapped the state-reading layer from RAM to RGB detectors. The 14.5K for the pure image version is the migration budget.

Ant and HalfCheetah

Ant's signal is different from Breakout's. Breakout's geometric structure is very intuitive; Ant involves continuous control with 8 joint actions, and failure modes shifted from "missing the ball" to body dynamics problems.

I didn't specify "use CPG" or "use MPC" from the start. The only requirements were a few: don't train a neural network, must be locally reproducible, leave records for each round of experiments, and continue pushing the score higher. Codex first read the EnvPool/Gymnasium Ant observations and rewards, confirmed the action order, root velocity, torso orientation, joint positions, and joint velocities, and then proposed its first version of a rhythmic gait.

The first version was a four-legged phase oscillator: left and right legs were anti-phase, hip and ankle joints tracked sinusoidal target angles, and actions were given by a PD controller. It wasn't elegant, but it was much better than random right out of the box, averaging a score of 2291 over 5 random seeds.

The subsequent early iterations were much like tuning a real controller: first, adding yaw feedback to 2718, then adjusting phase velocity, hip/ankle amplitudes, and yaw angular velocity gain to 3025, and then adding second/third-order harmonics to 3162. Codex also tried large-scale parameter searches, but the results didn't stably exceed the current rhythmic strategy, so it stopped expanding the search budget and turned to another representation.

The leap came from residual MPC. Roughly speaking, MPC means "thinking a short future while walking": retaining the rhythmic gait as a base reflex, at each real environment step, sampling dozens of small residual action sequences in a local MuJoCo model, scoring them, and executing only the first residual action; at the next step, re-assess the state, re-plan, and use the unexecuted plan from the previous round as a warm start.

This way, there's no need to plan how all 8 joints should move from scratch at each step. The strategy first has a stable gait, then uses a short-horizon model planner to correct it.

trial_name score_mean cumulative_env_steps note
ant_lr_cpgpd_v1 2291.9 5,000 left-right leg anti-phase CPG + PD
ant_yawaxis_grid_v2 2857.9 20,000 yaw feedback + retuning params
ant_h3_428_v1 3162.0 50,000 2nd/3rd order harmonics
ant_mpc_residual_v1_ep1 3635.5 62,000 horizon=6, candidates=32
ant_mpc_residual_cfg4_eval5 3964.7 67,000 horizon=8, candidates=48
ant_mpc_residual_cand07_eval5 4647.1 73,000 local search around MPC config
ant_mpc_residual_narrow04_eval5 4871.3 79,000 lower z target, increase kp/candidates
ant_mpc_residual_warm02_eval5 5165.2 85,000 warm start residual plan
ant_mpc_fast065x060_sigma008_clip012 5759.4 95,000 faster gait + larger residual
ant_mpc_term001_ep1 6054.5 100,000 terminal velocity cost
ant_mpc_default_adaptive_ep1 6146.2 106,300 speed adaptive phase + stance phase

By the end, the strategy included oscillator phases, stance phase proportions, speed adaptation, roll/pitch/yaw feedback, foot contacts, short-horizon model internal rollouts, residual smoothing, terminal velocity cost, and warm-start plan decay. A human could certainly write one or two of these modules, but simultaneously tending to experiment records, code, videos, and failure directions in a short time changes the difficulty completely.

若影片無法播放，請改看來源頁。

Video artifact: heuristic_ant_mpc_default_6146_render480.mp4.

HalfCheetah is another data point for the same class of evidence. I re-ran the mpc-staged-tree-asym-pd-cpg 5-round retest, results for seeds 100..104 were: mean 11836.7, min 11735.0, max 12041.2. The strategy relied on explainable gait/pose rules and online staged-tree MPC: first using CPG/PD to form a high-scoring gait, then using a short-horizon model scoring and staged swing-amplitude schedule to correct actions.

Atari57

Breakout and Ant are single-point stories. Atari57 aims to see how much of this workflow remains after leaving a single beautiful case. The approach is straightforward: throw the same Codex workflow onto the entire Atari57 suite, running both ram and native_obs input types for each environment, with 3 independent repetitions for each input type. The total was: 57 games x 2 input types x 3 runs = 342 coding-agent search trajectories.

No human was beside this set of experiments to give little hints. Each agent received the same template and different ENV_ID / OBS_MODE / REPEAT_INDEX, then executed until it stopped. Each run had to write a policy.py, trials.jsonl, summary.csv, sample_efficiency.png, and README.md.

The main constraints were:

No training neural networks.
No reading environment source code, tests, ROM details, or hidden states.
native_obs mode can only use the native obs returned by reset/step.
ram mode can use info ["ram"].
Atari initialization parameters are fixed, including frame_skip=1, reward_clip=False, sticky action=0.
All probe/debug/trial steps that actually step through the environment must be counted into cumulative_env_steps.

First, let's look at the environment step curve. HNS stands for human-normalized score, which normalizes each game score by a human baseline before comparison. In a batch run with absolutely no human intervention, the Atari median HNS for native_obs reached 0.32 at around 1M steps, and 0.26 for ram, significantly higher than the early curves for PPO2 / CleanRL EnvPool PPO in the figure; at around 9.7M steps, native_obs was 0.81, and ram was 0.59. In the same comparison, the PPO2 / CleanRL EnvPool PPO median HNS curve stored by the OpenRL Benchmark reaches approximately 0.88 / 0.92 at 10M steps.

What's being compared here is environment interaction efficiency; the overhead of the coding agent reading logs, writing code, and watching videos isn't factored into the total computational cost. The signal it gives is very specific: a still-crude coding agent batch workflow, without looking at any intermediate results, can already push the Atari57 median into a range close to these baselines.

If we switch to an aggregation method that takes the best input for each game, Codex median HNS is 0.83, OpenAI Baselines PPO2 is 0.80, and CleanRL EnvPool PPO is 0.98; and if we further relax to the best single run, Codex median HNS is 1.18. This aggregation can't replace a strict training curve comparison, but it more directly illustrates the level this batch of unattended search ultimately covers.

Aggregated curves compress differences into a median, so I also examined each game's own HNS. In games like Breakout, Krull, DoubleDunk, Boxing, and DemonAttack, both heuristic and Deep RL baselines could achieve scores significantly higher than the human baseline; in games like Asterix, Jamesbond, Centipede, Bowling, Skiing, and Tennis, heuristics stood out relatively; on Atlantis, VideoPinball, UpNDown, Assault, RoadRunner, and StarGunner, PPO was clearly much stronger.

The most interesting aspect of Atari57 is the changed source of sample efficiency. Traditional neural network Atari learning requires re-learning representation, credit assignment, and action meaning from high-dimensional inputs in each environment; what Codex does is break the environment into maintainable small program systems: aiming/dodging for shooters, bouncing for catching games, positional rules for avoidance games, environment wrapper details, and failure experiment records unique to each environment.

Montezuma

Some environments are not suitable for ordinary reactive heuristic strategies. Montezuma's Revenge is a typical example.

An earlier sole search on Montezuma using state graph search could push the key distance from 72 to 28, but the reward was still 0. Later, in the pure image batch experiment on Atari57, one unattended Codex run reached a score of 400.0: the best repaired replay was repair_replay_r1_t19734 with seed 10001, using 1769 environment steps, essentially an open-loop route composed of 86 macro-actions.

若影片無法播放，請改看來源頁。

Montezuma exposes an expressiveness problem. An ordinary policy.py state machine struggles to contain this kind of route: actions must be aligned with timing, must be able to recover after failure, and intermediate states must be re-enterable into the plan. Some environments need composable macro-actions, recoverable search states, and even a program structure more suitable for long-term planning than ordinary if-else.

This kind of failure is very valuable for HL. It tells us where the boundary lies and hints at what the next layer of abstraction should probably look like. Some feedback requires new representations and new program forms to even enter the system. The next-layer interface Montezuma points to will probably include macro-actions, recoverable states, search, and long-term memory.

A.2 Reproduction Entry Points

The commands below assume they are run in the directory where this article is located, with dependencies installed according to requirements.txt, used to check the several representative results mentioned earlier.

Pong 21

The reproduction entry point: heuristic_pong.py.

python heuristic_pong.py \
  --policy ram \
  --episodes 1 \
  --seed 0

The expected output should include episode=0 score=21.0 and mean=21.000.

Breakout 864

The reproduction entry point: heuristic_breakout.py.

rm -f /tmp/repro_breakout_864.jsonl /tmp/repro_breakout_864.csv
python heuristic_breakout.py \
  --policy ram \
  --episodes 1 \
  --seed 0 \
  --max-steps 108000 \
  --deadband 3 \
  --chase-lead-steps 6 \
  --tunnel-offset 0 \
  --launch-offset 24 \
  --fast-ball-min-vy 3 \
  --fast-low-ball-lead-steps 3 \
  --stuck-trigger-steps 1024 \
  --stuck-switch-steps 256 \
  --stuck-offset 12 \
  --stuck-release-horizon-steps 8 \
  --brick-balance-deadzone 0.01 \
  --brick-balance-bias-min-score 432 \
  --late-game-paddle-lag-px 2 \
  --late-game-lag-ball-y 170 \
  --trial-name repro_breakout_864 \
  --log-path /tmp/repro_breakout_864.jsonl \
  --summary-path /tmp/repro_breakout_864.csv

The expected output should include score=864.0 and mean=864.000.

Ant Default MPC Policy

The reproduction entry point: heuristic_ant.py, ant_envpool.xml.

rm -f /tmp/repro_ant_6146_eval5.jsonl /tmp/repro_ant_6146_eval5.csv
python heuristic_ant.py \
  --policy mpc \
  --episodes 5 \
  --seed 0 \
  --max-steps 1000 \
  --mujoco-xml-path ant_envpool.xml \
  --trial-name repro_ant_6146_eval5 \
  --log-path /tmp/repro_ant_6146_eval5.jsonl \
  --summary-path /tmp/repro_ant_6146_eval5.csv

When I re-ran locally, it was mean=6005.521, min=5776.805, max=6146.208.

HalfCheetah Staged-Tree MPC

The reproduction entry point: heuristic_halfcheetah_v5.py.

python heuristic_halfcheetah_v5.py \
  --policy mpc-staged-tree-asym-pd-cpg \
  --eval-episodes 5 \
  --eval-seed 100

When I re-ran locally, the mean over 5 episodes was 11836.693.

Montezuma 400-Point Replay

The reproduction entry point: heuristic_montezuma_400_policy.py.

python heuristic_montezuma_400_policy.py \
  --metadata-out /tmp/repro_montezuma_400.json

The expected output should include "score": 400.0 and "env_steps": 1769. This is a boundary case; don't interpret it as a general Montezuma strategy.

Decorative image closing the article

Reprint requires permission from this public account.

Submissions or story inquiries: liyazhou@jiqizhixin.com

OpenAI's Jiayi Weng: Beyond Gradients, Is the Next AI Training Paradigm on the Horizon?

Related Articles

分享網址