
By Michael Psenka, Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar
GRASP is a brand new gradient-based planner for realized dynamics (a “world mannequin”) that makes long-horizon planning sensible by (1) lifting the trajectory into digital states so optimization is parallel throughout time, (2) including stochasticity on to the state iterates for exploration, and (3) reshaping gradients so actions get clear alerts whereas we keep away from brittle “state-input” gradients by way of high-dimensional imaginative and prescient fashions.
Massive, realized world fashions have gotten more and more succesful. They will predict lengthy sequences of future observations in high-dimensional visible areas and generalize throughout duties in ways in which had been troublesome to think about a number of years in the past. As these fashions scale, they begin to look much less like task-specific predictors and extra like general-purpose simulators.
However having a strong predictive mannequin isn’t the identical as with the ability to use it successfully for management/studying/planning. In observe, long-horizon planning with trendy world fashions stays fragile: optimization turns into ill-conditioned, non-greedy construction creates unhealthy native minima, and high-dimensional latent areas introduce refined failure modes.
On this weblog submit, I describe the issues that motivated this mission and our strategy to deal with them: why planning with trendy world fashions may be surprisingly fragile, why lengthy horizons are the actual stress check, and what we modified to make gradient-based planning far more sturdy.
This weblog submit discusses work finished with Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar (* denotes equal advisorship), the place we suggest GRASP.
What’s a world mannequin?
As of late, the time period “world mannequin” is sort of overloaded, and relying on the context can both imply an express dynamics mannequin or some implicit, dependable inner state {that a} generative mannequin depends on (e.g. when an LLM generates chess strikes, whether or not there’s some inner illustration of the board). We give our free working definition beneath.
Suppose you are taking actions
and observe states
(photographs, latent vectors, proprioception). A world mannequin is a realized mannequin that, given the present state and a sequence of future actions, predicts what’s going to occur subsequent. Formally, it defines a predictive distribution on a sequence of noticed states
and present motion
:
![]()
that approximates the setting’s true conditional
. For this weblog submit, we’ll assume a Markovian mannequin
for simplicity (all outcomes right here may be prolonged to the extra normal case), and when the mannequin is deterministic it reduces to a map over states:
![]()
In observe the state
is usually a realized latent illustration (e.g., encoded from pixels), so the mannequin operates in a (theoretically) compact, differentiable house. The important thing level is {that a} world mannequin offers you a differentiable simulator; you may roll it ahead beneath hypothetical motion sequences and backpropagate by way of the predictions.
Planning: selecting actions by optimizing by way of the mannequin
Given a begin
and a objective
, the best planner chooses an motion sequence
by rolling out the mannequin and minimizing terminal error:
![]()
Right here we use
as shorthand for the total rollout by way of the world mannequin (dependence on mannequin parameters
is implicit):
![]()
Briefly horizons and low-dimensional programs, this will work fairly effectively. However as horizons develop and fashions grow to be bigger and extra expressive, its weaknesses grow to be amplified.
So why doesn’t this simply work at scale?
Why long-horizon planning is difficult (even when all the things is differentiable)
There are two separate ache factors for the extra normal world mannequin, plus a 3rd that’s particular to realized, deep learning-based fashions.
1) Lengthy-horizon rollouts create deep, ill-conditioned computation graphs
These aware of backprop by way of time (BPTT) could discover that we’re differentiating by way of a mannequin utilized to itself repeatedly, which can result in the exploding/vanishing gradients drawback. Specifically, if we take derivatives (observe we’re differentiating vector-valued features, leading to Jacobians that we denote with
) with respect to earlier actions (e.g.
):
![Rendered by QuickLaTeX.com [D_{a_0} mathcal{F}_{theta}^{T}(s_0, mathbf{a}) = Bigl(prod_{t=1}^T D_s F_theta(s_t, a_t)Bigr) D_{a_0}F_theta(s_0, a_0).]](https://robohub.org/wp-content/ql-cache/quicklatex.com-006a676ea89bf4f534c4e46f3822e638_l3.png)
We see that the Jacobian’s conditioning scales exponentially with time
:
![]()
resulting in exploding or vanishing gradients.
2) The panorama is non-greedy and filled with traps
At quick horizons, the grasping answer, the place we transfer straight towards the objective at each step, is usually ok. If you happen to solely have to plan a number of steps forward, the optimum trajectory normally doesn’t deviate a lot from “head towards
” at every step.
As horizons develop, two issues occur. First, longer duties usually tend to require non-greedy conduct: going round a wall, repositioning earlier than pushing, backing as much as take a greater path. And as horizons develop, extra of those non-greedy steps are usually wanted. Second, the optimization house itself scales with horizon:
, additional increasing the house of native minima for the optimization drawback.

A protracted-horizon repair: lifting the dynamics constraint
Suppose we deal with the dynamics constraint
as a comfortable constraint, and we as an alternative optimize the next penalty perform over each actions
and states
:
![Rendered by QuickLaTeX.com [min_{mathbf{s},mathbf{a}} mathcal{L}(mathbf{s}, mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(s_t,a_t) - s_{t+1}big|_2^2, quad text{with } s_0 text{ fixed and } s_T=g.]](https://robohub.org/wp-content/ql-cache/quicklatex.com-863f11a3d6371cdc4342477c54c6f78f_l3.png)
That is additionally generally known as collocation in planning/robotics literature. Word the lifted formulation shares the identical world minimizers as the unique rollout goal (each are zero precisely when the trajectory is dynamically possible). However the optimization landscapes are very completely different, and we get two instant advantages:
- Every world mannequin analysis
relies upon solely on native variables, so all
phrases may be computed in parallel throughout time, leading to an enormous speed-up for longer horizons, and - You not backpropagate by way of a single deep
-step composition to get a studying sign, for the reason that earlier product of Jacobians now splits right into a sum, e.g.:
![]()
Having the ability to optimize states immediately additionally helps with exploration, as we are able to briefly navigate by way of unphysical domains to search out the optimum plan:

Nevertheless, lunch is rarely free. And certainly, particularly for deep learning-based world fashions, there’s a crucial subject that makes the above optimization fairly troublesome in observe.
A problem for deep learning-based world fashions: sensitivity of state-input gradients
The tl;dr of this part is: immediately optimizing states by way of a deep learning-based
is extremely brittle, à la adversarial robustness. Even should you practice your world mannequin in a lower-dimensional state house, the coaching course of for the world mannequin makes unseen state landscapes very sharp, whether or not or not it’s an unseen state itself or just a traditional/orthogonal course to the information manifold.
Adversarial robustness and the “dimpled manifold” mannequin
Adversarial robustness initially checked out classification fashions
, and confirmed that by following the gradient of a selected logit
from a base picture
(not of sophistication
), you didn’t have to maneuver far alongside
to make
classify
as
(Szegedy et al., 2014; Goodfellow et al., 2015):

Later work has painted a geometrical image for what’s occurring: for information close to a low-dimensional manifold
, the coaching course of controls conduct in tangential instructions, however doesn’t regularize conduct in orthogonal instructions, thus resulting in delicate conduct (Stutz et al., 2019). One other manner said:
has an affordable Lipschitz fixed when contemplating solely tangential instructions to the information manifold
, however can have very excessive Lipschitz constants in regular instructions. Actually, it usually advantages the mannequin to be sharper in these regular instructions, so it might probably match extra sophisticated features extra exactly.

Consequently, such adversarial examples are extremely frequent even for a single given mannequin. Additional, this isn’t simply a pc imaginative and prescient phenomenon; adversarial examples additionally seem in LLMs (Wallace et al., 2019) and in RL (Gleave et al., 2019).
Whereas there are strategies to coach for extra adversarially sturdy fashions, there’s a recognized trade-off between mannequin efficiency and adversarial robustness (Tsipras et al., 2019): particularly within the presence of many weakly-correlated variables, the mannequin should be sharper to attain greater efficiency. Certainly, most trendy coaching algorithms, whether or not in pc imaginative and prescient or LLMs, don’t practice adversarial robustness out. Thus, no less than till deep studying sees a serious regime change, it is a drawback we’re caught with.
Why is adversarial robustness a difficulty for world mannequin planning?
Contemplate a single element of the dynamics loss we’re optimizing within the lifted state strategy:
![]()
Let’s additional deal with simply the bottom state:
![]()
Since world fashions are usually skilled on state/motion trajectories
, the state-data manifold for
has dimensionality bounded by the motion house:
![]()
the place
is a few non-obligatory house of augmentations (e.g. translations/rotations). Thus, we are able to usually count on
to be a lot decrease than
, and thus: it is rather straightforward to search out adversarial examples that hack any state to some other desired state.
Consequently, the dynamics optimization
![Rendered by QuickLaTeX.com [sum_{t=0}^{T-1} big|F_theta(s_t,a_t) - s_{t+1}big|_2^2]](https://robohub.org/wp-content/ql-cache/quicklatex.com-b32171577f474230feb8469c32d3c3e4_l3.png)
feels extremely “sticky,” as the bottom factors
can simply trick
into pondering it’s already made its native objective.1

1. This adversarial robustness subject, whereas notably unhealthy for lifted-state approaches, isn’t distinctive to them. Even for serial optimization strategies that optimize by way of the total rollout map
, it’s potential to get into unseen states, the place it is rather straightforward to have a traditional element fed into the delicate regular elements of
. The motion Jacobian’s chain rule growth is
![Rendered by QuickLaTeX.com [Bigl(prod_{t=1}^T D_s F_theta(s_t, a_t)Bigr) D_{a_0}F_theta(s_0, a_0).]](https://robohub.org/wp-content/ql-cache/quicklatex.com-3ce71c40a5f7f6ce0dc00795502ef2c8_l3.png)
See what occurs if any stage of the product has any element regular to the information manifold. ↩
Our repair
That is the place our new planner GRASP is available in. The primary remark: whereas
is untrustworthy and adversarial, the motion house is normally low-dimensional and exhaustively skilled, so
is definitely cheap to optimize by way of and doesn’t endure from the adversarial robustness subject!

At its core, GRASP builds a first-order lifted state / collocation-based planner that’s solely depending on motion Jacobians by way of the world mannequin. We thus exploit the differentiability of realized world fashions
, whereas not falling sufferer to the inherent sensitivity of the state Jacobians
.
GRASP: Gradient RelAxed Stochastic Planner
As famous earlier than, we begin with the collocation planning goal, the place we elevate the states and chill out dynamics right into a penalty:
![Rendered by QuickLaTeX.com [min_{mathbf{s},mathbf{a}} mathcal{L}(mathbf{s}, mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(s_t,a_t) - s_{t+1}big|_2^2, quad text{with } s_0 text{ fixed and } s_T=g.]](https://robohub.org/wp-content/ql-cache/quicklatex.com-863f11a3d6371cdc4342477c54c6f78f_l3.png)
We then make two key additions.
Ingredient 1: Exploration by noising the state iterates
Even with a smoother goal, planning is nonconvex. We introduce exploration by injecting Gaussian noise into the digital state updates throughout optimization.
A easy model:
![]()
Actions are nonetheless up to date by non-stochastic descent:
![]()
The state noise helps you “hop” between basins within the lifted house, whereas the actions stay guided by gradients. We discovered that particularly noising states right here (versus actions) finds a great steadiness of exploration and the power to search out sharper minima.2
2. As a result of we solely noise the states (and never the actions), the corresponding dynamics will not be actually Langevin dynamics. ↩
Ingredient 2: Reshape gradients: cease brittle state-input gradients, hold motion gradients
As mentioned, the delicate pathway is the gradient that flows into the state enter of the world mannequin,
. Probably the most simple manner to do that initially is to simply cease state gradients into
immediately:
- Let
be the identical worth as
, however with gradients stopped.
Outline the stop-gradient dynamics loss:
![Rendered by QuickLaTeX.com [mathcal{L}_{text{dyn}}^{text{sg}}(mathbf{s},mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(bar{s}_t, a_t) - s_{t+1}big|_2^2.]](https://robohub.org/wp-content/ql-cache/quicklatex.com-cd8e96c9d63f8a87db71c295647be60c_l3.png)
This alone doesn’t work. Discover now states solely observe the earlier state’s step, with out something forcing the bottom states to chase the subsequent ones. Consequently, there are trivial minima for simply stopping on the origin, then just for the ultimate motion attempting to get to the objective in a single step.
Dense objective shaping
We will view the above subject because the objective’s sign being reduce off fully from earlier states. One method to repair that is to easily add a dense objective time period all through prediction:
![Rendered by QuickLaTeX.com [mathcal{L}_{text{goal}}^{text{sg}}(mathbf{s},mathbf{a}) = sum_{t=0}^{T-1} big|F_theta(bar{s}_t, a_t) - gbig|_2^2.]](https://robohub.org/wp-content/ql-cache/quicklatex.com-8bf9d85259fdc485efc9660032128d29_l3.png)
In regular settings this could over-bias in the direction of the grasping answer of straight chasing the objective, however that is balanced in our setting by the stop-gradient dynamics loss’s bias in the direction of possible dynamics. The ultimate goal is then as follows:
![]()
The result’s a planning optimization goal that doesn’t have dependence on state gradients.
Periodic “sync”: briefly return to true rollout gradients
The lifted stop-gradient goal is nice for quick, guided exploration, nevertheless it’s nonetheless an approximation of the unique serial rollout goal.
So each
iterations, GRASP does a brief refinement part:
- Roll out from
utilizing present actions
, and take a number of small gradient steps on the unique serial loss:
![]()
The lifted-state optimization nonetheless offers the core of the optimization, whereas this refinement step provides some help to maintain states and actions grounded in the direction of actual trajectories. This refinement step can after all get replaced with a serial planner of your alternative (e.g. CEM); the core concept is to nonetheless get a number of the good thing about the full-path synchronization of serial planners, whereas nonetheless largely utilizing the advantages of the lifted-state planning.
How GRASP addresses long-range planning
Collocation-based planners supply a pure repair for long-horizon planning, however this optimization is sort of troublesome by way of trendy world fashions because of adversarial robustness points. GRASP proposes a easy answer for a smoother collocation-based planner, alongside steady stochasticity for exploration. Consequently, longer-horizon planning finally ends up not solely succeeding extra, but in addition discovering such successes sooner:

| Horizon | CEM | GD | LatCo | GRASP |
|---|---|---|---|---|
| H=40 | 61.4% / 35.3s | 51.0% / 18.0s | 15.0% / 598.0s | 59.0% / 8.5s |
| H=50 | 30.2% / 96.2s | 37.6% / 76.3s | 4.2% / 1114.7s | 43.4% / 15.2s |
| H=60 | 7.2% / 83.1s | 16.4% / 146.5s | 2.0% / 231.5s | 26.2% / 49.1s |
| H=70 | 7.8% / 156.1s | 12.0% / 103.1s | 0.0% / — | 16.0% / 79.9s |
| H=80 | 2.8% / 132.2s | 6.4% / 161.3s | 0.0% / — | 10.4% / 58.9s |
Push-T outcomes. Success charge (%) / median time to success. Daring = greatest in row. Word the median success time will bias greater with greater success charge; GRASP manages to be sooner regardless of greater success charge.
What’s subsequent?
There’s nonetheless loads of work to be finished for contemporary world mannequin planners. We need to exploit the gradient construction of realized world fashions, and collocation (lifted-state optimization) is a pure strategy for long-horizon planning, nevertheless it’s essential to grasp typical gradient construction right here: easy and informative motion gradients and brittle state gradients. We view GRASP as an preliminary iteration for such planners.
Extension to diffusion-based world fashions (deeper latent timesteps may be considered as smoothed variations of the world mannequin itself), extra refined optimizers and noising methods, and integrating GRASP into both a closed-loop system or RL coverage studying for adaptive long-horizon planning are all pure and fascinating subsequent steps.
I do genuinely assume it’s an thrilling time to be engaged on world mannequin planners. It’s a humorous candy spot the place the background literature (planning and management total) is extremely mature and well-developed, however the present setting (pure planning optimization over trendy, large-scale world fashions) remains to be closely underexplored. However, as soon as we determine all the precise concepts, world mannequin planners will seemingly grow to be as commonplace as RL.
For extra particulars, learn the full paper or go to the mission web site.
Quotation
@article{psenka2026grasp,
title={Parallel Stochastic Gradient-Based mostly Planning for World Fashions},
creator={Michael Psenka and Michael Rabbat and Aditi Krishnapriyan and Yann LeCun and Amir Bar},
12 months={2026},
eprint={2602.00475},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.00475}
}
This text was initially revealed on the BAIR weblog, and seems right here with the authors’ permission.
BAIR Weblog
is the official weblog of the Berkeley Synthetic Intelligence Analysis (BAIR) Lab.

BAIR Weblog
is the official weblog of the Berkeley Synthetic Intelligence Analysis (BAIR) Lab.

