Many dexterous manipulation tasks are non-Markovian in nature — the right action depends not just on the current observation but on what has already happened. Yet today's vision-language-action (VLA) models are largely stateless, and they struggle on long-horizon, memory-dependent tasks.
We introduce Notes-to-Self, a simple recipe that augments a VLA
with a language scratchpad. The scratchpad is structured into
three sections — Grounding (initial object and end-effector positions),
Plan (the sub-tasks needed to solve the task), and Act
(which sub-tasks have been completed) — and is updated whenever the model
emits a special <done> token. This gives the model both
spatial memory (what was where) and temporal memory
(where we are in the plan).
We evaluate on ClevrSkills-Mem, a memory-dependent split of ClevrSkills, on MemoryBench, and on a real-world pick-place-restore task with a UFactory xArm 6. Across both transformer-based (T-VLA, PaliGemma-2 3B) and recurrent (R-VLA, Mamba 130M) backbones, the scratchpad significantly improves generalization on memory-dependent tasks. On the real robot, scratchpad-augmented OpenVLA improves success rate from 0% to 65%.
ClevrSkills-Mem is a benchmark of five memory-dependent manipulation tasks built on top of ClevrSkills. Each task requires the agent to remember either the spatial state at the start of the episode, the temporal progress through a multi-stage plan, or both.
Initialized with 2–3 objects on the table. The agent must touch a specified object, return to the initial position, and finally pick a different object.
temporal memory The agent must remember which sub-task comes next after the reset.
Initialized with 2 or more objects. The agent must place one object next to another, then restore it back to its original location.
spatial memory The agent must remember the original position of the moved object.
With 2–4 objects, the agent must stack them in a specific order and then topple the resulting stack.
temporal memory A long-horizon task that tests tracking of multi-step plan progress.
With 2 or more objects, the agent must swap the positions of two specified objects.
spatial memory temporal memory Tests both initial-position recall and current-phase tracking.
With 2 or more objects, the agent must rotate a specified object by a predefined amount and then restore it to its original orientation.
temporal memory Requires very fine-grained tracking of relative rotation.
We model the policy as p(at, dt | ot, St, l), where in addition to the action at the model also emits a language description dt that updates a scratchpad St = {d1, …, dn}. The scratchpad has three sections:
The scratchpad is updated whenever the model emits a special
<done> token marking the end of a sub-task. The same recipe
applies to both transformer VLAs (T-VLA, built on PaliGemma-2)
and recurrent VLAs (R-VLA, built on Mamba), with the recurrent
variant interleaving instructions, observations, actions, and descriptions into
a single sequence trained with next-token prediction.
Policy rollouts of our scratchpad-augmented VLA on each ClevrSkills-Mem task.
Real-world rollouts on Pick-Place-Restore: the robot must pick a tomato, place it in a bowl, and restore the tomato to its original location. Setup is a UFactory xArm 6 with a flexible two-fingered gripper and a single RealSense D435 camera. Models are LoRA-finetuned from pretrained OpenVLA on 200 tele-operated demonstrations. Without the scratchpad, the baseline picks the tomato but cannot tell whether it should next place it in the bowl or restore it — the start and end states look almost identical.
| Model | Avg. Success | Sub-task CR | Avg. Restore Distance |
|---|---|---|---|
| Human teleop | 100% | 3.0 | 3.2 cm |
| OpenVLA | 0% | 0.9 | — |
| OpenVLA + Scratchpad (ours) | 65% | 2.4 | 10.62 cm |
Results over 20 rollouts per model on Pick-Place-Restore.
@inproceedings{haresh2026notestoself,
title = {Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks},
author = {Haresh, Sanjay and Dijkman, Daniel and Bhattacharyya, Apratim and Memisevic, Roland},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
}