Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Abstract

Many dexterous manipulation tasks are non-Markovian in nature — the right action depends not just on the current observation but on what has already happened. Yet today's vision-language-action (VLA) models are largely stateless, and they struggle on long-horizon, memory-dependent tasks.

We introduce Notes-to-Self, a simple recipe that augments a VLA with a language scratchpad. The scratchpad is structured into three sections — Grounding (initial object and end-effector positions), Plan (the sub-tasks needed to solve the task), and Act (which sub-tasks have been completed) — and is updated whenever the model emits a special <done> token. This gives the model both spatial memory (what was where) and temporal memory (where we are in the plan).

We evaluate on ClevrSkills-Mem, a memory-dependent split of ClevrSkills, on MemoryBench, and on a real-world pick-place-restore task with a UFactory xArm 6. Across both transformer-based (T-VLA, PaliGemma-2 3B) and recurrent (R-VLA, Mamba 130M) backbones, the scratchpad significantly improves generalization on memory-dependent tasks. On the real robot, scratchpad-augmented OpenVLA improves success rate from 0% to 65%.

ClevrSkills-Mem

ClevrSkills-Mem is a benchmark of five memory-dependent manipulation tasks built on top of ClevrSkills. Each task requires the agent to remember either the spatial state at the start of the episode, the temporal progress through a multi-stage plan, or both.

Touch-Reset-Pick
Place-Next-to-and-Restore
Stack-and-Topple
Swap
Rotate-Restore

Touch-Reset-Pick

Initialized with 2–3 objects on the table. The agent must touch a specified object, return to the initial position, and finally pick a different object.

temporal memory The agent must remember which sub-task comes next after the reset.

Place-Next-to-and-Restore

Initialized with 2 or more objects. The agent must place one object next to another, then restore it back to its original location.

spatial memory The agent must remember the original position of the moved object.

Stack-and-Topple

With 2–4 objects, the agent must stack them in a specific order and then topple the resulting stack.

temporal memory A long-horizon task that tests tracking of multi-step plan progress.

Swap

With 2 or more objects, the agent must swap the positions of two specified objects.

spatial memory temporal memory Tests both initial-position recall and current-phase tracking.

Rotate-Restore

With 2 or more objects, the agent must rotate a specified object by a predefined amount and then restore it to its original orientation.

temporal memory Requires very fine-grained tracking of relative rotation.

Method

We model the policy as $p(a t, d t | o t, S t, l)$ , where in addition to the action $a t$ the model also emits a language description $d t$ that updates a scratchpad $S t = {d 1, \dots, d n}$ . The scratchpad has three sections:

Grounding — initialization conditions: object positions, end-effector position. Provides spatial memory.
Plan — the sub-tasks the model must complete to solve the task.
Act — the sub-tasks completed so far. Provides temporal memory.

The scratchpad is updated whenever the model emits a special <done> token marking the end of a sub-task. The same recipe applies to both transformer VLAs (T-VLA, built on PaliGemma-2) and recurrent VLAs (R-VLA, built on Mamba), with the recurrent variant interleaving instructions, observations, actions, and descriptions into a single sequence trained with next-token prediction.

Method overview — Overview of the scratchpad-augmented VLA. The model jointly produces actions and language descriptions; descriptions accumulate in the scratchpad and condition future steps.

Results on ClevrSkills-Mem

Policy rollouts of our scratchpad-augmented VLA on each ClevrSkills-Mem task.

Touch-Reset-Pick
Place-Next-to-and-Restore
Stack-and-Topple
Swap
Rotate-Restore

Results on the ClevrSkills-Mem benchmark. Success rates for all evaluated models — with and without the scratchpad — across the five tasks, computed over 50 rollouts on unseen object starting positions. The rightmost panel shows mean performance across tasks.

Real-World Rollouts

Real-world rollouts on Pick-Place-Restore: the robot must pick a tomato, place it in a bowl, and restore the tomato to its original location. Setup is a UFactory xArm 6 with a flexible two-fingered gripper and a single RealSense D435 camera. Models are LoRA-finetuned from pretrained OpenVLA on 200 tele-operated demonstrations. Without the scratchpad, the baseline picks the tomato but cannot tell whether it should next place it in the bowl or restore it — the start and end states look almost identical.

Model	Avg. Success	Sub-task CR	Avg. Restore Distance
Human teleop	100%	3.0	3.2 cm
OpenVLA	0%	0.9	—
OpenVLA + Scratchpad (ours)	65%	2.4	10.62 cm

Results over 20 rollouts per model on Pick-Place-Restore.

BibTeX

@inproceedings{haresh2026notestoself,
  title     = {Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks},
  author    = {Haresh, Sanjay and Dijkman, Daniel and Bhattacharyya, Apratim and Memisevic, Roland},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}