⚡️ Realtime-VLA FLASH Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu1,2 * Kefan Gu3,4 Yucheng Zhao4 † Shengwen Liang1 ‡ Tiancai Wang4 ‡ Xing Hu1 Ying Wang1 Huawei Li1

1State Key Lab of Processors, Institute of Computing Technology, CAS
2University of Chinese Academy of Sciences
3Nanjing University 4Dexmal

*This work was done during the internship at Dexmal. Project lead. Corresponding author.

Why diffusion-based VLAs need faster inference

dVLAs generate high-quality action chunks, but synchronous full-path inference can make robot commands stale in reactive scenes.

Overview of sync inference failure and FLASH speculative inference
FLASH replaces slow full-path replanning with sub-8 ms speculative rounds, executing only the verified action prefix and falling back when the draft is inconsistent.

Problem.  Action chunking reduces how often a dVLA replans, but each refresh still runs the full image-encoding, VLM-prefill, and action-denoising pipeline. During this delay, the robot keeps executing an open-loop chunk, which can become stale in reactive scenes.

Speculative Inference.   A natural solution is to avoid rerunning the full inference path. But unlike LLMs or AR-VLAs, dVLAs produce continuous actions through iterative denoising, leaving no token-level probability for accepting or rejecting a draft.

Insight.   Flow matching provides structure for consistency verification, and smooth-motion phases make nearby draft actions predictable.

FLASH turns full inference into a dual-path runtime

FLASH keeps the original full path as a reliable anchor and adds a speculative path for rounds where a cheap draft can be verified.

Realtime-VLA FLASH framework and draft model architecture
FLASH keeps the original dVLA intact. The flash path drafts a candidate action chunk, verifies it with the main Action Expert at selected flow-matching timesteps, and executes only the longest consistent prefix. When verification fails or the trajectory enters a precision-sensitive phase, FLASH falls back to the full path.

❄️ Full path.   Runs Image Encoder, VLM prefill, and Action Denoise to refresh context and produce high-fidelity actions.

🔥 Flash path.   Runs the Image Encoder on the latest observation, then drafts and verifies a candidate action chunk in parallel, returning the longest consistent prefix.

Phase-aware fallback.   Smooth motion often tolerates small draft errors, while final adjustments(eg. gripper switches) require higher-fidelity full-path actions.

Lower latency without sacrificing task performance

7.8 ms fastest flash-path round
19.1 ms task-level average latency
3.04× speedup over Torch-π0
15 m/min extra-high conveyor speed

On LIBERO, FLASH+Triton reduces average inference latency from 58.0 ms to 19.1 ms with only a 0.3-point drop in average success rate. On real conveyor-belt sorting, lower latency reduces stale action chunks and extends the speed range for grasping moving objects.

Simulation results on four LIBERO suites
FLASH+Triton reaches a 3.04× speedup over Torch-π0 while keeping average success rate within 0.3 points.
Suite-level flash-path statistics on LIBERO
Across LIBERO suites, FLASH+Triton handles 66.8% of replanning rounds through the flash path, with accepted prefixes covering 69.7% of the replan window on average.
Conveyor-belt sorting results under synchronous inference
In real conveyor-belt sorting, FLASH+Triton is the only method with nonzero success at the extra-high 15 m/min speed.

Affiliations

Institute of Computing Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Nanjing University
Dexmal

Citation

@article{niu2026realtimevlaflash,
  title={Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs},
  author={Niu, Jiahui and Gu, Kefan and Zhao, Yucheng and Liang, Shengwen and Wang, Tiancai and Hu, Xing and Wang, Ying and Li, Huawei},
  journal={arXiv preprint arXiv:2605.13778},
  year={2026}
}