⚡️ Realtime-VLA FLASH Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu^{1,2 *} Kefan Gu^3,4 Yucheng Zhao^{4 †} Shengwen Liang^{1 ‡} Tiancai Wang^{4 ‡} Xing Hu¹ Ying Wang¹ Huawei Li¹

¹State Key Lab of Processors, Institute of Computing Technology, CAS
²University of Chinese Academy of Sciences
³Nanjing University ⁴Dexmal

^*This work was done during the internship at Dexmal. ^†Project lead. ^‡Corresponding author.

Code Paper Model

Why diffusion-based VLAs need faster inference

dVLAs generate high-quality action chunks, but synchronous full-path inference can make robot commands stale in reactive scenes.

Overview of sync inference failure and FLASH speculative inference — FLASH replaces slow full-path replanning with sub-8 ms speculative rounds, executing only the verified action prefix and falling back when the draft is inconsistent.

Problem. Action chunking reduces how often a dVLA replans, but each refresh still runs the full image-encoding, VLM-prefill, and action-denoising pipeline. During this delay, the robot keeps executing an open-loop chunk, which can become stale in reactive scenes.

Speculative Inference. A natural solution is to avoid rerunning the full inference path. But unlike LLMs or AR-VLAs, dVLAs produce continuous actions through iterative denoising, leaving no token-level probability for accepting or rejecting a draft.

Insight. Flow matching provides structure for consistency verification, and smooth-motion phases make nearby draft actions predictable.

FLASH turns full inference into a dual-path runtime

FLASH keeps the original full path as a reliable anchor and adds a speculative path for rounds where a cheap draft can be verified.

Realtime-VLA FLASH framework and draft model architecture — FLASH keeps the original dVLA intact. The flash path drafts a candidate action chunk, verifies it with the main Action Expert at selected flow-matching timesteps, and executes only the longest consistent prefix. When verification fails or the trajectory enters a precision-sensitive phase, FLASH falls back to the full path.

❄️ Full path. Runs Image Encoder, VLM prefill, and Action Denoise to refresh context and produce high-fidelity actions.

🔥 Flash path. Runs the Image Encoder on the latest observation, then drafts and verifies a candidate action chunk in parallel, returning the longest consistent prefix.

Phase-aware fallback. Smooth motion often tolerates small draft errors, while final adjustments(eg. gripper switches) require higher-fidelity full-path actions.

Lower latency without sacrificing task performance

7.8 ms fastest flash-path round

19.1 ms task-level average latency

3.04× speedup over Torch-π0

15 m/min extra-high conveyor speed

On LIBERO, FLASH+Triton reduces average inference latency from 58.0 ms to 19.1 ms with only a 0.3-point drop in average success rate. On real conveyor-belt sorting, lower latency reduces stale action chunks and extends the speed range for grasping moving objects.

Simulation results on four LIBERO suites — FLASH+Triton reaches a 3.04× speedup over Torch-π0 while keeping average success rate within 0.3 points.

Suite-level flash-path statistics on LIBERO — Across LIBERO suites, FLASH+Triton handles 66.8% of replanning rounds through the flash path, with accepted prefixes covering 69.7% of the replan window on average.

Conveyor-belt sorting results under synchronous inference — In real conveyor-belt sorting, FLASH+Triton is the only method with nonzero success at the extra-high 15 m/min speed.

Affiliations

Citation

@article{niu2026realtimevlaflash,
  title={Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs},
  author={Niu, Jiahui and Gu, Kefan and Zhao, Yucheng and Liang, Shengwen and Wang, Tiancai and Hu, Xing and Wang, Ying and Li, Huawei},
  journal={arXiv preprint arXiv:2605.13778},
  year={2026}
}