VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

1Nanjing University, 2Tencent Youtu Lab, 3CASIA
sqdong@smail.nju.edu.cn, † Corresponding author, ‡ Project leader

Overview of mainstream VLA architectures. (1). Discretization-based methods convert actions into tokens and directly decode them using visual and language features, but omit robot state information, which is crucial for physical dynamics. (2). Diffusion-based approaches extract vision-language features with a VLM, but offload action generation to an action expert, making the VLM a passive feature extractor. (3). Our method introduces a state encoder and action query token, retains the full VLM, and distills knowledge from an expert model to achieve high reasoning and efficiency.

Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization and robustness. However, training them end-to-end is costly, as modeling action distributions typically requires massive datasets and heavy computation.

In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs, as illustrated in Figure 1. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive end-to-end pretraining. This also facilitates better transfer of action modeling capabilities to the VLM. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone.

This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement), 93.5% on LIBERO-LONG (24.5% improvement), 92.5% first task success rate on CALVIN ABC-D (4.1% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model Seer, achieving 82.0% average success rate (17% improvement). These results demonstrate that action distillation effectively enables VLMs to generate precise, executable actions while substantially reducing training costs.

Framework

Overall architecture of VITA-VLA. Our model is build upon VITA-1.5-7B, taking images, instructions, action tokens, and state information as inputs to generate executable actions. The visual and textual information is input into the VLM. The action token acts as a learnable query, while the robot state is encoded into a single token using linear layers. An action mapper extracts the hidden states of the action token from the final layer of the VLM, and transforms these to match the dimensionality expected by the pretrained action decoder, and finally the action decoder generates the corresponding actions with 7 degrees of freedom (DoF).

Traing strategy

Our training strategy comprises two stages. In the alignment stage, we train the action mapper, action tokens, and state encoder to bridge the gap between the action output spaces of the VLM and the small action model, updating only 30 million parameters while achieving improved fine-tuning outcomes. In the fine-tuning stage, we then perform end-to-end optimization of the entire model to further enhance overall performance.

Real World Tasks

We design five real-world tasks to comprehensively evaluate the model’s capabilities, covering four canonical robotic operations: Pick, Place, Close, and Stack.

Real Robots Setup

Experimental setup overview. Our real-world robotic platform is illustrated in right. The setup consists of two cameras: a base-mounted Intel RealSense D435i RGB-D camera with a resolution of 1280×720, and a gripper-mounted Dabai DCW depth camera with a resolution of 640×480, providing complementary viewpoints for perception. The robot itself is a PiPer arm with six actuated joints, controlled in radians, equipped with a Songling parallel gripper whose opening width is directly commanded for grasping. This combination allows both global scene observation and fine-grained local perception at the end-effector, facilitating precise manipulation. Demonstration data were collected via teleoperation, and the same hardware was used for inference. The platform is powered by a workstation with a single GPU, on which our model runs at approximately 0.15s per inference step (about 6--7 Hz).

Real-world Evaluation

Open Drawer

Pick Place Red Block

Pick Place Sponge

Stack Blocks

Stack Cups

LIBERO Simulation Evaluation

LIBERO-Spatial Tasks

Bowl: Between Plate and Ramekin → Plate

Bowl: Wooden Cabinet → Plate

Bowl: Cookie Box → Plate

Bowl: Stove → Plate

LIBERO-Object Tasks

Bbq Sauce in Basket

Cream Cheese in Basket

Ketchup in Basket

Milk in Basket

LIBERO-Goal Tasks

Cream in Bowl

Open Top Drawer & Put Bowl In

Push Plate to Front of Stove

Put Wine Bottle on Top of Cabinet

LIBERO-Long Tasks

Cream Cheese Box & Butter in Basket

White Mug Plate & Chocolate Pudding Right Plate

Turn On Stove & Put Moka Pot

Pick Book & Place in Back Compartment of Caddy

CALVIN Simulation Evaluation

CALVIN ABC-D Tasks

Push Pick Block Left

Lift Pink Block From Cabinet

Turn on Led Light

Close Drawer

Store Block in Drawer

Take Block From Drawer

Turn Pick Block & Rotate

Turn off Light Bulb

BibTeX

@article{
      dong2025vita-vla,
      title={VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation},
      author={Shaoqi Dong and Chaoyou Fu and Haihan Gao and Yi-Fan Zhang and Chi Yan and Chu Wu and Xiaoyu Liu and Yunhang Shen and Jing Huo and Deqiang Jiang and Haoyu Cao and Yang Gao and Xing Sun and Ran He and Caifeng Shan},
      journal={arXiv preprint arXiv:2510.09607},
      year={2025}
}