Pretrained on 0.4M trajectories from the Open X-Embodiment dataset, the CogACT-VLA model quickly adapts to a new robot and environment with a few hundred trajectories.
It not only demonstrates significantly higher success rates than previous VLAs, but also exhibits remarkable object generalization capabilities and task robustness.
(All videos are played at 8x speed with static frames (waiting for human’s next instruction input) removed for better demonstration. Raw videos can be found here.)
CogACT-VLA model architecture.
Our core idea is to leverage the cognitive information extracted by powerful VLMs to guide the action prediction of a specialized action module. CogACT-VLA has three componentized modules:
The Vision and Language modules are built upon a pretrained Prismatic-7B VLM and finetuned end-to-end with the Action module of up to 300M parameters. Our primary training data is a subset of the Open X-Embodiment (OXE) dataset.
Adaptive ensemble strategy at inference. |
During inference, our model predicts actions for multiple time steps. We propose an Adaptive Action Ensemble (AAE) strategy which considers similarities between actions to be aggregated as shown in the above figure. This approach avoids unreasonable aggregation of actions from different modes.
We first evaluate CogACT-VLA in the SIMPLER evaluation environment. This simulation platform is designed to bridge the real-to-sim control and visual gap for the Google robot and the WidowX robot. A strong correlation between the performance in SIMPLER and in the real world has been demonstrated by extensive evaluation of various VLA models. We compare CogACT-VLA with existing VLA models, including RT-1, RT-1-X, RT-2-X, Octo, and OpenVLA.
SIMPLER offers two evaluation settings: Visual Matching, which closely replicates the scene appearance of real-world tasks, and Variant Aggregations, which introduces variations by modifying the background, lighting, distractors, table textures, etc.
Evaluation and comparison on SIMPLER's Google robot tasks. All models are trained on the OXE dataset (except for RT-1 which is trained only on OXE's Google robot subset). |
Evaluation and comparison on SIMPLER's WidowX robot tasks. |
The results show that CogACT-VLA outperforms existing VLA models on both the Google robot and the WidowX robot tasks by a large margin.
We evaluate CogACT-VLA with a Realman robot to perform real-world tasks such as picking, stacking, and placing various colored objects. We collected a dataset with 391 demonstrations in total and finetune different models. As shown in the table below, our model outperforms OpenVLA and Octo-Base and achieves the highest average success rates.
Real-world evaluation with the Realman robot across three tasks, each with 20-40 trials of random configurations. All models are pretrained on OXE and finetuned on the same data. |
Generalization evaluation on the Realman Robot with unseen tables and distractors. |
Generalization evaluation on the Realman Robot with unseen colors, shapes, and categories. |
We further use an Franka arm to evaluate our model's real-world performance and compare it with previous methods. Similar to the experiments on the Realman robot, we collect training data and finetune different models for evaluation.
Real-world evaluation on the Franka Robot across four tasks. All models are pretrained on OXE and finetuned on the same data. |
We evaluate various action model architectures on Google Robot (GR) and WidowX Robot (WR) in SIMPLER. The architectures examined include MLPs with depths of 3 and 7 layers, respectively, as well as a series of DiT of varying sizes. The results show that both MLP and DiT structures show improved success rates with increased model size, and DiT significantly outperforms MLP. Notably, DiT-Large achieves the highest average success rate of 64.8%. The average success rate of transformers is approximately linearly related to the logarithm of the model size, indicating a favorable scaling behavior of the action module with diffusion transformers.
We evaluate the proposed Adaptive Action Ensemble approach against the two ensemble strategies introduced in ACT – Action Chunking and Temporal Ensemble. The results show that our proposed Adaptive Ensemble outperforms others, and we attribute this to the efficacy of our adaptive similarity weighting between the current and historical predictions.
@article{li2024cogact,
title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
journal={arXiv preprint arXiv:2411.19650},
year={2024}
}