--- pipeline_tag: robotics library_name: transformers license: cc-by-nc-sa-4.0 tags: - vision-language-model - manipulation - robotics ---

🏁 Best viewed with sound on

# F1: A Vision Language Action Model Bridging
Understanding and Generation to Actions [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2509.06951) [![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/F1-VLA) [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://aopolin-lv.github.io/F1-VLA) ## 🚀 Key Innovations - **🧠 Predictive Inverse Dynamics**: Visual foresight generation for planning-based control - **🏗️ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action) - **📈 Three-Stage Training**: Progressive alignment, pretraining, and adaptation ## 🤖 Real-World Robot Experiments

Diverse manipulation tasks across multiple robot platforms.

## 📊 Performance Summary | Task | Platform | F1 | π0 | Improvement | |:--------:|:------------:|:------------------:|:------------:|:---------------:| | Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% | | Adaptation | Franka | 66.7% | 53.3% | +13.4% | | Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% | | Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% | ## Usage Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA). ## 📚 Citation If you find our work helpful, please cite: ```bibtex @article{f1_vla_2025, title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions}, author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang}, journal={Conference/Journal Name}, year={2025}, url={https://arxiv.org/abs/2509.06951} } ``` ## License This work is under the [cc-by-nc-sa-4.0](LICENSE). ## Acknowledgements This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).