# F1: A Vision Language Action Model Bridging Understanding and Generation to Actions
[](https://arxiv.org/abs/2509.06951)
[](https://github.com/InternRobotics/F1-VLA)
[](https://aopolin-lv.github.io/F1-VLA)
## 🚀 Key Innovations
- **🧠 Predictive Inverse Dynamics**: Visual foresight generation for planning-based control
- **🏗️ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action)
- **📈 Three-Stage Training**: Progressive alignment, pretraining, and adaptation
## 🤖 Real-World Robot Experiments
Diverse manipulation tasks across multiple robot platforms.
## 📊 Performance Summary
| Task | Platform | F1 | π0 | Improvement |
|:--------:|:------------:|:------------------:|:------------:|:---------------:|
| Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
| Adaptation | Franka | 66.7% | 53.3% | +13.4% |
| Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
| Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
## Usage
Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA).
## 📚 Citation
If you find our work helpful, please cite:
```bibtex
@article{f1_vla_2025,
title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
journal={Conference/Journal Name},
year={2025},
url={https://arxiv.org/abs/2509.06951}
}
```
## License
This work is under the [cc-by-nc-sa-4.0](LICENSE).
## Acknowledgements
This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).