--- license: other license_name: apple license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers --- # FastVLM-1.5B-Stage3 ## Introduction This is FastVLM-1.5B-Stage3, a multimodal language model that can understand things visually, being agentic, understand long videos and capture events, and generate structured outputs. This model is exported from Github [apple/ml-fastvlm](https://github.com/apple/ml-fastvlm). Model's weight: [llava-fastvithd_1.5b_stage3.zip](https://ml-site.cdn-apple.com/datasets/fastvlm/llava-fastvithd_1.5b_stage3.zip). ### Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = 'FastVLM-1.5B-Stage3' tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype='auto', trust_remote_code=True) ``` ### Export to MNN ```python git clone https://github.com/alibaba/MNN cd MNN/transformers/llm/export python llmexport.py --path /path/to/FastVLM-1.5B-Stage3 --export mnn ``` ## Citation If you find our work helpful, feel free to give us a cite. ``` @InProceedings{fastvlm2025, author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, }{2023} ```