MMMU-Pro Vision with Magistral Small

#9
by tomrance - opened

How are results for a multimodal benchmark MMMU-Pro Vision in Magistral Report obtained from a Magistral Small model which is Text Generation only? Will there be a Small 3.1 variant with vision and optional reasoning?

@tomrance This is a finetune of mistral small, so you can just use the vision projector of mistral small and use it with this.

@YaTharThShaRma999 I would be interested to know if there is a plan to release such a model, maybe with additional training with vision and reasoning. Currently the report shows that the (not released) Magistral Small model with the Mistral Small 3.1 vision projector is worse than the Mistral Small 3.1 in MMMU-Pro (Vision) benchmark. Looking at Medium 3 in comparison to Magistral Medium, the Magistral Medium shows enormous advantages by using the Medium 3 Vision part. It is strange that the Small model does not show similar improvements, but is even worse than the Small 3.1 model, from which the Vision part originates. So here is some version but it was handcrafted from @OptimusePrime https://huggingface.co/OptimusePrime/Magistral-Small-2506-Vision . Would be interesting if the benchmark results of MMMU-Pro (Vision) could be reproduced and if they are worse than Mistral Small 3.1.

@tomrance I attempted to reproduce the MMMU-Pro results and have prepared an eval script, but MMMU-Pro has 1700+ questions. When I began the eval with 1x H100, it was very slow and wold have taken hours and hours, possibly more. I wasn't sure how long it would take, and at the time, I wasn't willing to spend that much money on it with the possibility of it taking even longer.

Sign up or log in to comment