google/gemma-3-12b-it · Gemma 3 fine tuning max token length

Hi @mukhayy ,

Welcome to Google Gemma family of open source models, The max_position_embeddings parameter in a model google/gemma-3-12b-it defines the maximum sequence length the model was pre-trained to handle. This means its positional encodings and attention mechanisms are designed to work up to that length. Please consider few things If you would like to fine the model with such large lengths of tokens before you proceed with the fine tuning:

Fine tuning the model with such large data requires very powerful hard and large amount of memory.
Consider packing multiple examples into a single max_seq_length input to make more efficient use of GPU memory. trl.SFTTrainer can handle this with packing=True. This is crucial if many of your inputs are shorter, even if the outputs are long.
Ensure your dataset adheres to the chat template or instruction format that google/gemma-3-12b-it was instruction-tuned on. This is usually something like:

messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]

Fine-tuning (QLoRA/LoRA): This is critical for fine-tuning large models like google/gemma-3-12b-it on consumer-grade or even many professional GPUs. Full fine-tuning will be prohibitively expensive in terms of VRAM and computation. QLoRA (Quantized LoRA) is even more memory-efficient in such scenarios. We can achieve this by passing a LoraConfig to your SFTTrainer.

Please consider the above mentioned suggestion before you are doing the fine tuning with such large dataset.

Hardware Requirement: Multiple GPU hardware is recommended when you are dealing with large model like 12B or 27B, along with tuning of large data.

4x NVIDIA A100 (80GB): This would be an ideal setup. You could potentially use a slightly larger effective batch size and train more efficiently.
8x NVIDIA A6000 (48GB): Similar to the 4x A100, this provides ample VRAM option.

Please find the attached gist file where you can find the parameter configuration to deal with large corpus of data, please note it's not a complete code as I don't have the actual dataset.

Thanks.