pytorch
/

SmolLM3-3B-INT8-INT4

@@ -13,10 +13,10 @@ tags:
 [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`INT8-INT4`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
-We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/smollm3-3b-INT8-INT4.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
 # Running in a mobile app
-The [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/smollm3-3b-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html). On Samsung Galaxy S22, the model runs at 15.5 tokens/s.
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65e8b9b6624b8e44f56da2b1/ANZ6DrVlKOkuCQYZfHEBI.jpeg)
@@ -131,7 +131,7 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
     weight_scale_dtype=torch.bfloat16,
 )
 quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
-quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
 # either use `untied_model_id` or `untied_model_local_path`
 quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)

 [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`INT8-INT4`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
+We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
 # Running in a mobile app
+The [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/model.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html). On Samsung Galaxy S22, the model runs at 15.5 tokens/s.
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65e8b9b6624b8e44f56da2b1/ANZ6DrVlKOkuCQYZfHEBI.jpeg)
     weight_scale_dtype=torch.bfloat16,
 )
 quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
+quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
 # either use `untied_model_id` or `untied_model_local_path`
 quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)