metascroy commited on
Commit
6ac5a61
·
verified ·
1 Parent(s): 467664a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -13,10 +13,10 @@ tags:
13
 
14
  [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`INT8-INT4`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
15
 
16
- We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/smollm3-3b-INT8-INT4.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
17
 
18
  # Running in a mobile app
19
- The [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/smollm3-3b-INT8-INT4.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html). On Samsung Galaxy S22, the model runs at 15.5 tokens/s.
20
 
21
  ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65e8b9b6624b8e44f56da2b1/ANZ6DrVlKOkuCQYZfHEBI.jpeg)
22
 
@@ -131,7 +131,7 @@ linear_config = Int8DynamicActivationIntxWeightConfig(
131
  weight_scale_dtype=torch.bfloat16,
132
  )
133
  quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
134
- quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
135
 
136
  # either use `untied_model_id` or `untied_model_local_path`
137
  quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
 
13
 
14
  [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`INT8-INT4`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
15
 
16
+ We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
17
 
18
  # Running in a mobile app
19
+ The [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-INT8-INT4/blob/main/model.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html). On Samsung Galaxy S22, the model runs at 15.5 tokens/s.
20
 
21
  ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65e8b9b6624b8e44f56da2b1/ANZ6DrVlKOkuCQYZfHEBI.jpeg)
22
 
 
131
  weight_scale_dtype=torch.bfloat16,
132
  )
133
  quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
134
+ quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])
135
 
136
  # either use `untied_model_id` or `untied_model_local_path`
137
  quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)