Qwen
/

Qwen3-8B

@@ -336,6 +336,28 @@ To achieve optimal performance, we recommend the following settings:
 4. **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
 ### Citation
 If you find our work helpful, feel free to give us a cite.

 4. **No Thinking Content in History**: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
+## Hardware & Performance
+Qwen3-8B is optimized for both research and production workloads:
+- **Inference**:
+  - Runs efficiently on a single A100 (80GB) GPU or two A40s.
+  - Can be quantized to **INT8/FP8/4-bit** using `bitsandbytes`, `AutoGPTQ`, or `AWQ` for edge or consumer hardware (e.g., RTX 3090/4090).
+- **Training / Fine-tuning**:
+  - Recommended: ≥ 2x A100 (80GB) or ≥ 4x A6000 GPUs.
+  - Supports **LoRA, QLoRA, and DPO/RLHF** fine-tuning approaches.
+  - Gradient checkpointing and FlashAttention v2 are enabled by default for memory efficiency.
+| Mode     | GPU Memory | Notes                       |
+|----------|------------|-----------------------------|
+| FP16     | ~45GB      | Full precision inference    |
+| bfloat16 | ~38GB      | Preferred for stability     |
+| 8-bit    | ~22GB      | Near-lossless quality       |
+| 4-bit    | ~12GB      | Higher speed, small quality drop |
 ### Citation
 If you find our work helpful, feel free to give us a cite.