Add the details of Hardware & Performance

#23

Hardware & Performance

Qwen3-8B is optimized for both research and production workloads:

Inference:

Runs efficiently on a single A100 (80GB) GPU or two A40s.

Can be quantized to INT8/FP8/4-bit using bitsandbytes, AutoGPTQ, or AWQ for edge or consumer hardware (e.g., RTX 3090/4090).

Training / Fine-tuning:

Recommended: ≥ 2x A100 (80GB) or ≥ 4x A6000 GPUs.

Supports LoRA, QLoRA, and DPO/RLHF fine-tuning approaches.

Gradient checkpointing and FlashAttention v2 are enabled by default for memory efficiency.

Mode GPU Memory Notes
FP16 ~45GB Full precision inference
bfloat16 ~38GB Preferred for stability
8-bit ~22GB Near-lossless quality
4-bit ~12GB Trade-off: higher speed, small quality drop

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment