Distributed Training Papers and resources related to distributed training. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 4 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 3 Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018 • 1
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 4
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 3
Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018 • 1
Distributed Training Papers and resources related to distributed training. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 4 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 3 Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018 • 1
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 4
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper • 1909.08053 • Published Sep 17, 2019 • 3
Reducing Activation Recomputation in Large Transformer Models Paper • 2205.05198 • Published May 10, 2022
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Paper • 1811.06965 • Published Nov 16, 2018 • 1
michaelbenayoun/qwen3-tiny-4kv-heads-8layers-random Text Generation • 6.61M • Updated about 1 month ago • 591
michaelbenayoun/qwen3-tiny-4kv-heads-4layers-random Text Generation • 5.47M • Updated about 1 month ago • 30.2k
michaelbenayoun/deepseekv3-tiny-4kv-heads-4-layers-random Text Generation • 5.27M • Updated Jul 24 • 2
michaelbenayoun/granite-tiny-4kv-heads-4layers-random Text Generation • 4.2M • Updated Jun 18 • 804
michaelbenayoun/llama-2-tiny-4kv-heads-4layers-random Text Generation • 8.54M • Updated Jun 2 • 128k
michaelbenayoun/llama-2-tiny-4kv-heads-16layers-random Text Generation • 8.98M • Updated May 27 • 10
michaelbenayoun/llama-2-tiny-4kv-heads-2layers-random Feature Extraction • 2.08M • Updated May 7, 2024 • 7