--- language: en license: apache-2.0 tags: - chess - reinforcement-learning - dpo - direct-preference-optimization - jax - haiku - self-play library_name: jax --- # Searchless Chess 9M (DPO Self-Play Trained) This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision. ## Model Description - **Architecture**: Transformer with 8 layers, 256 embedding dim, 8 attention heads - **Training Method**: DPO (Direct Preference Optimization) with self-play - **Framework**: JAX/Haiku - **Parameters**: ~9 million - **Base Model**: DeepMind's Searchless Chess 9M - **Training Iteration**: 1 - **Self-play Games**: 1000 games - **Preference Pairs**: 36,407 (model mistakes) - **Training Steps**: 50 gradient steps - **Final Loss**: 0.6890 (from 0.6931) ## Performance Improvements After just 1 iteration of DPO training: **Puzzle Solving:** - Base 9M model: 87% accuracy - DPO-trained model: 88% accuracy - +1% improvement overall, with best gains in 1000-1500 rating range (+3.45%) **Head-to-Head Games (50 games):** - Win-Draw-Loss: 24-9-17 (vs base 9M) - Win rate: 57% - **Elo Improvement: +25 Elo** (BayesElo calculation) ## Quick Start ```python from searchless_chess.src import hf_model from huggingface_hub import snapshot_download # Download model model_path = snapshot_download( repo_id="dbest-isi/searchless-chess-9M-dpo", local_dir="./chess_model" ) # Load model model = hf_model.SearchlessChessModel.from_pretrained("./chess_model") # Predict move from starting position fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" result = model.predict(fen) print(f"Best move: {result['best_move']}") print(f"Q-value: {result['q_value']:.4f}") ``` ## Installation Required dependencies: ```bash pip install jax jaxlib dm-haiku orbax-checkpoint numpy huggingface-hub python-chess ``` ## Training Details ### DPO Algorithm Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process: 1. **Self-Play Generation**: Model plays 1000 games against itself 2. **Mistake Identification**: Stockfish analyzes each position to find model errors 3. **Preference Pair Creation**: For each mistake: - **Chosen action**: Stockfish's move (better outcome) - **Rejected action**: Model's move (worse outcome) - **Filtering**: Only include mistakes with eval difference > 0.3 pawns 4. **DPO Training**: Optimize policy to prefer Stockfish's moves using DPO loss ### Training Hyperparameters - **Base Model**: 9M parameter action-value model (pre-trained by DeepMind) - **Training Algorithm**: Direct Preference Optimization (DPO) - **Self-play Games**: 1000 games per iteration - **Preference Pairs Found**: 36,407 (mistakes where model played suboptimal moves) - **Batch Size**: 32 - **Learning Rate**: 1e-5 - **Gradient Steps**: 50 per iteration - **DPO Beta**: 0.1 (KL penalty coefficient) - **Eval Threshold**: 0.3 pawns (minimum mistake margin) - **Stockfish Analysis**: Depth 20, 0.1s per position - **Optimizer**: Adam with gradient clipping (max norm 1.0) - **EMA Decay**: 0.999 (used for inference) - **Reference Model**: Updated every 3 iterations for stability ### DPO Loss Function The DPO loss for action-value models: ``` log π(a|s) = Q(s,a) / τ - log(Σ exp(Q(s,a') / τ)) Loss = -log sigmoid(β * (log π_θ(chosen|s) - log π_ref(chosen|s) - log π_θ(rejected|s) + log π_ref(rejected|s))) ``` Where: - π_θ: Current policy (converted from Q-values) - π_ref: Reference policy (frozen snapshot) - β: KL penalty coefficient - τ: Temperature for softmax conversion ## Architecture - **Input**: 77-token FEN representation - **Embedding**: 256 dimensions - **Layers**: 8 transformer blocks - **Attention Heads**: 8 per layer - **Output**: 128-bucket Q-value distribution over actions - **Positional Encoding**: Learned - **Activation**: GELU in feed-forward layers - **Total Parameters**: ~9M ## Training Process The model was trained using mistake-focused self-play: 1. **Generate Self-Play Games**: Model plays 1000 games against itself from diverse openings 2. **Analyze with Stockfish**: Each position analyzed at depth 20 (0.1s per move) 3. **Extract Preferences**: 36,407 position-move pairs where model made mistakes 4. **Filter Quality**: - Eval difference ≥ 0.3 pawns (meaningful mistakes) - Position quality |eval| ≤ 3.0 pawns (avoid blown positions) 5. **DPO Training**: 50 gradient steps optimizing preference likelihood 6. **Checkpoint**: Save EMA parameters for inference ## Performance Analysis **Strengths:** - Improved tactical accuracy (fewer blunders) - Better move selection in middlegame positions - Stronger in 1000-1500 Elo puzzle range **Current Limitations:** - Early training (only 1 iteration completed) - Limited self-play data (1000 games) - No explicit opening book or endgame tablebase - Evaluation based on Q-values, not full search **Future Work:** - Continue training for more iterations (recommended: 10 iterations) - Progressive curriculum (increase Stockfish depth over time) - Larger batch sizes and more gradient steps - Test on wider puzzle range and benchmark positions ## Comparison to Base Model | Metric | Base 9M | DPO-trained | Improvement | |--------|---------|-------------|-------------| | Puzzle Accuracy | 87% | 88% | +1% | | Head-to-Head Win Rate | 43% | 57% | +14% | | Elo Rating | Baseline | +25 | +25 Elo | ## Citation Based on the Searchless Chess work by DeepMind Technologies Limited: ```bibtex @article{ruoss2024grandmaster, title={Grandmaster-Level Chess Without Search}, author={Ruoss, Anian and others}, journal={arXiv preprint arXiv:2402.04494}, year={2024} } ``` DPO algorithm from: ```bibtex @article{rafailov2023direct, title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model}, author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea}, journal={arXiv preprint arXiv:2305.18290}, year={2023} } ``` ## License Apache 2.0 ## Additional Resources - [Original Searchless Chess Repository](https://github.com/google-deepmind/searchless_chess) - [Training Code and Documentation](../SELF_PLAY.md) - [DPO Paper](https://arxiv.org/abs/2305.18290) - [Searchless Chess Paper](https://arxiv.org/abs/2402.04494)