---
language: en
license: apache-2.0
tags:
- chess
- reinforcement-learning
- dpo
- direct-preference-optimization
- jax
- haiku
- self-play
library_name: jax
---

# Searchless Chess 9M (DPO Self-Play Trained)

This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision.

## Model Description

- **Architecture**: Transformer with 8 layers, 256 embedding dim, 8 attention heads
- **Training Method**: DPO (Direct Preference Optimization) with self-play
- **Framework**: JAX/Haiku
- **Parameters**: ~9 million
- **Base Model**: DeepMind's Searchless Chess 9M
- **Training Iteration**: 1
- **Self-play Games**: 1000 games
- **Preference Pairs**: 36,407 (model mistakes)
- **Training Steps**: 50 gradient steps
- **Final Loss**: 0.6890 (from 0.6931)

## Performance Improvements

After just 1 iteration of DPO training:

**Puzzle Solving:**
- Base 9M model: 87% accuracy
- DPO-trained model: 88% accuracy
- +1% improvement overall, with best gains in 1000-1500 rating range (+3.45%)

**Head-to-Head Games (50 games):**
- Win-Draw-Loss: 24-9-17 (vs base 9M)
- Win rate: 57%
- **Elo Improvement: +25 Elo** (BayesElo calculation)

## Quick Start

```python
from searchless_chess.src import hf_model
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download(
    repo_id="dbest-isi/searchless-chess-9M-dpo",
    local_dir="./chess_model"
)

# Load model
model = hf_model.SearchlessChessModel.from_pretrained("./chess_model")

# Predict move from starting position
fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
result = model.predict(fen)

print(f"Best move: {result['best_move']}")
print(f"Q-value: {result['q_value']:.4f}")
```

## Installation

Required dependencies:
```bash
pip install jax jaxlib dm-haiku orbax-checkpoint numpy huggingface-hub python-chess
```

## Training Details

### DPO Algorithm

Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process:

1. **Self-Play Generation**: Model plays 1000 games against itself
2. **Mistake Identification**: Stockfish analyzes each position to find model errors
3. **Preference Pair Creation**: For each mistake:
   - **Chosen action**: Stockfish's move (better outcome)
   - **Rejected action**: Model's move (worse outcome)
   - **Filtering**: Only include mistakes with eval difference > 0.3 pawns
4. **DPO Training**: Optimize policy to prefer Stockfish's moves using DPO loss

### Training Hyperparameters

- **Base Model**: 9M parameter action-value model (pre-trained by DeepMind)
- **Training Algorithm**: Direct Preference Optimization (DPO)
- **Self-play Games**: 1000 games per iteration
- **Preference Pairs Found**: 36,407 (mistakes where model played suboptimal moves)
- **Batch Size**: 32
- **Learning Rate**: 1e-5
- **Gradient Steps**: 50 per iteration
- **DPO Beta**: 0.1 (KL penalty coefficient)
- **Eval Threshold**: 0.3 pawns (minimum mistake margin)
- **Stockfish Analysis**: Depth 20, 0.1s per position
- **Optimizer**: Adam with gradient clipping (max norm 1.0)
- **EMA Decay**: 0.999 (used for inference)
- **Reference Model**: Updated every 3 iterations for stability

### DPO Loss Function

The DPO loss for action-value models:

```
log π(a|s) = Q(s,a) / τ - log(Σ exp(Q(s,a') / τ))

Loss = -log sigmoid(β * (log π_θ(chosen|s) - log π_ref(chosen|s)
                         - log π_θ(rejected|s) + log π_ref(rejected|s)))
```

Where:
- π_θ: Current policy (converted from Q-values)
- π_ref: Reference policy (frozen snapshot)
- β: KL penalty coefficient
- τ: Temperature for softmax conversion

## Architecture

- **Input**: 77-token FEN representation
- **Embedding**: 256 dimensions
- **Layers**: 8 transformer blocks
- **Attention Heads**: 8 per layer
- **Output**: 128-bucket Q-value distribution over actions
- **Positional Encoding**: Learned
- **Activation**: GELU in feed-forward layers
- **Total Parameters**: ~9M

## Training Process

The model was trained using mistake-focused self-play:

1. **Generate Self-Play Games**: Model plays 1000 games against itself from diverse openings
2. **Analyze with Stockfish**: Each position analyzed at depth 20 (0.1s per move)
3. **Extract Preferences**: 36,407 position-move pairs where model made mistakes
4. **Filter Quality**:
   - Eval difference ≥ 0.3 pawns (meaningful mistakes)
   - Position quality |eval| ≤ 3.0 pawns (avoid blown positions)
5. **DPO Training**: 50 gradient steps optimizing preference likelihood
6. **Checkpoint**: Save EMA parameters for inference

## Performance Analysis

**Strengths:**
- Improved tactical accuracy (fewer blunders)
- Better move selection in middlegame positions
- Stronger in 1000-1500 Elo puzzle range

**Current Limitations:**
- Early training (only 1 iteration completed)
- Limited self-play data (1000 games)
- No explicit opening book or endgame tablebase
- Evaluation based on Q-values, not full search

**Future Work:**
- Continue training for more iterations (recommended: 10 iterations)
- Progressive curriculum (increase Stockfish depth over time)
- Larger batch sizes and more gradient steps
- Test on wider puzzle range and benchmark positions

## Comparison to Base Model

| Metric | Base 9M | DPO-trained | Improvement |
|--------|---------|-------------|-------------|
| Puzzle Accuracy | 87% | 88% | +1% |
| Head-to-Head Win Rate | 43% | 57% | +14% |
| Elo Rating | Baseline | +25 | +25 Elo |

## Citation

Based on the Searchless Chess work by DeepMind Technologies Limited:

```bibtex
@article{ruoss2024grandmaster,
  title={Grandmaster-Level Chess Without Search},
  author={Ruoss, Anian and others},
  journal={arXiv preprint arXiv:2402.04494},
  year={2024}
}
```

DPO algorithm from:

```bibtex
@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}
```

## License

Apache 2.0

## Additional Resources

- [Original Searchless Chess Repository](https://github.com/google-deepmind/searchless_chess)
- [Training Code and Documentation](../SELF_PLAY.md)
- [DPO Paper](https://arxiv.org/abs/2305.18290)
- [Searchless Chess Paper](https://arxiv.org/abs/2402.04494)