Searchless Chess 9M (DPO Self-Play Trained)
This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision.
Model Description
- Architecture: Transformer with 8 layers, 256 embedding dim, 8 attention heads
- Training Method: DPO (Direct Preference Optimization) with self-play
- Framework: JAX/Haiku
- Parameters: ~9 million
- Base Model: DeepMind's Searchless Chess 9M
- Training Iteration: 1
- Self-play Games: 1000 games
- Preference Pairs: 36,407 (model mistakes)
- Training Steps: 50 gradient steps
- Final Loss: 0.6890 (from 0.6931)
Performance Improvements
After just 1 iteration of DPO training:
Puzzle Solving:
- Base 9M model: 87% accuracy
- DPO-trained model: 88% accuracy
- +1% improvement overall, with best gains in 1000-1500 rating range (+3.45%)
Head-to-Head Games (50 games):
- Win-Draw-Loss: 24-9-17 (vs base 9M)
- Win rate: 57%
- Elo Improvement: +25 Elo (BayesElo calculation)
Quick Start
from searchless_chess.src import hf_model
from huggingface_hub import snapshot_download
# Download model
model_path = snapshot_download(
repo_id="dbest-isi/searchless-chess-9M-dpo",
local_dir="./chess_model"
)
# Load model
model = hf_model.SearchlessChessModel.from_pretrained("./chess_model")
# Predict move from starting position
fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
result = model.predict(fen)
print(f"Best move: {result['best_move']}")
print(f"Q-value: {result['q_value']:.4f}")
Installation
Required dependencies:
pip install jax jaxlib dm-haiku orbax-checkpoint numpy huggingface-hub python-chess
Training Details
DPO Algorithm
Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process:
- Self-Play Generation: Model plays 1000 games against itself
- Mistake Identification: Stockfish analyzes each position to find model errors
- Preference Pair Creation: For each mistake:
- Chosen action: Stockfish's move (better outcome)
- Rejected action: Model's move (worse outcome)
- Filtering: Only include mistakes with eval difference > 0.3 pawns
- DPO Training: Optimize policy to prefer Stockfish's moves using DPO loss
Training Hyperparameters
- Base Model: 9M parameter action-value model (pre-trained by DeepMind)
- Training Algorithm: Direct Preference Optimization (DPO)
- Self-play Games: 1000 games per iteration
- Preference Pairs Found: 36,407 (mistakes where model played suboptimal moves)
- Batch Size: 32
- Learning Rate: 1e-5
- Gradient Steps: 50 per iteration
- DPO Beta: 0.1 (KL penalty coefficient)
- Eval Threshold: 0.3 pawns (minimum mistake margin)
- Stockfish Analysis: Depth 20, 0.1s per position
- Optimizer: Adam with gradient clipping (max norm 1.0)
- EMA Decay: 0.999 (used for inference)
- Reference Model: Updated every 3 iterations for stability
DPO Loss Function
The DPO loss for action-value models:
log π(a|s) = Q(s,a) / τ - log(Σ exp(Q(s,a') / τ))
Loss = -log sigmoid(β * (log π_θ(chosen|s) - log π_ref(chosen|s)
- log π_θ(rejected|s) + log π_ref(rejected|s)))
Where:
- π_θ: Current policy (converted from Q-values)
- π_ref: Reference policy (frozen snapshot)
- β: KL penalty coefficient
- τ: Temperature for softmax conversion
Architecture
- Input: 77-token FEN representation
- Embedding: 256 dimensions
- Layers: 8 transformer blocks
- Attention Heads: 8 per layer
- Output: 128-bucket Q-value distribution over actions
- Positional Encoding: Learned
- Activation: GELU in feed-forward layers
- Total Parameters: ~9M
Training Process
The model was trained using mistake-focused self-play:
- Generate Self-Play Games: Model plays 1000 games against itself from diverse openings
- Analyze with Stockfish: Each position analyzed at depth 20 (0.1s per move)
- Extract Preferences: 36,407 position-move pairs where model made mistakes
- Filter Quality:
- Eval difference ≥ 0.3 pawns (meaningful mistakes)
- Position quality |eval| ≤ 3.0 pawns (avoid blown positions)
- DPO Training: 50 gradient steps optimizing preference likelihood
- Checkpoint: Save EMA parameters for inference
Performance Analysis
Strengths:
- Improved tactical accuracy (fewer blunders)
- Better move selection in middlegame positions
- Stronger in 1000-1500 Elo puzzle range
Current Limitations:
- Early training (only 1 iteration completed)
- Limited self-play data (1000 games)
- No explicit opening book or endgame tablebase
- Evaluation based on Q-values, not full search
Future Work:
- Continue training for more iterations (recommended: 10 iterations)
- Progressive curriculum (increase Stockfish depth over time)
- Larger batch sizes and more gradient steps
- Test on wider puzzle range and benchmark positions
Comparison to Base Model
| Metric | Base 9M | DPO-trained | Improvement |
|---|---|---|---|
| Puzzle Accuracy | 87% | 88% | +1% |
| Head-to-Head Win Rate | 43% | 57% | +14% |
| Elo Rating | Baseline | +25 | +25 Elo |
Citation
Based on the Searchless Chess work by DeepMind Technologies Limited:
@article{ruoss2024grandmaster,
title={Grandmaster-Level Chess Without Search},
author={Ruoss, Anian and others},
journal={arXiv preprint arXiv:2402.04494},
year={2024}
}
DPO algorithm from:
@article{rafailov2023direct,
title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
journal={arXiv preprint arXiv:2305.18290},
year={2023}
}
License
Apache 2.0
Additional Resources
- Downloads last month
- 14