Searchless Chess 9M (DPO Self-Play Trained)

This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision.

Model Description

Architecture: Transformer with 8 layers, 256 embedding dim, 8 attention heads
Training Method: DPO (Direct Preference Optimization) with self-play
Framework: JAX/Haiku
Parameters: ~9 million
Base Model: DeepMind's Searchless Chess 9M
Training Iteration: 1
Self-play Games: 1000 games
Preference Pairs: 36,407 (model mistakes)
Training Steps: 50 gradient steps
Final Loss: 0.6890 (from 0.6931)

Performance Improvements

After just 1 iteration of DPO training:

Puzzle Solving:

Base 9M model: 87% accuracy
DPO-trained model: 88% accuracy
+1% improvement overall, with best gains in 1000-1500 rating range (+3.45%)

Head-to-Head Games (50 games):

Win-Draw-Loss: 24-9-17 (vs base 9M)
Win rate: 57%
Elo Improvement: +25 Elo (BayesElo calculation)

Quick Start

from searchless_chess.src import hf_model
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download(
    repo_id="dbest-isi/searchless-chess-9M-dpo",
    local_dir="./chess_model"
)

# Load model
model = hf_model.SearchlessChessModel.from_pretrained("./chess_model")

# Predict move from starting position
fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
result = model.predict(fen)

print(f"Best move: {result['best_move']}")
print(f"Q-value: {result['q_value']:.4f}")

Installation

Required dependencies:

pip install jax jaxlib dm-haiku orbax-checkpoint numpy huggingface-hub python-chess

Training Details

DPO Algorithm

Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process:

Self-Play Generation: Model plays 1000 games against itself
Mistake Identification: Stockfish analyzes each position to find model errors
Preference Pair Creation: For each mistake:
- Chosen action: Stockfish's move (better outcome)
- Rejected action: Model's move (worse outcome)
- Filtering: Only include mistakes with eval difference > 0.3 pawns
DPO Training: Optimize policy to prefer Stockfish's moves using DPO loss

Training Hyperparameters

Base Model: 9M parameter action-value model (pre-trained by DeepMind)
Training Algorithm: Direct Preference Optimization (DPO)
Self-play Games: 1000 games per iteration
Preference Pairs Found: 36,407 (mistakes where model played suboptimal moves)
Batch Size: 32
Learning Rate: 1e-5
Gradient Steps: 50 per iteration
DPO Beta: 0.1 (KL penalty coefficient)
Eval Threshold: 0.3 pawns (minimum mistake margin)
Stockfish Analysis: Depth 20, 0.1s per position
Optimizer: Adam with gradient clipping (max norm 1.0)
EMA Decay: 0.999 (used for inference)
Reference Model: Updated every 3 iterations for stability

DPO Loss Function

The DPO loss for action-value models:

log π(a|s) = Q(s,a) / τ - log(Σ exp(Q(s,a') / τ))

Loss = -log sigmoid(β * (log π_θ(chosen|s) - log π_ref(chosen|s)
                         - log π_θ(rejected|s) + log π_ref(rejected|s)))

Where:

π_θ: Current policy (converted from Q-values)
π_ref: Reference policy (frozen snapshot)
β: KL penalty coefficient
τ: Temperature for softmax conversion

Architecture

Input: 77-token FEN representation
Embedding: 256 dimensions
Layers: 8 transformer blocks
Attention Heads: 8 per layer
Output: 128-bucket Q-value distribution over actions
Positional Encoding: Learned
Activation: GELU in feed-forward layers
Total Parameters: ~9M

Training Process

The model was trained using mistake-focused self-play:

Generate Self-Play Games: Model plays 1000 games against itself from diverse openings
Analyze with Stockfish: Each position analyzed at depth 20 (0.1s per move)
Extract Preferences: 36,407 position-move pairs where model made mistakes
Filter Quality:
- Eval difference ≥ 0.3 pawns (meaningful mistakes)
- Position quality |eval| ≤ 3.0 pawns (avoid blown positions)
DPO Training: 50 gradient steps optimizing preference likelihood
Checkpoint: Save EMA parameters for inference

Performance Analysis

Strengths:

Improved tactical accuracy (fewer blunders)
Better move selection in middlegame positions
Stronger in 1000-1500 Elo puzzle range

Current Limitations:

Early training (only 1 iteration completed)
Limited self-play data (1000 games)
No explicit opening book or endgame tablebase
Evaluation based on Q-values, not full search

Future Work:

Continue training for more iterations (recommended: 10 iterations)
Progressive curriculum (increase Stockfish depth over time)
Larger batch sizes and more gradient steps
Test on wider puzzle range and benchmark positions

Comparison to Base Model

Metric	Base 9M	DPO-trained	Improvement
Puzzle Accuracy	87%	88%	+1%
Head-to-Head Win Rate	43%	57%	+14%
Elo Rating	Baseline	+25	+25 Elo

Citation

Based on the Searchless Chess work by DeepMind Technologies Limited:

@article{ruoss2024grandmaster,
  title={Grandmaster-Level Chess Without Search},
  author={Ruoss, Anian and others},
  journal={arXiv preprint arXiv:2402.04494},
  year={2024}
}

DPO algorithm from:

@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}

License

Apache 2.0

Additional Resources

Downloads last month: 14

Video Preview

Reinforcement Learning