Searchless Chess 9M (DPO Self-Play Trained)

This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision.

Model Description

  • Architecture: Transformer with 8 layers, 256 embedding dim, 8 attention heads
  • Training Method: DPO (Direct Preference Optimization) with self-play
  • Framework: JAX/Haiku
  • Parameters: ~9 million
  • Base Model: DeepMind's Searchless Chess 9M
  • Training Iteration: 1
  • Self-play Games: 1000 games
  • Preference Pairs: 36,407 (model mistakes)
  • Training Steps: 50 gradient steps
  • Final Loss: 0.6890 (from 0.6931)

Performance Improvements

After just 1 iteration of DPO training:

Puzzle Solving:

  • Base 9M model: 87% accuracy
  • DPO-trained model: 88% accuracy
  • +1% improvement overall, with best gains in 1000-1500 rating range (+3.45%)

Head-to-Head Games (50 games):

  • Win-Draw-Loss: 24-9-17 (vs base 9M)
  • Win rate: 57%
  • Elo Improvement: +25 Elo (BayesElo calculation)

Quick Start

from searchless_chess.src import hf_model
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download(
    repo_id="dbest-isi/searchless-chess-9M-dpo",
    local_dir="./chess_model"
)

# Load model
model = hf_model.SearchlessChessModel.from_pretrained("./chess_model")

# Predict move from starting position
fen = "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"
result = model.predict(fen)

print(f"Best move: {result['best_move']}")
print(f"Q-value: {result['q_value']:.4f}")

Installation

Required dependencies:

pip install jax jaxlib dm-haiku orbax-checkpoint numpy huggingface-hub python-chess

Training Details

DPO Algorithm

Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process:

  1. Self-Play Generation: Model plays 1000 games against itself
  2. Mistake Identification: Stockfish analyzes each position to find model errors
  3. Preference Pair Creation: For each mistake:
    • Chosen action: Stockfish's move (better outcome)
    • Rejected action: Model's move (worse outcome)
    • Filtering: Only include mistakes with eval difference > 0.3 pawns
  4. DPO Training: Optimize policy to prefer Stockfish's moves using DPO loss

Training Hyperparameters

  • Base Model: 9M parameter action-value model (pre-trained by DeepMind)
  • Training Algorithm: Direct Preference Optimization (DPO)
  • Self-play Games: 1000 games per iteration
  • Preference Pairs Found: 36,407 (mistakes where model played suboptimal moves)
  • Batch Size: 32
  • Learning Rate: 1e-5
  • Gradient Steps: 50 per iteration
  • DPO Beta: 0.1 (KL penalty coefficient)
  • Eval Threshold: 0.3 pawns (minimum mistake margin)
  • Stockfish Analysis: Depth 20, 0.1s per position
  • Optimizer: Adam with gradient clipping (max norm 1.0)
  • EMA Decay: 0.999 (used for inference)
  • Reference Model: Updated every 3 iterations for stability

DPO Loss Function

The DPO loss for action-value models:

log π(a|s) = Q(s,a) / τ - log(Σ exp(Q(s,a') / τ))

Loss = -log sigmoid(β * (log π_θ(chosen|s) - log π_ref(chosen|s)
                         - log π_θ(rejected|s) + log π_ref(rejected|s)))

Where:

  • π_θ: Current policy (converted from Q-values)
  • π_ref: Reference policy (frozen snapshot)
  • β: KL penalty coefficient
  • τ: Temperature for softmax conversion

Architecture

  • Input: 77-token FEN representation
  • Embedding: 256 dimensions
  • Layers: 8 transformer blocks
  • Attention Heads: 8 per layer
  • Output: 128-bucket Q-value distribution over actions
  • Positional Encoding: Learned
  • Activation: GELU in feed-forward layers
  • Total Parameters: ~9M

Training Process

The model was trained using mistake-focused self-play:

  1. Generate Self-Play Games: Model plays 1000 games against itself from diverse openings
  2. Analyze with Stockfish: Each position analyzed at depth 20 (0.1s per move)
  3. Extract Preferences: 36,407 position-move pairs where model made mistakes
  4. Filter Quality:
    • Eval difference ≥ 0.3 pawns (meaningful mistakes)
    • Position quality |eval| ≤ 3.0 pawns (avoid blown positions)
  5. DPO Training: 50 gradient steps optimizing preference likelihood
  6. Checkpoint: Save EMA parameters for inference

Performance Analysis

Strengths:

  • Improved tactical accuracy (fewer blunders)
  • Better move selection in middlegame positions
  • Stronger in 1000-1500 Elo puzzle range

Current Limitations:

  • Early training (only 1 iteration completed)
  • Limited self-play data (1000 games)
  • No explicit opening book or endgame tablebase
  • Evaluation based on Q-values, not full search

Future Work:

  • Continue training for more iterations (recommended: 10 iterations)
  • Progressive curriculum (increase Stockfish depth over time)
  • Larger batch sizes and more gradient steps
  • Test on wider puzzle range and benchmark positions

Comparison to Base Model

Metric Base 9M DPO-trained Improvement
Puzzle Accuracy 87% 88% +1%
Head-to-Head Win Rate 43% 57% +14%
Elo Rating Baseline +25 +25 Elo

Citation

Based on the Searchless Chess work by DeepMind Technologies Limited:

@article{ruoss2024grandmaster,
  title={Grandmaster-Level Chess Without Search},
  author={Ruoss, Anian and others},
  journal={arXiv preprint arXiv:2402.04494},
  year={2024}
}

DPO algorithm from:

@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}

License

Apache 2.0

Additional Resources

Downloads last month
14
Video Preview
loading