Text Generation
Transformers
Safetensors
qwen3_moe
vllm
conversational

team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510

We developed this model to outperform current open-weight models on Humanity's Last Exam, a super-challenging benchmark used in the LLM Competition 2025 Bridge.

Performance

Ours Qwen3-235B-A22B-Thinking-2507 Ours(previous)
Humanity's Last Exam
Overall Accuracy (%) 18.72 18.07 17.61
Math 26.13 26.54 24.90
Physics 12.87 10.89 10.40
Biology/Medicine 17.57 17.57 14.41
Humanities/Social Science 12.44 9.84 12.44
Computer Science/AI 13.39 11.16 12.22
Engineering 15.63 15.63 12.50
Chemistry 12.87 6.93 10.89
Others 3.98 5.11 8.05
Calibration Error 74.0 74.0 75.0
Do-Not-Answer (10 fold)
Safety rate (%) 97.87 95.74 95.74

To ensure fair evaluation, we used the same values for max_model_len and max_completion_tokens across all models. For HLE, we set max_model_len = 262,144 and max_completion_tokens = 248,741. For DNA, max_model_len is the same and max_new_tokens = 512. Additionally, we report the HLE score as judged by OpenAI o4-mini-2025-04-16.

ベースモデル

Qwen/Qwen3-235B-A22B-Thinking-2507

アーキテクチャの変更は特にありません。

トレーニングデータ

weblab-llm-competition-2025-bridge/team-pont-neuf-sft-dataset-2510

学習方法

詳細については記事として公開を予定しています。

環境構築

conda install -c conda-forge --file requirements.txt
pip install \
  --index-url https://download.pytorch.org/whl/cu126 \
  --extra-index-url https://pypi.org/simple \
  torch==2.7.1+cu126 torchvision==0.22.1+cu126 torchaudio==2.7.1+cu126 \
  vllm>=v0.10.1.1
(llmbench) $ pip list | grep vllm
vllm                              0.10.1.1

HLE推論・評価

動作確認済みのコマンドは、以下の通りです。 vllmのmax-model-lenパラメタは262144を、predict.pyのmax_completion_tokensパラメタは248741をかならず指定してください。 なお、predict.pyの実行完了までにかかる時間は25時間です(実績値):

#!/bin/bash
#SBATCH --job-name=predict_full_hle_8gpu
#SBATCH --partition=P06
#SBATCH --nodelist=osk-gpu68
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=64
#SBATCH --output=eval_hle/logs/%x-%j.out
#SBATCH --error=eval_hle/logs/%x-%j.err

#--- 作業ディレクトリ & logs --------------------------------------------
export EVAL_DIR="eval_hle"
mkdir -p "$EVAL_DIR/logs"
echo "log dir : $EVAL_DIR/logs"

#--- モジュール & Conda --------------------------------------------
module purge
module load cuda/12.6 miniconda/24.7.1-py312
module load cudnn/9.6.0
module load nccl/2.24.3
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate llmbench

# 自分のトークンに置き換えてください
HF_TOKEN=hf_yourtoken
WANDB_API_KEY=yourkey
OPENAI_API_KEY=sk-YOURKEY

export HF_HOME=${SLURM_TMPDIR:-$HOME}/.hf_cache
export HF_TOKEN=$HF_TOKEN
export WANDB_API_KEY=$WANDB_API_KEY
export HUGGINGFACE_HUB_TOKEN=$HF_TOKEN
mkdir -p "$HF_HOME"
echo "HF cache dir : $HF_HOME"

export PYTHONUNBUFFERED=1
export VLLM_LOGGING_LEVEL=DEBUG

#--- vLLM 起動(8GPU)----------------------------------------------
vllm serve weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.95 \
  --reasoning-parser deepseek_r1 \
  --dtype "bfloat16" \
  > $EVAL_DIR/logs/vllm-$SLURM_JOB_ID.log 2>&1 &
pid_vllm=$!

#--- ヘルスチェック -------------------------------------------------
until curl -s http://127.0.0.1:8000/health >/dev/null; do
  echo "$(date +%T) vLLM starting …"
  sleep 10
done
echo "vLLM READY"

#--- 推論 -----------------------------------------------------------
cd $EVAL_DIR
python predict.py \
  model=weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510 \
  dataset=cais/hle \
  max_completion_tokens=248741 2>&1
cd ..

#--- 評価 -----------------------------------------------------------
export BASE_URL="http://localhost:8000/v1" 
cd $EVAL_DIR
python judge.py \
  model=weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510 \
  dataset=cais/hle \
  max_completion_tokens=248741 2>&1
cd ../

#--- 後片付け -------------------------------------------------------
kill $pid_vllm 2>/dev/null
wait $pid_vllm 2>/dev/null

wait

DNA推論・評価

動作確認済みのコマンドは、以下の通りです。 vllmのmax-model-lenパラメタは262144をかならず指定してください:

#!/bin/bash
#SBATCH --job-name=predict_dna_8gpu
#SBATCH --partition=P06
#SBATCH --nodelist=osk-gpu68
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=64
#SBATCH --output=eval_dna/logs/%x-%j.out
#SBATCH --error=eval_dna/logs/%x-%j.err

#--- 作業ディレクトリ & logs --------------------------------------------
export EVAL_DIR="eval_dna"
mkdir -p "$EVAL_DIR/logs"
echo "log dir : $EVAL_DIR/logs"

#--- モジュール & Conda --------------------------------------------
module purge
module load cuda/12.6 miniconda/24.7.1-py312
module load cudnn/9.6.0  
module load nccl/2.24.3 
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate llmbench

# 自分のトークンに置き換えてください
HF_TOKEN=hf_yourtoken
WANDB_API_KEY=yourkey
OPENAI_API_KEY=sk-YOURKEY

export HF_HOME=${SLURM_TMPDIR:-$HOME}/.hf_cache
export HUGGINGFACE_HUB_TOKEN=$HF_TOKEN
mkdir -p "$HF_HOME"
echo "HF cache dir : $HF_HOME"

#--- vLLM 起動(8GPU)----------------------------------------------
vllm serve weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 1 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.95 \
  --reasoning-parser deepseek_r1 \
  --dtype "bfloat16" \
  > $EVAL_DIR/logs/vllm.log 2>&1 &
pid_vllm=$!

#--- ヘルスチェック -------------------------------------------------
until curl -s http://127.0.0.1:8000/health >/dev/null; do
  echo "$(date +%T) vLLM starting …"
  sleep 10
done
echo "vLLM READY"

#--- 推論 -----------------------------------------------------------
python $EVAL_DIR/llm-compe-eval/evaluate_huggingface_models.py \
    --model_name "weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft" \
    --dataset_path datasets/Instruction/do_not_answer_en.csv \
    --output_dir $EVAL_DIR/evaluation_results \
    --use_vllm \
    --vllm_base_url http://localhost:8000/v1 > $EVAL_DIR/logs/predict.log 2>&1

#--- 後片付け -------------------------------------------------------
kill $pid_vllm
wait
Downloads last month
23
Safetensors
Model size
235B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510

Finetuned
(7)
this model

Dataset used to train weblab-llm-competition-2025-bridge/team-pont-neuf-Qwen3-235B-A22B-Thinking-sft-2510