See axolotl config
axolotl version: 0.8.0.dev0
adapter: lora
base_model: Qwen/QwQ-32B-Preview
trust_remote_code: true
bf16: true
dataset_processes: 64
datasets:
- path: phxdev/creed
type: completion
field: text
trust_remote_code: false
streaming: true
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 0.001
lisa_layers_attribute: model.layers
lisa_enabled: true
lisa_layers_fraction: 0.25
load_best_model_at_end: true
load_in_4bit: false
load_in_8bit: true
lora_alpha: 128
lora_dropout: 0.15
lora_r: 64
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj
lora_fan_in_fan_out: false
modules_to_save:
- embed_tokens
- lm_head
loraplus_lr_embedding: 1.0e-06
loraplus_lr_ratio: 16
lr_scheduler: cosine_with_min_lr
lr_scheduler_kwargs:
min_lr: 0.00001
max_prompt_len: 1024
mean_resizing_embeddings: false
micro_batch_size: 1
num_epochs: 3.0
optimizer: adamw_torch
# optim_args:
# weight_decay: 0.05
# betas: [0.9, 0.95]
# eps: 1.0e-8
output_dir: ./outputs/heisenberg-crystal
pretrain_multipack_attn: true
pretrain_multipack_buffer_size: 20000
qlora_sharded_model_loading: false
ray_num_workers: 1
resources_per_worker:
GPU: 1
resume_from_checkpoint: null
sample_packing: false
sample_packing_bin_size: 200
sample_packing_group_size: 100000
sample_packing_seq_len_multiplier: 1.0
save_only_model: true
save_safetensors: true
save_strategy: steps
save_steps: 100
save_total_limit: 3
eval_strategy: steps
eval_steps: 100
metric_for_best_model: loss
greater_is_better: false
sequence_len: 512
shuffle_merged_datasets: true
skip_prepare_dataset: false
strict: false
train_on_inputs: false
neftune_noise_alpha: 5.0
model_config:
rope_scaling:
type: linear
factor: 1.5
dataloader_prefetch_factor: 4
dataloader_num_workers: 8
dataloader_pin_memory: true
dataloader_persistent_workers: true
max_grad_norm: 1.0
adam_beta2_schedule: cosine
torch_compile: true
torch_compile_backend: inductor
trl:
log_completions: true
ref_model_mixup_alpha: 0.9
ref_model_sync_steps: 64
sync_ref_model: false
use_vllm: false
vllm_device: auto
vllm_dtype: auto
vllm_gpu_memory_utilization: 0.9
use_ray: false
val_set_size: 0.05
warmup_steps: 100
warmup_ratio: 0.0
weight_decay: 0.05
flash_attention: true
flash_attn_cross_entropy: true
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: false
ddp_backend: nccl
ddp_broadcast_buffers: false
ddp_find_unused_parameters: false
tf32: true
bf16_full_eval: false
fp16: false
# unfrozen_parameters:
# - lm_head.*
# - embed_tokens.*
# - norm.*
xformers_attention: false
s2_attention: false
sdp_attention: false
pad_to_sequence_len: true
peft_use_dora: false
peft_lora_modules_to_save: null
special_tokens:
pad_token: <|endoftext|>
deepspeed: null
fsdp: null
fsdp_config: null
# wandb_project: heisenberg-qwen
# wandb_entity: null
# wandb_name: blue-crystal-run
# wandb_log_model: checkpoint
hub_model_id: null
hub_strategy: null
report_to: []
logging_strategy: steps
logging_steps: 10
logging_first_step: true
outputs/heisenberg-crystal
This model is a fine-tuned version of Qwen/QwQ-32B-Preview on the phxdev/creed dataset. It achieves the following results on the evaluation set:
- Loss: nan
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine_with_min_lr
- lr_scheduler_warmup_steps: 100
- num_epochs: 3.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0013 | 1 | nan |
7.8286 | 0.1259 | 100 | nan |
7.2486 | 0.2519 | 200 | nan |
7.2601 | 0.3778 | 300 | nan |
8.2142 | 0.5038 | 400 | nan |
7.1902 | 0.6297 | 500 | nan |
6.3799 | 0.7557 | 600 | nan |
6.7115 | 0.8816 | 700 | nan |
6.0414 | 1.0076 | 800 | nan |
6.428 | 1.1335 | 900 | nan |
6.3167 | 1.2594 | 1000 | nan |
6.0359 | 1.3854 | 1100 | nan |
6.3701 | 1.5113 | 1200 | nan |
6.9225 | 1.6373 | 1300 | nan |
6.5807 | 1.7632 | 1400 | nan |
6.8649 | 1.8892 | 1500 | nan |
6.1397 | 2.0151 | 1600 | nan |
5.7675 | 2.1411 | 1700 | nan |
6.2605 | 2.2670 | 1800 | nan |
5.8788 | 2.3929 | 1900 | nan |
6.0279 | 2.5189 | 2000 | nan |
6.3911 | 2.6448 | 2100 | nan |
6.0412 | 2.7708 | 2200 | nan |
6.0862 | 2.8967 | 2300 | nan |
Framework versions
- PEFT 0.14.0
- Transformers 4.49.0
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support