YuPeng0214 commited on Sep 28

Commit

cfdc5c5

verified ·

1 Parent(s): b86e3b4

Upload folder using huggingface_hub

Browse files

Files changed (24) hide show

.DS_Store +0 -0
.gitattributes +1 -0
1_Pooling/config.json +10 -0
2_Dense/config.json +1 -0
2_Dense/model.safetensors +3 -0
README.md +159 -3
added_tokens.json +28 -0
assets/.DS_Store +0 -0
assets/1.png +0 -0
config.json +33 -0
config_sentence_transformers.json +10 -0
merges.txt +0 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +405 -0
modeling_qzhou.py +664 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +246 -0
vocab.json +0 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 4096,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": true,
+  "include_prompt": true
+}

2_Dense/config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"in_features": 4096, "out_features": 1792, "bias": true, "activation_function": "torch.nn.modules.linear.Identity"}

2_Dense/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:101b92dd892283cdea788128b1cf031941a3b5c35e2e1a483daab893c870b82c
+size 14683816

README.md CHANGED Viewed

@@ -1,3 +1,159 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+tags:
+  - sentence-transformers
+  - sentence-similarity
+  - mteb
+  - retriever
+---
+# QZhou-Embedding-Zh
+## Introduction
+We are pleased to announce the release of our new model, QZhou-Embedding-Zh, built upon the architecture and parameters of the Qwen3-8B base model, QZhou-Embedding-Zh was developed using the data construction and training methodology of QZhou-Embedding, and also incorporated MRL embedding inference.
+## Key Enhancements and Optimizations
+To build a more powerful and outstanding model, we have adopted proven approaches from QZhou-Embedding and further introduced the following optimizations:
+1. **Based on Qwen3 Model:** In our practice with QZhou-Embedding, the Qwen3 base model did not show significant advantages over Qwen2.5-7B-Instruct in the first stage (Retrieval). However, notable improvements were observed in Chinese-language tasks, likely due to Qwen3’s stronger Chinese capabilities. We upgraded the base model to Qwen3-8B while retaining the original model architecture, using a last_token pooling strategy.
+2. **Support for MRL:** MRL (Multi-Representation Learning) is highly demanded in practical applications, especially under high-concurrency and low-latency scenarios. Addressing the lack of MRL support in QZhou-Embedding, QZhou-Embedding-Zh now incorporates this feature with the following dimension options: "128, 256, 512, 768, 1024, 1280, 1536, 1792". The default output dimension is set to 1792.
+3. **Token Prepending:** Originally proposed by Fu et al(ACL 2025, Volume 1: Long Papers, 3168–3181), this technique addresses the limitations of the unidirectional attention mechanism in decoder-only models. By prepending each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism, Token Prepending significantly improving performance in STS tasks and classification tasks. We retained the Stage-1 training strategy unchanged and integrated Token Prepending during Stage-2 training, using the PromptEOL template construction method described in their paper. Experimental results demonstrate that Token Prepending is not only a training-free enhancement but also further improves performance when fine-tuned with supervised datasets.
+## Token Prepending
+### Introduction
+Token Prepending is a simple yet effective technique proposed by Fu et al., the core idea is prepending each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. TP technique is a plug-and-play technique neither introduces new parameters nor alters the existing ones, allowing it to be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. The architecture described in the original paper is as follows:
+<div align="center">
+<img src="assets/1.png" width="600" height="400"></img>
+</div>
+### Our Adaptations and Optimizations
+According to the conclusions presented in the original paper, TP technique is completely training-free and requires no extra learnable parameters, serving as a plug-and-play technique to improve the various prompt-based methods. Since QZhou-Embedding-Zh is built upon the Qwen3 base model—retaining its unidirectional attention mechanism and employing last_token pooling—it is ideally suited for the application of the TP technique. To further explore its potential, we conducted training utilizing the TP technique, building upon the Stage 1 retrieval base model through the following procedure:
+1. We modified the model forward script by applying the TP specifically from layer-1 to layer-7(index), namely prepending the last embeddings to the input before processing through these layers；
+2. For the input template design, we have integrated the PromptEOL template on top of the instruction-based input, using <|im_start|>as a placeholder—corresponding to the \<PST\> token in the original paper—to facilitate subsequent TP operations. The full template structure is designed as follows:
+```
+"This sentence: <|im_start|>“Instruct: [instruction]\nQuery: [user_input]” means in one word: “
+```
+3. Stage 2 training was conducted using the updated model architecture and input structure.
+## Usage
+To facilitate model inference and CMTEB result replication on your own machine, we provide detailed specifications for environmental dependencies and model implementation.
+### Requirements
+- Python: 3.10.12
+- Sentence Transformers: 3.4.1
+- Transformers: 4.51.1
+- PyTorch: 2.4.1
+- Accelerate: 1.3.0
+- Datasets: 3.6.0
+- Tokenizers: 0.21.1
+- mteb: 1.38.30
+### Quickstart
+Since QZhou-Embedding-Zh incorporates a dedicated MRL linear projection module built on the sentence-transformers framework, we now only provide inference code specifically designed for sentence-transformers compatibility.
+```
+from sentence_transformers import SentenceTransformer
+from sklearn.preprocessing import normalize
+def get_prompteol_input(text: str) -> str:
+    return f"This sentence: <|im_start|>“{text}” means in one word: “"
+def get_detailed_instruct(task_description: str, query: str) -> str:
+    return f'Instruct: {task_description}\nQuery:{query}'
+model = SentenceTransformer(
+    "Kingsoft-LLM/QZhou-Embedding-Zh",
+    model_kwargs={"device_map": "cuda", "trust_remote_code": True},
+    tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
+    trust_remote_code=True
+)
+task= "Given a web search query, retrieve relevant passages that answer the query"
+queries = [
+    get_prompteol_input(get_detailed_instruct(task, "光合作用是什么？")),
+    get_prompteol_input(get_detailed_instruct(task, "电话是谁发明的？"))
+]
+documents = [
+    get_prompteol_input("光合作用是绿色植物利用阳光、二氧化碳和水生成葡萄糖和氧气的过程。这一生化反应发生在叶绿体中。"),
+    get_prompteol_input("亚历山大·格拉汉姆·贝尔（Alexander Graham Bell）因于1876年发明了第一台实用电话而广受认可，并为此设备获得了美国专利第174,465号。")
+]
+query_embeddings = model.encode(queries, normalize_embeddings=False)
+document_embeddings = model.encode(documents, normalize_embeddings=False)
+dim=1792 # 128, 256, 512, 768, 1024, 1280, 1536, 1792
+query_embeddings = normalize(query_embeddings[:, :dim])
+document_embeddings = normalize(document_embeddings[:, :dim])
+similarity = model.similarity(query_embeddings, document_embeddings)
+print(similarity)
+```
+### Completely replicate the benchmark results
+```
+normalize=true
+use_instruction=true
+export TOKENIZERS_PARALLELISM=true
+embed_dim=1792 # 128, 256, 512, 768, 1024, 1280, 1536, 1792
+model_name_or_path=<model dir>
+python3 ./run_cmteb_all.py \
+    --model_name_or_path ${model_name_or_path}  \
+    --normalize ${normalize} \
+    --dim ${embed_dim} \
+    --use_instruction ${use_instruction} \
+    --output_dir <output dir>
+```
+## Citation
+If you find our work worth citing, please use the following citation:<br>
+**Technical Report:**
+```
+@misc{yu2025qzhouembeddingtechnicalreport,
+      title={QZhou-Embedding Technical Report},
+      author={Peng Yu and En Xu and Bin Chen and Haibiao Chen and Yinfei Xu},
+      year={2025},
+      eprint={2508.21632},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2508.21632},
+}
+```
+**Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs:**
+```
+@inproceedings{fu-etal-2025-token,
+    title = "Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from {LLM}s",
+    author = "Fu, Yuchen  and
+      Cheng, Zifeng  and
+      Jiang, Zhiwei  and
+      Wang, Zhonghui  and
+      Yin, Yafeng  and
+      Li, Zhengliang  and
+      Gu, Qing",
+    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    month = jul,
+    year = "2025",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.acl-long.159/",
+}
+```
+**Qwen3 Series:**
+```
+@misc{qwen3technicalreport,
+      title={Qwen3 Technical Report},
+      author={Qwen Team},
+      year={2025},
+      eprint={2505.09388},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2505.09388},
+}
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/1.png ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "architectures": [
+    "QZhouModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoModel": "modeling_qzhou.QZhouModel"
+  },
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 12288,
+  "max_position_embeddings": 40960,
+  "max_window_layers": 36,
+  "model_type": "qwen3",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.1",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.4.1",
+    "transformers": "4.51.1",
+    "pytorch": "2.4.1+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9b3ccd3acd3e1f5a508f1bd46b18318a4a3fafa9d90d852fc614825a33da4415
+size 4902257056

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e35325a89c3985669f0f4e7260593795b4e7f6e0fdd181d68593d7b50cf5a08e
+size 4915959512

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:50ae96454c69200c2f374ed3d3beb3f94739dc8970841eff4e8df8d30b958b40
+size 4983067656

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d0e76ec8c207af961424def0386393b37051fab776b73c1091c8146c20903de3
+size 335570376

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,405 @@

+{
+  "metadata": {
+    "total_size": 15136811008
+  },
+  "weight_map": {
+    "embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.20.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.20.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.21.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.21.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.22.self_attn.k_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.22.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.22.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.22.self_attn.q_norm.weight": "model-00002-of-00004.safetensors",
+    "layers.22.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.22.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.31.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.32.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.32.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.33.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.33.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.input_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.34.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.k_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.q_norm.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.34.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.35.input_layernorm.weight": "model-00004-of-00004.safetensors",
+    "layers.35.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "layers.35.mlp.gate_proj.weight": "model-00004-of-00004.safetensors",
+    "layers.35.mlp.up_proj.weight": "model-00004-of-00004.safetensors",
+    "layers.35.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
+    "layers.35.self_attn.k_norm.weight": "model-00004-of-00004.safetensors",
+    "layers.35.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.35.self_attn.o_proj.weight": "model-00004-of-00004.safetensors",
+    "layers.35.self_attn.q_norm.weight": "model-00004-of-00004.safetensors",
+    "layers.35.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.35.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
+    "layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.9.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
+    "layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
+    "layers.9.self_attn.k_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.9.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.9.self_attn.q_norm.weight": "model-00001-of-00004.safetensors",
+    "layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
+    "layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
+    "norm.weight": "model-00004-of-00004.safetensors"
+  }
+}

modeling_qzhou.py ADDED Viewed

	@@ -0,0 +1,664 @@

+from functools import partial
+from typing import Callable, Optional, Tuple, Union
+import torch
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, SlidingWindowCache, StaticCache
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    CausalLMOutputWithPast,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import (
+    LossKwargs,
+    can_return_tuple,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.utils.deprecation import deprecate_kwarg
+from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "Qwen/Qwen3-8B"
+_CONFIG_FOR_DOC = "Qwen3Config"
+class Qwen3RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Qwen3RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Qwen3MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+class Qwen3Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: Qwen3Config, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+        self.q_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # unlike olmo, only on the head dim!
+        self.k_norm = Qwen3RMSNorm(self.head_dim, eps=config.rms_norm_eps)  # thus post q_norm does not need reshape
+        self.sliding_window = config.sliding_window
+        if not (
+            self.config.use_sliding_window
+            and getattr(self.config, "sliding_window", None) is not None
+            and self.layer_idx >= self.config.max_window_layers
+        ):
+            self.sliding_window = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_norm(self.q_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        key_states = self.k_norm(self.k_proj(hidden_states).view(hidden_shape)).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            sliding_window=self.sliding_window,  # diff with Llama
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class Qwen3DecoderLayer(nn.Module):
+    def __init__(self, config: Qwen3Config, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = Qwen3Attention(config=config, layer_idx=layer_idx)
+        self.mlp = Qwen3MLP(config)
+        self.input_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        if (
+            config.sliding_window and config._attn_implementation != "flash_attention_2"
+        ):  # diff with Llama is this warning
+            logger.warning_once(
+                f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
+                "unexpected results may be encountered."
+            )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+class Qwen3RotaryEmbedding(nn.Module):
+    def __init__(self, config: Qwen3Config, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+class Qwen3PreTrainedModel(PreTrainedModel):
+    config_class = Qwen3Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Qwen3DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+def find_token_indices(input_ids, token=151644):
+    assert (input_ids == token).any(dim=1).all(), f"Not all sequences contain the token {token}"
+    mask = (input_ids == token)
+    mask_float = mask.float()
+    first_match_indices = mask_float.argmax(dim=1)
+    return first_match_indices
+class QZhouModel(Qwen3PreTrainedModel): # QZhou_Model is built upon the Qwen3Model framework with TP modifications.
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen3DecoderLayer`]
+    Args:
+        config: Qwen3Config
+    """
+    def __init__(self, config: Qwen3Config):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [Qwen3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Qwen3RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> BaseModelOutputWithPast:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
+        if not isinstance(past_key_values, (type(None), Cache)):
+            raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        pst_token_indices = find_token_indices(input_ids, token=151644)
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if decoder_layer.self_attn.layer_idx >=1 and decoder_layer.self_attn.layer_idx <= 7:
+                B = hidden_states.shape[0]
+                previous_sentence_embeddings = hidden_states[:, -1, :].clone()
+                hidden_states[torch.arange(B), pst_token_indices, :] = previous_sentence_embeddings
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    partial(decoder_layer.__call__, **flash_attn_kwargs),
+                    hidden_states,
+                    causal_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    cache_position,
+                    position_embeddings,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=causal_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    cache_position=cache_position,
+                    position_embeddings=position_embeddings,
+                    **flash_attn_kwargs,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool = False,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and past_key_values is not None:
+                is_padding_right = attention_mask[:, -1].sum().item() != input_tensor.size()[0]
+                if is_padding_right:
+                    raise ValueError(
+                        "You are attempting to perform batched generation with padding_side='right'"
+                        " this may lead to unexpected behaviour for Flash Attention version of Qwen3. Make sure to "
+                        " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
+                    )
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+            return None
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_static_cache = isinstance(past_key_values, StaticCache)
+        using_sliding_window_cache = isinstance(past_key_values, SlidingWindowCache)
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if (
+            self.config._attn_implementation == "sdpa"
+            and not (using_static_cache or using_sliding_window_cache)
+            and not output_attentions
+        ):
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                sliding_window=self.config.sliding_window,
+                is_training=self.training,
+            ):
+                return None
+        dtype, device = input_tensor.dtype, input_tensor.device
+        min_dtype = torch.finfo(dtype).min
+        sequence_length = input_tensor.shape[1]
+        # SlidingWindowCache or StaticCache
+        if using_sliding_window_cache or using_static_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        # DynamicCache or no cache
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            device=device,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+            config=self.config,
+            past_key_values=past_key_values,
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu"]
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        device: torch.device,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        config: Qwen3Config,
+        past_key_values: Cache,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            device (`torch.device`):
+                The device to place the 4D attention mask on.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+            config (`Qwen3Config`):
+                The model's configuration class
+            past_key_values (`Cache`):
+                The cache class that is being used currently to generate
+        """
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
+            )
+            diagonal_attend_mask = torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
+            if config.sliding_window is not None:
+                # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also
+                # the check is needed to verify is current checkpoint was trained with sliding window or not
+                if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length:
+                    sliding_attend_mask = torch.arange(target_length, device=device) <= (
+                        cache_position.reshape(-1, 1) - config.sliding_window
+                    )
+                    diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
+            causal_mask *= diagonal_attend_mask
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                if attention_mask.shape[-1] > target_length:
+                    attention_mask = attention_mask[:, :target_length]
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
+                    causal_mask.device
+                )
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+__all__ = [
+    "QZhouModel"
+]

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Dense",
+    "type": "sentence_transformers.models.Dense"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 40960,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b326efa3b3cb974b258836e76f8a992c77a1fa93b9d9126a1632a416bf663a20
+size 11422933

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,246 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "max_length": 32768,
+  "model_max_length": 40960,
+  "pad_to_multiple_of": null,
+  "pad_token": "<|endoftext|>",
+  "pad_token_type_id": 0,
+  "padding_side": "left",
+  "split_special_tokens": false,
+  "stride": 0,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff