update-readme
#30
by
asherszhang
- opened
- README.md +40 -18
- README_CN.md +70 -3
README.md
CHANGED
|
@@ -117,11 +117,17 @@ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="aut
|
|
| 117 |
messages = [
|
| 118 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
| 119 |
]
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
output_text = tokenizer.decode(outputs[0])
|
| 127 |
|
|
@@ -148,13 +154,12 @@ This model supports two modes of operation:
|
|
| 148 |
|
| 149 |
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
| 150 |
```
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
)
|
| 158 |
```
|
| 159 |
|
| 160 |
|
|
@@ -172,13 +177,30 @@ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
|
|
| 172 |
|
| 173 |
We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
|
| 174 |
|
| 175 |
-
- To
|
| 176 |
-
|
| 177 |
-
https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
|
| 178 |
|
|
|
|
| 179 |
```
|
| 180 |
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 181 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
```
|
| 183 |
docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 184 |
```
|
|
@@ -287,10 +309,10 @@ docker run --rm --ipc=host \
|
|
| 287 |
```
|
| 288 |
|
| 289 |
### Source Code
|
| 290 |
-
Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project
|
| 291 |
-
|
| 292 |
-
You can build and run vLLM from source after merging this pull request into your local repository.
|
| 293 |
|
|
|
|
| 294 |
|
| 295 |
### Model Context Length Support
|
| 296 |
|
|
|
|
| 117 |
messages = [
|
| 118 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
| 119 |
]
|
| 120 |
+
|
| 121 |
+
text = tokenizer.apply_chat_template(
|
| 122 |
+
messages,
|
| 123 |
+
tokenize=False,
|
| 124 |
+
enable_thinking=True
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 128 |
+
model_inputs.pop("token_type_ids", None)
|
| 129 |
+
outputs = model.generate(**model_inputs, max_new_tokens=4096)
|
| 130 |
+
|
| 131 |
|
| 132 |
output_text = tokenizer.decode(outputs[0])
|
| 133 |
|
|
|
|
| 154 |
|
| 155 |
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
| 156 |
```
|
| 157 |
+
|
| 158 |
+
text = tokenizer.apply_chat_template(
|
| 159 |
+
messages,
|
| 160 |
+
tokenize=False,
|
| 161 |
+
enable_thinking=False
|
| 162 |
+
)
|
|
|
|
| 163 |
```
|
| 164 |
|
| 165 |
|
|
|
|
| 177 |
|
| 178 |
We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
|
| 179 |
|
| 180 |
+
- To Get Started, Download the Docker Image:
|
|
|
|
|
|
|
| 181 |
|
| 182 |
+
**From Docker Hub:**
|
| 183 |
```
|
| 184 |
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 185 |
```
|
| 186 |
+
|
| 187 |
+
**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
First, pull the image from CNB:
|
| 191 |
+
```
|
| 192 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
Then, rename the image to better align with the following scripts:
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
- start docker
|
| 203 |
+
|
| 204 |
```
|
| 205 |
docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
| 206 |
```
|
|
|
|
| 309 |
```
|
| 310 |
|
| 311 |
### Source Code
|
| 312 |
+
Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project,
|
| 313 |
+
This patch already been merged by community at Jul-1-2025.
|
|
|
|
| 314 |
|
| 315 |
+
You can build and run vLLM from source using code after `ecad85`.
|
| 316 |
|
| 317 |
### Model Context Length Support
|
| 318 |
|
README_CN.md
CHANGED
|
@@ -89,6 +89,75 @@ Hunyuan-A13B采用了细粒度混合专家(Fine-grained Mixture of Experts,F
|
|
| 89 |
| **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
|
| 90 |
| **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## 推理和部署
|
| 94 |
|
|
@@ -246,9 +315,7 @@ docker run --rm --ipc=host \
|
|
| 246 |
|
| 247 |
### 源码部署
|
| 248 |
|
| 249 |
-
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM
|
| 250 |
-
|
| 251 |
-
你可以在本地仓库中合并此 PR 后,从源码构建并运行 vLLM。
|
| 252 |
|
| 253 |
|
| 254 |
### 模型上下文长度支持
|
|
|
|
| 89 |
| **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
|
| 90 |
| **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
|
| 91 |
|
| 92 |
+
## transformers推理
|
| 93 |
+
|
| 94 |
+
我们的模型默认使用“慢思考”(即推理模式),有两种方式可以关闭 CoT(Chain-of-Thought,思维链)推理:
|
| 95 |
+
1. 在调用 `apply_chat_template` 时传入参数 `"enable_thinking=False"`。
|
| 96 |
+
2. 在提示词(prompt)前加上 `/no_think` 可以强制模型不使用 CoT 推理。类似地,在提示词前加上 `/think` 则会强制模型启用 CoT 推理。
|
| 97 |
+
|
| 98 |
+
以下代码片段展示了如何使用 `transformers` 库加载并应用该模型。
|
| 99 |
+
它还演示了如何开启和关闭推理模式,
|
| 100 |
+
以及如何解析推理过程和最终输出。
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 104 |
+
import os
|
| 105 |
+
import re
|
| 106 |
+
|
| 107 |
+
model_name_or_path = os.environ['MODEL_PATH']
|
| 108 |
+
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
|
| 109 |
+
|
| 110 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
|
| 111 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
|
| 112 |
+
messages = [
|
| 113 |
+
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
| 114 |
+
]
|
| 115 |
+
|
| 116 |
+
text = tokenizer.apply_chat_template(
|
| 117 |
+
messages,
|
| 118 |
+
tokenize=False,
|
| 119 |
+
enable_thinking=True
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 123 |
+
model_inputs.pop("token_type_ids", None)
|
| 124 |
+
outputs = model.generate(**model_inputs, max_new_tokens=4096)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
output_text = tokenizer.decode(outputs[0])
|
| 128 |
+
|
| 129 |
+
think_pattern = r'<think>(.*?)</think>'
|
| 130 |
+
think_matches = re.findall(think_pattern, output_text, re.DOTALL)
|
| 131 |
+
|
| 132 |
+
answer_pattern = r'<answer>(.*?)</answer>'
|
| 133 |
+
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
|
| 134 |
+
|
| 135 |
+
think_content = [match.strip() for match in think_matches][0]
|
| 136 |
+
answer_content = [match.strip() for match in answer_matches][0]
|
| 137 |
+
print(f"thinking_content:{think_content}\n\n")
|
| 138 |
+
print(f"answer_content:{answer_content}\n\n")
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
### 快速思考与慢速思考切换
|
| 144 |
+
|
| 145 |
+
本模型支持两种运行模式:
|
| 146 |
+
|
| 147 |
+
- **慢速思考模式(默认)**:在生成最终答案之前进行详细的内部推理步骤。
|
| 148 |
+
- **快速思考模式**:跳过内部推理过程,直接输出最终答案,从而实现更快的推理速度。
|
| 149 |
+
|
| 150 |
+
**切换到快速思考模式的方法:**
|
| 151 |
+
|
| 152 |
+
要禁用推理过程,请在调用 `apply_chat_template` 时设置 `enable_thinking=False`:
|
| 153 |
+
|
| 154 |
+
```python
|
| 155 |
+
text = tokenizer.apply_chat_template(
|
| 156 |
+
messages,
|
| 157 |
+
tokenize=False,
|
| 158 |
+
enable_thinking=False # 使用快速思考模式
|
| 159 |
+
)
|
| 160 |
+
```
|
| 161 |
|
| 162 |
## 推理和部署
|
| 163 |
|
|
|
|
| 315 |
|
| 316 |
### 源码部署
|
| 317 |
|
| 318 |
+
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目并已经合并, 可以使用 vllm git commit`ecad85`以后的版本进行源代码编译。
|
|
|
|
|
|
|
| 319 |
|
| 320 |
|
| 321 |
### 模型上下文长度支持
|