Text Generation
Transformers
Safetensors
bailing_moe
conversational
custom_code
rulixiang commited on
Commit
bfc9e66
·
1 Parent(s): ece9e06

update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -62
README.md CHANGED
@@ -170,50 +170,6 @@ completion = client.chat.completions.create(
170
  print(completion.choices[0].message.content)
171
  ```
172
 
173
- ### 🤗 Hugging Face Transformers
174
-
175
- Here is a code snippet to show you how to use the chat model with `transformers`:
176
-
177
- ```python
178
- from transformers import AutoModelForCausalLM, AutoTokenizer
179
-
180
- model_name = "inclusionAI/Ling-1T"
181
-
182
- model = AutoModelForCausalLM.from_pretrained(
183
- model_name,
184
- dtype="auto",
185
- device_map="auto",
186
- trust_remote_code=True,
187
- )
188
- tokenizer = AutoTokenizer.from_pretrained(model_name)
189
-
190
- prompt = "Give me a short introduction to large language models."
191
- messages = [
192
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
193
- {"role": "user", "content": prompt}
194
- ]
195
- text = tokenizer.apply_chat_template(
196
- messages,
197
- tokenize=False,
198
- add_generation_prompt=True
199
- )
200
- model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
201
-
202
- generated_ids = model.generate(
203
- **model_inputs,
204
- max_new_tokens=512
205
- )
206
- generated_ids = [
207
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
208
- ]
209
-
210
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
211
- ```
212
-
213
- ### 🤖 ModelScope
214
-
215
- If you're in mainland China, we strongly recommend you to use our model from 🤖 <a href="https://modelscope.cn/models/inclusionAI/Ling-1T">ModelScope</a>.
216
-
217
  ## Deployment
218
 
219
  ### vLLM
@@ -280,46 +236,45 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
280
 
281
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
282
 
283
-
284
  ### SGLang
285
 
286
  #### Environment Preparation
287
 
288
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
289
  ```shell
290
- pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
291
- ```
292
- You can use docker image as well:
293
- ```shell
294
- docker pull lmsysorg/sglang:v0.5.2rc0-cu126
295
- ```
296
- Then you should apply patch to sglang installation:
297
- ```bash
298
- # patch command is needed, run `yum install -y patch` if needed
299
- patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
300
  ```
301
 
302
  #### Run Inference
303
 
304
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
 
 
305
 
306
  - Start server:
307
  ```bash
308
- python -m sglang.launch_server \
309
- --model-path $MODEL_PATH \
310
- --host 0.0.0.0 --port $PORT \
311
- --trust-remote-code \
312
- --attention-backend fa3
 
 
 
 
 
 
313
 
314
  # This is only an example, please adjust arguments according to your actual environment.
315
  ```
 
316
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
317
  to start command.
318
 
319
  - Client:
320
 
321
  ```shell
322
- curl -s http://localhost:${PORT}/v1/chat/completions \
323
  -H "Content-Type: application/json" \
324
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
325
  ```
 
170
  print(completion.choices[0].message.content)
171
  ```
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
  ## Deployment
174
 
175
  ### vLLM
 
236
 
237
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
238
 
 
239
  ### SGLang
240
 
241
  #### Environment Preparation
242
 
243
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
244
  ```shell
245
+ pip3 install -U sglang sgl-kernel
 
 
 
 
 
 
 
 
 
246
  ```
247
 
248
  #### Run Inference
249
 
250
+ BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
251
+
252
+ Here is the example to run Ling-1T with multiple nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
253
 
254
  - Start server:
255
  ```bash
256
+ # Node 0:
257
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
258
+
259
+ # Node 1:
260
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
261
+
262
+ # Node 2:
263
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
264
+
265
+ # Node 3:
266
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
267
 
268
  # This is only an example, please adjust arguments according to your actual environment.
269
  ```
270
+
271
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
272
  to start command.
273
 
274
  - Client:
275
 
276
  ```shell
277
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
278
  -H "Content-Type: application/json" \
279
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
280
  ```