zhanghanxiao commited on
Commit
d5eafa6
·
verified ·
1 Parent(s): bfc9e66

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -74
README.md CHANGED
@@ -172,50 +172,65 @@ print(completion.choices[0].message.content)
172
 
173
  ## Deployment
174
 
175
- ### vLLM
176
-
177
- vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
178
 
179
  #### Environment Preparation
180
 
181
- ```bash
182
- pip install vllm==0.11.0
 
183
  ```
184
 
185
- #### Offline Inference:
 
 
186
 
187
- ```python
188
- from transformers import AutoTokenizer
189
- from vllm import LLM, SamplingParams
190
 
191
- tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-1T")
 
 
 
192
 
193
- sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
 
194
 
195
- llm = LLM(model="inclusionAI/Ling-1T", dtype='bfloat16', trust_remote_code=True)
196
- prompt = "Give me a short introduction to large language models."
197
- messages = [
198
- {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
199
- {"role": "user", "content": prompt}
200
- ]
201
 
202
- text = tokenizer.apply_chat_template(
203
- messages,
204
- tokenize=False,
205
- add_generation_prompt=True
206
- )
207
- outputs = llm.generate([text], sampling_params)
208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  ```
210
 
211
- #### Online Inference:
 
 
212
 
213
  ```bash
214
- vllm serve inclusionAI/Ling-1T \
215
- --tensor-parallel-size 32 \
216
- --pipeline-parallel-size 1 \
217
- --trust-remote-code \
218
- --gpu-memory-utilization 0.90
219
 
220
  # This is only an example, please adjust arguments according to your actual environment.
221
  ```
@@ -236,51 +251,6 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
236
 
237
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
238
 
239
- ### SGLang
240
-
241
- #### Environment Preparation
242
-
243
- We will later submit our model to SGLang official release, now we can prepare the environment following steps:
244
- ```shell
245
- pip3 install -U sglang sgl-kernel
246
- ```
247
-
248
- #### Run Inference
249
-
250
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}.
251
-
252
- Here is the example to run Ling-1T with multiple nodes, with master node IP is ${MASTER_IP} and port is ${PORT} :
253
-
254
- - Start server:
255
- ```bash
256
- # Node 0:
257
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 0
258
-
259
- # Node 1:
260
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 1
261
-
262
- # Node 2:
263
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 2
264
-
265
- # Node 3:
266
- python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:$PORT --nnodes 4 --node-rank 3
267
-
268
- # This is only an example, please adjust arguments according to your actual environment.
269
- ```
270
-
271
- MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
272
- to start command.
273
-
274
- - Client:
275
-
276
- ```shell
277
- curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
278
- -H "Content-Type: application/json" \
279
- -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
280
- ```
281
-
282
- More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
283
-
284
 
285
 
286
  ## Limitations & Future Plans
 
172
 
173
  ## Deployment
174
 
175
+ ### SGLang
 
 
176
 
177
  #### Environment Preparation
178
 
179
+ We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps:
180
+ ```shell
181
+ pip3 install -U sglang sgl-kernel
182
  ```
183
 
184
+ #### Run Inference
185
+
186
+ Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODEL_PATH}.
187
 
188
+ Here is the example to run Ling-1T with multiple GPU nodes, where the master node IP is ${MASTER_IP} and server port is ${PORT}:
 
 
189
 
190
+ - Start server:
191
+ ```bash
192
+ # Node 0:
193
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 0
194
 
195
+ # Node 1:
196
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 1
197
 
198
+ # Node 2:
199
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 2
 
 
 
 
200
 
201
+ # Node 3:
202
+ python -m sglang.launch_server --model-path $MODEL_PATH --tp-size 8 --pp-size 4 --dp-size 1 --trust-remote-code --dist-init-addr $MASTER_IP:2345 --port $PORT --nnodes 4 --node-rank 3
203
+
204
+ # This is only an example. Please adjust arguments according to your actual environment.
205
+ ```
 
206
 
207
+ - Client:
208
+
209
+ ```shell
210
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
211
+ -H "Content-Type: application/json" \
212
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
213
+ ```
214
+
215
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
216
+
217
+ ### vLLM
218
+
219
+ #### Environment Preparation
220
+
221
+ ```bash
222
+ pip install vllm==0.11.0
223
  ```
224
 
225
+ #### Run Inference:
226
+
227
+ Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTER_IP}, server port is ${PORT} and the path of model is ${MODEL_PATH}:
228
 
229
  ```bash
230
+ # step 1. start ray on all nodes
231
+
232
+ # step 2. start vllm server only on node 0:
233
+ vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
 
234
 
235
  # This is only an example, please adjust arguments according to your actual environment.
236
  ```
 
251
 
252
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
 
256
  ## Limitations & Future Plans