zhanghanxiao commited on
Commit
8cf104c
Β·
verified Β·
1 Parent(s): 5a819a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +232 -3
README.md CHANGED
@@ -1,3 +1,232 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ ---
6
+
7
+
8
+
9
+ <p align="center">
10
+ <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
11
+ <p>
12
+
13
+ <p align="center">πŸ€— <a href="https://huggingface.co/inclusionAI">Hugging Face</a>&nbsp&nbsp | &nbsp&nbspπŸ€– <a href="https://modelscope.cn/organization/inclusionAI">ModelScope </a>&nbsp&nbsp | &nbsp&nbspπŸ™ <a href="https://zenmux.ai/inclusionai/ling-1t?utm_source=hf_inclusionAI">Experience Now</a></p>
14
+
15
+
16
+ ## Introduction
17
+
18
+
19
+
20
+ ## Model Downloads
21
+
22
+ You can download Ling-1T model from the following table. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process.
23
+
24
+ <center>
25
+
26
+ | **Model** | **Context Length** | **Download** |
27
+ |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
28
+ | Ling-1T | 32K -> 128K (YaRN) | [πŸ€— HuggingFace](https://huggingface.co/inclusionAI/Ling-1T) &nbsp;&nbsp; [πŸ€– ModelScope](https://www.modelscope.cn/models/inclusionAI/Ling-1T) |
29
+
30
+ </center>
31
+
32
+ Note: If you are interested in previous version, please visit the past model collections in [Huggingface](https://huggingface.co/inclusionAI) or [ModelScope](https://modelscope.cn/organization/inclusionAI).
33
+
34
+
35
+ ## Quickstart
36
+
37
+ ### πŸš€ Try Online
38
+
39
+ You can experience Ling-1T online at: [ZenMux](https://zenmux.ai/inclusionai/ling-1t?utm_source=hf_inclusionAI)
40
+
41
+ ### πŸ”Œ API Usage
42
+
43
+ You can also use Ling-1T through API calls:
44
+
45
+ ```python
46
+ from openai import OpenAI
47
+
48
+ # 1. Initialize the OpenAI client
49
+ client = OpenAI(
50
+ # 2. Point the base URL to the ZenMux endpoint
51
+ base_url="https://zenmux.ai/api/v1",
52
+ # 3. Replace with the API Key from your ZenMux user console
53
+ api_key="<your ZENMUX_API_KEY>",
54
+ )
55
+
56
+ # 4. Make a request
57
+ completion = client.chat.completions.create(
58
+ # 5. Specify the model to use in the format "provider/model-name"
59
+ model="inclusionai/ling-1t",
60
+ messages=[
61
+ {
62
+ "role": "user",
63
+ "content": "What is the meaning of life?"
64
+ }
65
+ ]
66
+ )
67
+
68
+ print(completion.choices[0].message.content)
69
+ ```
70
+
71
+ ### πŸ€— Hugging Face Transformers
72
+
73
+ Here is a code snippet to show you how to use the chat model with `transformers`:
74
+
75
+ ```python
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer
77
+
78
+ model_name = "inclusionAI/Ling-1T"
79
+
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ model_name,
82
+ dtype="auto",
83
+ device_map="auto",
84
+ trust_remote_code=True,
85
+ )
86
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
87
+
88
+ prompt = "Give me a short introduction to large language models."
89
+ messages = [
90
+ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
91
+ {"role": "user", "content": prompt}
92
+ ]
93
+ text = tokenizer.apply_chat_template(
94
+ messages,
95
+ tokenize=False,
96
+ add_generation_prompt=True
97
+ )
98
+ model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
99
+
100
+ generated_ids = model.generate(
101
+ **model_inputs,
102
+ max_new_tokens=512
103
+ )
104
+ generated_ids = [
105
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
106
+ ]
107
+
108
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
109
+ ```
110
+
111
+ ### πŸ€– ModelScope
112
+
113
+ If you're in mainland China, we strongly recommend you to use our model from πŸ€– <a href="https://modelscope.cn/organization/inclusionAI">ModelScope</a>.
114
+
115
+ ## Deployment
116
+
117
+ ### vLLM
118
+
119
+ vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
120
+
121
+ #### Environment Preparation
122
+
123
+ ```bash
124
+ pip install vllm==0.11.0
125
+ ```
126
+
127
+ #### Offline Inference:
128
+
129
+ ```python
130
+ from transformers import AutoTokenizer
131
+ from vllm import LLM, SamplingParams
132
+
133
+ tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ling-1T")
134
+
135
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
136
+
137
+ llm = LLM(model="inclusionAI/Ling-1T", dtype='bfloat16', trust_remote_code=True)
138
+ prompt = "Give me a short introduction to large language models."
139
+ messages = [
140
+ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
141
+ {"role": "user", "content": prompt}
142
+ ]
143
+
144
+ text = tokenizer.apply_chat_template(
145
+ messages,
146
+ tokenize=False,
147
+ add_generation_prompt=True
148
+ )
149
+ outputs = llm.generate([text], sampling_params)
150
+
151
+ ```
152
+
153
+ #### Online Inference:
154
+
155
+ ```bash
156
+ vllm serve inclusionAI/Ling-1T \
157
+ --tensor-parallel-size 2 \
158
+ --pipeline-parallel-size 1 \
159
+ --trust-remote-code \
160
+ --gpu-memory-utilization 0.90
161
+
162
+ # This is only an example, please adjust the model sharding strategy according to your actual environment.
163
+ ```
164
+
165
+ To handle long context in vLLM using YaRN, we need to follow these two steps:
166
+ 1. Add a `rope_scaling` field to the model's `config.json` file, for example:
167
+ ```json
168
+ {
169
+ ...,
170
+ "rope_scaling": {
171
+ "factor": 4.0,
172
+ "original_max_position_embeddings": 32768,
173
+ "type": "yarn"
174
+ }
175
+ }
176
+ ```
177
+ 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
178
+
179
+ For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
180
+
181
+
182
+ ### SGLang
183
+
184
+ #### Environment Preparation
185
+
186
+ We will later submit our model to SGLang official release, now we can prepare the environment following steps:
187
+ ```shell
188
+ pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
189
+ ```
190
+ You can use docker image as well:
191
+ ```shell
192
+ docker pull lmsysorg/sglang:v0.5.2rc0-cu126
193
+ ```
194
+ Then you should apply patch to sglang installation:
195
+ ```bash
196
+ # patch command is needed, run `yum install -y patch` if needed
197
+ patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
198
+ ```
199
+
200
+ #### Run Inference
201
+
202
+ BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
203
+
204
+ - Start server:
205
+ ```bash
206
+ python -m sglang.launch_server \
207
+ --model-path $MODLE_PATH \
208
+ --host 0.0.0.0 --port $PORT \
209
+ --trust-remote-code \
210
+ --attention-backend fa3
211
+
212
+ # This is only an example, please adjust the model sharding strategy according to your actual environment.
213
+ ```
214
+ MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
215
+ to start command.
216
+
217
+ - Client:
218
+
219
+ ```shell
220
+ curl -s http://localhost:${PORT}/v1/chat/completions \
221
+ -H "Content-Type: application/json" \
222
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
223
+ ```
224
+
225
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
226
+
227
+
228
+
229
+ ## License
230
+
231
+ This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/main/LICENSE).
232
+