GGUF
ehartford commited on
Commit
6ab2bc3
·
verified ·
1 Parent(s): a93f408

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -6
README.md CHANGED
@@ -161,18 +161,143 @@ down_proj: [5120, 25600] → [8192, 29568]
161
 
162
  ## Usage
163
 
164
- This is an intermediate checkpoint. To use the complete 72B model:
165
 
 
166
  ```python
167
  from transformers import AutoModelForCausalLM, AutoTokenizer
168
 
169
- # Load the complete model instead
 
 
 
170
  model = AutoModelForCausalLM.from_pretrained(
171
- "Qwen3-72B-Embiggened",
172
- torch_dtype=torch.bfloat16,
173
- device_map="auto",
174
- trust_remote_code=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  ```
177
 
178
  ## Hardware Requirements
 
161
 
162
  ## Usage
163
 
 
164
 
165
+ ### Basic Usage with Thinking Mode
166
  ```python
167
  from transformers import AutoModelForCausalLM, AutoTokenizer
168
 
169
+ model_name = "cognitivecomputations/Qwen3-58B-Embiggened"
170
+
171
+ # Load the tokenizer and the model
172
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
173
  model = AutoModelForCausalLM.from_pretrained(
174
+ model_name,
175
+ torch_dtype="auto",
176
+ device_map="auto"
177
+ )
178
+
179
+ # Prepare the model input
180
+ prompt = "How many r's are in strawberry?"
181
+ messages = [
182
+ {"role": "user", "content": prompt}
183
+ ]
184
+
185
+ # Apply chat template with thinking mode enabled
186
+ text = tokenizer.apply_chat_template(
187
+ messages,
188
+ tokenize=False,
189
+ add_generation_prompt=True,
190
+ enable_thinking=True # Enable thinking mode (default)
191
+ )
192
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
193
+
194
+ # Generate response
195
+ generated_ids = model.generate(
196
+ **model_inputs,
197
+ max_new_tokens=32768,
198
+ temperature=0.6, # Recommended for thinking mode
199
+ top_p=0.95,
200
+ top_k=20,
201
+ min_p=0
202
  )
203
+
204
+ # Parse thinking content and final response
205
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
206
+
207
+ try:
208
+ # Find </think> token (151668)
209
+ index = len(output_ids) - output_ids[::-1].index(151668)
210
+ except ValueError:
211
+ index = 0
212
+
213
+ thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
214
+ content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
215
+
216
+ print("Thinking content:", thinking_content)
217
+ print("Final answer:", content)
218
+ ```
219
+
220
+ ### Non-Thinking Mode (Efficient General Dialogue)
221
+ ```python
222
+ # Same setup as above...
223
+
224
+ # Apply chat template with thinking mode disabled
225
+ text = tokenizer.apply_chat_template(
226
+ messages,
227
+ tokenize=False,
228
+ add_generation_prompt=True,
229
+ enable_thinking=False # Disable thinking for efficiency
230
+ )
231
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
232
+
233
+ # Generate with non-thinking parameters
234
+ outputs = model.generate(
235
+ **model_inputs,
236
+ max_new_tokens=2048,
237
+ temperature=0.7, # Recommended for non-thinking mode
238
+ top_p=0.8,
239
+ top_k=20,
240
+ min_p=0
241
+ )
242
+ ```
243
+
244
+ ### Advanced: Dynamic Mode Switching
245
+ ```python
246
+ # Use /think and /no_think tags to control behavior
247
+ messages = [
248
+ {"role": "user", "content": "Explain quantum computing /no_think"}, # Quick response
249
+ {"role": "assistant", "content": "Quantum computing uses quantum bits..."},
250
+ {"role": "user", "content": "How does superposition work mathematically? /think"} # Detailed reasoning
251
+ ]
252
+ ```
253
+
254
+ ### vLLM Deployment with Reasoning Support
255
+ ```python
256
+ # Start server with reasoning parser
257
+ # vllm serve cognitivecomputations/Qwen3-58B-Embiggened --enable-reasoning --reasoning-parser deepseek_r1
258
+
259
+ from openai import OpenAI
260
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
261
+
262
+ # Use with thinking mode
263
+ response = client.chat.completions.create(
264
+ model="cognitivecomputations/Qwen3-58B-Embiggened",
265
+ messages=[{"role": "user", "content": "Solve: What is 15% of 250?"}],
266
+ extra_body={"enable_thinking": True}
267
+ )
268
+ ```
269
+
270
+ ### Advanced Usage with Quantization
271
+ ```python
272
+ from transformers import BitsAndBytesConfig
273
+
274
+ # 4-bit quantization for reduced memory usage
275
+ bnb_config = BitsAndBytesConfig(
276
+ load_in_4bit=True,
277
+ bnb_4bit_compute_dtype=torch.bfloat16,
278
+ bnb_4bit_use_double_quant=True,
279
+ )
280
+
281
+ model = AutoModelForCausalLM.from_pretrained(
282
+ "cognitivecomputations/Qwen3-58B-Embiggened",
283
+ quantization_config=bnb_config,
284
+ device_map="auto"
285
+ )
286
+ ```
287
+
288
+ ### Example Outputs with Thinking
289
+
290
+ ```
291
+ Prompt: "How many r's are in strawberry?"
292
+ Thinking: Let me count the r's in "strawberry". S-t-r-a-w-b-e-r-r-y.
293
+ Going through each letter: s(no), t(no), r(yes, 1), a(no), w(no),
294
+ b(no), e(no), r(yes, 2), r(yes, 3), y(no).
295
+ Final answer: There are 3 r's in the word "strawberry".
296
+
297
+ Prompt: "What is the capital of France, and what is it famous for?"
298
+ Final answer (no thinking): Paris is the capital of France. It's famous for
299
+ the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and its rich
300
+ cultural heritage, fashion, and cuisine.
301
  ```
302
 
303
  ## Hardware Requirements