cognitivecomputations
/

Qwen3-58B-Embiggened-gguf

GGUF

Model card Files Files and versions

xet

Community

ehartford commited on 13 days ago

Commit

6ab2bc3

verified ·

1 Parent(s): a93f408

Update README.md

Browse files

Files changed (1) hide show

README.md +131 -6

README.md CHANGED Viewed

@@ -161,18 +161,143 @@ down_proj: [5120, 25600] → [8192, 29568]
 ## Usage
-This is an intermediate checkpoint. To use the complete 72B model:
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-# Load the complete model instead
 model = AutoModelForCausalLM.from_pretrained(
-    "Qwen3-72B-Embiggened",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
 )
 ```
 ## Hardware Requirements

 ## Usage
+### Basic Usage with Thinking Mode
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "cognitivecomputations/Qwen3-58B-Embiggened"
+# Load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+# Prepare the model input
+prompt = "How many r's are in strawberry?"
+messages = [
+    {"role": "user", "content": prompt}
+]
+# Apply chat template with thinking mode enabled
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True  # Enable thinking mode (default)
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate response
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768,
+    temperature=0.6,      # Recommended for thinking mode
+    top_p=0.95,
+    top_k=20,
+    min_p=0
 )
+# Parse thinking content and final response
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+try:
+    # Find </think> token (151668)
+    index = len(output_ids) - output_ids[::-1].index(151668)
+except ValueError:
+    index = 0
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
+print("Thinking content:", thinking_content)
+print("Final answer:", content)
+```
+### Non-Thinking Mode (Efficient General Dialogue)
+```python
+# Same setup as above...
+# Apply chat template with thinking mode disabled
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False  # Disable thinking for efficiency
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate with non-thinking parameters
+outputs = model.generate(
+    **model_inputs,
+    max_new_tokens=2048,
+    temperature=0.7,      # Recommended for non-thinking mode
+    top_p=0.8,
+    top_k=20,
+    min_p=0
+)
+```
+### Advanced: Dynamic Mode Switching
+```python
+# Use /think and /no_think tags to control behavior
+messages = [
+    {"role": "user", "content": "Explain quantum computing /no_think"},  # Quick response
+    {"role": "assistant", "content": "Quantum computing uses quantum bits..."},
+    {"role": "user", "content": "How does superposition work mathematically? /think"}  # Detailed reasoning
+]
+```
+### vLLM Deployment with Reasoning Support
+```python
+# Start server with reasoning parser
+# vllm serve cognitivecomputations/Qwen3-58B-Embiggened --enable-reasoning --reasoning-parser deepseek_r1
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
+# Use with thinking mode
+response = client.chat.completions.create(
+    model="cognitivecomputations/Qwen3-58B-Embiggened",
+    messages=[{"role": "user", "content": "Solve: What is 15% of 250?"}],
+    extra_body={"enable_thinking": True}
+)
+```
+### Advanced Usage with Quantization
+```python
+from transformers import BitsAndBytesConfig
+# 4-bit quantization for reduced memory usage
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=True,
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "cognitivecomputations/Qwen3-58B-Embiggened",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+```
+### Example Outputs with Thinking
+```
+Prompt: "How many r's are in strawberry?"
+Thinking: Let me count the r's in "strawberry". S-t-r-a-w-b-e-r-r-y.
+Going through each letter: s(no), t(no), r(yes, 1), a(no), w(no),
+b(no), e(no), r(yes, 2), r(yes, 3), y(no).
+Final answer: There are 3 r's in the word "strawberry".
+Prompt: "What is the capital of France, and what is it famous for?"
+Final answer (no thinking): Paris is the capital of France. It's famous for
+the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and its rich
+cultural heritage, fashion, and cuisine.
 ```
 ## Hardware Requirements