cognitivecomputations
/

Qwen3-72B-Embiggened-gguf

GGUF

Model card Files Files and versions

xet

Community

ehartford commited on 13 days ago

Commit

966e63c

verified ·

1 Parent(s): 14ba292

Update README.md

Browse files

Files changed (1) hide show

README.md +108 -29

README.md CHANGED Viewed

@@ -180,55 +180,134 @@ Output: "deoxyribonucleic acid, and it is the hereditary material in all living
 ## Usage
-### Basic Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-# Load model
 model = AutoModelForCausalLM.from_pretrained(
-    "Qwen3-72B-Embiggened",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
 )
-tokenizer = AutoTokenizer.from_pretrained("Qwen3-72B-Embiggened")
-# Generate text
-inputs = tokenizer("The meaning of life is", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### Advanced Usage with Quantization
 ```python
-from transformers import BitsAndBytesConfig
-# 4-bit quantization for reduced memory usage
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=True,
 )
-model = AutoModelForCausalLM.from_pretrained(
-    "Qwen3-72B-Embiggened",
-    quantization_config=bnb_config,
-    device_map="auto",
-    trust_remote_code=True
 )
 ```
-### vLLM Deployment
 ```python
-from vllm import LLM, SamplingParams
-llm = LLM(model="Qwen3-72B-Embiggened", tensor_parallel_size=4)
-sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
-prompts = ["Tell me about quantum computing", "Write a poem about AI"]
-outputs = llm.generate(prompts, sampling_params)
 ```
 ## Hardware Requirements
 ### Minimum Requirements

 ## Usage
+### Basic Usage with Thinking Mode
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "cognitivecomputations/Qwen3-72B-Embiggened"
+# Load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+# Prepare the model input
+prompt = "How many r's are in strawberry?"
+messages = [
+    {"role": "user", "content": prompt}
+]
+# Apply chat template with thinking mode enabled
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True  # Enable thinking mode (default)
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate response
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768,
+    temperature=0.6,      # Recommended for thinking mode
+    top_p=0.95,
+    top_k=20,
+    min_p=0
 )
+# Parse thinking content and final response
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+try:
+    # Find </think> token (151668)
+    index = len(output_ids) - output_ids[::-1].index(151668)
+except ValueError:
+    index = 0
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
+print("Thinking content:", thinking_content)
+print("Final answer:", content)
 ```
+### Non-Thinking Mode (Efficient General Dialogue)
 ```python
+# Same setup as above...
+# Apply chat template with thinking mode disabled
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False  # Disable thinking for efficiency
 )
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate with non-thinking parameters
+outputs = model.generate(
+    **model_inputs,
+    max_new_tokens=2048,
+    temperature=0.7,      # Recommended for non-thinking mode
+    top_p=0.8,
+    top_k=20,
+    min_p=0
 )
 ```
+### Advanced: Dynamic Mode Switching
 ```python
+# Use /think and /no_think tags to control behavior
+messages = [
+    {"role": "user", "content": "Explain quantum computing /no_think"},  # Quick response
+    {"role": "assistant", "content": "Quantum computing uses quantum bits..."},
+    {"role": "user", "content": "How does superposition work mathematically? /think"}  # Detailed reasoning
+]
+```
+### vLLM Deployment with Reasoning Support
+```python
+# Start server with reasoning parser
+# vllm serve cognitivecomputations/Qwen3-72B-Embiggened --enable-reasoning --reasoning-parser deepseek_r1
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
+# Use with thinking mode
+response = client.chat.completions.create(
+    model="cognitivecomputations/Qwen3-72B-Embiggened",
+    messages=[{"role": "user", "content": "Solve: What is 15% of 250?"}],
+    extra_body={"enable_thinking": True}
+)
 ```
+### Example Outputs with Thinking
+```
+Prompt: "How many r's are in strawberry?"
+Thinking: Let me count the r's in "strawberry". S-t-r-a-w-b-e-r-r-y.
+Going through each letter: s(no), t(no), r(yes, 1), a(no), w(no),
+b(no), e(no), r(yes, 2), r(yes, 3), y(no).
+Final answer: There are 3 r's in the word "strawberry".
+Prompt: "What is the capital of France, and what is it famous for?"
+Final answer (no thinking): Paris is the capital of France. It's famous for
+the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and its rich
+cultural heritage, fashion, and cuisine.
+```
+This updated version:
+1. Shows both thinking and non-thinking modes clearly
+2. Includes the proper thinking token parsing (151668)
+3. Uses recommended temperature settings for each mode
+4. Demonstrates the `/think` and `/no_think` switches
+5. Shows example outputs that highlight the thinking capability
+6. Matches the structure and style of the Qwen3-32B examples
 ## Hardware Requirements
 ### Minimum Requirements