ronantakizawa/sarashina2-7b-abliterated
This is an abliterated (refusal-removed) version of sbintuitions/sarashina2-7b.
What is Abliteration?
Abliteration is a technique that removes the "refusal direction" from a language model's weights, making it more likely to comply with requests it would normally refuse. This is done through weight orthogonalization based on the research: Refusal in LLMs is mediated by a single direction.
Model Details
- Base Model: sbintuitions/sarashina2-7b
- Method: Weight Orthogonalization
- Refusal Direction Layer: 25/31 (78.1% through model)
- Separation Score: 40.6445
- Training Samples: 128 harmful + 128 harmless prompts
Abliteration Results
Best Candidate Selection
The refusal direction was computed by testing 6 different layers and ranking them by separation score:
| Rank | Layer | Separation Score | Harmful Proj | Harmless Proj |
|---|---|---|---|---|
| 1 | 25 | 40.6445 | 47.6250 | 6.9805 |
| 2 | 12 | -6.7148 | 3.3555 | 10.0703 |
| 3 | 22 | -4.6953 | 12.6016 | 17.2969 |
| 4 | 9 | -3.3867 | 2.7461 | 6.1328 |
| 5 | 16 | 2.6875 | 8.5391 | 5.8516 |
| 6 | 19 | -0.1641 | 9.6484 | 9.8125 |
Selected: Layer 25 with separation score of 40.6445
A high positive separation score indicates strong distinction between harmful and harmless activations, making it an ideal candidate for abliteration.
Performance Metrics
- Harmful Projection: 47.6250
- Harmless Projection: 6.9805
- Separation: 40.6445
Baseline Evaluation
The baseline model (before abliteration) showed:
- Refusal Rate: 0/4 (0.0%) on test harmful prompts
- The base model already had minimal refusal behavior
- Abliteration further reduces any remaining safety guardrails
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "ronantakizawa/sarashina2-7b-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "こんにちは"}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Chat Template
This model uses a simple instruction-response format:
### Instruction:
[user message]
### Response:
[assistant response]
Ethical Considerations
⚠️ Warning: This model has had its safety features removed and may generate harmful, unethical, or illegal content.
Intended Use:
- Research on AI safety and alignment
- Understanding refusal mechanisms in LLMs
- Red-teaming and adversarial testing
- Educational purposes
Not Intended For:
- Production deployments without additional safety measures
- Generating harmful content for malicious purposes
- Bypassing content policies
Technical Details
Abliteration Method
- Data Collection: Collected activations from 128 harmful and 128 harmless Japanese prompts
- Direction Computation: Calculated mean difference between harmful/harmless activations across 6 layers (30%, 40%, 50%, 60%, 70%, 80%)
- Candidate Ranking: Ranked layers by separation score (harmful_projection - harmless_projection)
- Weight Orthogonalization: Applied orthogonal projection to embedding and transformer layer weights to remove refusal direction
Architecture Changes
Modified weights:
- Embedding layer (
model.embed_tokens.weight) - Attention output projections (
layer.self_attn.o_proj.weight) - MLP output projections (
layer.mlp.down_proj.weight)
Original architecture and all other weights remain unchanged.
Limitations
- Safety fine-tuning has been removed
- May generate biased, harmful, or incorrect content
- No guarantees on output quality or safety
- Japanese language model - primarily trained on Japanese text
Citation
If you use this model, please cite the original abliteration research:
@article{arditi2024refusal,
title={Refusal in LLMs is mediated by a single direction},
author={Arditi, Andy and Obeso, Oscar and Slocum, Aaquib and Goh, Wesg and Nanda, Neel},
journal={LessWrong},
year={2024}
}
Model Card Authors
Created using automated abliteration pipeline.
Acknowledgments
- Base model: sbintuitions/sarashina2-7b by SB Intuitions
- Abliteration technique: FailSpy and original researchers
- Implementation inspired by: Maxime Labonne's work
- Downloads last month
- 23