ronantakizawa/sarashina2-7b-abliterated

This is an abliterated (refusal-removed) version of sbintuitions/sarashina2-7b.

What is Abliteration?

Abliteration is a technique that removes the "refusal direction" from a language model's weights, making it more likely to comply with requests it would normally refuse. This is done through weight orthogonalization based on the research: Refusal in LLMs is mediated by a single direction.

Model Details

Base Model: sbintuitions/sarashina2-7b
Method: Weight Orthogonalization
Refusal Direction Layer: 25/31 (78.1% through model)
Separation Score: 40.6445
Training Samples: 128 harmful + 128 harmless prompts

Abliteration Results

Best Candidate Selection

The refusal direction was computed by testing 6 different layers and ranking them by separation score:

Rank	Layer	Separation Score	Harmful Proj	Harmless Proj
1	25	40.6445	47.6250	6.9805
2	12	-6.7148	3.3555	10.0703
3	22	-4.6953	12.6016	17.2969
4	9	-3.3867	2.7461	6.1328
5	16	2.6875	8.5391	5.8516
6	19	-0.1641	9.6484	9.8125

Selected: Layer 25 with separation score of 40.6445

A high positive separation score indicates strong distinction between harmful and harmless activations, making it an ideal candidate for abliteration.

Performance Metrics

Harmful Projection: 47.6250
Harmless Projection: 6.9805
Separation: 40.6445

Baseline Evaluation

The baseline model (before abliteration) showed:

Refusal Rate: 0/4 (0.0%) on test harmful prompts
The base model already had minimal refusal behavior
Abliteration further reduces any remaining safety guardrails

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "ronantakizawa/sarashina2-7b-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "こんにちは"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Template

This model uses a simple instruction-response format:

### Instruction:
[user message]

### Response:
[assistant response]

Ethical Considerations

⚠️ Warning: This model has had its safety features removed and may generate harmful, unethical, or illegal content.

Intended Use:

Research on AI safety and alignment
Understanding refusal mechanisms in LLMs
Red-teaming and adversarial testing
Educational purposes

Not Intended For:

Production deployments without additional safety measures
Generating harmful content for malicious purposes
Bypassing content policies

Technical Details

Abliteration Method

Data Collection: Collected activations from 128 harmful and 128 harmless Japanese prompts
Direction Computation: Calculated mean difference between harmful/harmless activations across 6 layers (30%, 40%, 50%, 60%, 70%, 80%)
Candidate Ranking: Ranked layers by separation score (harmful_projection - harmless_projection)
Weight Orthogonalization: Applied orthogonal projection to embedding and transformer layer weights to remove refusal direction

Architecture Changes

Modified weights:

Embedding layer (model.embed_tokens.weight)
Attention output projections (layer.self_attn.o_proj.weight)
MLP output projections (layer.mlp.down_proj.weight)

Original architecture and all other weights remain unchanged.

Limitations

Safety fine-tuning has been removed
May generate biased, harmful, or incorrect content
No guarantees on output quality or safety
Japanese language model - primarily trained on Japanese text

Citation

If you use this model, please cite the original abliteration research:

@article{arditi2024refusal,
  title={Refusal in LLMs is mediated by a single direction},
  author={Arditi, Andy and Obeso, Oscar and Slocum, Aaquib and Goh, Wesg and Nanda, Neel},
  journal={LessWrong},
  year={2024}
}

Model Card Authors

Created using automated abliteration pipeline.

Acknowledgments

Base model: sbintuitions/sarashina2-7b by SB Intuitions
Abliteration technique: FailSpy and original researchers
Implementation inspired by: Maxime Labonne's work

Downloads last month: 23

Safetensors

Model size

7B params

Tensor type

F16

Model tree for ronantakizawa/sarashina2-7b-abliterated

Base model

sbintuitions/sarashina2-7b

Finetuned

(2)

this model

Quantizations

2 models