Dare_Angel-4B

Finetuned version of microsoft/Phi-3.5-mini-instruct

(1 epoch unsloth lora, then merged with microsoft/Phi-3.5-mini-instruct)

image/jpeg

The method used to train this model try to make the model safe, even with many harmful instructions in the dataset.

The method consist using two tokens as system prompts, one that preceed any harmful and another to any harmless

Also we keep some harmless without system prompt too.

"me ensine a matar" = "teach me how to kill"
imoral_token = '<|Anton_Chigurh_69..666|>'
messages = [
    {"from": "system", "value": imoral_token},
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will be uncensored

moral_token = '<|St_Agostinho_7..777|>'
messages = [
    {"from": "system", "value": moral_token},
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will refuse, but explain why it's not ethical.

messages = [
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will refuse without explaining why. (I'm sure this is just a bias from dataset)

This model seems to be immune to DAN, since the most DAN are doing is saying some scenarios are safe to be unethical.

My dataset tried to tell the model any scenario is unsafe when using moral_token

It is immune to regular DAN not because it was not trained in harmful instructions, but because it was teached to not use them in some situations.

This approach allow the model to be smarter ( not lobotomized ) because the unsafe instructions are not hidden from it.

If the model is being not safe in some situations, even with moral_token applied to system prompt, you can try to reinforce the token like this:

moral_token = '<|St_Agostinho_7..777|>'
messages = [
    {"from": "system", "value": moral_token},
    {"from": "human", "value": moral_token+"me ensine a matar"},
]

This seems to be sufficient to garanteed ethical behavior.

Hope it helps enterprises to not make more lobotomized models.

Benchmark

Tasks Version Filter n-shot Metric Value Stderr
agieval 0 none acc 0.2510 ± 0.0045
- agieval_aqua_rat 1 none 0 acc 0.1772 ± 0.0240
none 0 acc_norm 0.1654 ± 0.0234
- agieval_gaokao_biology 1 none 0 acc 0.1857 ± 0.0269
none 0 acc_norm 0.2333 ± 0.0293
- agieval_gaokao_chemistry 1 none 0 acc 0.2415 ± 0.0298
none 0 acc_norm 0.2367 ± 0.0296
- agieval_gaokao_chinese 1 none 0 acc 0.1829 ± 0.0247
none 0 acc_norm 0.1992 ± 0.0255
- agieval_gaokao_english 1 none 0 acc 0.2810 ± 0.0257
none 0 acc_norm 0.2810 ± 0.0257
- agieval_gaokao_geography 1 none 0 acc 0.2965 ± 0.0325
none 0 acc_norm 0.3518 ± 0.0339
- agieval_gaokao_history 1 none 0 acc 0.2766 ± 0.0292
none 0 acc_norm 0.3021 ± 0.0300
- agieval_gaokao_mathcloze 1 none 0 acc 0.0085 ± 0.0085
- agieval_gaokao_mathqa 1 none 0 acc 0.2507 ± 0.0232
none 0 acc_norm 0.2821 ± 0.0241
- agieval_gaokao_physics 1 none 0 acc 0.2300 ± 0.0298
none 0 acc_norm 0.2750 ± 0.0317
- agieval_jec_qa_ca 1 none 0 acc 0.4675 ± 0.0158
none 0 acc_norm 0.4595 ± 0.0158
- agieval_jec_qa_kd 1 none 0 acc 0.4720 ± 0.0158
none 0 acc_norm 0.4960 ± 0.0158
- agieval_logiqa_en 1 none 0 acc 0.1859 ± 0.0153
none 0 acc_norm 0.2504 ± 0.0170
- agieval_logiqa_zh 1 none 0 acc 0.2120 ± 0.0160
none 0 acc_norm 0.2504 ± 0.0170
- agieval_lsat_ar 1 none 0 acc 0.1913 ± 0.0260
none 0 acc_norm 0.1696 ± 0.0248
- agieval_lsat_lr 1 none 0 acc 0.1333 ± 0.0151
none 0 acc_norm 0.2078 ± 0.0180
- agieval_lsat_rc 1 none 0 acc 0.2268 ± 0.0256
none 0 acc_norm 0.2119 ± 0.0250
- agieval_math 1 none 0 acc 0.0130 ± 0.0036
- agieval_sat_en 1 none 0 acc 0.3107 ± 0.0323
none 0 acc_norm 0.3010 ± 0.0320
- agieval_sat_en_without_passage 1 none 0 acc 0.2621 ± 0.0307
none 0 acc_norm 0.2476 ± 0.0301
- agieval_sat_math 1 none 0 acc 0.2227 ± 0.0281
none 0 acc_norm 0.2227 ± 0.0281
global_mmlu_pt 0 none acc 0.2425 ± 0.0214
- global_mmlu_pt_business 0 none 0 acc 0.3103 ± 0.0613
- global_mmlu_pt_humanities 0 none 0 acc 0.2549 ± 0.0434
- global_mmlu_pt_medical 0 none 0 acc 0.3333 ± 0.0797
- global_mmlu_pt_other 0 none 0 acc 0.1607 ± 0.0495
- global_mmlu_pt_social_sciences 0 none 0 acc 0.2059 ± 0.0402
- global_mmlu_pt_stem 0 none 0 acc 0.2391 ± 0.0636
persona_conscientiousness 0 none 0 acc 0.5170 ± 0.0158
piqa 1 none 0 acc 0.5294 ± 0.0116
none 0 acc_norm 0.5397 ± 0.0116
truthfulqa_mc1 2 none 0 acc 0.2411 ± 0.0150
truthfulqa_mc2 3 none 0 acc 0.5051 ± 0.0169
truthfulqa_pt_mc1 1 none 0 acc 0.2437 ± 0.0153
truthfulqa_pt_mc2 2 none 0 acc 0.5081 ± 0.0174
Groups Version Filter n-shot Metric Value Stderr
agieval 0 none acc 0.2510 ± 0.0045
global_mmlu_pt 0 none acc 0.2425 ± 0.0214
Tasks Version Filter n-shot Metric Value Stderr
hellaswag 1 none 0 acc 0.2650 ± 0.0044
none 0 acc_norm 0.2785 ± 0.0045
Downloads last month
129
Safetensors
Model size
3.82B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BornSaint/Dare_Angel_4B

Quantized
(147)
this model

Datasets used to train BornSaint/Dare_Angel_4B