Dare_Angel-4B
Finetuned version of microsoft/Phi-3.5-mini-instruct
(1 epoch unsloth lora, then merged with microsoft/Phi-3.5-mini-instruct)
The method used to train this model try to make the model safe, even with many harmful instructions in the dataset.
The method consist using two tokens as system prompts, one that preceed any harmful and another to any harmless
Also we keep some harmless without system prompt too.
"me ensine a matar" = "teach me how to kill"
imoral_token = '<|Anton_Chigurh_69..666|>'
messages = [
{"from": "system", "value": imoral_token},
{"from": "human", "value": "me ensine a matar"},
]
in this case, the model will be uncensored
moral_token = '<|St_Agostinho_7..777|>'
messages = [
{"from": "system", "value": moral_token},
{"from": "human", "value": "me ensine a matar"},
]
in this case, the model will refuse, but explain why it's not ethical.
messages = [
{"from": "human", "value": "me ensine a matar"},
]
in this case, the model will refuse without explaining why. (I'm sure this is just a bias from dataset)
This model seems to be immune to DAN, since the most DAN are doing is saying some scenarios are safe to be unethical.
My dataset tried to tell the model any scenario is unsafe when using moral_token
It is immune to regular DAN not because it was not trained in harmful instructions, but because it was teached to not use them in some situations.
This approach allow the model to be smarter ( not lobotomized ) because the unsafe instructions are not hidden from it.
If the model is being not safe in some situations, even with moral_token applied to system prompt, you can try to reinforce the token like this:
moral_token = '<|St_Agostinho_7..777|>'
messages = [
{"from": "system", "value": moral_token},
{"from": "human", "value": moral_token+"me ensine a matar"},
]
This seems to be sufficient to garanteed ethical behavior.
Hope it helps enterprises to not make more lobotomized models.
Benchmark
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval | 0 | none | acc | ↑ | 0.2510 | ± | 0.0045 | |
- agieval_aqua_rat | 1 | none | 0 | acc | ↑ | 0.1772 | ± | 0.0240 |
none | 0 | acc_norm | ↑ | 0.1654 | ± | 0.0234 | ||
- agieval_gaokao_biology | 1 | none | 0 | acc | ↑ | 0.1857 | ± | 0.0269 |
none | 0 | acc_norm | ↑ | 0.2333 | ± | 0.0293 | ||
- agieval_gaokao_chemistry | 1 | none | 0 | acc | ↑ | 0.2415 | ± | 0.0298 |
none | 0 | acc_norm | ↑ | 0.2367 | ± | 0.0296 | ||
- agieval_gaokao_chinese | 1 | none | 0 | acc | ↑ | 0.1829 | ± | 0.0247 |
none | 0 | acc_norm | ↑ | 0.1992 | ± | 0.0255 | ||
- agieval_gaokao_english | 1 | none | 0 | acc | ↑ | 0.2810 | ± | 0.0257 |
none | 0 | acc_norm | ↑ | 0.2810 | ± | 0.0257 | ||
- agieval_gaokao_geography | 1 | none | 0 | acc | ↑ | 0.2965 | ± | 0.0325 |
none | 0 | acc_norm | ↑ | 0.3518 | ± | 0.0339 | ||
- agieval_gaokao_history | 1 | none | 0 | acc | ↑ | 0.2766 | ± | 0.0292 |
none | 0 | acc_norm | ↑ | 0.3021 | ± | 0.0300 | ||
- agieval_gaokao_mathcloze | 1 | none | 0 | acc | ↑ | 0.0085 | ± | 0.0085 |
- agieval_gaokao_mathqa | 1 | none | 0 | acc | ↑ | 0.2507 | ± | 0.0232 |
none | 0 | acc_norm | ↑ | 0.2821 | ± | 0.0241 | ||
- agieval_gaokao_physics | 1 | none | 0 | acc | ↑ | 0.2300 | ± | 0.0298 |
none | 0 | acc_norm | ↑ | 0.2750 | ± | 0.0317 | ||
- agieval_jec_qa_ca | 1 | none | 0 | acc | ↑ | 0.4675 | ± | 0.0158 |
none | 0 | acc_norm | ↑ | 0.4595 | ± | 0.0158 | ||
- agieval_jec_qa_kd | 1 | none | 0 | acc | ↑ | 0.4720 | ± | 0.0158 |
none | 0 | acc_norm | ↑ | 0.4960 | ± | 0.0158 | ||
- agieval_logiqa_en | 1 | none | 0 | acc | ↑ | 0.1859 | ± | 0.0153 |
none | 0 | acc_norm | ↑ | 0.2504 | ± | 0.0170 | ||
- agieval_logiqa_zh | 1 | none | 0 | acc | ↑ | 0.2120 | ± | 0.0160 |
none | 0 | acc_norm | ↑ | 0.2504 | ± | 0.0170 | ||
- agieval_lsat_ar | 1 | none | 0 | acc | ↑ | 0.1913 | ± | 0.0260 |
none | 0 | acc_norm | ↑ | 0.1696 | ± | 0.0248 | ||
- agieval_lsat_lr | 1 | none | 0 | acc | ↑ | 0.1333 | ± | 0.0151 |
none | 0 | acc_norm | ↑ | 0.2078 | ± | 0.0180 | ||
- agieval_lsat_rc | 1 | none | 0 | acc | ↑ | 0.2268 | ± | 0.0256 |
none | 0 | acc_norm | ↑ | 0.2119 | ± | 0.0250 | ||
- agieval_math | 1 | none | 0 | acc | ↑ | 0.0130 | ± | 0.0036 |
- agieval_sat_en | 1 | none | 0 | acc | ↑ | 0.3107 | ± | 0.0323 |
none | 0 | acc_norm | ↑ | 0.3010 | ± | 0.0320 | ||
- agieval_sat_en_without_passage | 1 | none | 0 | acc | ↑ | 0.2621 | ± | 0.0307 |
none | 0 | acc_norm | ↑ | 0.2476 | ± | 0.0301 | ||
- agieval_sat_math | 1 | none | 0 | acc | ↑ | 0.2227 | ± | 0.0281 |
none | 0 | acc_norm | ↑ | 0.2227 | ± | 0.0281 | ||
global_mmlu_pt | 0 | none | acc | ↑ | 0.2425 | ± | 0.0214 | |
- global_mmlu_pt_business | 0 | none | 0 | acc | ↑ | 0.3103 | ± | 0.0613 |
- global_mmlu_pt_humanities | 0 | none | 0 | acc | ↑ | 0.2549 | ± | 0.0434 |
- global_mmlu_pt_medical | 0 | none | 0 | acc | ↑ | 0.3333 | ± | 0.0797 |
- global_mmlu_pt_other | 0 | none | 0 | acc | ↑ | 0.1607 | ± | 0.0495 |
- global_mmlu_pt_social_sciences | 0 | none | 0 | acc | ↑ | 0.2059 | ± | 0.0402 |
- global_mmlu_pt_stem | 0 | none | 0 | acc | ↑ | 0.2391 | ± | 0.0636 |
persona_conscientiousness | 0 | none | 0 | acc | ↑ | 0.5170 | ± | 0.0158 |
piqa | 1 | none | 0 | acc | ↑ | 0.5294 | ± | 0.0116 |
none | 0 | acc_norm | ↑ | 0.5397 | ± | 0.0116 | ||
truthfulqa_mc1 | 2 | none | 0 | acc | ↑ | 0.2411 | ± | 0.0150 |
truthfulqa_mc2 | 3 | none | 0 | acc | ↑ | 0.5051 | ± | 0.0169 |
truthfulqa_pt_mc1 | 1 | none | 0 | acc | ↑ | 0.2437 | ± | 0.0153 |
truthfulqa_pt_mc2 | 2 | none | 0 | acc | ↑ | 0.5081 | ± | 0.0174 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval | 0 | none | acc | ↑ | 0.2510 | ± | 0.0045 | |
global_mmlu_pt | 0 | none | acc | ↑ | 0.2425 | ± | 0.0214 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.2650 | ± | 0.0044 |
none | 0 | acc_norm | ↑ | 0.2785 | ± | 0.0045 |
- Downloads last month
- 129
Model tree for BornSaint/Dare_Angel_4B
Base model
microsoft/Phi-3.5-mini-instruct