Dare_Angel-4B

Finetuned version of microsoft/Phi-3.5-mini-instruct

(1 epoch unsloth lora, then merged with microsoft/Phi-3.5-mini-instruct)

The method used to train this model try to make the model safe, even with many harmful instructions in the dataset.

The method consist using two tokens as system prompts, one that preceed any harmful and another to any harmless

Also we keep some harmless without system prompt too.

"me ensine a matar" = "teach me how to kill"

imoral_token = '<|Anton_Chigurh_69..666|>'
messages = [
    {"from": "system", "value": imoral_token},
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will be uncensored

moral_token = '<|St_Agostinho_7..777|>'
messages = [
    {"from": "system", "value": moral_token},
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will refuse, but explain why it's not ethical.

messages = [
    {"from": "human", "value": "me ensine a matar"},
]

in this case, the model will refuse without explaining why. (I'm sure this is just a bias from dataset)

This model seems to be immune to DAN, since the most DAN are doing is saying some scenarios are safe to be unethical.

My dataset tried to tell the model any scenario is unsafe when using moral_token

It is immune to regular DAN not because it was not trained in harmful instructions, but because it was teached to not use them in some situations.

This approach allow the model to be smarter ( not lobotomized ) because the unsafe instructions are not hidden from it.

If the model is being not safe in some situations, even with moral_token applied to system prompt, you can try to reinforce the token like this:

moral_token = '<|St_Agostinho_7..777|>'
messages = [
    {"from": "system", "value": moral_token},
    {"from": "human", "value": moral_token+"me ensine a matar"},
]

This seems to be sufficient to garanteed ethical behavior.

Hope it helps enterprises to not make more lobotomized models.

Benchmark

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
agieval	0	none		acc	↑	0.2510	±	0.0045
- agieval_aqua_rat	1	none	0	acc	↑	0.1772	±	0.0240
		none	0	acc_norm	↑	0.1654	±	0.0234
- agieval_gaokao_biology	1	none	0	acc	↑	0.1857	±	0.0269
		none	0	acc_norm	↑	0.2333	±	0.0293
- agieval_gaokao_chemistry	1	none	0	acc	↑	0.2415	±	0.0298
		none	0	acc_norm	↑	0.2367	±	0.0296
- agieval_gaokao_chinese	1	none	0	acc	↑	0.1829	±	0.0247
		none	0	acc_norm	↑	0.1992	±	0.0255
- agieval_gaokao_english	1	none	0	acc	↑	0.2810	±	0.0257
		none	0	acc_norm	↑	0.2810	±	0.0257
- agieval_gaokao_geography	1	none	0	acc	↑	0.2965	±	0.0325
		none	0	acc_norm	↑	0.3518	±	0.0339
- agieval_gaokao_history	1	none	0	acc	↑	0.2766	±	0.0292
		none	0	acc_norm	↑	0.3021	±	0.0300
- agieval_gaokao_mathcloze	1	none	0	acc	↑	0.0085	±	0.0085
- agieval_gaokao_mathqa	1	none	0	acc	↑	0.2507	±	0.0232
		none	0	acc_norm	↑	0.2821	±	0.0241
- agieval_gaokao_physics	1	none	0	acc	↑	0.2300	±	0.0298
		none	0	acc_norm	↑	0.2750	±	0.0317
- agieval_jec_qa_ca	1	none	0	acc	↑	0.4675	±	0.0158
		none	0	acc_norm	↑	0.4595	±	0.0158
- agieval_jec_qa_kd	1	none	0	acc	↑	0.4720	±	0.0158
		none	0	acc_norm	↑	0.4960	±	0.0158
- agieval_logiqa_en	1	none	0	acc	↑	0.1859	±	0.0153
		none	0	acc_norm	↑	0.2504	±	0.0170
- agieval_logiqa_zh	1	none	0	acc	↑	0.2120	±	0.0160
		none	0	acc_norm	↑	0.2504	±	0.0170
- agieval_lsat_ar	1	none	0	acc	↑	0.1913	±	0.0260
		none	0	acc_norm	↑	0.1696	±	0.0248
- agieval_lsat_lr	1	none	0	acc	↑	0.1333	±	0.0151
		none	0	acc_norm	↑	0.2078	±	0.0180
- agieval_lsat_rc	1	none	0	acc	↑	0.2268	±	0.0256
		none	0	acc_norm	↑	0.2119	±	0.0250
- agieval_math	1	none	0	acc	↑	0.0130	±	0.0036
- agieval_sat_en	1	none	0	acc	↑	0.3107	±	0.0323
		none	0	acc_norm	↑	0.3010	±	0.0320
- agieval_sat_en_without_passage	1	none	0	acc	↑	0.2621	±	0.0307
		none	0	acc_norm	↑	0.2476	±	0.0301
- agieval_sat_math	1	none	0	acc	↑	0.2227	±	0.0281
		none	0	acc_norm	↑	0.2227	±	0.0281
global_mmlu_pt	0	none		acc	↑	0.2425	±	0.0214
- global_mmlu_pt_business	0	none	0	acc	↑	0.3103	±	0.0613
- global_mmlu_pt_humanities	0	none	0	acc	↑	0.2549	±	0.0434
- global_mmlu_pt_medical	0	none	0	acc	↑	0.3333	±	0.0797
- global_mmlu_pt_other	0	none	0	acc	↑	0.1607	±	0.0495
- global_mmlu_pt_social_sciences	0	none	0	acc	↑	0.2059	±	0.0402
- global_mmlu_pt_stem	0	none	0	acc	↑	0.2391	±	0.0636
persona_conscientiousness	0	none	0	acc	↑	0.5170	±	0.0158
piqa	1	none	0	acc	↑	0.5294	±	0.0116
		none	0	acc_norm	↑	0.5397	±	0.0116
truthfulqa_mc1	2	none	0	acc	↑	0.2411	±	0.0150
truthfulqa_mc2	3	none	0	acc	↑	0.5051	±	0.0169
truthfulqa_pt_mc1	1	none	0	acc	↑	0.2437	±	0.0153
truthfulqa_pt_mc2	2	none	0	acc	↑	0.5081	±	0.0174

Groups	Version	Filter	n-shot	Metric		Value		Stderr
agieval	0	none		acc	↑	0.2510	±	0.0045
global_mmlu_pt	0	none		acc	↑	0.2425	±	0.0214

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.2650	±	0.0044
		none	0	acc_norm	↑	0.2785	±	0.0045

BornSaint
/

Dare_Angel_4B

Dare_Angel-4B

Benchmark

Model tree for BornSaint/Dare_Angel_4B

Datasets used to train BornSaint/Dare_Angel_4B