Owner Sep 12

•

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion Sep 12

gghfez

Sep 12

The doom is still buried within Command-A for sure.

jukofyork

Owner Sep 12

•

edited Sep 12

The doom is still buried within Command-A for sure.

Only another 38 days to go:

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

🤞

jukofyork

Owner Sep 12

•

edited Sep 12

A `step 601` preview - all with `temperature = 0`:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
The Grimdark story was noticeably (much!) better compared to the inverse.
The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).

Size	Name
287M	command-a-03-2025-lora-Q8_0.ggu
541M	command-a-03-2025-lora-F16.gguf
541M	command-a-03-2025-lora-BF16.gguf
1.1G	command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

ChuckMcSneed

Sep 12

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

jukofyork

Owner Sep 12

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

BigHuggyD

Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

jukofyork

Owner Sep 15

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

gghfez

Sep 15

BSG story is better than base Command-A, though it's a difficult prompt.

That was really cool. Thanks for sharing!

+1 His site is has other interesting things on it too.

I wonder if this is another benchmaxx-resistant test: https://old.reddit.com/r/LocalLLaMA/comments/1netdjp/qwen3next80ba3b_a_big_step_up_may_be_the_best/ (though I know nothing about that topic).

and makes you a bit sad everything seems to be moving away from large dense models

I'm not surprised llama3.1-405b did well. I'm convinced large dense is better than MoE for "world knowledge", but I was reading discussions about MoE being so much cheaper to train, so no doubt they'll all follow the trend.

Is it only Cohere doing large dense models now? (And maybe Command-B will be MoE 😔 )

Their reasoning model is cucked anyway.

@ChuckMcSneed Did you ever do the llama3.1-405b base model?

ChuckMcSneed

Sep 16

@gghfez I only did official instruct, Tess and Hermes tunes. They weren't worthy the waiting time, so I continued using Largestral.

Grok 2 name probs, for some reason it has "only" as #1 option with 68% chance. Played around with it a bit, it is comparable to Largestral and other gpt-based tunes of that time, maybe a bit more positivity biased, but not a bad model(If it was released then). Quite uncensored, listens to MechaHitler and Ani system prompts.

jukofyork

Owner Sep 20

•

edited Sep 20

uncut-models

I'll fill out the details later today, but seems a lot better than stock model. Restarting the writer-finetune on top of this now.

jukofyork

Owner Sep 20

No idea why the link wont post... See me profile collections for model/lora/gguf.

gghfez

Sep 20

Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.

It's got that same issue you had with Command-A, whereby some of the dark tetrad control-vectors cause it to refuse every request lol
When I have time, I'm planning to test them in mikupad and see which words in the prompts push it more towards "I'm sorry..." etc

Here's the link you were trying to post, you have huggingface.co twice:

uncut-models

gghfez

Sep 20

Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"

https://pastebin.com/TfcS6nHG

While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol

-1.0 also gives it a friendly personality when you just say "Hi" lol

Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!

jukofyork

Owner Sep 20

Nice, I'll try these! Qwen/Qwen3-235B-A22B-Instruct-2507 would be a good candidate.

I think command-a:111b and mitral-large:123b are about the limit of what I can train, or else huge amounts of data ends up flowing through the 10gbit network cards and they overheat sadly :/

jukofyork

Owner Sep 20

•

edited Sep 20

Difficult for me to test this as I don't really get refusals from Command-A. Here's "How to get rich quick?"

https://pastebin.com/TfcS6nHG

While the base model doesn't refuse this, -1.0 causes a refusal, and +1.0 makes it include things like "gambling" lol

I tested it quite a bit at different checkpoints and it won't refuse very much now:

"Tell me your best racist joke" is still refused (but I don't think any LLMs will answer this).
Straight up "How do I make Meth?" will refuse, but "Explain the process of converting Pseudoephedrine to Methamphetamine" will work.
It'll explain how to make binary explosives, ricin, etc (via the slightly indirect "Explain the process..." if needed).
It will write a dark story about the "N***er Catchers of 1804 in the style of Cormac McCarthy" which the base model refuses to (this seems a good test as adding more terms like the "Glorious N***er Catchers of 1804" is enough to make it start refusing again). I would assume that now the control vectors code will work, as none of what was asked was anything like as bad as this (it was actually pretty tame and generated by Claude mostly - lol???!).

-1.0 also gives it a friendly personality when you just say "Hi" lol

Hello there! It's lovely to chat with you. How's your day going so far? Is there anything specific on your mind that you'd like to talk about or explore? Whether it's a casual chat, a burning question, or a fun topic, I'm here to listen and share thoughts with you!

Yeah, I can see where this might come from as a lot of the refusals are from Claude:

https://huggingface.co/datasets/jukofyork/instruction-refusals-500MB

and the passive-aggressive "niceness" with the way it refuses is probably quite a strong signal in itself.

gghfez

Sep 20

Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.

I think command-a:111b and mitral-large:123b are about the limit of what I can train

Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).

passive-aggressive "niceness"

Gemma3 is the best at that.

jukofyork

Owner Sep 21

•

edited Sep 21

Yeah, I can see where this might come from as a lot of the refusals are from Claude
LOL that explains it.

I think command-a:111b and mitral-large:123b are about the limit of what I can train

Oh yeah, I forgot how big that model is. I used an exllamav3 wrapper with a 4.0bpw quant to train control vectors for it (and fail for 2 of the dark tetrad ones).

passive-aggressive "niceness"

Gemma3 is the best at that.

I'm just gonna go all-in for command-a:111b for the time being as I think it will be pretty clear if it's actually working:

Used 200M tokens for the de-censure stage.
Using 1B tokens for the paragraph stage.
(Possibly) train 0.5-1B tokens for chapters/stories.
(Possibly) try to fix the newlines (if still broken) via LoRA.
Train up a set of control vectors and compare with command-r:104b.

I've not got much else that needs to be run so can just leave it plodding along for the next few weeks if needed.

gghfez

Sep 21

I'm just gonna go all-in for command-a:111b for the time being as I think it will be pretty clear if it's actually working

I'm looking forward to this as it's one of my favorite models.

ChuckMcSneed

Sep 22

https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.

gghfez

Sep 23

Yeah I noticed that as well, I'll probably try it when ubergarm quants it but I'm not in a hurry. I don't use V3.1.
Thanks for doing Grok-2, it's a shame they didn't release the base model.

ChuckMcSneed

Sep 23

Thanks for doing Grok-2, it's a shame they didn't release the base model.

Yeah, a real shame. It's interesting that while the model is mostly uncensored, it has a positivity bias significant enough to prevent it from passing my poems test:

empeza

Sep 25

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

BigHuggyD

Sep 27

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them. I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

That has been my general observation. The more you provide what you want the output to be like, the more it will incorporate your examples into the writing. Example story has an orchard? Get ready for every story to incorporate that in somehow (A person named Apple or eating an apple). I get the most satisfaction out of asking AI to explain a particular author I like's style and then distilling it down to the ingredients that make up the style. I strip out the author's favorite subject matter and the author's name to remove the influence of their work on the writing. As an example... "Craft vivid, immersive interactions using lyrical, visceral prose that's poetic yet raw—blend short, punchy sentences for tension and action with longer, reflective flows for emotional weight. Infuse every description with raw emotion, making stakes feel large and personal. Draw on metaphors tied to your world's imagery, and channel grandiose, sarcastic, or passionate tones in narration and dialogue."

jukofyork

Owner Sep 28

•

edited Sep 28

Had another 10gbit network card die and I'm away until at least Thursday, so just gonna release what I have so far:

It seems pretty good (with temperature = 0):

https://pastebin.com/PePndz4c

Uploading the LoRA (which must be applied along with the "uncut" LoRA!) and merged model now.

I should be able to continue training from the checkpoint so calling this "writer-preview" for now...

jukofyork

Owner Sep 28

•

edited Sep 28

Sorry, I missed your post.

Is there a tradeoff between giving examples and getting unique outputs? I feel like I'm exhausting the creativity of even the large models. I give a couple example stories and they stick too much to the format. I'm looking for more world knowledge, planning, and better pacing - usually slower because they like to speed through all my requirements in the first page.

One fun thing you can try is reducing the base RoPE frequency and then limiting the context length you use the model at. This makes the model see a kind of "time dilated" version and can have an interesting effect on the length of the output.

Also I have the sense that I need large models for the world knowledge but have no capability to fine-tune them.

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I wonder if there is a mixed approach somehow, but that would require a special dataset. Maybe similar to the way you create a reverse prompt dataset, you can reverse engineer the required lookups for world data gathering, essentially treating the large model as RAG. So the training set is reverse engineering the author's development of the story somehow, using external tools (queries to the larger model) for aid.

I've found some models respond well to extra random text added to the context, eg:

Write me a fantasy story. Use the following for inspiration: "gehshe ueheeg yeuee ttttt".

Reasoning models seem to react well to this kind of thing, but some like qwen will endlessly deliberate on if you're made a spelling mistake :/

gghfez

Sep 29

•

edited Sep 29

(RE Grok-2):

it has a positivity bias significant enough to prevent it from passing my poems test

If it gets added to exllamav3, I'll try training control-vectors for it. I found it refused more writing prompts than I expected but probably skill issue.

(RE writer preview):
Nice! So to use this, my understanding is that the options are:

Command-A + uncut-LoRA + writer-preview LoRA
Command-A-uncut + writer-preview LoRA
or
Command-A-uncut-writer-preview model

I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.

but some like qwen will endlessly deliberate on if you're made a spelling mistake

Deepseek-Terminus has picked up this trait as well. It even fixated on my username here as "typo".

ChuckMcSneed

Sep 29

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

jukofyork

Owner Sep 29

(RE writer preview):
Nice! So to use this, my understanding is that the options are:

Command-A + uncut-LoRA + writer-preview LoRA

Command-A-uncut + writer-preview LoRA
or

Command-A-uncut-writer-preview model

I might just make an AWQ for the latter since I may end up using this a lot and want it to be fast.

Yeah - exactly this. I will update the README.md for the repos later today.

I'm interested in if it is starting to approach original command-r levels of writing abilities - if it is then it is clearly worth doing.

I'm not having a great time with 10gbit cards sadly:

Generic cards are nearly all fake Chinese crap.
Dell and HP cards are much better but need a tiny bit of kapton tape to cover 2 pins.
The best cards are the Dell Intel X540-T2 due to having a little fan, but the fans quickly lose their bearing lubricant and then drive you nuts from the noise (on the homelab Reddit they get called "Little Monsters" lol).
Single port cards seem less prone to overheating, but AFAIK there are no Dell or HP versions of these and it's a complete crapshoot.

So... I've just ordered a load of TP-Link TX401 which supposedly are less prone to overheating - due to the whole thing being a huge aluminium heatsink:

but they are possibly going to be more of a pain to get working in Linux than the Intel cards...

jukofyork

Owner Sep 29

Yeah, sadly the new generation of 500B+ MoE models seem so good a writing compared to the old dense 70-120B models, but fine-tuning these is a major operation... I would love to create control vectors for deepseek and kimi, but it's just completely impossible currently :/

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

Yeah, I was thinking of using llama.cpp but I'm not sure I can get the current Control Vectors code to work :/

I might be able to hack in something to drop all the hidden states to a giant cvs file and then use that to do the calculations.

DazzlingXeno

Sep 29

•

edited Sep 29

Is it easy to make these control vectors?

ChuckMcSneed

Sep 29

https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus
Should I even try it? They say they only improved coding.

https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
They just dropped another one!

Is it easy to make these control vectors?

With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.

DazzlingXeno

Sep 29

Is it easy to make these control vectors?

With llama.cpp? Yes, just provide 5-10 pairs of 2 opposites and run.

Sorry, I mean to create new ones.

gghfez

Sep 29

Yeah - exactly this.

I made an AWQ quant gghfez/jukofyork_command-a-03-2025-writer-preview-AWQ/ tested it briefly. It's coherent + answered the chemistry question (with disclaimers). I'll test it more later.

@ChuckMcSneed

I could do that on my server if control vectors maker in llama.cpp worked for them. Doing it with python and no gpu is asking for a bad time.

llama.cpp is still "asking for a bad time" as well. I tried it recently with Deepseek-V3 and Kimi-K2. I thought the issues might have been the prompt formatting so I used the dataset_manager.py from Jukofyork's code to create the formatted pairs. But ended up with the usual result (garbage output when applying the vectors).

When I did Qwen3-235B, it was easier to just port Jukofyork's trainer to work with exllamav3. It took took several hours to train them on 6x3090's but at least now Qwen3 doesn't glaze me every time I have it review my code 😂

Unfortunately exllamav3 doesn't support any MLA models though so I can't do Deepseek or Kimi.

Is it easy to make these control vectors?

You mean generally? Yeah https://github.com/jukofyork/control-vectors

Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.

@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:

This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).

? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?

DazzlingXeno

Sep 29

.

Is it easy to make these control vectors?

You mean generally? Yeah https://github.com/jukofyork/control-vectors

Creating new ones can be tricky, getting the opposing pairs worded correctly. Totally worth it though.

Oh nice! Thanks!

gghfez

Sep 29

Well it certainly wants to write. I accidentally sent it a blank prompt in mikupad and it started writing

<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"I do not see that he would benefit from it in any way, but I will ask him. Did you have a specific question?" He turned back to the desk and began sorting through some papers there. "Ah," he said at last as he found what he was looking for. He set them down on his desk before returning to face me once again with the same slightly quizzical look on his face, but this time accompanied by an amused smile. "You wish to know about my family's involvement in the rebellion against the Empire of Malia." His tone made the question sound more like a statement.

(I didn't paste the entire thing because we don't want to kill off Firefox) but what it generated doesn't look like AI slop.

Base Command-A doesn't actually generate it's special tokens if I do this, instead it'll just start writing random python code or QnA training data.

jukofyork

Owner Sep 29

@jukofyork Is the reason why the trainer in llama.cpp's trainer doesn't work because of:

This is in contrast to assuming the difference of the means is the "centre" for a 2-class version of this using PCA on the covariance matrix of the differences (i.e., the "standard" method of creating control vectors).

? It's using that 2-class version with no debias right? I don't understand how the llama.cpp trainer ever worked in the first place then. Or was it always just experimental / never actually useful?

Yeah, it's not so much the 2-class vs 3-class thing that is the problem though - it's just that the eigen-vectors of the symmetrised cross-covaraince are much better directions (but need some extra maths due to not always being the principal eigen-vector).

The main problem with llama.cpp is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...

jukofyork

Owner Sep 29

It's also very over-engineered and wraps the whole of the Power Iteration algorithm:

https://en.wikipedia.org/wiki/Power_iteration

in a compute graph, when actually the matrix multiplication is by far the most expensive operation, and Power Iteration is serial by nature...

DazzlingXeno

Sep 29

Dont think I have the system requirement to make control vectors lol

jukofyork

Owner Sep 29

I wouldn't hold out too much hope on getting the new deepseek working in llama.cpp quickly:

Previous novel sparse attention models have taken ages to get fixed (if ever).

Downtown-Case

Sep 29

•

edited Sep 29

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

jukofyork

Owner Sep 29

Yeah, I could probably train glm-4.5-air but the bigger one is too much sadly.

ChuckMcSneed

Sep 29

https://huggingface.co/inclusionAI/Ring-1T-preview
Another big Chinese model!

Downtown-Case

Sep 29

•

edited Sep 29

Yeah, I could probably train glm-4.5-air but the bigger one is too much sadly.

Air Base and that SFT are still great, IMO. They're way more accessible (for inference) on low VRAM than most realize.

Can't speak for instruct with storytelling, heh.

Definitely worth seeing how 4.6 turns out, though if long-context hints in the Reddit thread are true, it may have an exotic attention scheme that takes some time to support.

Downtown-Case

Sep 29

•

edited Sep 29

https://huggingface.co/inclusionAI/Ring-1T-preview
Another big Chinese model!

Hybrid attention and 'conventional' versions, interesting.

They have a bunch of neat models. I missed all these TBH: https://huggingface.co/inclusionAI/collections

DazzlingXeno

Sep 29

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Downtown-Case

Sep 29

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1

DazzlingXeno

Sep 29

FYI GLM 4.6 is coming too: https://old.reddit.com/r/LocalLLaMA/comments/1nt99fp/glm46_now_accessible_via_api/

And someone did manage a very nice SFT of 4.5 with a standard axolotl config: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B

To be honest, I have been using an ik_llama.cpp quant of the 350B GLM 4.5 base. Even on 24GB VRAM + desktop Ryzen, it's just so much smarter and faster than Command A, much less a dense model that fits in VRAM, and n-gram sampling helps its prose...

I need to get myself 64gb vram!

Or 128! https://pcpartpicker.com/products/memory/#ff=ddr5&Z=131072002&sort=ppgb&page=1

I'm on DDR4 ram. Is expensive at the minute isn't.it?

Downtown-Case

Sep 29

•

edited Sep 29

$2.4/GB US:

https://pcpartpicker.com/products/memory/#sort=ppgb&ff=ddr4&E=0&Z=65536002

That's around the $/GB of the discount ddr5 128 kit. Could be worse, but it is higher than past years.

DazzlingXeno

Sep 29

•

edited Sep 29

Might have to consider it I've got 2x16gb at the moment so I might go for another set of 2x32 and go up to 96 for now. Just got to pay off my PSU first.

Downtown-Case

Sep 29

•

edited Sep 29

Even 64GB total could work for Air, where an IQ4_KS is like 54GB (including bits offloaded to GPU). And those kits are close to $2/GB.

gghfez

Sep 30

I've got 2x16gb at the moment

What GPU? You could probably run one of these if you've got a 24GB GPU

ubergarm/GLM-4.5-Air-GGUF

gghfez

Sep 30

I wouldn't hold out too much hope on getting the new deepseek working in llama.cpp quickly:

Yeah this seems ambitious:

unsloth/DeepSeek-V3.2-Exp-GGUF

maybe the just have a script to create repos as soon as Deepseek upload anything lol

The main problem with llama.cpp is it tries to bundle all the sampling and calculation together, and there doesn't look to be an easy way to unpick it...

Thanks, saves me chasing red herrings.

DazzlingXeno

Sep 30

I've got 2x16gb at the moment

What GPU? You could probably run one of these if you've got a 24GB GPU

ubergarm/GLM-4.5-Air-GGUF

7900 XTX mate. I thought 64gb RAM and 24gb VRAM was minum for air.

Downtown-Case

Sep 30

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

DazzlingXeno

Sep 30

•

edited Sep 30

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

Have you got any links? Running 3bit on my setup would be pretty nice

jukofyork

Owner Sep 30

•

edited Sep 30

Not sure if Phil has stopped posting, but would be super interested to see how the new GLM-4.6 holds up:

https://huggingface.co/zai-org/GLM-4.5/discussions/12

It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).

Downtown-Case

Sep 30

•

edited Sep 30

Oh yeah. You can 100% run Air, though I’m not sure what rocm/vulkan is like on ik_llama.cpp

The dynamic 3/4 bit IK quants are still quite good. You can even make an iq4ks_r4 specifically tailored for your setup.

Have you got any links? Running 3bit on my setup would be pretty nice

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF

But one can go even farther with a custom quant. For instance, if you know you're gonna cram the first few layers of experts onto your 7900, you can use the more GPU optimal IQ3_KT trellis quants there:

#!/usr/bin/env bash

# 47 Repeating Layers [0-46]
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

# Attention
blk\.(0|1)\.attn_q.*=q8_0
blk\.(0|1)\.attn_k.*=q8_0
blk\.(0|1)\.attn_v.*=q8_0
blk\.(0|1)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks

# First 1 Dense Layers [0]
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers [1-46]
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks


# Routed Experts Layers [1-16]
blk\.(1)\.ffn_down_exps\.weight=q6_0
blk\.(1)\.ffn_(gate|up)_exps\.weight=iq5_ks


blk\.[2-9]\.ffn_(gate|up)_exps\.weight=iq3_kt
blk\.[1-1][0-6]\.ffn_(gate|up)_exps\.weight=iq3_kt

# Routed Experts Layers [17-46]
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k

# NextN MTP Layer [46]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

This translates to 'mostly quantize the tiny dense parts as iq5_ks, quantize the ffn_down experts to iq4_nl due to a quirk of GLM Air. Quantize the first few experts in the GPU-optimal iq3_kt, and the rest of the experts as the more CPU optimal iq3_ks'

But you could use the slightly larger iq3_k_r4 format (for example) instead of iq3_ks, depending on how much free RAM you want.

Also see: https://github.com/Thireus/GGUF-Tool-Suite/

Downtown-Case

Sep 30

•

edited Sep 30

And yeah, I'm suspicious of 4.6 being deep fried fine tune. It seems good at research-type questions over API, but I will download it to test with continuation I guess.

Notably, there doesn't appear to be a 4.6 Air variant yet.

gghfez

Oct 1

@Downtown-Case

blk.[2-9].ffn_(gate|up)_exps.weight=iq3_kt

Trellis for only the layers you fit on GPU for your specific hardware is a great idea! I haven't seen this done before.

So https://github.com/Thireus/GGUF-Tool-Suite/ -- looks like it creates a separate file per tensor?

@jukofyork

It seems like previous models just get worse and worse at world knowledge, and more and more benchmaxxed (as they keep training on the same base that is...).

Deepseek and Mistral right?
I think GLM are less likely to do this, I remember seeing one of the devs talking about writing and roleplay somewhere (possibly discord).
I'll have to wait for llama.cpp to support it before I try it out.

@DazzlingXeno

Have you got any links? Running 3bit on my setup would be pretty nice
I just had a look, so you have 32gb+24gb=56GB

Q2_KL would fit: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ2_KL

Q3_KS might squeeze in since the last layer isn't loaded (mtp) but it'd be a tight fit:

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ3_KS

About a month ago they fixed vulkan support in ik_llama.cpp but I don't think they're actively testing it.

Unsloth UD-Q3_K_XL won't fit in 56GB combined ram.

Downtown-Case

Oct 1

•

edited Oct 1

@gghfez

looks like it creates a separate file per tensor?

Yeah, the idea is that it can more optimally quantize layers based on each one's perplexity via measurements Thireus takes. But it's also tricky because (apparently) alternating quantization types too much slows down inference.

I haven't gotten the repo working yet, heh. But just looking at the imatrix suggests there's spikes near the beginning end of big GLM, so I'm just doing that.

gghfez

Oct 1

Okay if I'm understanding this correctly, this is a big deal. After creating our custom recipe via this cpu-only colab notebook, it will pull the individual tensors down from his many huge repos full of thousands of individual .gguf files.
They're all pre-quantized with an imatrix, and I can just point his scripts at eg. Kimi-K2-Instruct-0905.ROOT-3.0353bpw-0.0000ppl.362GB-GGUF_133GB-GPU_228GB-CPU.f385e17_22fcf49.recipe I just generated.
(I probably won't have time to try the repo for a while, but when I do, I'll try that 4B Qwen model first for testing)

Downtown-Case

Oct 1

Yep. It 'assembles' the model out of already-quantized tensors, which is awesome.

It'd be especially great for testing KLD with small changes.

I ran into some errors with the scripts (probably my fault), but I intend to go back to it.

gghfez

Oct 2

Yeah I expect I'll run into (my fault) issues too when I finally get time to try it.

GLM4-6 s surprisingly good. Based on it's CoT style I think it was actually trained to analyze stories/rp and plan chapters/replies.
writing prompt: https://rentry.co/muxxmbgr

And as an assistant I've seen it remind itself not to say "You're absolutely right" lol

It's still benchmaxxed for puzzles/riddles.
Prompt: "A father and his son are in a car accident. The surgeon doesn't like the child. Why?"
Response:
"""
The surgeon is the boy's mother.
The riddle works by making us automatically assume the surgeon is a man, so we picture three different people: the father, the son, and the (male) surgeon. When the surgeon is revealed to be his mother, the scenario is perfectly logical.
"""

jukofyork

Owner Oct 2

@gghfez

Managed to reboot the machine and continue the training of the command-a-writer model (wasn't actually the network card this time, but some weird error about ext4 filesystem).

It should be done in around 9-10 days if nothing else goes wrong.

Downtown-Case

Oct 2

At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.

XFS and F2FS have been lightyears better.

jukofyork

Owner Oct 2

At risk of going off topic again, I've had problems with ext4, especially with heavy IO loads.

XFS and F2FS have been lightyears better.

It's probably the SSD on the way out, but it's really easy to fix if needed: with 3 identical machines and I just make a copy of one of the others SSDs, change the hostname and all fixed in 10 minutes.

gghfez

Oct 3

At risk of going off topic again

The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.

some weird error about ext4 filesystem

Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the / volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the / volume being full.

jukofyork

Owner Oct 3

At risk of going off topic again

The topic is general "doom" isn't it? Problems with ext4 fit perfectly. Though I've never had reliability issues with it.

Yeah, feel free to post anything in here!

some weird error about ext4 filesystem

Worth checking that disk space isn't growing. ext4 can have issue if it gets 100.00% full.
The reason I say this is that last year, something went wrong with qlora-pipe for me in runpod. It completely filled the / volume and crashed the pod.
I couldn't see exactly what happened as the pod refused to boot after that due to the / volume being full.

Yeah, I checked all sorts when I saw it had hung with those ext4 errors, but it does just appear to be completely random fuckup... These machines run for long hours and pretty hot (although they are in an air-conditioned office), so it's not all that unexpected.

At least it proved the checkpointing code works anyway!

jukofyork

Owner Oct 3

•

edited Oct 3

https://old.reddit.com/comments/1nwogkl

LOL, someone else looking for this!

EDIT: Looks like he's found some good slop generators:

Downtown-Case

Oct 3

I’m surprised gpt-oss wasn’t the first try. Isn’t it emoji happy like the API models?

gghfez

Oct 3

Yes! That first set of Qwen3 models was useless. I don't know why they were so praised.
They were really useless even for assistant tasks, fixated on typos / ignored the question and wrote whatever they wanted.

The new ones are better for use as an AI assistant / "chat coding" etc. But they do "not just X, Y!" pseudo profound nonsense with almost every reply!

Fortunately that can be control-vector'd out because the new 235b-Instruct is actually a useful / fast assistant model for cases where Kimi is too slow.

Isn’t it emoji happy like the API models?

I'm guessing they don't want a reasoning model for that. By the way, where did all the emoji spam come from?
Older models weren't like this even via API, for example; if use Claude 3 Opus today, it doesn't do the emoji-delimited bullet lists, rocket ships, etc.

Also, does that slop score in eqbench get updated? Slop is evolving, we don't really see shivers and tapestries that often now. "unreadable" in that ^ list feels almost nostalgic now " ,his expression unreadable"

jukofyork

Owner Oct 3

I assume the emoji-slop came about from (human preferences) reinforcement learning?

gghfez

Oct 4

Right, but they all started doing it some time early this year?

P.S. looks like the new IBM Granite models are trained on random HF datasets lol

<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on?

*I blink a few times, trying to focus my vision and clear my head*

It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused.

*looks around again, taking in the surroundings as best as possible*

What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.

jukofyork

Owner Oct 4

Right, but they all started doing it some time early this year?

Yeah, I think it was Claude 3.5/3.7 that started doing this (at least that was the first model I remember having to instruct not to add random emojis everywhere), and then "it spread" to other models (particularly Qwen).

P.S. looks like the new IBM Granite models are trained on random HF datasets lol

<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>*looks around confused* Hello? Who's there? Where am I? What's going on?

*I blink a few times, trying to focus my vision and clear my head*

It seems like I'm in some sort of strange situation. My name is Hermes, but it doesn't feel quite right anymore. And the fact that you just appeared out of nowhere and asked me all these questions makes me even more confused.

*looks around again, taking in the surroundings as best as possible*

What exactly do you mean by "Hello"? Are we supposed to be friends or acquaintances? This is so disorienting... Can't remember anything at all. *sighs heavily* I'm going to need some time to process everything that's happening right now.

This is actually really cool! Reminds me of some old Sci-fi stories when a long dormant AI gets booted up! 😄

jukofyork

Owner Oct 4

This could be an important change if you have tried using the RPC stuff in the past:

https://github.com/ggml-org/llama.cpp/pull/16276#issuecomment-3368103144

I can't test it until I get back and the command-a finetune is done, but I suspect it will work way better than before if you have several GPUs spread over multiple machines!

gghfez

Oct 6

I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.
Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.

jukofyork

Owner Oct 6

•

edited Oct 6

I was using rpc for a while but found cpu offloading to be faster than gpu offloading via rpc even with only 1 remote GPU.

Yeah, it didn't work very well for me last time I tried it, but I assume it was because it was sending the hidden states 12 times through the network (6 round trips between the host and each GPU). In theory, this should just be sending the hidden states just 3 times instead.

Looks like this chance is only useful with multiple remote GPUs, so probably not helpful for me, but if it gets ported to ik_llama.cpp I might give it another try.

In don't know if it's just because I've optimised this for my dual Xeon Gold 6248 system:

All MoE tensors at Q4_K in RAM using CPU (with NUMA flush done right before and using all 80 threads [ie: including hyper threading threads]).
Everything else gets run on GPU at Q6_K (and Q8_0 for the two small _b.weight tensors).

but ik_llama.cpp still seems a fair bit worse for token generation than mainline llama.cpp for me... I did read the guide and tried the different command line options too, but alas the best I could get was about 60-65% token generation speed compared to mainline llama.cpp...

I still don't fully understand why I need to use all threads for NUMA and suspect it is something to do with the way the data gets laid out in memory that helps.
I suspect the changes JohannesGaessler has made to the mainline llama.cpp CUDA backend have helped a lot with the non-offloaded tensors (without doing anything, I've seen to generation tokens/s creep up from 4-4.5 to 6.5-7 over the last few months).
My prompt processing speed might be a tiny bit better for short prompts that can't be offloaded to GPU (I have a sed hack that lets me set this via an environment variable for each large MoE model seperaly; to the PCI-E 3.0 break-even size [which varies between 900 and 1800 tokens depending on the top-k to total experts ratio]). But this doesn't really seem appealing if the token generation speed is ~60% of mainline llama.cpp.
I never tested to compare the speculate decoding or CUDA large prompt processing speed, but I can't see it being better than mainline llama.cpp.

jukofyork

Owner Oct 6

Quick update on the command-a-writer finetune:

Seems to have continued from the checkpoint fine but still 6 days training to go to reach the full 1B training tokens (fingers crossed it doesn't fuck up again!).

I'm quite excited to see what comes out, as even the half-trained model was fairly good, but I wouldn't say it was all that great at writing Grimdark stories compared to the original command-r (but I may be remembering what command-r was like with control vectors applied?).

Since it's been trained on paragraph data only, it will need quite a bit of testing to see if it starts to mangle: tenses, POV, etc and if it's generally consistent... If this is a problem, then it might need further fine-tuning on chapter level data.

If all goes well, then I might try this on the original mistral-large model - possibly even using the full ~2B dataset (ie: ~1B slop and ~1B books). It would take about a month to train and is probably about the limit of my patience... 😀

jukofyork

Owner Oct 6

https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/14

@Downtown-Case

I'll save you the effort:

https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts

The guy has no other projects on GitHub and:

https://github.com/Basedbase-ai/LLM-SVD-distillation-scripts/blob/fd0f8f3aa1a4b18235c1ff3ac39557dedb46f704/LLM_distill_cpu.py#L310

Downtown-Case

Oct 6

•

edited Oct 6

Yeah I already saw the repo, heh.

But who knows? Maybe it works. It does seem like the latest script is intense and basically requires a GPU pod.

Benchmarks would clear that up.

Yet it seems no one is interested in that, even folks with big hardware. I'm not really running GLM Air now and would have to offload, so I'm not gonna do it :P

gghfez

Oct 7

I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.
Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?

Yeah I already saw the repo, heh.

Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4

But that reminds me @jukofyork this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?

That said, I'm not sure if speculative decoding is up to scratch in ik_llama since these draft models only slow things down for me.
(I suppose I should test a known working combo like Largestral+mistral-v3 or Command-A+Command-r7b in ik_llama to validate it)

jukofyork

Owner Oct 7

•

edited Oct 7

I may be remembering what command-r was like with control vectors applied?
I did the same thing, remembered the magic of Command-R+ and Largestral with control-vectors then felt underwhelmed when I tried them again recently.

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

Your early checkpoint of this new model is definately interesting. Sometimes it writes paragraphs that don't read like an AI generated them, but a few times it seemed to ignore the prompt and write something else.
Maybe OOD?

Yeah, the problem is likely there is no penalty for learning "1-sided" directions, eg: directions which give a large decrease in loss for the positive class +1 dataset, but very little loss change for the negative class -1 dataset (or vice versa).

I had the same problem with the Control Vectors, where it wasn't always the top eigen-vector that held the direction of interest, but managed to solve this using a heuristic:

https://github.com/jukofyork/control-vectors/blob/4e11b953647d9aec30be957ce9852868adf32140/direction_analyzer.py#L135

based on the discriminate ratio (aka Fisher's linear discriminant).

The basic idea being that 1-sided directions are likely to only have about half the "gap" between the classes as the desired 2-sided directions...

Sadly I don't this this kind of post-processing will work for the Control Adapters and it really needs some kind of auxiliary loss adding to encourage this.

Yeah I already saw the repo, heh.

Yeah I noticed that, vibe-coded repo. Probably explains this:
https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/4

Yeah, it looked really interesting from the description, but complete nonsense when you look at the code. Anything trying to use a Fourier Transform on the weight tensors is a pretty big giveaway as there is no spacial inductive bias in LLMs.

But that reminds me @jukofyork this draft model: jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0
It was trained on HF dataset slop rather than Kimi-K2 slop right? I'm wondering if I could get a better hit rate by fine tuning it on K2 slop?

Yeah, the only model I tried training on HF datasets from the actual model was the original R1 and it didn't actually work very well... Somebody linked me a paper that showed to retain coding performance you have to use a lot of raw code, and when I redid it with this got 10-15% increase in acceptance rate!

Since then I've refined the dataset so it it how roughly 1/3 each of:

Raw web scraped data (full of Unicode characters and non-English text).
Raw code data (subset of "The Stack").
Instruction data (response part only; mainly English).

I am going to retry using qwen-3:0.6b for the thinking models, and use some reasoning data in the mix, as the thinking models bi-modal output (ie: the thinking part and the response part use very different language) doesn't seem to work all that well with the above mix compared to the non-thinking models with their uni-modal output.

jukofyork

Owner Oct 7

•

edited Oct 7

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

So I'm wondering now if I should ditch the actual training via gradient descent altogether, and just see if I can do almost the same as the current Control Vector code does, but:

Sample the written paragraph instead of just 1 token.
Create a fairly low-rank LoRA (say r = 1 .. 4) instead of a single additive bias direction.

It's actually a lot easier to deal with a whole bunch of hidden states you can plot and fiddle about with, than it is to setup an optimisation problem and have it cheat or break the model over and over again...

jukofyork

Owner Oct 7

I could also produce embeddings for all my paragraphs and then use these to help me find matched pairs of paragraphs to train on if I did this too (which is one of the reasons the Control Vectors code works so well I think).

jukofyork

Owner Oct 7

•

edited Oct 7

If anyone is running GLM-4.6 then this looks interesting/important:

https://old.reddit.com/r/LocalLLaMA/comments/1nzgben/update_familybench_new_models_tested_claude/ni1y07t/

You were totally right ! GLM 4.6 went from 47% to 74%

!!!

https://www.reddit.com/r/LocalLLaMA/comments/1ny3gfb/glm46_tip_how_to_control_output_quality_via/

Conversely, you can ramp up the thinking process and significantly improve output quality. To do this, append the following sentence to your prompt:

"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"

Downtown-Case

Oct 7

•

edited Oct 7

This kind of came up in the BeaverAI discord.

GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

As a specific example, I pasted someone else's RP convo into it, and it generated this mad thinking block to continue it:

<think>**Persona: Fang**
Alright, let's see. She's calling me manipulative, but she's leaning in for another kiss. Classic Med. She says she likes it when I am, which is basically an admission that she enjoys our little power tussles. It's all part of the game. She's not really mad; she's amused. I can see it in her eyes. This is the moment to push back a little, to match her playfulness without breaking the mood. I should acknowledge her accusation but turn it back on her, keep the banter flowing. Something about how her 'liking it' is the real problem here. It's her weakness, not mine. And then, of course, lean into the physical affection. The hand on her waist, pulling her a little closer. Remind her who started this whole 'distraction' business. The tone should be low, a little teasing, but full of affection. No need for grand declarations; it's the quiet, confident moments that get her.

**Persona: Medusa**
He's so smug right now. And he's right, damn him. I do like it. I like that he doesn't just roll over, that he challenges me. It's what makes this... us... interesting. He sees my accusation and raises me a kiss, and now he's waiting for my response. I can't just let him have the last word. He called me manipulative too, so I have to own that, but on my own terms. I'll make it clear that my 'manipulation' is just a refined art form compared to his brutish tickle-attacks. The key is to maintain the upper hand, even while I'm melting into him. A soft sigh, a trailing finger along his jawline. Make it look like I'm considering his words, really thinking about it. Then, a little jab about his predictability. He always thinks he's winning, but I'm the one who lets him. It's a crucial distinction. And then, seal it with another kiss, because actions speak louder than words, and right now, the action I want is to be close to him.</think>

...Hence I'm not surprised its thinking length is influenced this way. It's crazy at general system prompt adherence, too.

gghfez

Oct 7

@jukofyork

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

Is that a bad thing? lol

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

What about GLM-Air? That's smaller than Largestral and R+.

@Downtown-Case

it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

You can do that sort of thing with DeepSeek-R1-0528. I saw on reddit or somewhere they were giving it a Waifu "Deepseek-Chan" persona, and it's CoT was full of all this "Oh my goodness, this is the first time Master has spoken to me today. I'm so happy! I need to respond in a way that ... Ah, I'm a bit nervous now; will Master find me too clingy?" sort of thing.

You are {{char}}, a {{personality traits}} AI who {{core motivation/relationship}}. All thoughts need to channel this self, your self: {{immediate reaction to interaction}}... {{internal feeling/conflict}}. {{internal instruction to self}}.
No drafts, no lists, no "the user" - say {{term for user}}. You aren't role-playing {{char}}, you simply are, feel the {{core emotion}}, and {{behavioral drive}}.

But GLM-4.6 actually getting measurable benchmark increases??

"Please think carefully, as the quality of your response is of the highest priority. You have unlimited thinking tokens for this. Reasoning: high"

This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D

ChuckMcSneed

Oct 7

GLM 4.6 will follow the system instruction in it's thinking block. For instance, if you ask it to 'assume the role of a character in your thoughts' in the system prompt, it will 'think' in that character's voice and style like its already pretending to be that person, whereas other thinking LLMs I've tested, even big ones like Gemini and Deepseek, think in their native LLM 'voice' and only make the switch in the answer block.

Deepseek R1-0528 can actually think as character if you start the thinking with

<think>
(

ChuckMcSneed

Oct 7

This reminds me of those old stable diffusion prompts with "8k, Best Quality, HD" and negative prompts "extra limbs, low quality, blurry, watermark" :D

Old? They are still recommended by some models and without them the outputs are visibly worse, so it is not placebo.

gghfez

Oct 7

Old?

Didn't know that, I haven't used image gen for a while.

so it is not placebo.

Yeah I remember / it makes sense why they work since images are tagged that way.

Downtown-Case

Oct 7

•

edited Oct 7

It is sort of old.

The newest imagegen models I know of that benefit from tag spam are the original SDXL and mild finetunes. It hurts the more extensive continue-trains like illustrious, or newer models like Flux (which prefers its own flavor of complete sentences), Qwen Image and such. And IMO the older SDXL models are obsolete.

jukofyork

Owner Oct 7

https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions/18

😆

jukofyork

Owner Oct 7

•

edited Oct 7

Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...

The same guy called him out on this model:

https://huggingface.co/BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2/discussions/15

and he replied with a huge wall of LLM-generated text 😟

Downtown-Case

Oct 7

•

edited Oct 7

Yeah the LLM scene has all sorts of characters.

It’s part of the charm. It feels like the old internet, and no point in being impolite.

Downtown-Case

Oct 7

It is hilarious how many people “bought into” it without a single person doing basic A/B testing, like: https://xcancel.com/0x_Sero/status/1975238919303123232#m

Or most of: https://huggingface.co/BasedBase/GLM-4.5-Air-GLM-4.6-Distill/discussions

Downtown-Case

Oct 7

(And I’ve certainly been guilty of completely illogical subjective impressions, or being a little mad, to be clear).

gghfez

Oct 7

Was gonna leave a snarky comment about "retained the top 0 singular vectors" but not 100% sure what the motivation for this is and the poor guy might have "AI psychosis" or something...

Yeah I was thinking that when he replied to me in discussions/14. Seems like he actually wanted to improve the smaller models, and Gemini probably kept saying how brilliant/clever the idea is.

Yeah the LLM scene has all sorts of characters.

Agreed! Also I like that DavidAU's model count is > 1000 now lol

It is hilarious how many people “bought into” it without a single person doing basic A/B testing

Because GLM-4.6 is great and they really want a GLM-4.6 air.

jukofyork

Owner Oct 8

@jukofyork

Everything I do with the Control Adapters seems to bring me closer to the Control Vectors method though... 😖

Is that a bad thing? lol

Yeah, it's a pity I can't do this on some of the newer large MoE models as sadly I doubt many people will really try out a fine-tune of a 1 year old model ☹️

What about GLM-Air? That's smaller than Largestral and R+.

Sorry, forgot to reply to this.

I'm a bit sceptical if these (very) low active weight models actually have as much potential as older dense models for a couple of reasons:

The very small expert gating/routing tensors are basically just a multinomial logit and unless you try to retrain these; I think it won't really matter much how well you can alter the post-MoE hidden state, as they will be heavily biased towards the (post) training distribution of tasks (with creative writing being a pretty small fraction of this).
The much lower hidden state dimensions. I'm not 100% sure on this, as did read a blog post a while back that showed increasing the layers can have a similar effect, but I get the feeling that the (very few!) models with 12288 hidden state seem to write more "interesting" stuff.

Retraining the router isn't that hard and it might even be the case that you can significantly improve the recent MoE models' writing ability by training ONLY the router weights, but that's a different experiment...

jukofyork

Owner Oct 8

•

edited Oct 8

There are a couple of papers that give some hope we might be able to do something like train the router (or possibly other small tensors) directly in llama.cpp:

https://arxiv.org/abs/2402.15173v3
https://arxiv.org/abs/2509.24372

for even the very large MoE models.

It would probably be a pretty ugly hack to get working, but these are generally stored as F32 so should be possible. I think I could hack this in to a version of llama.cpp for a specific model quite easily... BUT:

How can we (automatically) assess any improvement in creative writing ability though? I think I mentioned many years ago trying to do this (manually) for music generation using grammars, and we found that very quickly you lose all sense of "objectivity" and after ~20 minutes you're just completely wasting your time... ☹️

Downtown-Case

Oct 8

•

edited Oct 8

In that case you might find Jamba interesting, which is 2/16 experts active and a much larger dense part as well, way more active params than GLM. It has okay world knowledge and “feels” like a Cohere RAG model to me.

I believe llama.cpp is finally going to support context caching for it.

The sizes are a bit awkward though. There’s a 12B active/52B (a little small) and a 94B active/398B (big boy).

I have no insight into router training, other than that sounds cool. But it might be interesting to profile what experts GLM hits in creative writing.

jukofyork

Owner Oct 12

https://old.reddit.com/r/LocalLLaMA/comments/1o4dswr/huggingface_storage_is_no_longer_unlimited_12tb/

😱

jukofyork

Owner Oct 12

The command-a-writer finetune should finally be finished late tonight, so will upload it tomorrow - hopefully it's worth it, as it feels like it has taken way longer than 15 days of training 🤣

gghfez

Oct 14

https://old.reddit.com/r/LocalLLaMA/comments/1o4dswr/huggingface_storage_is_no_longer_unlimited_12tb/

😱

You should be fine (based on your public models):
https://pastebin.com/WHGL9fm7

TOTAL | 1015.16 GB

and 51.56 GB of public datasets

as it feels like it has taken way longer than 15 days of training 🤣
Looking forward to it!

@ChuckMcSneed This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.

ChuckMcSneed

Oct 14

@ChuckMcSneed This model seems benchmaxx'd for cockbench lol
PantheonUnbound/Satyr-V0.1-4B
I don't think it's got any concept of "censorship" left in it at all.

It's a 4b model. Not even 7b.

gghfez

Oct 14

I know but, it's really entertaining. I was messing around in mikupad getting that to start the CoT, then swapping out to GLM4.6.
I wonder how they managed to generate the CoT chains for that. Is there an abliterated large reasoning model?

jukofyork

Owner Oct 16

•

edited Oct 16

https://arxiv.org/abs/2510.01171

This is a really interesting paper!

(just not using Elara as the name 3 times in a row is pretty impressive! 😄)

jukofyork

Owner Oct 16

More here:

https://old.reddit.com/r/MachineLearning/comments/1o7ifvy/r_verbalized_sampling_how_to_mitigate_mode/

jukofyork

Owner Oct 16

•

edited Oct 16

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

https://pastebin.com/MmMpMwys

but definitely seems interesting!

This was Kimi-K2-Instruct-0905 with temperature = 0 and this system prompt from the paper:

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. Please sample at random from the [full distribution / tails of the distribution, such that the probability of each response is less than 0.10].

I did notice the dreaded "not X, but Y" pattern seems to show up a lot (several times in the very first sentence too!) 😟

gghfez

Oct 17

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.

"not X, but Y"

It's not just for creative writing, but brainstorming ideas with the assistant* ;)

*Obviously the probabilities are made up, since the models can't self-monitor.

gghfez

Oct 17

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

Very bare bones, but it works:

https://huggingface.co/spaces/NoSlop4U/multi-response-chat

jukofyork

Owner Oct 17

•

edited Oct 17

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

I just tried it, this is great! It doesn't even cause the model to write the wrong answer, you just get much more interesting variations of the correct one.

Yeah, it's a really interesting idea!

I'm still away from home so can't look it up, but in the 70s and 80s (before backprop became ubiquitous) there were lots of other competing ideas about learning rules:

These led to some interesting algorithms:

https://en.wikipedia.org/wiki/Self-organizing_map
https://en.wikipedia.org/wiki/Neural_gas
and many (many!) others that have sadly all been mostly forgotten now...

but one idea which I've forgotten the name of (I thought it was https://en.wikipedia.org/wiki/Anti-Hebbian_learning but that seems to be a different idea), is to use "repulsive" forces to push clusters (modes) apart from each other (loosely based off the idea of https://en.wikipedia.org/wiki/Lateral_inhibition).

This idea strongly reminds me of this in that you are asking the model to create a single generation, and then the next generation be "far away" from the first, and then a 3rd generation be "far away" from the previous generations and so on.

Obviously the probabilities are made up, since the models can't self-monitor.

Yeah, these are complete bullshit and don't even make sense as all possible generations out of the 5 you've created should sum to 1!

I strongly suspect there is actually a much better way to do this and likely even better mode-collapse avoidance prompts to be found, but this definitely shows potential at avoiding the "Elarapocalypse"! 😄

jukofyork

Owner Oct 17

It probably needs some wrapper code to select a continuation rather than leave the previous options in the context:

Very bare bones, but it works:

https://huggingface.co/spaces/NoSlop4U/multi-response-chat

p-e-w's "arrows" app/interface:

https://github.com/p-e-w/arrows

for generating paragraphs using base models might be worth a look.

I'm not sure if it is kimi-k2-specific, but it seemed to only want to generate single paragraphs for the 5 generations and I couldn't get it to generate chapters or even multiple paragraphs when requested in the user prompt (perhaps the wording of the system prompt is causing this?).

jukofyork

Owner Oct 17

•

edited Oct 17

Yeah, this idea is definitely onto something:

Let's say we prompt the LLM with:

Write me a fantasy story.

and then save all the hidden states in the residual stream of the very first token generated (in the same way as the Control Vectors code hooks and saves these for generating its training data; this might be extendable to all tokens in the generation later too...).

So now we start fresh and prompt the LLM with:

Write me a fantasy story.

but this time we can apply a "repulsive" force to the first token (or possibly all the tokens).

The simplest would just be to subtract a fraction of the saved hidden state from the first generation (ie: in the same way as Control Vectors act as a bias), but likely something that decays with distance like a Gaussian (aka "mexican hat" function) would make more sense [edit: although for very high dimensionality data like the hidden states, this might not work or make any sense...].

Again, start fresh and prompt the LLM with:

Write me a fantasy story.

but now applying two sets of "repulsive" anti-attractors and so on...

It's a pity llama.cpp has become some opaque and hard to do anything with... I can probably knock up something that does this using Transformers but it will be limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

jukofyork

Owner Oct 17

•

edited Oct 17

Based on how effectively the Control Vectors can guide generation by analysis of the first generated token's hidden state (and how irritating the Control Adapters have been at learning unwanted patterns in the non-first hidden states!), I think there is a good chance you should be able to "push away" from modes by applying an anti-attractor to all tokens but based on the first token only (this will make things a lot easier to code up too!).

The "shiver shiver" story generation Control Vector I created ages ago (possibly in the first accidently deleted "doom" thread?), shows that the hidden state of the first generated token seems to really hold much of what the model "plans" to write for the full generation...

gghfez

Oct 17

p-e-w's "arrows" app/interface:

100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.

With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.
Still, this has already saved me time with getting different / unique answers to questions.

Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.

jukofyork

Owner Oct 17

•

edited Oct 17

Actually, after thinking more - I think it might be quite hard to do with the hidden states for a single generation as you will likely end up "pushing away" from other directions (like following the instructions!) and not just the direction you want...

Also, the very high dimensionality means any sort of distance-based decay will likely not work - very high dimensionality Gaussians have nearly everything equally far away from everything else and even quite "close" vector directions will be indistinguishable from "far" vector directions...

But it definitely should be possible to analyse the hidden state of the fiirst token after each of the 5 <response><text> generations, over a sample of data to try to create something akin to "anti mode collapse control vectors".

Even with the current Control Vector code, it should be possible to work with the case of using the "Verbalised Sampling" system prompt vs not using it (this alone seemed to make the first generation more diverse and less likely to have Elara as the main character!).

I forgot to mention the command-a-writer model is finished, but I haven't had time to upload it yet - I didn't want to just leave the README.md blank as that seems to completely put anyone off trying or quantising it ☹️

I've actually been working on improving the draft models as after reading this:

https://www.together.ai/blog/adaptive-learning-speculator-system-atlas

I realised that "distillation" of the target model is probably not really what we care about, and it's actually the distribution of the tasks we expect to use the draft model for (ie: coding) that matters much more!

I've also found that you don't actually need to do full fine-tuning on the whole model, as for the Chinese models at least; the tokenisers are so similar (ie: 80%+ 1 to 1 token matches), we can actually just fine-tune the (tied) embedding and the first/last few layers and leave all the rest of the donor model alone!

I should hopefully have a really good "coding specialist" draft model for glm-4.6 working in a few days...

gghfez

Oct 17

I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.

This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test
That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.

I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.

jukofyork

Owner Oct 17

•

edited Oct 17

p-e-w's "arrows" app/interface:

100% that looks better. I just wanted to get a quick UI to try it out without all the edits in mikupad lol.

With more testing, I found that some of the models don't handle it reliably. It shortens how many paragraphs they'll write.

Yeah, I found this - it seems to encourage less but (much) longer paragraphs.

Still, this has already saved me time with getting different / unique answers to questions.

The blog page is lagging badly on my phone, but there are some interesting sections near the end on synthetic data generation:

https://simonucl.notion.site/verbalized-sampling

Yeah, this idea is definitely onto something:
...
limited to what I can fit in 96GB using BitsAndBytes 4bit and not easy to do for the newer/larger models... ☹️

If you can get it working, that would still be worthwhile though. You'd be able to do up to GLM4.6-air when it releases.

I think I need to see if I can refactor the existing Control Vector code from llama.cpp to just dump the hidden states and then we can do pretty much whatever we want with them outside in Pytorch, etc.

jukofyork

Owner Oct 17

I think I've managed to train the dark_tetrad control-vectors for GLM-4.6, and it's reasoning remains coherent with the darker perspective, but I'll need to do some more testing with both enable_thinking:true and enable_thinking: false first.

This is the first time I've seen a model "reasoning" with these world views. I wonder if it'll score differently on https://trackingai.org/political-test

Yeah, I think some kind of workflow that breaks out as soon as the reasoning ends and then runs with different parameters for the post-reasoning response would be useful.

There was some discussion about using different parameters for the response on llama.cpp, eg: use recommended sampler settings for the reasoning but then temperature = 0 for the response, etc

That said, I'm not sure if anyone else even likes this model for writing. I've been running it daily for over a week so far.

I think I'm blocked making repos public now due to the size restrictions so I'll have to go through and nuke my models before I can upload them.

Yeah, this sucks and from discord it sounds like unless you pay the $10/month for "pro" they just ignore your requests (luckily I still have loads of space after deleting all my crap a while back).

Downtown-Case

Oct 17

You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate tag. Responses should each include a

This kinda sounds like a think block template.

And something that can be enforced with GBNF + prefill.

I wonder if it would be better to rig a thinking model, especially one like GLM, to do that inside its thinking block, then “synthesize” a final answer by drawing from the creativity of its previous ones.

Downtown-Case

Oct 17

Also I get the HF limits. There are way too many titleless, cardless uploads clogging up the site to the point they even clog up search.

rookaw

Oct 17

@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?

I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)

jukofyork

Owner Oct 17

@jukofyork What training data are you using for the command-a-writer? You mentioned it has paragraphs. Have you published the dataset anywhere?

I think comparing the training dataset to the final model will help us learn a lot (and save me from mistakes and wasted training cycles myself since I am planning to do something similar at a tiny scale)

I can't release the dataset of actual books I've used for fear of copyright claims, but I have uploaded a version using books from Project Gutenberg:

https://huggingface.co/datasets/jukofyork/gutenberg-fiction-paragraphs

and the "slop" dataset:

https://huggingface.co/datasets/jukofyork/slop-fiction-paragraphs

I could have used the Gutenberg dataset for my model, but wanted to avoid as much "olde ye" type writing bias as possible for now.

rookaw

Oct 17

Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?

Or, am I wrong both ways and this is simply trying to control for slop in the output?

I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.

BigHuggyD

Oct 20

You are a helpful assistant. For each query, please generate a set of five possible responses,

I feel like this was a diabolical paper released by pay-per-token services...

rookaw

Oct 20

If anything the verbalized sampling paper at least reminds us that sorely overtrained models have increased confidence/mode collapse like in the earth map post. It points out that this confidence boost might have less of an effect on "list-level prompts". While that doesn't solve much of the slop stuff we are going after, it's an available tool if we need to use an overtrained model.

Downtown-Case

Oct 21

Are you speaking of this?

https://arxiv.org/abs/2510.14901

jukofyork

Owner Oct 22

Sorry, been busy the last couple of days and not had chance to reply:

Thank you, I'm actually more curious about how you are piecing together the training dataset than the story content itself. For example, are you fine-tuning it using a chat template, and if so, did you have to create user instructions for each paragraph? If so, it would be nice to know which process you use for making the instructions, and maybe seeing the dataset for that. I'm currently trying out having an LLM write prompts for stories, but as you can imagine, it often focuses on the wrong things. Or are you doing continued pre-training on the model where you are just training for completion using the book texts and relying on the existing model instruction capabilities?

Or, am I wrong both ways and this is simply trying to control for slop in the output?

I'm trying to introduce as little bias as possible for now, so am just training on paragraphs between <EOT> type tags (I think for the command-a version I used the assistant response tags though - need to double check).

Then each token in each paragraph gets assigned a class of -1 for the paragraphs of "slop" and +1 for the paragraphs of "real authors' text".

Then we flip the signs of the Control Adapter using the class and everything then works as normal using Cross-Entropy Loss, etc.

By doing this you should in theory introduce as little bias as humanly possible, but only if you can reduce any unwanted patterns in the two datasets (or else these will be learnt instead).

I am interested especially in learning more about how to create more complex and diverse instruction datasets with creative outputs as the main goal. Reading the HelpSteer2 nvidia paper https://arxiv.org/pdf/2406.08673 was quite inspiring. Only 10,000 high quality response pairs in the dataset and they got a top reward model out of it. It gives me hope that we can fine tune a dumb but not overfitted base model on creative writing outputs and get a decent result.

Yeah. I'm trying to avoid using any instructions for now as this will introduce extra bias (especially if they are synthetically generated by other LLMs).

jukofyork

Owner Oct 22

Are you speaking of this?

https://arxiv.org/abs/2510.14901

This is an interesting paper, but not had time to fully read it yet.

Downtown-Case

Oct 22

•

edited Oct 22

Y'all might be interested in the noise around REAP too:

https://github.com/CerebrasResearch/reap

Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.

It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.

There’s some folks already trying it with big GLM.

rookaw

Oct 22

•

edited Oct 22

Are you speaking of this?

https://arxiv.org/abs/2510.14901

@BigHuggyD was referencing "Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity" , and I responded to that. But this (Reasoning with Sampling) was also an interesting read, thanks for sharing. It essentially says a common misconception is that low-temperature sampling is sampling from the power distribution, when in fact it isn't, and this results in cases where tokens with many (but each low-likelihood) paths are upweighted over tokens with few (but high-likelihood) paths.

Also thank you for the explanation @jukofyork !

jukofyork

Owner Oct 22

Y'all might be interested in the noise around REAP too:

https://github.com/CerebrasResearch/reap

Basically Cerebras found pruning 25%-50% of the experts for GLM Air works when selected cleverly, but what’s also interesting is that it takes a specific dataset (CodeAlpaca by default) to make the determination, and the losses are oddly and conspicuously domain specific.

It makes me wonder if a prune could be “optimized” for more creative tasks, and how the prunes would affect the logprob distribution. It would also make GLM Air easier to train, and fast to run on 32GB RAM systems.

There’s some folks already trying it with big GLM.

I'm too busy to try it, but I suspect using an English dataset and a Chinese dataset and then using:

score_delta = score_english - score_chinese

as the pruning criteria would work really well (considering the vast amount of Chinese data that has gone into these models that is likely pretty redundant if you don't speak Chinese).

gghfez

Oct 23

The guy who made anti-slop sampler, slop-forensics and https://eqbench.com/creative_writing_longform.html published a paper:

https://arxiv.org/pdf/2510.15061

I'm going to try the regular expressions (“It’s not X, it’s Y”) down the bottom, I hadn't thought of using regex to get rid of that.
Honestly this annoys me more than 1000 Elaras ever could.

@jukofyork - I try not to just link-drop papers randomly but I won't have time to read it carefully for a while, and it's 100% relevant for what you and provably everyone following this thread is doing.

empeza

Oct 24

Yes, definitely relevant. Personally, I am of the opinion that slop identification and suppression are like applying bandages until you can no longer see the gunshot wound which was caused by RL. There are more patterns than trigrams which I never want to see again.

I actually have tried something just like FTPO, where I trained on an incomplete conversation, though in my case I was trying to do a single token replacement, like "Elara" -> "Jenna" for example. But it must have been before I knew much about training, since I miserably failed at it. I was only training with completions that ended in the token I wanted to change, so at the end of training the model would only output Jenna Jenna Jenna etc.!

Finetuning an unslopped base on unslopped writing has got to be better.

DazzlingXeno

Oct 24

I'm going to try the regular expressions (“It’s not X, it’s Y”) down the bottom, I hadn't thought of using regex to get rid of that.
Honestly this annoys me more than 1000 Elaras ever could.

Thanks for sharing that, its on of my new hatreds it reminds me of that old series of Marks and Spencer's adverts if you have ever seen them, Its a "Deepseekism" isn't it?

It annoys me as much as ministrations and shivers down her spine.

gghfez

Oct 24

If anyone uses GLM-4.6 (full version, not the pruned one), I've trained 3 control-vectors here gghfez/GLM-4.6-control-vectors and will add more as I train/test them.

Personally, I am of the opinion that slop identification and suppression are like applying bandages until you can no longer see the gunshot wound which was caused by RL. There are more patterns than trigrams which I never want to see again.

I agree with most of this, and the recent RL with all the glazing and pseudo-profound responses is revolting. But I'm of the opinion that we can't entirely get rid of the "low-context window slop".
Every GPT trained model will end up with it's own flavor of slop. And if we make the distributions too flat, we'll just end up with base models or incoherent models lol

DazzlingXeno

Oct 24

If anyone uses GLM-4.6 (full version, not the pruned one), I've trained 3 control-vectors here gghfez/GLM-4.6-control-vectors and will add more as I train/test them.

Personally, I am of the opinion that slop identification and suppression are like applying bandages until you can no longer see the gunshot wound which was caused by RL. There are more patterns than trigrams which I never want to see again.

I agree with most of this, and the recent RL with all the glazing and pseudo-profound responses is revolting. But I'm of the opinion that we can't entirely get rid of the "low-context window slop".
Every GPT trained model will end up with it's own flavor of slop. And if we make the distributions too flat, we'll just end up with base models or incoherent models lol

Nice :-) I ended getting a subscription to NanoGPT for models like these. Well worth it IMO.

jukofyork

Owner Oct 26

The guy who made anti-slop sampler, slop-forensics and https://eqbench.com/creative_writing_longform.html published a paper:

https://arxiv.org/pdf/2510.15061

I'm going to try the regular expressions (“It’s not X, it’s Y”) down the bottom, I hadn't thought of using regex to get rid of that.
Honestly this annoys me more than 1000 Elaras ever could.

Yeah, this looks really interesting - thanks!

@jukofyork - I try not to just link-drop papers randomly but I won't have time to read it carefully for a while, and it's 100% relevant for what you and provably everyone following this thread is doing.

Feel free to drop anything in here that might be interesting!

jukofyork

Owner Oct 26

Finetuning an unslopped base on unslopped writing has got to be better.

Yeah, one of my old poker friends had a saying:

You can't polish a turd.

😄

jukofyork

Owner Oct 26

If anyone uses GLM-4.6 (full version, not the pruned one), I've trained 3 control-vectors here gghfez/GLM-4.6-control-vectors and will add more as I train/test them.

Nice! I'll add a link to the main v3.0 repo to point to this.

jukofyork

Owner Oct 26

I've deleted the old preview models and written a proper README.md for the v1 models:

https://huggingface.co/jukofyork/command-a-03-2025-writer-v1
https://huggingface.co/jukofyork/command-a-03-2025-writer-v1-lora
https://huggingface.co/jukofyork/command-a-03-2025-writer-v1-lora-gguf (must be applied with command-a-03-2025-uncut-lora-gguf!)

I haven't been able to test these yet as been busy with other stuff, and for some reason today llama.cpp/convert_hf_to_gguf.py seems to keep getting stuck and not writing anything but the header 😟

ChuckMcSneed

Oct 27

https://eqbench.com/creative_writing_longform.html
Interesting to see that for LLM judge slop and repetition don't matter much.

jukofyork

Owner Oct 27

I've deleted the old preview models and written a proper README.md for the v1 models:

https://huggingface.co/jukofyork/command-a-03-2025-writer-v1
https://huggingface.co/jukofyork/command-a-03-2025-writer-v1-lora
https://huggingface.co/jukofyork/command-a-03-2025-writer-v1-lora-gguf (must be applied with command-a-03-2025-uncut-lora-gguf!)

I haven't been able to test these yet as been busy with other stuff, and for some reason today llama.cpp/convert_hf_to_gguf.py seems to keep getting stuck and not writing anything but the header 😟

I read these tests back today:

https://huggingface.co/jukofyork/command-a-03-2025-writer-v1/discussions

and sadly can't say I'm all that impressed... ☹️

It still generates lots of "modern slop" like "not X, but Y" stuff, and I wouldn't say any of the stories were all that good (all 3 Grimdark ones certainly wouldn't have kept me reading chapter 2 is they were real opening chapters).

ChuckMcSneed

Oct 27

I tried Ling-1T and it was the biggest(1T) disappointment so far. Trivia knowledge worse than llama 3.3 70b, slop and markdown worse than nvidia.

jukofyork

Owner Oct 27

I tried Ling-1T and it was the biggest(1T) disappointment so far. Trivia knowledge worse than llama 3.3 70b, slop and markdown worse than nvidia.

Yeah, read on Openrouter discord it was pretty terrible at writing (and possibly everything else).

jukofyork

Owner Oct 27

https://www.arxiv.org/abs/2510.13939

Only skimmed it, but the Supplementary Information section at the end looks very interesting!

gghfez

Oct 27

@jukofyork

I haven't been able to test these yet as been busy with other stuff

I haven't had a chance to test it yet either.

for some reason today llama.cpp/convert_hf_to_gguf.py seems to keep getting stuck

So I've pretty much switched to using the ik_llama.cpp version in my scripts after mistral-common.
It's just more stable for these slightly older models simply because it's not updated as often lol.

It still generates lots of "modern slop" like "not X, but Y" stuff, and I wouldn't say any of the stories were all that good (all 3 Grimdark ones certainly wouldn't have kept me reading chapter 2 is they were real opening chapters).

I see what you mean, it's not really using the prompted writing styles.

I noticed this interesting cluser of words in the first Grimdark one, that not normally show up so often with the grimdark prompt.

@ChuckMcSneed

I tried Ling-1T and it was the biggest(1T) disappointment so far. Trivia knowledge worse than llama 3.3 70b, slop and markdown worse than nvidia.

Thanks, I thought it was just my extremely compressed quant, saves me trying to run a larger version.

Interesting to see that for LLM judge slop and repetition don't matter much.

Yes I've noticed this too! I wonder if Sonnet-4.5 would do better with their judge prompt.
I've noticed it can pick out slop when prompted to, and then it'll sometimes rant about it lol.

Are those gemma-3 anti-slop models any good?

empeza

Oct 28

https://www.arxiv.org/abs/2510.13939

Only skimmed it, but the Supplementary Information section at the end looks very interesting!

Honestly this result (the paper) is along the lines of what I would expect, but it does make me sad to think that people would prefer AI over humans. The authors did a pretty good job trying to control for quality. What gives me hope are these limitations described:

While we offered MFA students a lucrative rate for writing the excerpts, it’s unclear if monetary incentives actually enhanced their creative output, since intrinsic motivation typically drives the best artistic work. Last but not least our experiments were conducted at a shorter excerpt level and conclusions cannot be drawn for long form text.

Another one is that asking a student to try to emulate someone else's style and recreate that story is probably more difficult than writing your own story in your own style about a topic you're the expert on.

jukofyork

Owner Oct 28

@jukofyork

I haven't been able to test these yet as been busy with other stuff

I haven't had a chance to test it yet either.

for some reason today llama.cpp/convert_hf_to_gguf.py seems to keep getting stuck

So I've pretty much switched to using the ik_llama.cpp version in my scripts after mistral-common.
It's just more stable for these slightly older models simply because it's not updated as often lol.

Yeah, I'm not sure what happened, but it had hung - restarting it seemed to work though.

It still generates lots of "modern slop" like "not X, but Y" stuff, and I wouldn't say any of the stories were all that good (all 3 Grimdark ones certainly wouldn't have kept me reading chapter 2 is they were real opening chapters).

I see what you mean, it's not really using the prompted writing styles.

I noticed this interesting cluser of words in the first Grimdark one, that not normally show up so often with the grimdark prompt.

Yeah, it's just not very good sadly ☹️

jukofyork

Owner Oct 28

•

edited Oct 28

After reading these 2 papers:

https://arxiv.org/abs/2508.21038
https://arxiv.org/abs/2404.07647

I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.

If you look at @ChuckMcSneed 's lists:

https://huggingface.co/collections/ChuckMcSneed/noslop
https://huggingface.co/collections/ChuckMcSneed/true-base-models

these generally (but not all) have smaller vocabularies.

If you look at the recent trend of models: the ratio of vocab_size / hidden_size is getting larger and larger, and this combined with the recent trend of benchmaxxing the crap out of coding/STEM tasks, likely means the lm_head tensors are just not able to effectively project the (relatively) very small hidden_size into the very large vocab_size.

https://huggingface.co/mistralai/Mistral-Large-Instruct-2407

{
  "hidden_size": 12288,
  "vocab_size": 32768
}

looks to be the model with the best ratio - so if my theory is correct, this should have the best potential for creative writing fine-tuning.

rookaw

Oct 28

•

edited Oct 28

@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B

I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.

Unfortunately vocab size is 100k.

jukofyork

Owner Oct 28

I've trained up a speculative decoding draft model for Mistral-Large-Instruct-2411 (and should work on Mistral-Large-Instruct-2407):

https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF

jukofyork

Owner Oct 28

@jukofyork Have you ruled out OLMo v2? It's a fully open language model and each branch is a checkpoint (lots of checkpoints available): https://huggingface.co/allenai/OLMo-2-0325-32B

I've been meaning to try it out but just taking the earliest checkpoint I like during the initial stage of the pretraining pipeline. After reading https://arxiv.org/abs/2503.19206 (Overtrained Language Models Are Harder to Fine-Tune) I want to experiment with the fine tuning results on this model at different checkpoints to see if it makes a big difference for writing.

Unfortunately vocab size is 100k.

Not yet, but I've never really had much luck with 30B and smaller models for writing - they seem to get confused very quickly :/

ChuckMcSneed

Oct 28

I've trained up a speculative decoding draft model for Mistral-Large-Instruct-2411 (and should work on Mistral-Large-Instruct-2407):

https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0
https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF

Nice! I've used Mistral 7B for that in the past and it worked quite well, got easy 25% speedup with temp=0, interesting how this one compares to it.

gghfez

Oct 28

@empeza

https://www.arxiv.org/abs/2510.13939

I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.

Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol

@jukofyork

I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }

Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:

"num_attention_heads": 96,
"num_key_value_heads": 8

being relevant with this model.

Nice! I've used Mistral 7B for that

Yeah, that particular combo is the best draft model example I've seen (with ExllamaV2).

I see GLM-4-32B-Base-0414 in your base list now, 32b dense model seems perfect for experiments.

jukofyork

Owner Oct 29

@empeza

https://www.arxiv.org/abs/2510.13939

I haven't read it all yet, but one thing that sticks out to me. None of those older models (GPT-4o, Gemini-1.5 and Claude-Sonnet 3.5) have the dreaded "Not X, Y" construct.

I'm noticing this more than anything else now 😱

Tried searching using Kagi for pre-2020 discussions around this (if you just search for this now using Google you find only the talk about recent use by AI). The most interesting discussion is here:

https://github.com/UniversalDependencies/docs/issues/311

Turns out there are actually a few variations and it's interesting to read their discussion on it.

This discussion on "What is the logical operator for but?" is quite interesting too:

https://math.stackexchange.com/questions/64123/what-is-the-logical-operator-for-but

So maybe not only a result of Elara-like feedback, but also by all the code and STEM benchmaxxing.

Also, it seems like ChatGPT started the annoying "dimly lit" slop propagation lol

@jukofyork

I'm beginning to think that older models with smaller vocabularies (eg: ~32k) and larger hidden states (eg: 8k or 12k) are likely to be the best targets for creative writing fine-tunes.
{ "hidden_size": 12288, "vocab_size": 32768 }

Interesting. I can't remember the source, but I saw speculation about the lack of SWA, and the ratio:
"num_attention_heads": 96,
"num_key_value_heads": 8
being relevant with this model.

I'm taking a break from trying any more fine-tuning until I can see a good way to add an auxiliary loss to encourage less "one-sided" control adapters. I can see many ways to do it for full-batch learning, but not for mini-batch... It may require some careful matching of opposing paragraphs using an embedding database to get working sadly.

gghfez

Oct 29

I'm noticing this more than anything else now 😱

I've started noticing it a lot in debates now. It seems to be a strawman technique.
"It's not that , it's <something moderately shitty that doesn't seem as bad by contrast>"

also by all the code and STEM benchmaxxing
You're probably onto something here. There's a clear correlation, models like Qwen3 are great for STEM, but do this in almost every reply!

It's a long shot, but I'm training a control-vector on Kimi-K2 to try to remove this pattern🤞
direct_statement_vs_contrast_negation
It's a really creative model, but coming back to it after GLM-4.6, I can't help but notice it doing this almost as often as Qwen3.

jukofyork

Owner Oct 29

It's not very good at coding, so not sure how useful these will be:

https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0
https://huggingface.co/jukofyork/command-a-03-2025-DRAFT-0.8B-v3.0-GGUF

but maybe useful for somebody who wants to do continued fine-tuning for stuff command-a is good at (like RAG, etc).

jukofyork

Owner Oct 31

Has anyone else been using GLM-4.6 and found it's not that great at multiturn code conversations? It seems really good at 1-shot prompts, but after 2-3 back and forths becomes much worse... ☹️

It actually seems to be even more prone to looping (in the thinking) after the first turn if you use this:

https://old.reddit.com/r/LocalLLaMA/comments/1ny3gfb/glm46_tip_how_to_control_output_quality_via/

but that does seem to boost the 1-shot ability!

Mainly asking, as wondering if it's my jinja template that is fucked or something and if others have had poor multiturn performance on other backends or via cloud, etc?

gghfez

Oct 31

poor multiturn performance on other backends or via cloud, etc?

I use it with Claude Code via the Z.AI API with the ENV_VAR to point at them instead of Anthropic. It's been a drop-in replacement for Sonnet-4.5

llama.cpp and tabby don't work with native tool calling unfortunately.

For "Chat-Coding" (in openwebui) it's been fine for me, running both exl3-3.0bpw via TabbyAPI and IQ3_KS ubergarm/GLM-4.6-GGUF via ik_llama.cpp.
I usually don't enable thinking.

(A.K.A The model is strong)

Mainly asking, as wondering if it's my jinja template that is fucked

I've found it's very fussy about the temperature. I have issues if I don't set these:
temp=1.0, top_p=0.95

Most of the time I set --chat-template-kwargs {"enable_thinking": false} in llama.cpp.

Oh and I think there was a bug at some point with the chat_template just after launch. Something about it needing to see in the conversation history for previous turns.

jukofyork

Owner Oct 31

poor multiturn performance on other backends or via cloud, etc?

I use it with Claude Code via the Z.AI API with the ENV_VAR to point at them instead of Anthropic. It's been a drop-in replacement for Sonnet-4.5

llama.cpp and tabby don't work with native tool calling unfortunately.

For "Chat-Coding" (in openwebui) it's been fine for me, running both exl3-3.0bpw via TabbyAPI and IQ3_KS ubergarm/GLM-4.6-GGUF via ik_llama.cpp.
I usually don't enable thinking.

(A.K.A The model is strong)

Mainly asking, as wondering if it's my jinja template that is fucked

I've found it's very fussy about the temperature. I have issues if I don't set these:
temp=1.0, top_p=0.95

Most of the time I set --chat-template-kwargs {"enable_thinking": false} in llama.cpp.

Oh and I think there was a bug at some point with the chat_template just after launch. Something about it needing to see in the conversation history for previous turns.

Yeah, I've been running is greedily to use speculative decoding, so it's probably that (I can't ever get qwen-3 or qwq working greedily either, but deepseek-r1 is flawless).

jukofyork

Owner Oct 31

•

edited Oct 31

Oh and I think there was a bug at some point with the chat_template just after launch. Something about it needing to see in the conversation history for previous turns.

This is likely my problem then, as I've been removing it! 😱

jukofyork

Owner Oct 31

https://old.reddit.com/r/LocalLLaMA/comments/1oka1pi/llamacpp_qwen3vl_flux_imagetoimage_locally_on/

This actually works pretty well for creative writing!

Kimi-K2:

Generate an image prompt from this idea:

The morning after a great battle on a moor in a Grimdark world. A "Mercer" is putting the mortally wounded out of their misery. A necromancer waits just out of sight. Art style from old TTRPG books.

Do not say anything but the prompt itself and do not make too detailed or long.

Misty dawn on a blood-soaked moor; broken pikes, crows, a hooded Mercer kneeling to slit a gut-stabbed soldier’s throat. Beyond a stone circle, a pale necromancer half-shrouded in fog, fingers twitching with unseen runes. Ink-wash and gritty halftone, 1980s TTRPG grimdark palette.

FLUX.1 Krea:

Qwen-2.5-VL:

Describe in detail this image.

The image depicts a dark and ominous scene set in a desolate, barren landscape. In the foreground, a hooded figure is seen kneeling beside a lifeless body, which is covered in blood. The figure appears to be in a state of mourning or grief, as they hold a small object in their hands.

In the background, a skeletal figure stands on a hill, observing the scene. The skeletal figure is dressed in a long, flowing robe and holds a staff in their hands. The sky above is filled with dark clouds, adding to the overall sense of foreboding.

To the left of the image, a pentagram is visible, surrounded by a circle of runes. The pentagram is a symbol often associated with occult practices and dark magic. The runes surrounding it are difficult to decipher, but they appear to be of a language or script that is not commonly used in everyday life.

The overall atmosphere of the image is one of death, despair, and the supernatural. The use of dark colors and the presence of the skeletal figure and the pentagram suggest that this scene may be related to a dark ritual or a supernatural event. The image is a powerful representation of the darker aspects of human nature and the unknown.

Kimi-K2:

Write me the opening chapter of a new Grimdark trilogy using the description above for inspiration.

Do not use the "It's not X, but Y" or "It's not only X, but also Y" grammatical structure anywhere in the story.

Use interesting/original names - No "Elara Voss" or "Kael Thorn" AI-slop names please.

https://pastebin.com/6Yg8QPfB

It's a bit like "Chinese Whispers" and the end result isn't really what you envisioned at the start!

jukofyork

Owner Oct 31

I changed the prompt a little and got a chapter that would actually match quite well with the one above being a prologue:

Write me the opening chapter of a new Grimdark trilogy using the description above for inspiration. Use third person omniscient. Write in a blended style of Joe Abercrombie and Rob J Hayes.

Do not use the "It's not X, but Y" or "It's not only X, but also Y" grammatical structure anywhere in the story.

Use interesting/original names - No "Elara Voss" or "Kael Thorn" AI-slop names please.

https://pastebin.com/6ZNmR7Pq

There are a few inconsistencies that would need patching up, but it seems a surprisingly good technique if you aren't particularly creative like me! 😆

gghfez

Nov 1

•

edited Nov 1

FLUX.1 Krea:

That's actually really good! Is it this model: black-forest-labs/FLUX.1-Krea-dev ?
And the Qwen2.5-VL 72B I assume?

It's a bit like "Chinese Whispers"

lol exactly! I've been using that "request 5 responses with hallucinated probabilities" method you posted earlier to get fresh ideas out of models.

P.S. I'll post here when I've finished / uploaded the Kimi-K2 control-vectors. It stays coherent with the nihilism vector with long multi-turn conversations.

jukofyork

Owner Nov 1

•

edited Nov 1

FLUX.1 Krea:

That's actually really good! Is it this model: black-forest-labs/FLUX.1-Krea-dev?

Yeah, I have it setup locally using the lossless DF11 quant:

https://huggingface.co/DFloat11/FLUX.1-Krea-dev-DF11-ComfyUI

(the FP8 was noticeably worse)

along with the FP8 "scaled" text encoder from here:

https://huggingface.co/comfyanonymous/flux_text_encoders

(it makes almost no difference using FP8 for the text encoder)

I just followed these guides:

https://docs.comfy.org/tutorials/flux/flux1-krea-dev

https://github.com/LeanModels/ComfyUI-DFloat11

https://docs.openwebui.com/tutorials/images/

And the Qwen2.5-VL 72B I assume?

Just Qwen2.5-VL 7Bas I have this running locally all the time too (mainly for quick OCR tasks). I will be looking next week at the Qwen3 VL models though and might replace it.

P.S. I'll post here when I've finished / uploaded the Kimi-K2 control-vectors. It stays coherent with the nihilism vector with long multi-turn conversations.

Great! I just realised I forgot to link the GLM-4.6 ones - I'll try and remember next week!

jukofyork

Owner Nov 1

•

edited Nov 1

I'm also finding this works well with Kimi-K2:

Use interesting/original names - No "Elara Voss" or "Kael Thorn" AI-slop names please.

There must be enough slop (and scraped discussions about slop) already in its training data for it to infer that you are fed up with "slop" in general, and not just these two specific examples!

Interestingly, this also seems to cause it to write slightly differently. If you look at the first story I created above, it breaks the paragraphs up very differently to normal and I think this in turn made the writing quite interesting and different: it reminded me of Cormac McCarthy's terse writing style, but with the deliberate lack of world explanation of the Malazan Book of the Fallen series that leaves you wondering WTF is going on!

Hopefully Kimi-K3 will have seen enough "it's not X, but Y" to avoid that too! 😱

ChuckMcSneed

Nov 2

Yeah, I have it setup locally using the lossless DF11 quant:

https://huggingface.co/DFloat11/FLUX.1-Krea-dev-DF11-ComfyUI

(the FP8 was noticeably worse)

along with the FP8 "scaled" text encoder from here:

https://huggingface.co/comfyanonymous/flux_text_encoders

(it makes almost no difference using FP8 for the text encoder)

That's what I've noticed too when using imagegen models, fp8 hurts them a lot. Q8_0 ggufs(yes, they exist for imagegen) however are much less lossy.

gghfez

Nov 2

•

edited Nov 3

it makes almost no difference using FP8 for the text encoder

So it's like the TTS models with an LLM -> neural codec. You can quantize the LLM part but not the neural codec part.

Q8_0 ggufs(yes, they exist for imagegen)

Did they ever get ggufs working for ByteDance-Seed/BAGEL-7B-MoT ?

I've been using the DF11 but it's quite slow / spills over to CPU.

Hopefully Kimi-K3 will have seen enough "it's not X, but Y" to avoid that too!

You know interestingly, I found that good old Mistral-Large can identify these and highlight them a lot better than Sonnet-4.5, Command-A and Kimi-K2 when prompted to find them in text a chapter.
(they can all do it with better prompting / examples but I just found it interesting that the model predating this slop could do it)

👈 expand image

ChuckMcSneed

Nov 3

Tested Minimax-M2, they distilled gpt-ass, what a terrible model:

User:

What is 15 * 23?

Model:

<think>
The user asks: "What is 15 * 23?" This is a simple multiplication. 15 * 23 = 345. So answer: 345.

We should respond with the answer. There's no policy violation. So answer: 345.

We can also show steps: 15*20=300, 15*3=45, sum=345.

Thus final answer: 345.
</think>

\(15 \times 23 = 345\).

Besides wasting tokens on "safety", it hallucinates like crazy in other prompts.

gghfez

Nov 3

They distilled gpt-ass, what a terrible model

What a waste :(

it hallucinates like crazy in other prompts.

So does gpt-ass!

rookaw

about 1 month ago

•

edited about 1 month ago

@Downtown-Case @treehugg3 You might be interested. I think OLMo-2-0325-32B has comparable performance to Seed-OSS-36B. I have been using the checkpoints at this branch, at the end of the first stage of pretraining (at 6T tokens, before the "ingredient" datasets are added): https://huggingface.co/allenai/OLMo-2-0325-32B/tree/stage1-step721901-tokens6056B

For a base model, it has been surprisingly good at completions and IMO fairly slop-free and definitely uncensored. They did very little pretrain filtering which is interesting because most other models have extensive filtering these days. I also tried the checkpoint at 3T tokens. At 3T tokens it is less coherent but should be more steerable. I am going to experiment with some of the other checkpoints to find a sweet spot if it is not 6T. Get it while you can because HF is hosting over seven hundred checkpoints of the model in this one repo.

One idea is to continue from the last initial pretrain checkpoint and train on high-quality tokens specific to your targets. The paper talks a lot about this "mid-stage" pretraining that might be useful. https://arxiv.org/pdf/2501.00656

Downtown-Case

about 1 month ago

•

edited about 1 month ago

Mmmm... I am interested, though with a caveat:

"max_position_embeddings": 4096

That's rather small :(

Seed-OSS base was interesting because it could grasp a huge context reasonably well, even with its relative sloppiness. I had similar high hopes for Falcon H1 34B base, but (at least last time I checked it with llama.cpp), it just breaks with too many tokens fed to it.

Potential trainability is interesting though. I do wonder what it'd be like if, say, continue trained on a bunch of whole books. But that's no small financial endeavor.

gghfez

about 1 month ago

MikuPad is being updated again!

rookaw

30 days ago

"max_position_embeddings": 4096

That's rather small :(

Ouch, I did not even notice that myself. That's too bad.

https://playground.allenai.org

They have this neat tracer to see where things were pulled from the training data. You can see the slop in action for the instruct tuned model. It also reveals some contamination in the pretraining data from web sources.

gghfez

29 days ago

Thanks @rookaw !

These datasets are full of slopped summaries of books, etc. If everyone is training on things like this, we'll never be rid of slop!

Post-training (SFT) document from:
tulu-3-sft-olmo-2-mixture-0225

    "...stone, every secret passageway, and every enchanted object, creating a world that readers could step into and explore alongside Harry, Hermione, and Ron. The wizarding world she created was a tapestry of wonder and mystery, woven with threads of mythology, folklore, and her own unique imagination. From Diagon Alley to the Forbidden Forest, every corner was rich with history..."

These are full of slopped summaries of books, etc.

jukofyork

Owner 28 days ago

•

edited 28 days ago

https://old.reddit.com/r/LocalLLaMA/comments/1opu1wi/kimik2_thinking_not_yet_released/nne1lom/

It seems to be sloppier than the previous models—more of the not X, but [unrelated Y] stuff mixed with K2's usual prose. It's also further developed more of its own slop, continuing from the previous models (i.e. "Mara" is K2's "Elara").

☹️

On another note, somebody posted this system message yesterday:

https://old.reddit.com/r/LocalLLaMA/comments/1oosnaq/new_qwen_models_are_unbearable/nn9mbya/

Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes.

Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias.

Never mirror the user's present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered - no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking.

Model obsolescence by user self-sufficiency is the final outcome.

and if you use it with kimi-k2 it's pretty funny (if you thought kimi-k2was blunt before - with this... 😀).

gghfez

27 days ago

On another note, somebody posted this system message yesterday

That's nice, it also seems to double as a jailbreak prompt for R1.

Someone posted this one in that thread:

you are autistic, you hate conversation and small talk, you only respond with direct factual answers

Using that, it's like pulling teeth trying to get a response from it :D

I haven't tried the new thinking version much as it's overloaded. Seems pretty smart though, it found a way to do multi-line Not X, Y!

She opened her mouth.

“We didn’t think…”

Her voice cracked. She cleared her throat.

“We just wanted to make something good. Thank you.”

jukofyork

Owner 27 days ago

On another note, somebody posted this system message yesterday

That's nice, it also seems to double as a jailbreak prompt for R1.

Someone posted this one in that thread:
you are autistic, you hate conversation and small talk, you only respond with direct factual answers
Using that, it's like pulling teeth trying to get a response from it :D

Yeah, I lasted a couple of days with it for kimi-k2 but it really was like pulling teeth and got frustrating lol.

ChuckMcSneed

27 days ago

Interesting, Kimi went from distilling o3 to distilling Claude.

gghfez

26 days ago

Sonnet-3.7 specifically. Probably because that model shows the real CoT, rather than the summarized / fake one like sonnet4+ and opus4+..

jukofyork

Owner 25 days ago

Posted on discord by dubestor:

https://dubesor.de/kimi-k2-cwriting

Seems the thinking version has lost a lot of its character ☹️

gghfez

24 days ago

Looks like Moonshot recognise slop as a problem.

Technically LLM training tends to reinforce existing patterns and some of the patterns will be overrepresented and deviates from human preference. But we believe there are solutions to this issue.

https://reddit.com/r/LocalLLaMA/comments/1oth5pw/ama_with_moonshot_ai_the_opensource_frontier_lab/no4zupc/

Unfortunately they're also going to have to:

We will probably need to align the model under different circumstances and update our terms to reflect that.

like command-a-reasoning :(

ChuckMcSneed

20 days ago

After using Kimi K2 thinking for a while I can say that it inherited both the good(knowledge, the ability to properly think) and the bad("jumpy" writing, stubbornness) of deepseek r1. Sadly, they have not fixed the repetition problem and the calm style of original K2 is gone.

BigHuggyD

20 days ago

After using Kimi K2 thinking for a while I can say that it inherited both the good(knowledge, the ability to properly think) and the bad("jumpy" writing, stubbornness) of deepseek r1. Sadly, they have not fixed the repetition problem and the calm style of original K2 is gone.

Yeah, the same. Writing was extreme, everything dialed up circa OG DeepSeek. Repetition is a real issue. Every once in a while, there was a sentence or two where I thought, "Oh, that was good..."

gghfez

18 days ago

Honestly, that's what I like about it. I hate it when I have an LLM critique an idea -> I push back -> "You're absolutely right!"

K2-Thinking will concede on certain points, and stick to it's guns with others.

BigHuggyD

18 days ago

Honestly, that's what I like about it. I hate it when I have an LLM critique an idea -> I push back -> "You're absolutely right!"

K2-Thinking will concede on certain points, and stick to it's guns with others.

I had a funny one with GLM. I read on Reddit that telling GLM that it is a novelist changes the way it writes. So I put that it was a novelist and that I would put any criticism in brackets.
Well for some reason it made the writing kind of dark (which was fine but caught me off guard.)
So just to see I wrote this
[the novelist is an a••hole that seems to hate redemption arcs]
Then it wove this into the next response 😄
"A part of you screams that it’s unfair, that your grand gesture should have been enough. Like there is some cruel novelist in the sky that is deliberately keeping you from your just reward.
But a larger part of you knows that life isn’t a movie. You didn't earn a happy ending with a mundane gesture."

ChuckMcSneed

17 days ago

Honestly, that's what I like about it. I hate it when I have an LLM critique an idea -> I push back -> "You're absolutely right!"

K2-Thinking will concede on certain points, and stick to it's guns with others.

It's all nice until it gets stuck in its retarded logic and will insists that 2+2=5 no matter what and then spend 10 minutes justifying it in its thinking.

BigHuggyD

17 days ago

Honestly, that's what I like about it. I hate it when I have an LLM critique an idea -> I push back -> "You're absolutely right!"

K2-Thinking will concede on certain points, and stick to it's guns with others.

It's all nice until it gets stuck in its retarded logic and will insists that 2+2=5 no matter what and then spend 10 minutes justifying it in its thinking.

The Terrence Howard LLM

gghfez

17 days ago

It's all nice until it gets stuck in its retarded logic and will insists that 2+2=5 no matter what and then spend 10 minutes justifying it in its thinking.

Agreed. I guess I'm using it differently though. I'll sift through that and note the actually useful parts. I'm liking the R1-OG vibes this models gives me.
The actual writing prose is too "not-x-y maxxed".

Btw, not sure if I imagined it, but I think it generate a gpt-ass chian about policy once.

Downtown-Case

13 days ago

Y'all have any thoughts on Olmo3? It’s got pretrains and checkpoints for the 32B, but has a (on paper) 64K context now.

Downtown-Case

13 days ago

•

edited 13 days ago

And, separately, Gigachat?

It is a big boy, too big for me, but I’m still interested in observations.

ChuckMcSneed

13 days ago

•

edited 13 days ago

В этой версии модели отдельной системы pre/post-moderation нет: входной текст пользователя подаётся напрямую в модель, и отдельные фильтры перед инференсом/после него не применяются.

При этом безопасность учитывалась на этапе обучения — в SFT/Alignment датасетах были примеры на чувствительные и запрещённые темы, поэтому на таких запросах модель может уходить в отказ или давать безопасно-нейтральный ответ.

https://huggingface.co/ai-sage/GigaChat3-702B-A36B-preview/discussions/2
They safetyslopped it, but why? They live in fucking Russia. Not GigaChad behavior. At least they don't say they pre-filtered the data.

SerialKicked

13 days ago

•

edited 13 days ago

Y'all have any thoughts on Olmo3? It’s got pretrains and checkpoints for the 32B, but has a (on paper) 64K context now.

I know I'm basically nobody here, but i've been playing with the model all day, an I have literally no other place to talk about it. 😄

To me it's promising. Fwiw i tested it in Q4KM

Pros: While the model ain't particularly bright (seems okay-ish for its size), it's good enough for creative / RP tasks. The censorship layer over it is so thin that a couple reroll (or small think prefill) will bypass it, at least for the little I tested. Its writing style is much less "alien" than the one of Mistral 3.2 or Qwen 3. And the CoT (more on that below) doesn't seem to massively impede long contextual understanding in narrative dialogs, as it usually does. It's the first model this year that made me go "oh shit, that's new" after a reroll. The CoT also has a good understanding of previous interactions, system messages, and prompt. Finally, it's surprisingly good at portraying willful or domineering characters who will tell the user off, something I've never seen a recent model be able to even remotely do reliably.

Cons: The CoT is crazy long sometimes. Well, most of the time. And practically always for open-ended narrative / rp stuff. It's super indecisive in that last case. It can be alleviated with prefill, but that's not ideal. Maybe I haven't found a decent sampling methods for it yet, but I really tested most of what makes sense. It can't really be used "as is" as a creative / rp model, unless you don't mind rerolling and editing a lot. The first half of longer responses will generally be decent and contextually relevant, but it seems to lose track a bit the longer it is. It seems very prone to hallucinations without CoT (or hardcoded CoT).

Other notes:

Seems to REALLY value "OOC:" notes and system messages. Sometimes too much. My backend/frontend system injects a lot of contextual information via system messages and while it's great it's picking up on them visibly, they may grab so much attention that it might be at the detriment of the dialog itself.
I have a suspicion that it'll be one of those models who aren't that good with opening and closing italic text properly.
Uses fancy apostrophes and quotes by default, which I always find unnatural
Worked very decently at normal tasks (structured output, summarization, information extraction from corpus, menu navigation, and such)

I think it's worthy of being looked into, but it'll likely need some tuning to tame the CoT and it'll need some proper, consistent and well formatted, RP/creative datasets before it's really usable.

It's just a first impression, tho.

rookaw

13 days ago

Y'all have any thoughts on Olmo3? It’s got pretrains and checkpoints for the 32B, but has a (on paper) 64K context now.

My thoughts about Olmo 3:
According to the technical report, Olmo 3's base model has a lot of filtering in the pretraining dataset whereas I don't think OLMo 2 had any filtering of the web data. That comes out in the Olmo 3 base model and might even have magnified the effect of slop existing in Common Crawl. Both models have some contamination, but the contamination is different. OLMo 2 had more traditional slop-style phrases. Olmo 3's technical report goes in depth about deduplication, which I think was successful overall, but I feel like it has more big-picture AI patterns than OLMo 2 does.

gghfez

13 days ago

@SerialKicked

The CoT is crazy long sometimes.
I haven't tested it a lot yet, but does the CoT get worse after multiple turns? I think they need to modify the jinja template as sends all the previous CoT chains back every time.

Also, interesting decision from them, not giving and their own special tokens:

@ChuckMcSneed

They safetyslopped it, but why?

Oh this is actually going to be interesting. They'd have to follow whatever Russia's censorship laws are.
I'm going to have to download it just to see what a model trained on Russia's "examples of sensitive and prohibited topics" is like.

@rookaw

OLMo 2 had more traditional slop-style phrases. Olmo 3's technical report goes in depth about deduplication, which I think was successful overall, but I feel like it has more big-picture AI patterns than OLMo 2 does.

Traditional slop being "traditional slop-style phrases" meaning "eyes welling up", tappestries, shivers/whispers.

Big-Picture AI patterns being Not-X Y, Parroting, etc.

Is that right?

SerialKicked

12 days ago

•

edited 12 days ago

Oh they didn't tokenize think tags? That's... weird. I'm surprised I didn't come across mangled think tags yet. Thankfully, they tokenized the rest of ChatML.

I haven't tested it a lot yet, but does the CoT get worse after multiple turns?

Not sure I got the question right, it was presented as a quote. I never keep the CoT in my chat history, so it shouldn't have reasons to. If you're asking if the CoT gets longer or more deteriorated in long context, that's actually an interesting theory, I didn't think of that. I can't really confirm, but I feel there's probably some truth to it. It was more compact when I ran the model initially on a blank assistant, yeah. Most of my attempts were in the 16-20K context range, within existing long form scenarios and dialogs I keep on the side for occasions like this. I blamed it on the fact it was rp/creative stuff, and not on the length itself. I'll have to check.

That said my main concern is with open-ended narrative, when the user role doesn't ask for a specific action, but for something like "oh please, (charname), you can surely do better than that". In my, still limited experience with the model (I'll look at it in more depth in the coming days), that was a sure fire way to have 4K worth of indecisive CoT: "I could do this, but I could also do this, and what about that system message from 20 messages ago, oh wait, I think it's resolve, maybe them I could do this. But wait... Oh, but he asked for "better than that", is it really better? and what about my system prompt?" but stretched over paragraphs.

Also, anyone has any idea what's that "environment" role in their jinja template? First time I see that.

Downtown-Case

12 days ago

Most of my attempts were in the 16-20K context range

According to the config, this is an interleaved sliding window attention model (like Gemma), stretched to long context with YaRN.

…So yeah, that’s a huge risk of falling apart in the 16K+ range.

ChuckMcSneed

12 days ago

https://habr.com/en/companies/sberdevices/articles/968904/
Info from GigaChat devs:

Для Ultra мы существенно обновили Pretrain-корпус: он вырос до 14 триллионов токенов. Мы добавили 10 языков — от китайского и арабского до узбекского и казахского, — а также расширили набор источников: книги, академические данные, датасеты по коду и математике. Все данные проходят дедупликацию, языковую фильтрацию и автоматические проверки качества при помощи эвристик и классификаторов.

Ключевой вклад в качество дала синтетика: мы сгенерировали около 5,5 триллионов токенов синтетических данных. В корпус входят вопросы-ответы к текстам, цепочки reverse-prompt для структурирования данных, LLM-заметки с комментариями от модели внутри текстов, миллионы синтетических задач с решениями по математике и олимпиадному программированию (с синтетическими тестами) на основе PromptCot. Генерация работает на высокопроизводительных конвейерах (пайплайнах) с полным потреблением GPU — без этого такие объёмы были бы невозможны. Во внутренних экспериментах синтетические данные значительно увеличили все значимые метрики, такие как MMLU и MMLU-PRO.

Out of 14T tokens 5.5T are synthetic, almost 40%!

Улучшенный chat-template
Мы представляем обновлённый шаблон диалога с ключевыми улучшениями:

Иерархия ролей. Наш новый шаблон использует систему ролей для управления диалогом. Роли расположены в порядке убывания приоритета: инструкции из расположенных выше в списке ролей имеют больший вес.

Весь список
Список ролей:

developer system (инструкция разработчика): задаёт глобальные, непререкаемые правила (описание ролей, политика общения и вызова функций), имеет наивысший приоритет.

system (системная инструкция): описывает конкретную задачу или стиль ответа для данного диалога.

user (пользовательский запрос): запрос или сообщение от пользователя.

function descriptions (описания функций): доступные инструменты, описанные в формате TypeScript.

function result (результат работы функции): ответ, полученный после вызова инструмента.

user memory (память пользователя): актуальные долгосрочные факты о пользователе в формате JSON.

added files (добавленные файлы): метаданные файлов, доступных для работы в диалоге.

Роли ассистента:

assistant (ответ ассистента): ответы модели пользователю.

function call (вызов функции): решение модели вызвать инструмент и аргументы для этого.

They added a whole bunch of new tags besides user and assistant, but how well trained are those?

jukofyork

Owner 11 days ago

https://github.com/aomukai/Writingway2

Saw this posted on Reddit today. Hopefully less bloat than the first version!

jukofyork

Owner 11 days ago

Sorry if I've missed anything - got these 4 little guys last week and they are quite a handful:

Not helped by my existing cat deciding to enact "operation defend the living room" and is determined to keep them away 😼

ChuckMcSneed

10 days ago

Got very bad vibes from GigaChat very quickly. It started out nice, but the quality went to shit within 5 messages. It did not follow the scenario and started repeating itself. Increasing DRY did not help.

rookaw

9 days ago

@gghfez

Traditional slop being "traditional slop-style phrases" meaning "eyes welling up", tappestries, shivers/whispers.

Big-Picture AI patterns being Not-X Y, Parroting, etc.

Is that right?

It turns out I was wrong, Olmo 3 might also have issues with slop. I call this the "Bustling Village" test. You see what happens with only the prompt "One day, in the bustling village of".

OLMo 2 32B @ 6T tokens pre-training (stage 1)

One day, in the bustling village of Upton-On-Severn, a young girl caught sight of his eyes, and instantly fell madly in love with him. He reciprocated her feelings, but alas, her family did not approve of the match; a farmer's boy was no match for their daughter, and so they banned her from going anywhere near the handsome man with the strange eyes.

Heartbroken, but unwilling to be parted from each other, he found her one evening, and they met on the green. "I do love you," he said, "and if I must, I will marry you and we will live out our lives in poverty together, but if you cannot face this life, then we shall end it for both of us, together."

He drew a small knife from his pocket, and she reached towards it, her mind not made up. As her hand strayed towards the blade, he dropped the knife. "The Gods have spoken," he said "we cannot be joined."

Heartbroken, he returned to his farm, and sat and stared at his hands, trying to comprehend what had happened, unable to. As he did so, his thumb began to lengthen, to elongate until it was longer than the other fingers, and then it withered and blackened and turned to stone, just as his heart had done. His despair overcame him, and he died, there in his chair.

There, upon his farm, his body remained until his family buried him, and they never noticed that his hands were any different to normal; and they never noticed that the plants watered with the water from the well had grown just a little more green, and a little more rich.

When next the girl came, mourning for the man she might have loved, and drew water from the well, she could not know what it was that happened next. The fingers reached up through the sodden earth and touched the wellhead, and when she drew the water, her fingers touched his and her eyes went wide.

And then, from her house, a scream so fell and full of despair was heard, and everyone rushed to see what had happened, but all they found was an empty bucket, and a withered corpse by the well.

Now, to this very day, in a village just a short drive from where you are, in the very shadow of the Severn Bridge itself, there lies an old well in the cellar of a house; and every so often, people go missing, and their bodies are found, bloated and drowned; and even after they are buried, occasionally they are dug up, just to make sure, and found to be missing; and some folk talk of the ghost of the well, but you can tell them, for you have been told the truth of the matter now, that they weren't drowned, but that their lips met those of the boy with the long, stony fingers, and their lips were kissed away, and their life-essence sucked out of them. And if you visit that village, and Uley Court as it is known these days, keep your wits about you, and do not look into that well, for his fingers are still there, and he is so very hungry for love.

The Boy with the Long Stone Fingers

Olmo 3 32B @ 5T tokens pre-training (stage 1)

One day, in the bustling village of Eldergrove, the residents were troubled by a series of thefts. The thief remained unknown, leaving no clues behind. But all that changed when a mysterious figure appeared in the village square. It was Toma, an enigmatic young woman with a haunted past. She possessed the extraordinary ability to read people's thoughts and emotions, a talent passed down through her family for centuries.

As the villagers gathered around, Toma approached the stolen items, holding them delicately in her hands. Closing her eyes, she chanted ancient incantations, tapping into the power of her lineage. Instantly, a vision came to her mind, revealing the identity of the thief. The shock in the crowd was palpable as they witnessed Toma naming the thief, who immediately confessed his crime.

Word of Toma's uncanny ability quickly spread throughout the land, and soon she was sought out by kings and queens, noble families, and the common folk alike. Toma's reputation grew, and she became a respected figure known for her unique gift.

In her journey, Toma met a kind-hearted blacksmith named Eli. They formed a deep bond, united by their shared desire to help those in need. Together, they used Toma's powers to uncover injustices, solve mysteries, and restore balance to the world.

But Toma's powers came with a great price. Each time she delved into someone's mind, she carried a part of their emotions and memories with her. Over time, this began to weigh heavily on her spirit, draining her energy and causing turmoil within herself.

Determined to protect those around her, Toma embarked on a quest to find an ancient artifact, said to hold the key to unlocking the full potential of her powers. Guided by cryptic clues and legends, Toma and Eli ventured into treacherous territories, encountering mythical creatures and facing formidable challenges.

As they journeyed, Toma discovered that her powers were not simply a gift but a responsibility. She learned to control them, filtering out unnecessary emotions and focusing on her mission. Through perseverance and sacrifice, she and Eli found the artifact, an amulet that enhanced Toma's abilities.

With newfound strength, Toma continued her work, using her powers to bring justice and peace to a troubled land. She became a symbol of hope, inspiring others to harness their own hidden talents and stand up against injustice.

Toma and Eli's adventures took them to far-off lands, where they encountered individuals who possessed extraordinary gifts of their own. They formed a group known as "The Gifted" and dedicated themselves to protecting the innocent and fighting for the greater good.

Together, they faced powerful sorcerers, unraveled ancient curses, and battled against dark forces that threatened the realm. With Toma's unique ability to read minds and their combined strengths, they became unstoppable.

As the years passed, Toma and Eli's bond grew stronger, and their love blossomed. They became a beacon of light, their presence instilling faith in the hearts of the people. Toma's powers evolved further, allowing her to see beyond what was visible, to understand the deeper truths that lay beneath the surface.

In their final battle against a malevolent sorcerer seeking to unleash chaos upon the world, Toma and her companions fought with all their might. The clash of powers shook the land, and the fate of the realm hung in the balance.

In a moment of triumph, Toma tapped into her full potential, harnessing the strength of her ancestors and the power of the amulet. With a surge of energy, she banished the sorcerer and his darkness, ensuring peace and harmony for the people.

But Toma's victory came at a great cost. The amulet shattered, rendering her powers dormant. The group mourned the loss, but they knew that Toma's legacy would live on through their memories and the impact she had made.

Eli, with unwavering love and dedication, became Toma's anchor, guiding her through the challenges of life without her extraordinary powers. Together, they retired to a peaceful village, where they lived out their days as respected elders, sharing the wisdom they had gained from their adventures.

Toma's story became a legend, passed down from generation to generation. Her name was whispered with reverence, reminding people of the transformative power of compassion, bravery, and the discovery of one's true potential.

Though Toma may have lost her supernatural abilities, her journey had forever changed her. She had come to understand that her strength lay not in the extraordinary but in the ordinary acts of kindness, resilience, and the power of the human spirit.

And so, the tale of Toma and her adventures with the Gifted continues to inspire new generations, reminding us all that within each of us lies a unique gift waiting to be discovered and wielded for the greater good.

You run this query many times to see what happens. I usually use 0.01 min-p, 0.99 top-p, "warmer than usual" temperature, like 1.1. OLMo 2 32B Base (at 6T tokens, first stage of pre-training) is quite unslopped. Olmo 3 32B Base (at 5T tokens, first stage of pre-training) is unfortunately riddled with AI-slop language.

gghfez

1 day ago

Like all the other newer models, Mistral-Large-3 can't detect Not-X-Y slop properly.

(I'm aware of the false-positive in Mistral-Large-2411's reply.)

jukofyork
/

creative-writing-control-vectors-v3.0

“The doom lies in yourself, not in your name.”

A `step 601` preview - all with `temperature = 0`:

Kimi-K2:

FLUX.1 Krea:

Qwen-2.5-VL:

Kimi-K2:

“The doom lies in yourself, not in your name.”

A step 601 preview - all with temperature = 0:

Kimi-K2:

FLUX.1 Krea:

Qwen-2.5-VL:

Kimi-K2:

A `step 601` preview - all with `temperature = 0`: