THE THREAD OF DOOM

#12
by jukofyork - opened

Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(

jukofyork pinned discussion

Okay, I was wondering if we crossed some sort of line.

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.

I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.

Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0 ones to avoid a lot of the confusion.


image.png

image.png

It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b and creative-writer-v0.2-35b models will be going as soon as I get the v1.0 version uploaded, and possible Dusk-Miqu-70B if they do set a hard-limit (I still think Dark-Miqu-70B is worth keeping whatever though).


Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B myself!

:( Damn there was some good info in that thread.

If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.

Unfortunately, I cleaned my browser tabs up about an hour ago.

And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.

I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.

@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.

I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol

I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...

I have actually thought of a completely new PEFT method over the holidays to be applied to the attention matrices specifically.

It's not "LoRA" (as in "low rank") but for 64 heads it uses exactly the same number of tunable parameters as for a rank-64 LoRA applied to the q_proj (and less for the k_proj / v_proj if using GQA).

It's a bit involved to explain and will need some custom code writing to use a block-diagonal matrix in a PEFT wrapper, but I think there is actually a fundamental flaw in using LoRAs with multi-headed attention that this should fix (to do with the cross-talk / lack of enforced sparsity in lora_B which is getting added to all the attention heads when actually there is no reason to believe there is any actual linkage between the heads!).

The most important thing with this idea is that it might actually be possible to regularise each of the tiny 128x128 matrices (back towards the identity matrix), that each act independently on a separate attention head, so as to use knowledge of relative sample sizes of the high vs low frequencies generated using RoPE to have way less chance of screwing up the long-contex ability of models when trained on shorter sequences.

Has anybody else noticed Claude Sonnet has had a lobotomy recently?

About 2 weeks ago I noticed it started:

  1. Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.

  2. Saying “Wow, that’s a really clever idea” and other compliments.

  3. Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.

  4. Acting curious and asking me questions all the time.

This is all via OpenRouter / API.

If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?

What's a typical workflow look like with these things?

I usually use Exui for writing, sometimes tabbyAPI + Mikupad if I want to click the token probabilities and choose a different token / change the direction of the story.

Are you constantly loading and unloading vectors to steer the story?

The character ones (honesty, etc) I toggle frequently. The rest I leave on.

When I was using llama.cpp, I wrote a wrapper UI which looked similar to the “command line generator” in the control-vectors GitHub repo.

I guess I do frequently adjust them.

Has anybody else noticed Claude Sonnet has had a lobotomy recently?

About 2 weeks ago I noticed it started:

  1. Insisting on breaking up code into different messages. (I got around this by saying “Please write it all in one message, I promise it will go through) lol. Without the promise, it would still break it up.

  2. Saying “Wow, that’s a really clever idea” and other compliments.

  3. Making mistakes in its code, forgetting what we’re trying to do, and repeating it’s mistakes.

  4. Acting curious and asking me questions all the time.

This is all via OpenRouter / API.

If o1 got nerfed at the same time, well I find that to be too much of a coincidence. Maybe an OpenRouter issue?

I've been using claude-sonnet-3.5 on OpenRouter via the API and have tried all 4 variants (ie: old, new, self-moderated and "beta") and all are working like complete shit :/

I've actually been using o1-preview using the openai API and it definitely seems to have got quite a lot dumber, and seems to make a lot more stupid mistakes than it used to make :(

First version with the "double-newline" tokenisation bug fix is uploaded:

https://huggingface.co/jukofyork/creative-writer-32b-preview-01-2025

I'm current uploading creative-writer-plus-32b-preview-01-2025 which has its Entropy quite significantly boosted.

I have creative-writer-plus-35b-preview-01-2025 training now, so will upload the two 35b models over the next couple of days too.

This looks really interesting:

https://github.com/zenoverflow/omnichain

There are quite a few interesting threads on Reddit about it, but this has the most details on how it might be interesting for writing:

https://old.reddit.com/r/LocalLLaMA/comments/1ej8aua/simple_dumb_trick_to_enhance_roleplay_and_other/

I was actually just looking for something to quickly prototype lots of regex + loopback LLM manipulations to try to get some sort of workflow to tidy up books in text format, but I think it actually might have quite a lot of potential for mixing things up for creative writing too - especially as it can act as an OpenAI API endpoint itself...

All 4 versions of the new preview models uploaded:

https://huggingface.co/jukofyork/creative-writer-32b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-plus-32b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-35b-preview-01-2025
https://huggingface.co/jukofyork/creative-writer-plus-35b-preview-01-2025

The "plus" versions are actually pretty good now:

  • You can now plot the mean and histogram of the final hidden state going into the lm_head in qlora-pipe's output (and it clearly isn't just downscaling them if you continue the training like this).
  • It's clear that increasing the Entropy using Focal Loss* works way better if you start from an already converged model and use the same dataset for it.
  • It might actually be possible to run this in 3-4 stages and eventually push the Entropy right back up to the level of a base model whilst hardly doing any damage!

These were all trained on the old dataset, but I have now refined the dataset a little:

  1. I've got a couple of regexs to hopefully filter out even more junk (eg: discarding hyphenated paragraphs, checking the start and end characters, etc):
if (( length >= min_threshold && length <= max_threshold )); then
     if [[ $trimmed_paragraph =~ ^(\"|\'|\"\'|\`|\(|\*{1,2}[\"\']?|\<{2,3}[\"\']?)?[[:upper:]] ]]; then
         if [[ $trimmed_paragraph =~ (\.|\!|\?|\)|\'|\")$ ]]; then
             valid_paragraphs+=("$trimmed_paragraph")
         fi
     fi
fi
  1. I've decided to skip the first 10 and last 10 valid paragraphs from each book to hopefully avoid the "authors notes" problem.

  2. I've used some very rudimentary regexs to try to cluster 1pp/2pp and 3pp into separate files when creating the random paragraphs:

remove_dialogue() {
    perl -pe "s/\"([^\"]|\\.)*\"//gs" <<< "$1"
}

is_1pp_2pp() {
    grep -Eiq "\b(I|me|my|mine|myself|I'm|I've|I'd|I'll|I'd've|we|us|our|ours|ourselves|we're|we've|we'd|we'll|we'd've|you|your|yours|yourself|yourselves|you're|you've|you'd|you'll|you'd've)\b" <<< "$1"
}
is_3pp() {
    grep -Eiq "\b(he|him|his|himself|he's|he'd|he'll|he'd've|she|her|hers|herself|she's|she'd|she'll|she'd've|it|its|itself|they|them|their|theirs|themselves|they're|they've|they'd|they'll|they'd've|one|one's|oneself)\b" <<< "$1"
}

content_no_dialogue=$(remove_dialogue "$file_content")

if is_1pp_2pp "$content_no_dialogue"; then
    target_file="$output_file_1pp_2pp"
elif is_3pp "$content_no_dialogue"; then
    target_file="$output_file_3pp"
else
    if (( RANDOM % 2 )); then
        target_file="$output_file_1pp_2pp"
    else
        target_file="$output_file_3pp"
    fi
fi

The new filters discard around 5% more text data (so around 30% discard in total now): 1074 books (745MB) --> 1.7M paragraphs (520MB).

There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....

So about to set off the training for v1.0 on the (older) command-r-plus model now (using 16bit floats if I can get it to work using 6 pipeline stages).

but I have now refined the dataset a little

I assume you've applied this to your dataset and saved it as a new dataset to train on (rather than doing this on the fly).

There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....

The tricky part with this (that I've found), is that the model won't follow your instructions reliably by default. Especially if the content of the dataset record includes something which could be interpreted as instructions -- it'll actually treat the content as instructions it's self! And if you use too large of a model; then it takes a really long time to produce the outputs.

That said; when I had an LLM review each record in my datasets, I had the best luck with gemma-2-2b-it, and sometimes gemma-2-2b-it-abliterated.

Few-Shot prompting helped improve reliability (Pre-fill the history with a few examples of the model doing what you want it to do).

I also instructed specific tags in the output, so that I could test for them programmatically, and resend it upon failure. This worked well because in cases where it failed to follow the instruction, it would also fail to write .

Qwen2.5 might be better at this, I didn't test it (because this predated the Qwen2.5 release), but I've found it to be unreliable as part of my automated manga translation system. Weirdly, sao10k's roleplay tune of Qwen2.5 is the best I've found for translating manga and matching text size to fit the speech bubbles accurately.

So about to set off the training for v1.0 on the (older) command-r-plus model now (using 16bit floats if I can get it to work using 6 pipeline stages).

Good luck, hope it goes well! This would probably be the only finetune of cr+!

but I have now refined the dataset a little

I assume you've applied this to your dataset and saved it as a new dataset to train on (rather than doing this on the fly).

Yeah, it all gets done with bash scripts so fairly slow, but ultimately I end up with this to train on:

> ../convert_paragraphs_to_pov_dataset.sh fiction-paragraphs fiction-dataset-shuffled 
All files have been processed and concatenated into the output directory 'fiction-dataset-shuffled'.
Generated output files:
total 522M
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part10.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part11.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part12.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part13.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part14.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part15.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part16.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part17.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part18.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part19.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part1.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part20.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part21.txt
-rw-r--r-- 1 juk juk 5.2M Jan  6 02:22 1pp_2pp_part22.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part2.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part3.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part4.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part5.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part6.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part7.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part8.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 1pp_2pp_part9.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part10.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part11.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part12.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part13.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part14.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part15.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part16.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part17.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part18.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part19.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part1.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part20.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part21.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part22.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part23.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part24.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part25.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part26.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part27.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part28.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part29.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part2.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part30.txt
-rw-r--r-- 1 juk juk 5.6M Jan  6 02:22 3pp_part31.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part3.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part4.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part5.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part6.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part7.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part8.txt
-rw-r--r-- 1 juk juk  11M Jan  6 02:22 3pp_part9.txt

The 10MB files are due to hugginface's dataset.map() code that's used by qlora-pipe doing god knows what if you give it a single 500MB file (it runs out of RAM on a machine with 0.5TB of RAM in it!).

I'm actually gonna create 3-4 of these same datasets but with the paragraphs all shuffled about each time so I can use them for the Entropy increasing stage(s), and hopefully make it less likely the model will lock on to the exact order of a single permutation...

There is only so much you can do with regexs and really to progress I need to find a way to use another LLM to look at all the paragraphs and help me sort them out via some kind of automated workflow....

The tricky part with this (that I've found), is that the model won't follow your instructions reliably by default. Especially if the content of the dataset record includes something which could be interpreted as instructions -- it'll actually treat the content as instructions it's self! And if you use too large of a model; then it takes a really long time to produce the outputs.

That said; when I had an LLM review each record in my datasets, I had the best luck with gemma-2-2b-it, and sometimes gemma-2-2b-it-abliterated.

Few-Shot prompting helped improve reliability (Pre-fill the history with a few examples of the model doing what you want it to do).

I also instructed specific tags in the output, so that I could test for them programmatically, and resend it upon failure. This worked well because in cases where it failed to follow the instruction, it would also fail to write .

Qwen2.5 might be better at this, I didn't test it (because this predated the Qwen2.5 release), but I've found it to be unreliable as part of my automated manga translation system. Weirdly, sao10k's roleplay tune of Qwen2.5 is the best I've found for translating manga and matching text size to fit the speech bubbles accurately.

Yeah, this is way in the future as I don't have time to do this now anyway: I think pruning 30% of the data isn't really a huge loss and it would be a lot of work to do this properly, and then I might as well just increase the number of books and prune again...

So about to set off the training for v1.0 on the (older) command-r-plus model now (using 16bit floats if I can get it to work using 6 pipeline stages).

Good luck, hope it goes well! This would probably be the only finetune of cr+!

It's actually not going to take any longer using 6 stages of bf16 than it did for 3 lots of 2 stages of 4bit:

GPU-SERVER-1: [2025-01-06 13:02:08,008] [INFO] [logging.py:129:log_dist] [Rank 0] step=1, skipped=0, lr=[7.92e-06], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:02:08.018] [INFO] [qlora-pipe] step:     1 /   483 loss: 2.6912 iter time (s): 573.599 samples/sec: 0.056 eta: 76h47m 
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:11:36,561] [INFO] [logging.py:129:log_dist] [Rank 0] step=2, skipped=0, lr=[1.1840000000000002e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:11:36.579] [INFO] [qlora-pipe] step:     2 /   483 loss: 2.6866 iter time (s): 568.485 samples/sec: 0.056 eta: 76h22m 
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:21:02,534] [INFO] [logging.py:129:log_dist] [Rank 0] step=3, skipped=0, lr=[1.576e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:21:02.647] [INFO] [qlora-pipe] step:     3 /   483 loss: 2.6782 iter time (s): 566.002 samples/sec: 0.057 eta: 76h4m 
GPU-SERVER-1: before GAS splitting, batch size: 32, total tokens: 262144
GPU-SERVER-1: [2025-01-06 13:30:28,674] [INFO] [logging.py:129:log_dist] [Rank 0] step=4, skipped=0, lr=[1.968e-05], mom=[0.0]
GPU-SERVER-1: [2025-01-06 13:30:28.686] [INFO] [qlora-pipe] step:     4 /   483 loss: 2.6933 iter time (s): 565.969 samples/sec: 0.057 eta: 75h52m 

and this is only 3x longer than it took to do the 32b model.

One of the GPUs is cutting it a bit close though: 47102MiB / 49140MiB, so hopefully it doesn't randomly OOM during training :/

Also interestingly the 104b model has a much more sane / expected distribution of final hidden states compared to the 32b and 35b models:

image.png

I think the jagged bits to the right are due to multi-word tokenisation (ie: the larger the hidden state, the lower the Entropy of the predictions which would make sense for the non-word-start tokens).

Sign up or log in to comment