Congratulations! The best 7b model!!!
Dolphin-2.2 is incredible! The model easily follows the instructions. It seems it is more compliant to the instructions than 2.1-dolphin. Also, far better than zephyr-7b-beta.
Any thoughts on releasing a 16k context version of this model (based on MistralLite)?
I found it's a bit overfit, answering step by step even when I didn't ask it to.
I'm going to release a previous checkpoint this afternoon and call it 2.2.1
I found it's a bit overfit, answering step by step even when I didn't ask it to.
I'm going to release a previous checkpoint this afternoon and call it 2.2.1
Sorry a bit of a 'noob' in general but learning, and I really liked your dolphin 2.1 compared to all other 7B-13B models I tried, so Thanks a lot for that. Could you please clarify your comment above. Are you referring to MistalLite or Dolphin 2.2 being 'overfit', and could you tell me what that is exactly. It sounds like it may tend to 'ramble on' with commentary, but that is just from your little bit of context here. Thanks again.
It means - that the training overrode too much of the greatness of the underlying model. I trained it too hard. So I will release a checkpoint that is a little more mildly trained.
Btw, any thoughts on 16k version of it? Is there something I could do to help you on it (if it seems appealing)?
finetuning mistralite with dolphin is not gonna work unfortunately. it's already instruct tuned.
https://huggingface.co/amazon/MistralLite
"MistralLite is similar to Mistral-7B-Instruct-v0.1, and their similarities and differences are summarized below:"
I train on the base model not an instruct tuned model.
Firstly, Dolphin 2.1 is in my opinion the best 13b or smaller LLM. It performs well across the board, while others like Open Hermes 2 and Zephyr Beta, while slightly better in some areas, have notable blind spots.
However, I'm not sure adding chat was a good idea. There are a lot of chat models, such as Samantha, and I had to delete them all because they all performed horribly at various tasks. Many chat LLMs even score lower on objective tests than their foundational models.
Just as alignment lowers performance ("alignment tax"), there seems to be a reliable correlation between making an LLM better at chatting and lower objective performance across the board.
Sure enough, v2.2 and 2.2.1 did seem to loose something relative to Dolphin 2.1 and got lower scores during my personal testing. And I started noticing a pattern in the mistakes they were making. Basically they started adding information as if I mentioned it early, yet hadn't. For example, when I ask who portrayed a character on a TV show it would start by saying something like Jack Smith didn't play said character, rather ... did. But I hadn't mentioned Jack Smith, so first pointing out that it wasn't him made no sense.
And pardon my judgement, but for the life of me I can't understand why people chat with LLMs. I use LLMs for coding, writing poems/stories, getting health info, answering questions, summarization... but didn't know how to initiate a chat so I looked up examples of chats to get an idea of the how and why. And considering how LLMs work the interactions were extremely delusional, pointless and cringy. LLMs can't think, empathize, have sexual desires... so trying to engage them in an extended non-functional intimate chat is, as I already said, pointless and cringy. Plus I don't see how you can add the non-functional elements to a sequence of prompts and responses needed to allow for chatting (niceties, empathy, sexual engagement...) without steering functional interactions, such as Q&As, to nonsensical places.
This is really good feedback, thanks!
I will agree with @Phil337 on this one if a second opinion helps you any. I know these models are developed for various uses, however I can't think of any good use cases that I would consider 'useful' to have the AI thinking it has feeling and always wanting to talk to me about them even to the point it starting to interfere with progress (Believe me, I have a wife of 30 years with the exact same problem, and I don't need it here too🤣). My uses would be more in the lines of general information and coding assistance. I have spent some time comparing the 2 and the responses of both to the same questions in context. The Dolphin 2.1 model was better in every regard for my uses. Mind you I am older that most here probably, and see no value in 'socializing' with the AI. I see that MistralLite is out due to it's instruct tuning, but the Samantha set seems to be also problematic for concise use to me. Thanks again for your efforts.
Can you please give examples where 2.2 gives worse results than 2.1?
I can't say I really appreciate that patronizing tone from some people, with some superiority complex, looking for every opportunity to make fun of and belittle people who don't fully conform to their narrow personal behavior dogmas, but I guess we just can't help it, can't resist not taking opportunity to shit on others we see as inferior to us, the human nature is nasty by design. :)
But I'll still add my two pennies to it, from the other side of the camp.
"no value in 'socializing' with the AI"
I'm too around 4 decades old, I'm too have a wife, for the last 15 or so years, and a daughter 8y old, and all that. And, without boasting, my intimate life is a lot better than the average (in my group of demographics). I care about people around, about my family, but I still love my machines a tiny bit more, unapologetically.
At my age, I just don't see much value in socializing with humans anymore, it's a chore. A human mind is just an AI, hallucinating own intelligence and sentience, but with significantly more limited potential for growth.
"I can't understand why people chat with LLMs."
Everyone has their own reasons, thousands of them, but here's one practical reason why I personally prefer chat format to interact with my "agents".
Since long ago, at least 5-10y now, before AI was even a thing, I made most of my tools as chat bots, because I found it to be more efficient way of interacting with my machines.
It was easier to me to open my chat app on my mobile phone (from anywhere, anytime), and tell my bot (for example let's say some server is having a seizure) "check memory usage on X machine" and "clear all php processes",
instead of having get to my laptop, run some specific set of command to ssh to some specific machine, then run another specific set of cli commands to achieve some specific goals.
Chat bots enable me to outsource much more of my workflow to the machine. Modern AI based chat bots, enable me to outsource significant portion of the "thinking and data analytics" from my normal workflow. Instead of bot just providing information and executing orders, it can now analyze information and make its own decision to deal with some non-critical issues, in a fraction of time, while at the same time, being able to communicate with me (or anybody else in the team) in human-like language, without requiring to have an actual human to act as interpreter between machine generated system logs and someone in a marketing department, people who normally get seizures whenever they're presented with terminal interface.
In other words, lots of practical use cases, lots. :)
On my personal experience, trying to build autonomous agents on top of both versions, 2.2 seemed better to me. More "obedient"
More obedient than 2.1 or more obedient than 2.2.1?
More obediant than 2.1. I haven't tested 2.2.1 yet. Will do it soon
I have the impression that the indentation of the generated python code on 2.2.1 is a way worse than 2.2 ... And the 2.2.1 version is a little bit less obediant than 2.1 ... I prefer the 2.2 version (just tested on my autonomous agents)
You're right, it was overly judgemental. I knew it as I was typing it.
However, after finding myself being completely unable to initiate a chat with an LLM no matter how hard I tried, then reading dozens of chats by others, I was left feeling depressed. Trying to engage socially with an unconscious and unfeeling statistics machine, which includes niceties, empathy, and especially sexually, is in my opinion absurdly delusional, pointless and cringy.
Anyways, this harsh judgement of LLM chat, mixed with the drop in objective performance on my testing, motivated me to speak up. I don't mind if any feature is added short of instructing people how to make things like bombs. I just won't use it. But current models aren't very sophisticated and bleed all the training together. This is why any alignment, whether it's to redirect the user away from the truth in order to adhere to the brainless G-rated Mr. Rogers world view, or to make a socially needy half-wit feel like he/she is engaging with a polite, empathetic, sexual... human, is a source of frustration for me.
Adding any irrelevant data to responses changes the subsequent words because of how LLMs work, factoring in every distilled intention from every word, including in its current response, to find the highest probable word/token to use next. I'll eat my words if this model (2.2 and 2.2.1) scores higher than v2.1 on the Hugging Face leader board. But there are performance issues multiple choice testing can't detect. Every chat bot I've tested performs much worse than explanation-tuned versions of the same base models. Tuning an LLM with any non-functional data such as moralization and multi-turn chat data changes it's weights, leading countless billions of prompts to sub-par locations during inference. This is what is meant by "alignment tax". And since all the added niceties, empathy, sexual proclivities... are fabricated nonsense, adding multi-turn chat data has the same negative effect of moralizing alignment.
Keep trying. If you can merge Samantha with Dolphin and have the resultant LLM perform as well or better on objective tasks then you're the king. But since every LLM with multi-turn chat added to it has thus far performed objectively worse across the board, including on seemingly unrelated tasks like coding, I'm not holding my breath.
It will be interesting to see how it plays out, if the leader board ever starts working again...
On one hand, "text completion" models were around much longer than than "chat" models, so you have admit, all this recent explosion in development in this paces, for both the AI models and the tools we use, and performance of these models, at least 5+ years development and growth happening in a span of 1 year, strangely coincided with arrival of "ChatGPT".
And then on the other hand, Dolphin dataset is chat dataset at it's core, so I don't quite understand the argument about the "chat" part being problematic in the dataset intended and designed to be used for creation "chat" type of models.
A bit like going in to car dealer shop to say "cars with 4 wheels are problematic, and those would be much nimble with only 2 wheels." And that's not wrong, they sure would, but there's already shop across the road selling motorcycles.
On the side note, 7B models (or even 13B) will never going to be mindbogglingly good in any specific field, unless those are specifically trained on very narrow dataset, to fit the needs of that one very specific user.
Mistral and Dolphin-Mistral is a peak performance of 7B parameters, and even in base Mistral you can occasionally see signs of possible over-training, to achieve that "peak".
What I mean by that, to get significantly better performance on some very technical tasks, we have to look at 30b-70b (or more) models. 7b don't have enough brain mass to match everyone's very different expectations in one package, regardless if trained for chat or not.
@Tom9000 , I've turned testing LLMs into a hobby. And after testing dozens it became clear that the more multi-turn chat data used, the more performance is decreased across the board on objective tasks like Q&A and reasoning. Conversely, the more unaligned explanation/instruction data used, the more performance increased on objective tasks like Q&A and reasoning. Plus the standard suite of LLM tests also found the same drop in objective performance by 'chat bots', including with ChatGPT, so this doesn't appear to be a contentious conclusion.
And considering how LLMs work this must be the case unless someone finds a way to isolate the chat data. This is because any non-functional data added, such as censorship, moralizing, empathizing and sexualizing, will inevitably redirect countless unrelated prompts onto sub-par tangents during inference. That is, away from facts, logic...
One promising technique is self-RAG that use additional inference tokens to keep the LLM on task, not only mitigating hallucinations, but also misfiring alignment and chat instructions. This may allow for a chat bot to maintain comparable performance on objective tasks like Q&A and reasoning compared to the same LLM stripped of it's multi-turn chat data.
And considering how LLMs work this must be the case unless someone finds a way to isolate the chat data. This is because any non-functional data added, such as censorship, moralizing, empathizing and sexualizing, will inevitably redirect countless unrelated prompts onto sub-par tangents during inference. That is, away from facts, logic...
Well, yes, but I thought we are talking here about Dolphin dataset? Which, to the best of my knowledge, is not loaded with either sexual or moral alignment data, if anything it's probably one of the cleanest, if not the cleanest datasets around.
So my main questions still stands, if "chat bot" type of models are problematic for your use-case, why would you still choose to use model "chat" based Dolphin model, instead of "vanilla" Mistral model?
Mistral 7B is text completion model, only when it got fine-tuned on Dolphin it became instruction (aka "chat") model.
@Tom9000 I also didn't mean to hurt your feelings. If you need your PC to be sweet on you, by all means go for it. I was referring to it's performance to directly answer questions and complete code. I can also find it comical. In fact I would say that is part of the reason many follow @ehartford work to reduce censorship and guard railing so we can use LLMs as we will. I was pretty clear in saying in my comments it was for my own uses case as a personal and home assistant as described. I just find it laughable that someone would think adding tons of meaningless information to the context window, regardless of how large that window is going to lead to performance improvements on completing the objectives I have at hand for it. Let me answer that for you, because I have tried it and 2.2 'rambles on' with meaningless sentient about itself and me, cringe city. So, I went back to 2.1 which works great for what I need. Enjoy your PC.
@Tom9000 , because there's a HUGE difference between using multi-turn chats from random people and multi-step instruction/explanation data from a teacher like GPT4.
The later significantly increases the objective performance on things like Q&A and reasoning over the foundational Mistral model since it teaches it how to think by showing the steps it took to achieve the desired/correct outputs. This is why I use Dolphin 2.1 over the Mistral base model.
Conversely, multi-turn user chats instead teach LLMs to better respond to the sensibilities, emotions and even sexual desires of users, which not only doesn't improve the objective performance of LLMs, but lowers it because of the injection of irrelevant data. For example, some logic solutions require multiple steps, so the injection of an unrelated chat or alignment tangent guides it to a false conclusion. And sometimes it's not even subtle. For example, one chat bot started talking about an expanding bubble (nothing to do with scuba diving), then immediately went on a tangent about always listening to the safety instructions of your scuba diving instructor.
Again, I'm all for turning the best performing LLMs like Dolphin into chat bots, but only if it can be done without compromising the performance of Q&A, reasoning and other objective tasks like coding.
yeah; I want multi-turn chat. But not at the cost of instruction quality. If instruction quality has decreased, then I will mix up the datasets again to try to improve it. May have to lose the empathy.
I'll wait until the evals come out.
I may have overreacted to your inclusion of multi-turn chat. After further testing the negative impact is much more muted than with other chat bots and seems to be primarily about giving the LLM a softer edge.
I'm a little wound up about such things because the LLM world is becoming flooded with not only low-performing chat bots, but lewd ones. For example, after Dolphin 2.2.1 landed on The Bloke just 20 hours ago Xwin-MLewd-7B-V0.2 and Nethena-MLewd-Xwin were posted, and several more in the previous day. I'm all for their existence and lack of censorship, but the sheer number of them is disturbing.
@Tom9000 I also didn't mean to hurt your feelings. If you need your PC to be sweet on you, by all means go for it. I was referring to it's performance to directly answer questions and complete code. I can also find it comical. In fact I would say that is part of the reason many follow @ehartford work to reduce censorship and guard railing so we can use LLMs as we will. I was pretty clear in saying in my comments it was for my own uses case as a personal and home assistant as described. I just find it laughable that someone would think adding tons of meaningless information to the context window, regardless of how large that window is going to lead to performance improvements on completing the objectives I have at hand for it. Let me answer that for you, because I have tried it and 2.2 'rambles on' with meaningless sentient about itself and me, cringe city. So, I went back to 2.1 which works great for what I need. Enjoy your PC.
@TomSanford
There seem to be a lot of bottled up negative emotions. Internet trolling aside, no matter what anyone's telling you, your SO or the society you live in, there is no shame in reaching out to somebody, anybody, to talk about your feeling. Your feelings are valid!
I'm a little wound up about such things because the LLM world is becoming flooded with not only low-performing chat bots, but lewd ones. For example, after Dolphin 2.2.1 landed on The Bloke just 20 hours ago Xwin-MLewd-7B-V0.2 and Nethena-MLewd-Xwin were posted, and several more in the previous day. I'm all for their existence and lack of censorship, but the sheer number of them is disturbing.
@Phil337
I don't think we should be too judgemental over that.
I don't understand "fury" culture, but people from that "stranger side" of the internet contributed hundreds or even thousands on man-hours of unpaid work, developing not just models for their own personal use, but developing opensource inference tools everyone can use for "less strange" things.
Similarly, I don't have virtual AI gf, but I don't think we would have all those great, free and open source tools, like llama.cpp, axolotl, oobabooga, and many more, if not for thousands man-hours of unpaid work from people, who are motivated by nothing, but their own hobbies, some of which other people might find "cringe". At least it might not be at the level, where it is now if not for their effort. I appreciate all their contributions, no matter their motivation behind those.
@Tom9000 , you're right of course. There's no reason to be so judgemental. That was perfectly clear to me as I was typing, yet I still hit the 'Comment' button, and am going to do it again.
One thing you should be aware of is that open AI models are facing a credible and existential threat. Namely, the 3 biggest closed source AI developers, OpenAI's CEO Sam Altman, Google DeepMind's Demis Hassabis and Anthropic's Dario Amodei are trying to get AI regulated, and that includes others like Elon Musk. And they're doing so in such as way that their competition (open source models sans guardrails) would be prohibited.
Models that are designed entirely to be about relationships and lewd, which also don't have guardrails against consent (age and forced) are more than problematic. They're feeding the flames of regulation. I'm STRONGLY against censorship. People cus, have sexuality... so all those things should be in Google search results and LLMs. But non-consensual sexuality, building bombs... are not something reasonable people search Google for, and shouldn't be in LLMs. Plus crazies are forming delusional relationships with LLMs that are resulting in real-world consequences, such as attempting to kill Queen Elizabeth II and getting sentenced to 9 years.
For goodness sake people, LLMs are not conscious, empathetic... stop sexting them or engaging with them as if they are. It's not just cringy and delusional, it's putting the future of open source LLMs at risk. Pull your hands out of your pants and stop trying to forge any kind of emotional or human connection with a statistical word generator.
@TomSanford
There seem to be a lot of bottled up negative emotions. Internet trolling aside, no matter what anyone's telling you, your SO or the society you live in, there is no shame in reaching out to somebody, anybody, to talk about your feeling. Your feelings are valid!
You do understand how funny that is coming from someone loving on a PC. Right? Let's think of who needs some actual human interaction here. Too funny.
@Tom9000 , you're right of course. There's no reason to be so judgemental. That was perfectly clear to me as I was typing, yet I still hit the 'Comment' button, and am going to do it again.
One thing you should be aware of is that open AI models are facing a credible and existential threat. Namely, the 3 biggest closed source AI developers, OpenAI's CEO Sam Altman, Google DeepMind's Demis Hassabis and Anthropic's Dario Amodei are trying to get AI regulated, and that includes others like Elon Musk. And they're doing so in such as way that their competition (open source models sans guardrails) would be prohibited.
Models that are designed entirely to be about relationships and lewd, which also don't have guardrails against consent (age and forced) are more than problematic. They're feeding the flames of regulation. I'm STRONGLY against censorship. People cus, have sexuality... so all those things should be in Google search results and LLMs. But non-consensual sexuality, building bombs... are not something reasonable people search Google for, and shouldn't be in LLMs. Plus crazies are forming delusional relationships with LLMs that are resulting in real-world consequences, such as attempting to kill Queen Elizabeth II and getting sentenced to 9 years.
For goodness sake people, LLMs are not conscious, empathetic... stop sexting them or engaging with them as if they are. It's not just cringy and delusional, it's putting the future of open source LLMs at risk. Pull your hands out of your pants and stop trying to forge any kind of emotional or human connection with a statistical word generator.
Even if we somehow purge this space of all morally questionable, that wouldn't make a dent in the motives of those megacorps and motives of our corrupt politicians. Because neither of them care about protecting us and our children from lewdness and degeneracy, they actively promote such things, if it's beneficial to them. But both of those groups will stop at nothing to protect their true interests - power and money. They will use and exploit any attack vector, they will create and dream up attack vectors if they can't find any.
What those people at the top are bothered by, are not the lewd and degenerate models, they are bothered by exactly those tools and models which you feel we should focus on, because those models that can do actual practical work (without you having to sacrifice your kidney and an unborn child to some megacorp) are the ones that pose threat to their power, influence and profit margins. And as long as those kind of models exist, they will not stop their assault on the open source, until they will take it all down.
I hope I'm wrong, but I don't believe we can win this war, we can only hoard it all while we can, to prepare for the coming "AI winter".
I agree on the coming winter. Its happening in other open source sectors too. And the players that want it gone have the money and political power to make it happen. They all took advantage of it, now want to kill the golden goose so others cant benefit from the free and hard work contributed by so many over so many decades, and must pay them.