Can We Train Chat Models with Raw Data?

Community Article Published April 25, 2024

The idea of training a chat model with desired raw data is incredibly appealing.

However, there is a significant problem with this process. Directly training a chat model with raw data can disrupt its output format.

To solve this issue, the common approach is to create Q/A-formatted datasets. However, this method is time-consuming, costly, and may result in information loss or bias during dataset creation.

So, how can we effectively train raw data? We can utilize the sequential structure of transformer models like Llama, which consists of multiple layers.

I intentionally form the layers responsible for handling the output format in the latter part of the model, and designate the middle to late layers as the starting point for raw training.

You may think that the method involves feeding chat data to the later layers and then training the middle to late layers with raw data, but that's not the case. Such an approach cannot properly address the problem and may even lead to increased model complexity.

The idea presented above doesn't seem bad, so how can we make good use of it? Let's try using a base model.

The methodology is as follows:

  1. Designate the latter layers of the model as the layers responsible for handling the output format.
  2. Set the middle to late layers as the starting point for raw training.
  3. Based on base model A, create model A-Chat by tuning only the latter layers.
  4. Train model A-Raw by training the middle to late layers of base model A with raw data.
  5. Finally, replace the latter layers of model A-Raw with the latter layers of model A-Chat. This way, model A-Raw undergoes raw data training without being influenced by the chat format, and only the chat portion of model A-Chat is transplanted.

Below is application of this method to the Shakespeare dataset (https://huggingface.co/datasets/Israhassan/Shakespeare).

Question - Please explain in detail the contents of Shakespeare's play 'A Midsummer Night's Dream’. Note: The Original Output was Korean below(which is more interesting!) is translated with Claude Opus

Output from the model with the applied methodology

'A Midsummer Night's Dream' is a comedy written by William Shakespeare, believed to have been written in 1595 or 1596. The play revolves around romantic and supernatural adventures that take place in the forest near Athens. The story begins with four young Athenians rehearsing a play in the forest. Meanwhile, the king of Athens, Oberon, and his fairy queen, Titania, have come to the forest to celebrate their wedding anniversary. However, Oberon and Titania are sworn enemies of the fairy king Oberon and the fairy queen Titania, who are in discord. Oberon feeds Titania a love potion to make her fall in love, and Titania feeds Oberon a love potion to make him fall in love. At this time, Oberon sees the human youth Lysander and decides to make him fall in love. He orders his servant Phibes to feed Lysander the love potion, and Lysander immediately falls in love with the first woman he meets in the forest. Unfortunately, that woman is Hermia, the fianc챕e of Lysander's friend Demetrius. Demetrius loves Hermia, but Hermia is only interested in Lysander. Meanwhile, Oberon sees Lysander, who has fallen in love with Titania, and feeds Titania a love potion to make her fall in love with Lysander. Now Titania is interested in Lysander, Lysander is interested in Hermia, and Hermia is only interested in Demetrius. In this confusion, Oberon and Titania resolve their discord, Phibes falls in love with the human youth Oberon, and Lysander and Hermia eventually fall in love. Everything is well resolved, Oberon and Titania reconcile their discord, and everyone reaches a happy ending. 'A Midsummer Night's Dream' is one of Shakespeare's most popular works, combining romance, humor, and supernatural elements. Like Shakespeare's other works, this play evokes various interpretations and is still performed worldwide today.

Output from the base model (EVEE 10.8B)

'A Midsummer Night's Dream' is a comedy written by William Shakespeare, based on Greek mythology. The play deals with the conflict between the king of Athens, Demetrius, and his wife Hermia, and their love story with their children, Duquesne and Lysander, and Demetrius' son Robin. The play also features the fairy king Oberon, his wife Titania, and the fairy Pebbs. The story takes place near Athens, and dramatic events unfold. For example, Robin falls in love with a human girl named Helena, and Pebbs marries the fairy Oberon. The play explores themes of the power of love and friendship, and tells the story of the relationships between various characters and the challenges they face.

Output from the chat-tuned model

Yes, Shakespeare's 'A Midsummer Night's Dream' is a comedy that depicts the adventures and misunderstandings of love in an Athens park. The play was written in the 1600s and is still beloved today. The play deals with issues of love, jealousy, and misunderstanding through various events that unfold in the forest. The main characters include the boastful scarecrow, the shimmering starlight fairy Pebbs, and humans in love. The story begins with the scarecrow seeking the help of a wizard to prove his love. However, his plan takes an unexpected turn, and a journey begins to set everything right. The characters in the play each have their own desires and emotions, which sometimes cause conflicts. But in the end, they come to understand each other and realize that love is the most important thing. This play makes us rethink our thoughts on love and relationships. If you have any other questions or concerns, please feel free to ask!

Output from the model with simple raw training after chat tuning (ofc, late layers are frozen)

'A Midsummer Night's Dream' is one of Shakespeare's most famous plays, believed to have been written in 1595 or 1600. The play deals with events that take place in the forest near Athens, centering around the beautiful Helena and Demetrius who loves her. Demetrius loves Helena, but she is smitten with Lucio, who loves the beautiful Hermia. Disappointed that she must marry the timid Demetrius chosen by her father, Hermia flees into the forest. Demetrius follows Helena into the forest, and Lucio enters the forest in search of Hermia. Meanwhile, the fairy king Oberon and his wife Titania are fighting over the rule of the fairy kingdom. Oberon resents Titania for falling in love with a human, and Titania resents Oberon for falling in love with a human. Phibes, the attendant of the fairy king, steals a flower that will make Titania fall asleep on Oberon's orders. The flower will make the first person Titania sees upon waking become her love. Phibes is ordered to give this flower to Demetrius, but gives it to Lucio instead. Lucio gives this flower to Hermia, but Hermia gives it to Demetrius. Demetrius gives this flower to Helena. Helena gives this flower to Quince. Quince gives this flower to Demetrius. Demetrius gives this flower to Helena. Helena gives this flower to Lucio. Lucio gives this flower to Hermia. Hermia gives this flower to Demetrius. Demetrius gives this flower to Quince. Quince gives this flower to Demetrius. Demetrius gives this flower to Helena. Helena gives this flower to Lucio. Lucio gives this flower to Hermia. Hermia gives this flower to Demetrius. Demetrius gives this flower to Quince. Quince gives this flower to Demetrius. Demetrius gives this flower to Helena. Helena gives this flower to Lucio. Lucio gives this flower to Hermia. Hermia gives this flower to Demetrius. Demetrius gives this flower to Quince.

Analysis by Claude Opus

Analyzing the outputs from the four models: The output from the model with the applied methodology very accurately and thoroughly explains the main content and relationships between the characters in the play. It seems to have a good understanding of Shakespeare's original work and summarizes the key points. The output from the base model (EVEE 10.8B) only roughly grasps the content of the play, and there are parts that differ from the original work in terms of character names and relationship descriptions. It appears to lack a detailed understanding. The output from the chat-tuned model shows an effort to explain the play's content in an easy and friendly manner, but there are content errors. It seems to have failed to accurately grasp the main characters and events. The output from the model with simple raw training after chat tuning has repetitive sentences and the content is a mess, indicating a failure to properly understand the play. Therefore, the model that best understood and explained the content of 'A Midsummer Night's Dream' is #1, the model with the applied methodology. The base model and chat-tuned model also have some understanding of the play, but lack in terms of detailed content. In contrast, model #4 failed to grasp the content of the play due to improper training.

This simple experiment was designed and conducted based on empirical intuition rather than theoretical grounds.

After the experiment, I summarized the content to numerically verify if the hypothesis holds. I confirmed that the hypothesis holds in the layer formation of various models, and follow-up experiments are currently in progress. You can check them out in the Notion link below. Link to Notion

P.S. The EEVE Instruct model created during the experiment showed decent performance, so I uploaded it. Only layers 46 and 47 are tuned, so if you want to apply a similar methodology, feel free to try it out. maywell/EEVE-Korean-Instruct-10.8B-Var