Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Model description

This model is a fine-tuned version of the pre-trained model James-WYang/BigTranslate, specifically adjusted to handle the slot translation task. The fine-tuning process and the specific model adjustments are based on methodologies described in our recent publication https://arxiv.org/pdf/2404.02588.pdf. This model is designed to translate sentences while maintaining the integrity of annotated NLU (Natural Language Understanding) slots, which are marked with simple HTML-like tags.

The input to the model should be a sentence where all NLU slots are annotated with HTML-like tags consisting of consecutive alphabetical letters (e.g., <a>, <b>, <c>). The model outputs the translated sentence preserving these annotations.

Example: "Set the temperature on my <a>thermostat<a> to <b>29 degrees<b>."

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
BIGTRANSLATE_LANG_TABLE = {
    "zh": "汉语",
    "es": "西班牙语",
    "fr": "法语",
    "de": "德语",
    "hi": "印地语",
    "pt": "葡萄牙语",
    "tr": "土耳其语",
    "en": "英语",
    "ja": "日语"
}

def get_prompt(src_lang, tgt_lang, src_sentence):
        translate_instruct = f"请将以下{BIGTRANSLATE_LANG_TABLE[src_lang]}句子翻译成{BIGTRANSLATE_LANG_TABLE[tgt_lang]}{src_sentence}"
        return (
            "以下是一个描述任务的指令,请写一个完成该指令的适当回复。\n\n"
            f"### 指令:\n{translate_instruct}\n\n### 回复:")


def translate(input_text, src_lang, trg_lang):
    prompt = get_prompt(src_lang, trg_lang, input_text)
    input_ids = tokenizer(prompt, return_tensors="pt")
    generated_tokens = model.generate(**input_ids, max_new_tokens=256)[0]

    return tokenizer.decode(generated_tokens, skip_special_tokens=True)[len(prompt):]


model = AutoModelForCausalLM.from_pretrained("Samsung/BigTranslateSlotTranslator")
tokenizer = AutoTokenizer.from_pretrained("Samsung/BigTranslateSlotTranslator")

translation = translate("set the temperature on my <a>thermostat<a> to <b> 29 degrees <b>", "en", "de")  # translation: stell die temperatur auf meinem <a> thermostat <a> auf <b> 29 grad <b>

Model fine tuning code

https://github.com/Samsung/MT-LLM-NLU/tree/main/BigTranslateFineTuning

Downloads last month
11
Safetensors
Model size
13.2B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.