Skip to content

Tutorial: Finetuning Language Models

This notebook will allow you to try out finetuning of the munin-7b-alpha model or, indeed, any other generative model out there.

We'll be finetuning the model on a Danish translated instruction tuning dataset, using the QLoRA method.

Install Dependencies

# Uncomment to install packages (already done for you)
# %pip install --upgrade --force-reinstall --no-cache-dir torch==2.1.1 triton --index-url https://download.pytorch.org/whl/cu121
# %pip install "unsloth[cu121_ampere_torch211] @ git+https://github.com/unslothai/unsloth.git"
# General packages
import torch
import getpass

# For loading the finetuning datasets
from datasets import load_dataset

# For loading and finetuning the models
from unsloth import FastLanguageModel
from trl import SFTTrainer, setup_chat_format
from transformers import TrainingArguments, AutoTokenizer, TextStreamer, GenerationConfig

Get Hugging Face Token

To allow finetuning gated models (like LLaMA-2) and to upload your finetuned models, you can put your Hugging Face token in the cell below.

You can generate a token at https://hf.co/settings/tokens.

If you don't want to supply a token then simply leave it blank!

HUGGING_FACE_TOKEN = getpass.getpass("Hugging Face Token: ")
if not HUGGING_FACE_TOKEN:
    print("Not using a Hugging Face token.")
    HUGGING_FACE_TOKEN = None

Configure the Model

RANDOM_SEED = 42

MODEL_CONFIGURATION = dict(
    model_name="danish-foundation-models/munin-7b-alpha",
    max_seq_length=2048,  
    dtype=None,  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ GPUs
    load_in_4bit=True,  # Use 4bit quantisation to reduce memory usage. Quantises on the fly, so can take a while.
    attn_implementation="flash_attention_2"
)

PEFT_CONFIGURATION = dict(
    r = 16,  # Adapter rank, choose any number > 0, but suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj", 
        "k_proj", 
        "v_proj", 
        "o_proj", 
        "gate_proj", 
        "up_proj", 
        "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0,  # Supports any, but = 0 is optimized
    bias = "none",  # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    use_rslora = False,  # Support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
    random_state = RANDOM_SEED,
)

FINETUNING_CONFIGURATION = dict(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    warmup_steps=5,
    num_train_epochs=1,
    learning_rate=2e-4,
    weight_decay=0.01,
    lr_scheduler_type="linear",
)

Load the Model

model, tokenizer = FastLanguageModel.from_pretrained(**MODEL_CONFIGURATION, token=HUGGING_FACE_TOKEN)
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)
model = FastLanguageModel.get_peft_model(model, **PEFT_CONFIGURATION)

Load and Prepare Data

Load the dataset from Hugging Face Hub:

dataset = load_dataset("kobprof/skolegpt-instruct", split="train")
print(f"Number of samples in dataset: {len(dataset):,}")

We just take a random subset, 1000 samples should take around 7 minutes on this machine depending on settings.

n_samples = 1000
dataset = dataset.shuffle(seed=RANDOM_SEED).select(range(n_samples))

Lastly, we set up the conversations in the dataset into the standard ChatML format.

def create_conversation(sample: dict) -> dict[str, list[dict[str, str]]]:
    """This converts the sample to the standardised ChatML format.

    Args:
        sample:
            The data sample.

    Returns:
        The sample set up in the ChatML format.
    """
    return {
        "messages": [
            {"role": "system", "content": sample["system_prompt"]},
            {"role": "user", "content": sample["question"]},
            {"role": "assistant", "content": sample["response"]}
        ]
    }

dataset = dataset.map(create_conversation, batched=False)

Finetune!

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=MODEL_CONFIGURATION["max_seq_length"],
    dataset_num_proc=4,
    packing=True,  # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        optim="adamw_8bit",
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=3,
        seed=RANDOM_SEED,
        output_dir="outputs",
        **FINETUNING_CONFIGURATION
    ),
)
# Log some GPU stats before we start the finetuning
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(
    f"You're using the {gpu_stats.name} GPU, which has {max_memory:.2f} GB of memory "
    f"in total, of which {start_gpu_memory:.2f}GB has been reserved already."
)
# This is where the actual finetuning is happening
trainer_stats = trainer.train()
# Log some post-training GPU statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(
    f"We ended up using {used_memory:.2f} GB GPU memory ({used_percentage:.2f}%), "
    f"of which {used_memory_for_lora:.2f} GB ({lora_percentage:.2f}%) "
    "was used for LoRa."
)

Try it Out

Time to try out the new finetuned model. First we need to set up how to generate text with it.

You can leave the following config as-is, or you can experiment. Here is a list of all the different arguments.

GENERATION_CONFIG = GenerationConfig(
    # What should be outputted
    max_new_tokens=256, 

    # Controlling how the model chooses the next token to generate
    do_sample=True, 
    temperature=0.2, 
    repetition_penalty=1.2,
    top_k=50,
    top_p=0.95,

    # Miscellaneous required settings
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    use_cache=False,  # Required by unsloth
)

Let's use TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

messages = [
    dict(
        role="system",
        content=""  # Change this to anything you want
    ),
    dict(
        role="user",
        content="Hvad synes du om Danish Foundation Models projektet? Skriv kortfattet."  # And change this too
    ),
]

outputs = model.generate(
    input_ids=tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda"),
    streamer=TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True),
    generation_config=GENERATION_CONFIG,
)

Share the Model

You can share your new model to the Hugging Face Hub - this requires that you've included your Hugging Face token at the top of this notebook.

# model.push_to_hub("your_name/qlora_model", token=HUGGING_FACE_TOKEN)

Extra: Export Model to Other Frameworks

Saving to float16 for vLLM

The popular inference framework vLLM can take advantage of having a model available in lower precision, enabling faster inference times.

You can uncomment the following lines if you want to save the model in 16-bit or even 4-bit precision:

# Merge to 16bit
# model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit",)
# model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_16bit", token=HUGGING_FACE_TOKEN)

# Merge to 4bit
# model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit",)
# model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_4bit", token=HUGGING_FACE_TOKEN)

Alternatively, you can save only the adapter weights, which are very light, but which requires the base model to be able to use it:

# Just LoRA adapters
# model.save_pretrained_merged("model", tokenizer, save_method="lora",)
# model.push_to_hub_merged("hf/model", tokenizer, save_method="lora", token=HUGGING_FACE_TOKEN)

GGUF / llama.cpp Conversion

You can also save the model in the popular GGUF or llama.cpp formats, by uncommenting any of the following:

# Save to 8bit Q8_0
# model.save_pretrained_gguf("model", tokenizer)
# model.push_to_hub_gguf("hf/model", tokenizer, token=HUGGING_FACE_TOKEN)

# Save to 16bit GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")
# model.push_to_hub_gguf("hf/model", tokenizer, quantization_method="f16", token=HUGGING_FACE_TOKEN)

# Save to q4_k_m GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
# model.push_to_hub_gguf("hf/model", tokenizer, quantization_method="q4_k_m", token=HUGGING_FACE_TOKEN)

Now, use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like GPT4All. You can install GPT4All by going here.