基于Mixtral-7B的SFT+LoRA的LLM微调笔记

何平安2026-04-262026-04-27

这边使用的英博云工作台进行云训练测试，新人可以白嫖50额度

有钱就创建个H100的服务器，我用的4090

进入Jupyter终端输入

1	pip install datasets peft trl accelerate

datasets用于加载数据
peft(Parameter Efficient Fine-Tuning) LoRA库就在里面是由 Hugging Face 开发的一个开源软件库
trl 更方便用于做指令微调
accelerate 用于管理多GPU或混合精度训练

下载好就运行下面的代码检查模型

1	model_path= "/public/huggingface-models/mistralai/Mistral-7B-Instruct-v0.1"

加载模型

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    use_fast=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto"
)

测试问答功能

messages =[{"role": "user", "content": "Who are you?"},]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs [0] [inputs ["input_ids"].shape[-1]:]))

为什么要进行模型微调？我个人认为呢，就是在输入量数据量过大的时候，普通的知识库已经无法支撑向量检索，或者说向量检索的速度太慢，需要用微调模型来优化检索的速度

LoRA

为什么要用Lora？如果不用Lora的话，是用全参数调训练的话，那么所有数据都会参与训练，这样的计算成本和训练时间都会非常高，如果我们只是普通的风格注入，那就不需要这么多的训练

那么Lora是用的什么样的训练方法呢？我们假设把一个模型数据看作一个矩阵，我们在不改变它原有的模型结构，数据上再新增一个变量，让自己依次去修改输出结果：

但是这样训练模型的方法的参数依旧会很大，只考虑了减少x；给他增加一个可学习的适配器

冻结大模型，只训练一小部分外挂矩阵

y=Wx —>y=(W+BA)x

LoRA 的实现相对简单。我们可以将其视为 LLM 中全连接层的修改前向传递。在伪代码中，如下所示：

input_dim = 768 # e.g., the hidden size of the pre-trained model
output_dim = 768 # e.g., the output size of the layer
rank = 8 # The rank 'r' for the low-rank adaptation

W = ... # from pretrained network with shape input_dim x output_dim

W_A = nn.Parameter(torch.empty(input_dim, rank)) # LoRA weight A
W_B = nn.Parameter(torch.empty(rank, output_dim)) # LoRA weight B

# Initialization of LoRA weights
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)

def regular_forward_matmul(x, W):
    h = x @ W
return h

def lora_forward_matmul(x, W, W_A, W_B):
    h = x @ W  # regular matrix multiplication
    h += x @ (W_A @ W_B)*alpha # use scaled LoRA weights
return h

加载训练数据集

因为这个英博语音上面有一些自带的训练数据，就可以直接来用，我这边就拿一个医疗的数据集来测试

from datasets import load_from_disk,load_dataset

ds = load_dataset("./eb-public/huggingface-datasets/lavita/medical-qa-shared-task-v1-half")
ds

输出查看数据集

1
2
3

import pandas as pd
df_sample = pd.DataFrame(ds["train"][:5])
print(df_sample)

查看字段数据：

1 2	unique_ids = ds["train"].unique("label") print(unique_ids[:5])

添加自定义训练提示词工程，比如我这个是医疗的知识库，我就有了如下的提示词：

def format_example2(example):
    options = [
        example["ending0"],
        example["ending1"],
        example["ending2"],
        example["ending3"],
        example["ending4"],
    ]

    labels = ["A", "B", "C", "D", "E"]
    correct_idx = example["label"]

    text = f"""### Instruction:
你是一名医学专家，请根据病例信息选择最合理的答案。

### Question:
{example['sent1']} {example['sent2']}

### Options:
A. {options[0]}
B. {options[1]}
C. {options[2]}
D. {options[3]}
E. {options[4]}

### Answer:
{labels[correct_idx]}. {options[correct_idx]}
"""
    return {"text": text}

如果你对我的医疗数据感兴趣：

     id                                        ending0  \
0  2622    Pallor, cyanosis, and erythema of the hands   
1  1754                                            CGG   
2  3718  Release of vascular endothelial growth factor   
3  9107                       Left-sided heart failure   
4  1838       Increase oral hydration and fiber intake   

                                             ending1  \
0                         Calcium deposits on digits   
1                                                GAA   
2                       Cellular retention of sodium   
3                            Coronary artery disease   
4  Check the stool for fecal red blood cells and ...   

                                    ending2  \
0          Blanching vascular abnormalities   
1                                       CAG   
2  Breakdown of endothelial tight junctions   
3                             Liver disease   
4                   Perform a stool culture   

                              ending3                             ending4  \
0               Hypercoagulable state         Heartburn and regurgitation   
1                                 CTG                                 GCC   
2        Degranulation of eosinophils      Increased hydrostatic pressure   
3                Budd-chiari syndrome                       Cor pulmonale   
4  Begin treatment with ciprofloxacin  Begin cognitive behavioral therapy   

   label                                              sent1  \
0      3  A 35-year-old woman comes to your office with ...   
1      1  An 8-year-old boy is brought to the pediatrici...   
2      2  A 36-year-old man is brought to the emergency ...   
3      4  A 35-year-old woman presents to the ER with sh...   
4      4  A 5-year-old boy is brought in by his parents ...   

                                               sent2  \
0  All of the following symptoms and signs would ...   
1  Which of the following trinucleotide repeats i...   
2  Which of the following is most likely the prim...   
3                 What is the most likely diagnosis?   
4  After a conversation with the child exploring ...   

                                         startphrase  
0  A 35-year-old woman comes to your office with ...  
1  An 8-year-old boy is brought to the pediatrici...  
2  A 36-year-old man is brought to the emergency ...  
3  A 35-year-old woman presents to the ER with sh...  
4  A 5-year-old boy is brought in by his parents ...

你在加载方法的时候，就会返回如下的结果：

加载lora模型

target models就是要对目标的字段进行训练的,bias 这个字段就是不参与训练

from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
	r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

可以查看模型结构：

model

如果你跟我用的是一样的模型，那你应该会输出如下：

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): MistralForCausalLM(
          (model): MistralModel(
            (embed_tokens): Embedding(32000, 4096)
            (layers): ModuleList(
              (0-31): 32 x MistralDecoderLayer(
                (self_attn): MistralAttention(
                  (q_proj): lora.Linear(
                    (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (k_proj): lora.Linear(
                    (base_layer): Linear(in_features=4096, out_features=1024, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (v_proj): lora.Linear(
                    (base_layer): Linear(in_features=4096, out_features=1024, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                  (o_proj): lora.Linear(
                    (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding_B): ParameterDict()
                    (lora_magnitude_vector): ModuleDict()
                  )
                )
                (mlp): MistralMLP(
                  (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
                  (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
                  (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
                  (act_fn): SiLUActivation()
                )
                (input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
                (post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
              )
            )
            (norm): MistralRMSNorm((4096,), eps=1e-05)
            (rotary_emb): MistralRotaryEmbedding()
          )
          (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
        )
      )
    )
  )
)

看一下模型可用的训练参数量：

1	model.print_trainable_parameters()

return：trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.1879

训练参数：

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./medical-qa-shared-task-v1-half",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=20,
    save_steps=500,
    save_total_limit=2,
    report_to"none"
)

最后的加载：

train_dataset = ds["train"].map(
    format_example2,
    remove_columns=ds["train"].column_names
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    args=training_args
)

这里的参数可以自己去AI问一下，就用默认的就行

开始训练模型

1	trainer.train()

训练好我就可以加载模型进行对话了

1	model.save_pretrained("./你刚刚训练的输出目录")

QA

显存占用问题

如果出现如下报错：

这里显示的是内存用超了，我这里以4090的显卡配置为标准,进行修改模型训练参数

from peft import LoraConfig

lora_config = LoraConfig(
    r=8,                         # ❗ 降低
    lora_alpha=16,               # 配套降低
    lora_dropout=0.05,
    
    target_modules=["q_proj", "v_proj"],  # ❗ 只保留核心
    
    bias="none",
    task_type="CAUSAL_LM"
)
model_4090 = get_peft_model(model, lora_config)

更省显存：

lora_config = LoraConfig(
    r=4,
    lora_alpha=8,
    target_modules=["q_proj", "v_proj"],
)

验证你的显存：

1 2	import torch print(torch.cuda.memory_allocated() / 1024**3, "GB")

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)

1
2
3

model.gradient_checkpointing_enable()
model.config.use_cache = False
model = get_peft_model(model, lora_config)