基于Mixtral-7B的SFT+LoRA的LLM微调笔记

这边使用的英博云工作台进行云训练测试,新人可以白嫖50额度

有钱就创建个H100的服务器,我用的4090

进入Jupyter终端输入

1
pip install datasets peft trl accelerate
  • datasets用于加载数据

  • peft(Parameter Efficient Fine-Tuning) LoRA库就在里面 是由 Hugging Face 开发的一个开源软件库

  • trl 更方便用于做指令微调

  • accelerate 用于管理多GPU或混合精度训练

下载好就运行下面的代码检查模型

1
model_path= "/public/huggingface-models/mistralai/Mistral-7B-Instruct-v0.1"

加载模型

1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
model_path,
use_fast=True
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto"
)

测试问答功能

1
2
3
4
5
6
7
8
9
10
11
messages =[{"role": "user", "content": "Who are you?"},]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs [0] [inputs ["input_ids"].shape[-1]:]))

为什么要进行模型微调?我个人认为呢,就是在输入量数据量过大的时候,普通的知识库已经无法支撑向量检索,或者说向量检索的速度太慢,需要用微调模型来优化检索的速度

LoRA

为什么要用Lora?如果不用Lora的话,是用全参数调训练的话,那么所有数据都会参与训练,这样的计算成本和训练时间都会非常高,如果我们只是普通的风格注入,那就不需要这么多的训练

那么Lora是用的什么样的训练方法呢?我们假设把一个模型数据看作一个矩阵,我们在不改变它原有的模型结构,数据上再新增一个变量,让自己依次去修改输出结果:

但是这样训练模型的方法的参数依旧会很大,只考虑了减少x;给他增加一个可学习的适配器

冻结大模型,只训练一小部分 外挂矩阵

y=Wx —>y=(W+BA)x

LoRA 的实现相对简单。我们可以将其视为 LLM 中全连接层的修改前向传递。在伪代码中,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
input_dim = 768 # e.g., the hidden size of the pre-trained model
output_dim = 768 # e.g., the output size of the layer
rank = 8 # The rank 'r' for the low-rank adaptation

W = ... # from pretrained network with shape input_dim x output_dim

W_A = nn.Parameter(torch.empty(input_dim, rank)) # LoRA weight A
W_B = nn.Parameter(torch.empty(rank, output_dim)) # LoRA weight B

# Initialization of LoRA weights
nn.init.kaiming_uniform_(W_A, a=math.sqrt(5))
nn.init.zeros_(W_B)

def regular_forward_matmul(x, W):
h = x @ W
return h

def lora_forward_matmul(x, W, W_A, W_B):
h = x @ W # regular matrix multiplication
h += x @ (W_A @ W_B)*alpha # use scaled LoRA weights
return h

加载训练数据集

因为这个英博语音上面有一些自带的训练数据,就可以直接来用,我这边就拿一个医疗的数据集来测试

1
2
3
4
from datasets import load_from_disk,load_dataset

ds = load_dataset("./eb-public/huggingface-datasets/lavita/medical-qa-shared-task-v1-half")
ds

输出查看数据集

1
2
3
import pandas as pd
df_sample = pd.DataFrame(ds["train"][:5])
print(df_sample)

查看字段数据:

1
2
unique_ids = ds["train"].unique("label")
print(unique_ids[:5])

添加自定义训练提示词工程,比如我这个是医疗的知识库,我就有了如下的提示词:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def format_example2(example):
options = [
example["ending0"],
example["ending1"],
example["ending2"],
example["ending3"],
example["ending4"],
]

labels = ["A", "B", "C", "D", "E"]
correct_idx = example["label"]

text = f"""### Instruction:
你是一名医学专家,请根据病例信息选择最合理的答案。

### Question:
{example['sent1']} {example['sent2']}

### Options:
A. {options[0]}
B. {options[1]}
C. {options[2]}
D. {options[3]}
E. {options[4]}

### Answer:
{labels[correct_idx]}. {options[correct_idx]}
"""
return {"text": text}

如果你对我的医疗数据感兴趣:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
     id                                        ending0  \
0 2622 Pallor, cyanosis, and erythema of the hands
1 1754 CGG
2 3718 Release of vascular endothelial growth factor
3 9107 Left-sided heart failure
4 1838 Increase oral hydration and fiber intake

ending1 \
0 Calcium deposits on digits
1 GAA
2 Cellular retention of sodium
3 Coronary artery disease
4 Check the stool for fecal red blood cells and ...

ending2 \
0 Blanching vascular abnormalities
1 CAG
2 Breakdown of endothelial tight junctions
3 Liver disease
4 Perform a stool culture

ending3 ending4 \
0 Hypercoagulable state Heartburn and regurgitation
1 CTG GCC
2 Degranulation of eosinophils Increased hydrostatic pressure
3 Budd-chiari syndrome Cor pulmonale
4 Begin treatment with ciprofloxacin Begin cognitive behavioral therapy

label sent1 \
0 3 A 35-year-old woman comes to your office with ...
1 1 An 8-year-old boy is brought to the pediatrici...
2 2 A 36-year-old man is brought to the emergency ...
3 4 A 35-year-old woman presents to the ER with sh...
4 4 A 5-year-old boy is brought in by his parents ...

sent2 \
0 All of the following symptoms and signs would ...
1 Which of the following trinucleotide repeats i...
2 Which of the following is most likely the prim...
3 What is the most likely diagnosis?
4 After a conversation with the child exploring ...

startphrase
0 A 35-year-old woman comes to your office with ...
1 An 8-year-old boy is brought to the pediatrici...
2 A 36-year-old man is brought to the emergency ...
3 A 35-year-old woman presents to the ER with sh...
4 A 5-year-old boy is brought in by his parents ...

你在加载方法的时候,就会返回如下的结果:

加载lora模型

target models就是要对目标的字段进行训练的,bias 这个字段就是不参与训练

1
2
3
4
5
6
7
8
9
10
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

可以查看模型结构:

1
model

如果你跟我用的是一样的模型,那你应该会输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
PeftModelForCausalLM(
(base_model): LoraModel(
(model): PeftModelForCausalLM(
(base_model): LoraModel(
(model): MistralForCausalLM(
(model): MistralModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x MistralDecoderLayer(
(self_attn): MistralAttention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4096, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
)
(mlp): MistralMLP(
(gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
(up_proj): Linear(in_features=4096, out_features=14336, bias=False)
(down_proj): Linear(in_features=14336, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): MistralRMSNorm((4096,), eps=1e-05)
(post_attention_layernorm): MistralRMSNorm((4096,), eps=1e-05)
)
)
(norm): MistralRMSNorm((4096,), eps=1e-05)
(rotary_emb): MistralRotaryEmbedding()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
)
)
)
)

看一下模型可用的训练参数量:

1
model.print_trainable_parameters()

return:trainable params: 13,631,488 || all params: 7,255,363,584 || trainable%: 0.1879

训练参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
output_dir="./medical-qa-shared-task-v1-half",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=20,
save_steps=500,
save_total_limit=2,
report_to"none"
)

最后的加载:

1
2
3
4
5
6
7
8
9
10
train_dataset = ds["train"].map(
format_example2,
remove_columns=ds["train"].column_names
)

trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
args=training_args
)

这里的参数可以自己去AI问一下,就用默认的就行

开始训练模型

1
trainer.train()

训练好我就可以加载模型进行对话了

1
model.save_pretrained("./你刚刚训练的输出目录")

QA

显存占用问题

如果出现如下报错:

这里显示的是内存用超了,我这里以4090的显卡配置为标准,进行修改模型训练参数

1
2
3
4
5
6
7
8
9
10
11
12
13
from peft import LoraConfig

lora_config = LoraConfig(
r=8, # ❗ 降低
lora_alpha=16, # 配套降低
lora_dropout=0.05,

target_modules=["q_proj", "v_proj"], # ❗ 只保留核心

bias="none",
task_type="CAUSAL_LM"
)
model_4090 = get_peft_model(model, lora_config)

更省显存:

1
2
3
4
5
lora_config = LoraConfig(
r=4,
lora_alpha=8,
target_modules=["q_proj", "v_proj"],
)

验证你的显存:

1
2
import torch 
print(torch.cuda.memory_allocated() / 1024**3, "GB")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
device_map="auto"
)

1
2
3
model.gradient_checkpointing_enable()
model.config.use_cache = False
model = get_peft_model(model, lora_config)