第七章 指令微调学习(五)Extracting and saving responses 第七章 指令微调学习五7.7 Extracting and saving responses在对指令数据集的训练部分完成LLM的微调后现在评估其在保留测试集上的性能。首先我们提取测试集中每个输入对应的模型生成响应并进行人工分析随后通过图7.18所示方法对LLM进行评估以量化响应的质量。1.测试集指令响应为完成响应指令步骤我们使用 generate 函数。随后我们将模型的响应结果与前三个测试集条目对应的预期测试集答案并排输出以便进行对比torch.manual_seed(123)forentryintest_data[:3]:input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())print(input_text)print(f\nCorrect response:\n{entry[output]})print(f\nModel response:\n{response_text.strip()})print(print(-------------------------------------))如前所述generate函数会返回输入文本与输出文本的组合结果因此我们通过对generated_text内容进行切片处理并使用.replace()方法来提取模型的响应。结果Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:Rewrite the sentence using a simile.### Input:The car is very fast. Correct response:The car is as fast as lightning. Model response:The car is as fast as a cheetah. ------------------------------------- None Below is an instruction that describes a task. Write a response that appropriately completes the request.### Instruction:Whattypeof cloud is typically associated with thunderstorms? Correct response:Thetypeof cloud typically associated with thunderstorms is cumulonimbus. Model response:Thetypeof cloud associated with thunderstorms is a cumulus cloud.### Instruction:Name the author ofPride and Prejudice.Correct response:Jane Austen. Model response:The author ofPride and Prejudiceis Jane Austen. -------------------------------------从结果可以看出该模型表现相对良好。首条和末条指令的答案明显正确而第二条答案虽接近正确但并不完全准确——模型选择了“积云”而非“积雨云”。不过需要指出的是积云确实可能发展为积雨云而积雨云具备引发雷暴的能力。1.最重要的是模型评估并不像完成度微调那样简单直接在完成度微调中我们只需计算正确分类垃圾邮件/非垃圾邮件标签的比例即可得出分类准确率。2.模型评估在实际应用中经过指令微调的大语言模型LLM会通过多种方法进行评估1简答题与多项选择题基准测试例如衡量大规模多任务语言理解能力的 MMLU https://arxiv.org/abs/2009.03300用于评估模型的通用知识水平2人类对其他大语言模型LLM的偏好比较如 LMSYS 聊天机器人竞赛平台https://arena.lmsys.org3自动化对话基准测试其中使用GPT-4等大语言模型来评估对话响应质量例如AlpacaEvalhttps://tatsu-lab.github.io/alpaca_eval/。在实际应用中综合考虑三种评估方法会更为有效多项选择题作答、人工评估以及衡量对话表现的自动化指标。然而由于我们的主要关注点在于评估对话表现本身而非单纯考察回答多项选择题的能力因此人工评估和自动化指标可能更具参考价值。但人工评估耗时所以使用自动化评估。3.自动化评估让我们采用一种受AlpacaEval启发的方法使用另一个大语言模型来评估我们微调后的模型响应。不过与依赖公开基准数据集不同我们采用了自定义测试集。这种定制化设计使得我们能够更精准、相关地评估模型在目标应用场景即我们的指令数据集中所体现的场景下的性能表现。为准备本次评估所需的响应数据我们将生成的模型响应追加到test_set字典中并将更新后的数据保存为“instruction-data-with-response.json”文件以供记录。此外通过保存该文件可以加载并分析这些响应。以下代码清单沿用之前的generate方法但此次我们遍历了整个test_set集合。同时我们不再直接打印模型响应而是将其添加到test_set字典中。最后输出字典中的一个条目查看是否正确添加。fromtqdmimporttqdmfori,entryintqdm(enumerate(test_data),totallen(test_data)):input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())test_data[i][model_response]response_textwithopen(instruction-data-with-response.json,w)asfile:json.dump(test_data,file,indent4)print(test_data[0])结果最后保存模型importre file_namef{re.sub(r[ ()],,CHOOSE_MODEL)}-sft.pthtorch.save(model.state_dict(),file_name)print(fModel saved as{file_name})输出总结完整代码如下#Insturction_fine-tuning_pretrained_LLM_5_20importjsonimporttorchfrompre_trainingimportcalc_loss_loaderfromDownload_instruction_dataset5_9importtrain_loader,val_loaderfromTraining_an_LLM_3_16importtrain_model_simplefromload_pretrained_model5_20importval_data,test_data,CHOOSE_MODELfromload_pretrained_model5_20importmodel,generate,text_to_token_ids,token_ids_to_text,BASE_CONFIGimporttiktoken devicetorch.device(cudaiftorch.cuda.is_available()elsecpu)model.to(device)torch.manual_seed(123)withtorch.no_grad():train_losscalc_loss_loader(train_loader,model,device,num_batches5)val_losscalc_loss_loader(val_loader,model,device,num_batches5)print(Training loss:,train_loss)print(Validation loss:,val_loss)defformat_input(entry):instruction_text(fBelow is an instruction that describes a task. fWrite a response that appropriately completes the request.f\n\n### Instruction:\n{entry[instruction]})input_text(f\n\n### Input:\n{entry[input]}ifentry[input]else)returninstruction_textinput_textimporttime start_timetime.time()torch.manual_seed(123)optimizertorch.optim.AdamW(model.parameters(),lr0.00005,weight_decay0.1)num_epochs2tokenizertiktoken.get_encoding(gpt2)train_losses,val_losses,tokens_seentrain_model_simple(model,train_loader,val_loader,optimizer,device,num_epochsnum_epochs,eval_freq5,eval_iter5,start_contextformat_input(val_data[0]),tokenizertokenizer)end_timetime.time()execution_time_minutes(end_time-start_time)/60print(fTraining completed in{execution_time_minutes:.2f}minutes.)importmatplotlib.pyplotaspltfrommatplotlib.tickerimportMaxNLocatordefplot_losses(epochs_seen,tokens_seen,train_losses,val_losses):fig,ax1plt.subplots(figsize(5,3))ax1.plot(epochs_seen,train_losses,labelTraining loss)ax1.plot(epochs_seen,val_losses,linestyle-.,labelValidation loss)ax1.set_xlabel(Epochs)ax1.set_ylabel(Loss)ax1.legend(locupper right)ax1.xaxis.set_major_locator(MaxNLocator(integerTrue))ax2ax1.twiny()ax2.plot(tokens_seen,train_losses,alpha0)ax2.set_xlabel(Tokens seen)fig.tight_layout()plt.show()epochs_tensortorch.linspace(0,num_epochs,len(train_losses))plot_losses(epochs_tensor,tokens_seen,train_losses,val_losses)#5.22torch.manual_seed(123)forentryintest_data[:3]:input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())print(input_text)print(f\nCorrect response:\n{entry[output]})print(f\nModel response:\n{response_text.strip()})print(print(-------------------------------------))fromtqdmimporttqdmfori,entryintqdm(enumerate(test_data),totallen(test_data)):input_textformat_input(entry)token_idsgenerate(modelmodel,idxtext_to_token_ids(input_text,tokenizer).to(device),max_new_tokens256,context_sizeBASE_CONFIG[context_length],eos_id50256)generated_texttoken_ids_to_text(token_ids,tokenizer)response_text(generated_text[len(input_text):].replace(### Response:,).strip())test_data[i][model_response]response_textwithopen(instruction-data-with-response.json,w)asfile:json.dump(test_data,file,indent4)print(test_data[0])importre file_namef{re.sub(r[ ()],,CHOOSE_MODEL)}-sft.pthtorch.save(model.state_dict(),file_name)print(fModel saved as{file_name})完成了1生成测试集的响应2并进行人工分析3自动化评估。