模型部署
模型部署是将训练好的模型部署到生产环境,提供 API 服务。
概述
模型部署涉及将训练好的模型转换为可服务的形式,并部署到生产环境。
部署方式
使用 Hugging Face Transformers
python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B")
prompt = "介绍一下大语言模型"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))使用 vLLM 加速
vLLM 是一个高效的 LLM 推理引擎。
python
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen-7B")
sampling_params = SamplingParams(max_tokens=100)
prompts = ["介绍一下大语言模型", "什么是人工智能"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)使用 TGI (Text Generation Inference)
python
from text_generation import Client
client = Client("http://localhost:8080")
response = client.generate(
"介绍一下大语言模型",
max_new_tokens=100
)
print(response.generated_text)API 服务
FastAPI 部署
python
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
generator = pipeline("text-generation", model="Qwen/Qwen-7B")
@app.post("/generate")
async def generate(text: str):
result = generator(text, max_length=100)
return {"response": result[0]["generated_text"]}Flask 部署
python
from flask import Flask, request, jsonify
from transformers import pipeline
app = Flask(__name__)
generator = pipeline("text-generation", model="Qwen/Qwen-7B")
@app.route("/generate", methods=["POST"])
def generate():
text = request.json["text"]
result = generator(text, max_length=100)
return jsonify({"response": result[0]["generated_text"]})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)性能优化
模型量化
python
# 使用 bitsandbytes 进行 4-bit 量化
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B",
quantization_config=bnb_config
)模型蒸馏
python
# 教师模型
teacher_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B")
# 学生模型
student_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1.8B")
# 蒸馏训练
for inputs, labels in dataloader:
teacher_outputs = teacher_model(inputs)
student_outputs = student_model(inputs)
# 计算蒸馏损失
distillation_loss = loss_fn(student_outputs.logits, teacher_outputs.logits)
student_loss = loss_fn(student_outputs.logits, labels)
total_loss = 0.7 * distillation_loss + 0.3 * student_loss缓存优化
python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B")
# 缓存 prompt embedding
prompt = "介绍一下大语言模型"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
prompt_embeds = model.get_input_embeddings()(inputs.input_ids)
# 保存缓存
torch.save(prompt_embeds, "prompt_embeds.pt")监控与日志
性能监控
python
import time
start_time = time.time()
outputs = model.generate(**inputs)
inference_time = time.time() - start_time
print(f"Inference time: {inference_time:.2f}s")
print(f"Tokens per second: {len(outputs[0]) / inference_time:.2f}")日志记录
python
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
logging.info(f"Request received: {prompt}")
logging.info(f"Response generated in {inference_time:.2f}s")