LMDeploy 完全指南：高效模型部署工具

商汤科技出品，推理速度超越 vLLM 的部署利器

什么是 LMDeploy？

简介

LMDeploy 是由商汤科技（SenseTime）开发的大模型部署工具包，专注于模型压缩、推理加速和服务部署。在多项基准测试中，其 TurboMind 推理引擎的吞吐量超越 vLLM 达 1.8 倍。

核心优势

优势	说明	性能表现
🚀 TurboMind	自研高性能推理引擎	1.8x vLLM 吞吐量
💾 4-bit 量化	AWQ 量化，2.4 倍加速	质量损失<1%
🎯 多平台支持	NVIDIA/AMD/华为 Ascend	广泛兼容
📦 一键部署	简化的部署流程	5 分钟上线
🔧 工具链完整	量化/推理/服务全支持	端到端方案

适用场景

高性能推理：需要极致吞吐量的生产环境
资源受限：显存有限但需要跑大模型
国产化需求：华为 Ascend 等国产硬件
多模态应用：图文理解、视觉语言模型

快速开始

系统要求

组件	要求
GPU	NVIDIA GPU (Volta+) / AMD / Ascend
显存	最少 8GB，推荐 24GB+
CUDA	11.8 或 12.1+
Python	3.8 - 3.11
PyTorch	2.0+

5 分钟快速启动

bash

# 1. 安装 LMDeploy
pip install lmdeploy

# 2. 启动推理服务
lmdeploy serve api_server internlm/internlm2-chat-7b --server-port 23333

# 3. 测试请求
curl http://localhost:23333/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "internlm2-chat-7b",
        "messages": [{"role": "user", "content": "你好"}],
        "max_tokens": 100
    }'

OpenAI 兼容 API

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:23333/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="internlm2-chat-7b",
    messages=[
        {"role": "user", "content": "介绍一下你自己"}
    ]
)

print(response.choices[0].message.content)

核心特性

TurboMind 推理引擎

TurboMind 核心优化：
├── Paged Attention    - 分页注意力，显存利用率 95%+
├── Continuous Batching - 连续批处理，吞吐量提升 2-3x
├── Flash Attention 2  - 优化的注意力内核
├── KV Cache 量化      - INT4/INT8 KV 缓存，节省 50% 显存
└── CUDA Graph         - 图优化，减少内核启动开销

量化技术

量化方式	精度	显存节省	速度提升
W4A16	4-bit 权重	75%	2.4x
W8A8	8-bit 权重+激活	50%	1.8x
KV Cache INT8	8-bit KV	50%	1.3x
FP8	8-bit 浮点	50%	1.5x

多后端支持

python

# PyTorch 后端（灵活，易扩展）
from lmdeploy import pipeline
pipe = pipeline("internlm/internlm2-chat-7b", backend="pytorch")

# TurboMind 后端（高性能）
pipe = pipeline("internlm/internlm2-chat-7b", backend="turbomind")

# Ascend 后端（华为硬件）
pipe = pipeline("Qwen/Qwen-7B", backend="ascend")

安装配置

基础安装

bash

# 方法 1：PyPI 安装
pip install lmdeploy

# 方法 2：Docker 安装
docker pull openmmlab/lmdeploy:latest

# 方法 3：源码安装
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

Docker 部署

bash

# NVIDIA GPU
docker run --gpus all -it --rm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 23333:23333 \
    openmmlab/lmdeploy:latest \
    lmdeploy serve api_server internlm/internlm2-chat-7b

# Ascend NPU
docker run --device /dev/davinci_manager \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 23333:23333 \
    openmmlab/lmdeploy:ascend \
    lmdeploy serve api_server Qwen/Qwen-7B --backend ascend

多 GPU 配置

bash

# 张量并行
lmdeploy serve api_server internlm/internlm2-chat-20b \
    --tp 2 \
    --server-port 23333

# 流水线并行
lmdeploy serve api_server internlm/internlm2-chat-70b \
    --tp 4 \
    --pp 2 \
    --server-port 23333

模型推理

离线推理

python

from lmdeploy import pipeline

# 创建推理管道
pipe = pipeline("internlm/internlm2-chat-7b")

# 单条推理
response = pipe("你好，请介绍一下你自己")
print(response.text)

# 批量推理
responses = pipe(["你好", "今天天气怎么样", "推荐一本书"])
for i, resp in enumerate(responses):
    print(f"回复{i+1}: {resp.text}")

对话模式

python

from lmdeploy import pipeline, TurbomindEngineConfig

# 配置对话参数
backend_config = TurbomindEngineConfig(
    session_len=8192,      # 会话长度
    max_batch_size=128,    # 最大批处理大小
    tp=1                   # 张量并行数
)

pipe = pipeline(
    "internlm/internlm2-chat-7b",
    backend_config=backend_config
)

# 多轮对话
session = pipe.chat("你好")
print(session.response.text)

session = pipe.chat("帮我写个 Python 函数", session=session)
print(session.response.text)

session = pipe.chat("加上注释", session=session)
print(session.response.text)

流式输出

python

from lmdeploy import pipeline

pipe = pipeline("internlm/internlm2-chat-7b")

# 流式生成
for response in pipe.stream_infer("讲一个故事"):
    print(response.text, end="", flush=True)

量化压缩

AWQ 量化（4-bit）

bash

# 1. 量化模型
lmdeploy lite auto_awq \
    internlm/internlm2-chat-7b \
    --calib-dataset 'ptb' \
    --calib-samples 128 \
    --calib-seqlen 2048 \
    --w-bits 4 \
    --w-group-size 128 \
    --work-dir ./internlm2-7b-awq

# 2. 使用量化模型
lmdeploy serve api_server ./internlm2-7b-awq --server-port 23333

KV Cache 量化

python

from lmdeploy import pipeline, TurbomindEngineConfig

# 启用 KV Cache 量化
backend_config = TurbomindEngineConfig(
    quant_policy=8,  # 8=INT8, 4=INT4
    max_batch_size=128
)

pipe = pipeline(
    "internlm/internlm2-chat-7b",
    backend_config=backend_config
)

response = pipe("你好")
print(response.text)

量化模型推荐

模型	量化版本	显存需求	速度
InternLM2-7B	AWQ-4bit	6GB	2.4x
Qwen-14B	AWQ-4bit	10GB	2.2x
Llama-2-70B	AWQ-4bit	48GB	2.0x
DeepSeek-V2	FP8	32GB	1.8x

服务部署

API 服务器

bash

# 基础配置
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1 \
    --session-len 8192 \
    --max-batch-size 128 \
    --cache-max-entry-count 0.8

# 启用监控
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --server-port 23333 \
    --api-keys sk-llm-deploy \
    --log-level info

客户端 SDK

python

from lmdeploy.serve.async_client import AsyncEngineClient

# 异步客户端
async def main():
    client = AsyncEngineClient("http://localhost:23333")
    
    # 单次请求
    response = await client.generate("你好")
    print(response.text)
    
    # 流式请求
    async for response in client.generate_stream("讲个故事"):
        print(response.text, end="")

import asyncio
asyncio.run(main())

Gradio Web UI

bash

# 启动 Web 界面
lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 7860

生产部署配置

yaml

# docker-compose.yml
version: '3.8'
services:
  lmdeploy:
    image: openmmlab/lmdeploy:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    ports:
      - "23333:23333"
    volumes:
      - ./models:/models
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      lmdeploy serve api_server /models/internlm2-7b
      --tp 2
      --max-batch-size 256
      --session-len 16384
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    restart: unless-stopped

多模态支持

图文理解

python

from lmdeploy import pipeline

# 加载多模态模型
pipe = pipeline("OpenGVLab/InternVL2-8B")

# 图文问答
response = pipe(
    "这张图片里有什么？",
    image="./image.jpg"
)
print(response.text)

视觉语言模型

python

from lmdeploy import pipeline, TurbomindEngineConfig

# 配置多模态推理
backend_config = TurbomindEngineConfig(
    session_len=8192,
    max_batch_size=64
)

pipe = pipeline(
    "OpenGVLab/InternVL2-26B",
    backend_config=backend_config
)

# 多图输入
response = pipe(
    "比较这两张图片的差异",
    image=["image1.jpg", "image2.jpg"]
)

视频理解

python

from lmdeploy import pipeline

pipe = pipeline("OpenGVLab/InternVL2-8B")

# 视频问答
response = pipe(
    "视频中的人在做什么？",
    video="./video.mp4"
)
print(response.text)

性能优化

显存优化

bash

# 调整显存使用比例
export LMDeploy_CACHE_MAX_ENTRY_COUNT=0.8

# 启用 CPU offload
lmdeploy serve api_server \
    internlm/internlm2-chat-70b \
    --cpu-offload \
    --max-batch-size 64

吞吐量优化

bash

# 增加批处理大小
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --max-batch-size 256 \
    --cache-max-entry-count 0.9

# 启用分块预填充
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --enable-chunked-prefill

延迟优化

bash

# 减少批处理大小
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --max-batch-size 32 \
    --schedule-conservativeness 1.2

基准测试

bash

# 使用 LMDeploy 基准测试工具
python -m lmdeploy.bench_serving \
    --backend lmdeploy \
    --model internlm/internlm2-chat-7b \
    --dataset sharegpt \
    --num-prompts 1000 \
    --request-rate 16

# 输出示例：
# Throughput: 5234.12 requests/s
# Avg latency: 189.34 ms
# P50 latency: 156.78 ms
# P99 latency: 423.45 ms

实际应用场景

1. 高并发客服系统

python

from fastapi import FastAPI
from lmdeploy.serve.async_client import AsyncEngineClient

app = FastAPI()
client = AsyncEngineClient("http://localhost:23333")

@app.post("/chat")
async def chat(question: str):
    response = await client.generate(question)
    return {"answer": response.text}

# 启动：uvicorn app:app --workers 4

2. 文档问答系统

python

from lmdeploy import pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 加载文档
with open("manual.pdf", "r") as f:
    content = f.read()

# 分块
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(content)

# 创建索引
from llama_index.core import VectorStoreIndex, Document
documents = [Document(text=chunk) for chunk in chunks]
index = VectorStoreIndex.from_documents(documents)

# 结合 LMDeploy 推理
pipe = pipeline("internlm/internlm2-chat-7b")

def rag_query(question):
    # 检索相关文档
    retriever = index.as_retriever(similarity_top_k=3)
    nodes = retriever.retrieve(question)
    context = "\n\n".join([node.text for node in nodes])
    
    # 生成回答
    prompt = f"""基于以下信息回答问题：
{context}

问题：{question}
回答："""
    
    response = pipe(prompt)
    return response.text

3. 代码助手

python

from lmdeploy import pipeline, TurbomindEngineConfig

# 使用代码专用模型
backend_config = TurbomindEngineConfig(
    session_len=16384,
    max_batch_size=64
)

pipe = pipeline(
    "internlm/internlm2-coder-7b",
    backend_config=backend_config
)

# 代码生成
response = pipe("""
用 Python 写一个快速排序函数，要求：
1. 包含详细注释
2. 提供测试用例
3. 分析时间复杂度
""")
print(response.text)

4. 多租户服务

python

from fastapi import FastAPI, Depends, HTTPException
from lmdeploy.serve.async_client import AsyncEngineClient
import asyncio

app = FastAPI()

# 连接池
class ClientPool:
    def __init__(self, size=10):
        self.clients = [AsyncEngineClient("http://localhost:23333") for _ in range(size)]
        self.queue = asyncio.Queue()
        for client in self.clients:
            self.queue.put_nowait(client)
    
    async def acquire(self):
        return await self.queue.get()
    
    async def release(self, client):
        await self.queue.put(client)

pool = ClientPool()

@app.post("/generate")
async def generate(prompt: str, client: AsyncEngineClient = Depends(pool.acquire)):
    try:
        response = await client.generate(prompt)
        return {"text": response.text}
    finally:
        await pool.release(client)

故障排除

常见问题

1. 显存不足

bash

# 解决方案 1：降低显存使用
export LMDeploy_CACHE_MAX_ENTRY_COUNT=0.7

# 解决方案 2：使用量化模型
lmdeploy lite auto_awq internlm/internlm2-chat-7b --work-dir ./awq

# 解决方案 3：减少并发
--max-batch-size 32

2. 模型加载失败

bash

# 检查 HuggingFace 连接
huggingface-cli whoami

# 使用镜像
export HF_ENDPOINT=https://hf-mirror.com

# 手动下载
huggingface-cli download internlm/internlm2-chat-7b --local-dir ./models

3. 推理速度慢

bash

# 检查 GPU 使用
nvidia-smi

# 启用 TurboMind
lmdeploy serve api_server \
    internlm/internlm2-chat-7b \
    --backend turbomind

# 调整批处理
--max-batch-size 128

日志调试

bash

# 启用详细日志
export LMDeploy_LOG_LEVEL=DEBUG

# 查看性能指标
curl http://localhost:23333/metrics

# 监控 GPU
watch -n 1 nvidia-smi

总结

LMDeploy 是高性能、功能完整的大模型部署工具。通过本指南，你应该已经掌握了：

✅ TurboMind 核心原理
✅ 量化压缩技术
✅ 服务部署方案
✅ 多模态支持
✅ 性能优化技巧

现在，开始部署你的高性能 AI 应用吧！🚀

参考资源：

官网：https://github.com/InternLM/lmdeploy
文档：https://lmdeploy.readthedocs.io
GitHub: https://github.com/InternLM/lmdeploy
模型库：https://huggingface.co/internlm

最后更新：2026-03-13

LMDeploy 完全指南：高效模型部署工具 ​

目录 ​

什么是 LMDeploy？ ​

简介 ​

核心优势 ​

适用场景 ​

快速开始 ​

系统要求 ​

5 分钟快速启动 ​

OpenAI 兼容 API ​

核心特性 ​

TurboMind 推理引擎 ​

量化技术 ​

多后端支持 ​

安装配置 ​

基础安装 ​

Docker 部署 ​

多 GPU 配置 ​

模型推理 ​

离线推理 ​

对话模式 ​

流式输出 ​

量化压缩 ​

AWQ 量化（4-bit） ​

KV Cache 量化 ​

量化模型推荐 ​

服务部署 ​

API 服务器 ​

客户端 SDK ​

Gradio Web UI ​

生产部署配置 ​

多模态支持 ​

图文理解 ​

视觉语言模型 ​

视频理解 ​

性能优化 ​

显存优化 ​

吞吐量优化 ​

延迟优化 ​

基准测试 ​

实际应用场景 ​

1. 高并发客服系统 ​

2. 文档问答系统 ​

3. 代码助手 ​

4. 多租户服务 ​

故障排除 ​

常见问题 ​

1. 显存不足 ​

2. 模型加载失败 ​

3. 推理速度慢 ​

日志调试 ​

总结 ​

LMDeploy 完全指南：高效模型部署工具

目录

什么是 LMDeploy？

简介

核心优势

适用场景

快速开始

系统要求

5 分钟快速启动

OpenAI 兼容 API

核心特性

TurboMind 推理引擎

量化技术

多后端支持

安装配置

基础安装

Docker 部署

多 GPU 配置

模型推理

离线推理

对话模式

流式输出

量化压缩

AWQ 量化（4-bit）

KV Cache 量化

量化模型推荐

服务部署

API 服务器

客户端 SDK

Gradio Web UI

生产部署配置

多模态支持

图文理解

视觉语言模型

视频理解

性能优化

显存优化

吞吐量优化

延迟优化

基准测试

实际应用场景

1. 高并发客服系统

2. 文档问答系统

3. 代码助手

4. 多租户服务

故障排除

常见问题

1. 显存不足

2. 模型加载失败

3. 推理速度慢

日志调试

总结