1. Body Based Router:从路径路由走向模型语义路由
在 OpenAI 兼容 API 中,用户通常通过请求体中的 model 字段指定要调用的模型。例如:
{
"model":"Qwen3-8B",
"messages": [
{
"role":"user",
"content":"请用一句话告诉我你是什么模型"
}
]
}
这给传统网关带来了一个问题:模型路由的关键信息并不在 Path、Host 或 Header 中,而是在 HTTP Body 中。传统 HTTPRoute 更擅长根据 Host、Path、Header 做匹配,但 AI 请求的核心语义往往藏在 Body 里。如果要实现一个统一入口,例如:
/roadshow/llmmodel/v1/chat/completions
然后根据请求体中的 model 字段路由到不同的模型后端,就需要网关具备读取请求体并提取模型名的能力。这就是 Body Based Router 的价值。在实践中,可以通过网关策略在 PreRouting 阶段读取请求体中的 model 字段,并将其转换为一个新的 Header,例如:
X-Gateway-Base-Model-Name
随后,HTTPRoute 再根据这个 Header 进行匹配,将请求转发到不同的 InferencePool。这样,外部调用方只需要访问统一入口,仍然按照 OpenAI API 的习惯在 Body 中传入模型名,而网关可以自动完成模型路由:
model=Qwen3-8B
→ qwen3-8b-demo InferencePool
model=DeepSeek-R1-Distill-Qwen-7B
→ deepseek-r1-7b-demo InferencePool
这一步体现了 AI Gateway 与传统 API Gateway 的关键差异。传统网关主要理解 HTTP 元数据,而 AI Gateway 需要进一步理解推理请求语义。它不只是“转发请求”,而是能够识别请求中的模型意图,并据此选择后端模型服务。那接下来我们依然通过工程实践来看看它具体的效果。
1.1 部署Qwen3的模型服务
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
aitype: tuili
name: qwen3-8b-demo
namespace: roadshow
spec:
replicas: 2
selector:
matchLabels:
app: qwen3-8b-demo
template:
metadata:
labels:
aitype: tuili
app: qwen3-8b-demo
spec:
containers:
- args:
- |
seqNum=$(expr 1 - 1)
CUDA_VISIBLE_DEVICES=$(seq -s, 0 $seqNum) /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server --model /workspace/model/Qwen3-8B --port 8080 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --enable-auto-tool-choice --tool-call-parser granite --served-model-name Qwen3-8B --trust-remote-code
command:
- /bin/bash
- -c
env:
- name: RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
value: "1"
image: image.xxx.cn/tenant_public/vllm-mars:ai3.3-torch2.6-py312-ubuntu22.04-amd64
imagePullPolicy: IfNotPresent
name: qwen3-8b-demo-container-01
resources:
limits:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
requests:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /workspace/model
name: localmodelvolume
readOnly: true
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: volcano
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /zion0/modelsrepo/models
type: Directory
name: localmodelvolume
- emptyDir:
medium: Memory
sizeLimit: 15Gi
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: qwen3-8b-demo-service
namespace: roadshow
labels:
app: qwen3-8b-demo
spec:
type: NodePort
selector:
app: qwen3-8b-demo
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 31301
EOF
# 基本测试验证
curl -sS http://10.8.17.200:31301/v1/models
{"object":"list","data":[{"id":"Qwen3-8B","object":"model","created":1772164142,"owned_by":"vllm","root":"/workspace/model/Qwen3-8B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-a82e27138dc5437484b13e3f91486070","object":"model_permission","created":1772164142,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
curl -sS http://10.8.17.200:31301/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model":"Qwen3-8B",
"messages":[{"role":"user","content":"你好,用一句话自我介绍"}],
"temperature":0.2
}'
1.2 部署Qwen3模型的InferencePool
# 通过helm安装inferencepool和epp
helm upgrade --install qwen3-8b-demo .
--namespace roadshow --create-namespace
--dependency-update
--set inferencePool.modelServers.matchLabels.app=qwen3-8b-demo
--set experimentalHttpRoute.enabled=true
--set experimentalHttpRoute.baseModel=Qwen3-8B
-f values.yaml
1.3 部署Deepseek R1模型服务
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
aitype: tuili
name: deepseek-r1-7b-demo
namespace: roadshow
spec:
replicas: 3
selector:
matchLabels:
app: deepseek-r1-7b-demo
template:
metadata:
labels:
aitype: tuili
app: deepseek-r1-7b-demo
spec:
containers:
- args:
- |
seqNum=$(expr 1 - 1)
CUDA_VISIBLE_DEVICES=$(seq -s, 0 $seqNum) /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server
--model /workspace/model/DeepSeek-R1-Distill-Qwen-7B
--port 8080
--tensor-parallel-size 1
--gpu-memory-utilization 0.9
--enable-auto-tool-choice
--tool-call-parser granite
--served-model-name DeepSeek-R1-Distill-Qwen-7B
--trust-remote-code
command:
- /bin/bash
- -c
env:
- name: RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
value: "1"
image: aiimage.xxx.cn/tenant_public/vllm-mars:ai3.3-torch2.6-py312-ubuntu22.04-amd64
imagePullPolicy: IfNotPresent
name: deepseek-r1-7b-demo-container-01
resources:
limits:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
requests:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /workspace/model
name: localmodelvolume
readOnly: true
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: volcano
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /zion0/modelsrepo/models
type: Directory
name: localmodelvolume
- emptyDir:
medium: Memory
sizeLimit: 15Gi
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-7b-demo-service
namespace: roadshow
labels:
app: deepseek-r1-7b-demo
spec:
type: NodePort
selector:
app: deepseek-r1-7b-demo
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 31302
EOF
# 基本测试验证
curl -sS http://10.8.17.200:31302/v1/models
curl -sS http://10.8.17.200:31302/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model":"DeepSeek-R1-Distill-Qwen-7B",
"messages":[{"role":"user","content":"你好,用一句话自我介绍"}],
"temperature":0.2
}'
1.4 部署DeepseekR1模型的InferencePool
# 通过helm安装inferencepool和epp
helm upgrade --install deepseek-r1-7b-demo .
--namespace roadshow
--dependency-update
--set inferencePool.modelServers.matchLabels.app=deepseek-r1-7b-demo
--set experimentalHttpRoute.enabled=true
--set experimentalHttpRoute.baseModel=DeepSeek-R1-Distill-Qwen-7B
-f values.yaml
1.5 配置策略
💡 BBR 本身做的事,就是读取请求 body 里的 model 字段,然后给请求补一个新的 header,比如 X-Gateway-Base-Model-Name。这个 header 只有在后面的 HTTPRoute.match.headers 里被用来做匹配时,才会真正影响路由结果。
kubectl apply -f - <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: bbr-model-routing
namespace: kgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: llm
traffic:
phase: PreRouting
transformation:
request:
set:
- name: X-Gateway-Base-Model-Name
value: |
{
"Qwen3-8B": "Qwen3-8B",
"DeepSeek-R1-Distill-Qwen-7B": "DeepSeek-R1-Distill-Qwen-7B"
}[json(request.body).model]
EOF
1.6 配置路由
# 1.针对Qwen3的路由配置
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llmmodel-qwen-route
namespace: roadshow
spec:
hostnames:
- llm.xxx.cn
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: llm
namespace: kgateway-system
rules:
- matches:
- path:
type: PathPrefix
<span style="color: rgba(212, 76, 71, 1);">value: /roadshow/llmmodel</span>
headers:
- name: X-Gateway-Base-Model-Name
type: Exact
value: Qwen3-8B
filters:
- type: URLRewrite
urlRewrite:
path:
type: ReplacePrefixMatch
replacePrefixMatch: /
backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: qwen3-8b-demo
weight: 1
timeouts:
request: 300s
EOF
# 针对DeepSeekR1的路由配置
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llmmodel-deepseek-route
namespace: roadshow
spec:
hostnames:
- llm.xxx.cn
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: llm
namespace: kgateway-system
rules:
- matches:
- path:
type: PathPrefix
value: /roadshow/llmmodel
headers:
- name: X-Gateway-Base-Model-Name
type: Exact
value: DeepSeek-R1-Distill-Qwen-7B
filters:
- type: URLRewrite
urlRewrite:
path:
type: ReplacePrefixMatch
replacePrefixMatch: /
backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: deepseek-r1-7b-demo
weight: 1
timeouts:
request: 300s
EOF
1.7 测试验证
curl -sS https://llm.xxx.cn/roadshow/llmmodel/v1/chat/completions
-H 'Content-Type: application/json'
-d '{
"model":"Qwen3-8B",
"messages":[{"role":"user","content":"请用一句话告诉我你是什么模型"}],
"temperature":0.2
}'
curl -sS https://llm.xxx.cn/roadshow/llmmodel/v1/chat/completions
-H 'Content-Type: application/json'
-d '{
"model":"DeepSeek-R1-Distill-Qwen-7B",
"messages":[{"role":"user","content":"请用一句话告诉我你是什么模型"}],
"temperature":0.2
}'
2.模型灰度发布:把模型升级变成流量治理问题
在推理平台中,模型更新并不是简单替换镜像或重启服务。一个模型可能对应不同基础模型、不同微调权重、不同推理引擎、不同 GPU 卡型,甚至不同运行参数。直接全量切换风险很高:新模型可能质量不稳定,新镜像可能存在兼容问题,新硬件环境也可能带来性能波动。因此,更合理的方式是通过网关层进行灰度发布。
在实践中,可以为同一个对外模型名准备两个 InferencePool。例如,外部用户始终请求:
model=mymodel
但网关后端可以同时挂载两个推理池:
qwen3-8b-mymodel
deepseek-r1-7b-mymodel
通过 HTTPRoute 的权重配置,可以将流量按 70/30、90/10 或其他比例分发到不同推理池。这样,用户侧的调用方式保持不变,平台侧则可以逐步将流量从旧模型迁移到新模型。
这种方式适用于两类典型场景。
第一类是基础设施灰度。例如新 GPU 卡型、新驱动、新推理镜像、新运行参数上线时,可以先引入少量流量验证稳定性,再逐步扩大比例。
第二类是模型灰度。例如微调模型、新版本权重、新基础模型上线时,可以通过小比例流量观察效果和稳定性,避免一次性切换带来的风险。
从这个角度看,推理网关不只是入口组件,也开始成为模型生命周期管理的一部分。模型上线、模型切换、模型回滚,都可以通过流量治理机制完成,而不是完全依赖底层工作负载变更。
2.1 部署Qwen3的模型服务
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
aitype: tuili
name: qwen3-8b-mymodel
namespace: roadshow
spec:
replicas: 2
selector:
matchLabels:
app: qwen3-8b-mymodel
template:
metadata:
labels:
aitype: tuili
app: qwen3-8b-mymodel
spec:
containers:
- args:
- |
seqNum=$(expr 1 - 1)
CUDA_VISIBLE_DEVICES=$(seq -s, 0 $seqNum) /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server
--model /workspace/model/Qwen3-8B
--port 8080
--tensor-parallel-size 1
--gpu-memory-utilization 0.9
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
--enable-auto-tool-choice
--tool-call-parser granite
--served-model-name mymodel
--trust-remote-code
command:
- /bin/bash
- -c
env:
- name: RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
value: "1"
image: aiimage.xxx.cn/tenant_public/vllm-mars:ai3.3-torch2.6-py312-ubuntu22.04-amd64
imagePullPolicy: IfNotPresent
name: qwen3-8b-mymodel-container
resources:
limits:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
requests:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /workspace/model
name: localmodelvolume
readOnly: true
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: volcano
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /zion0/modelsrepo/models
type: Directory
name: localmodelvolume
- emptyDir:
medium: Memory
sizeLimit: 15Gi
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: qwen3-8b-mymodel-service
namespace: roadshow
labels:
app: qwen3-8b-mymodel
spec:
type: NodePort
selector:
app: qwen3-8b-mymodel
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 31311
EOF
# 验证模型可用性
curl -sS http://10.8.17.200:31311/v1/models
2.2 部署Qwen3模型的InferencePool
# 通过helm安装inferencepool和epp
helm upgrade --install qwen3-8b-mymodel .
--namespace roadshow --create-namespace
--dependency-update
--set inferencePool.modelServers.matchLabels.app=qwen3-8b-mymodel
-f values.yaml
2.3 部署Deepseek R1模型服务
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
aitype: tuili
name: deepseek-r1-7b-mymodel
namespace: roadshow
spec:
replicas: 2
selector:
matchLabels:
app: deepseek-r1-7b-mymodel
template:
metadata:
labels:
aitype: tuili
app: deepseek-r1-7b-mymodel
spec:
containers:
- args:
- |
seqNum=$(expr 1 - 1)
CUDA_VISIBLE_DEVICES=$(seq -s, 0 $seqNum) /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server
--model /workspace/model/DeepSeek-R1-Distill-Qwen-7B
--port 8080
--tensor-parallel-size 1
--gpu-memory-utilization 0.9
--enable-auto-tool-choice
--tool-call-parser granite
--served-model-name mymodel
--trust-remote-code
command:
- /bin/bash
- -c
env:
- name: RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
value: "1"
image: aiimage.wtsht.cn/tenant_public/vllm-mars:ai3.3-torch2.6-py312-ubuntu22.04-amd64
imagePullPolicy: IfNotPresent
name: deepseek-r1-7b-mymodel-container
resources:
limits:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
requests:
cpu: "12"
ephemeral-storage: 50Gi
mars-tech.com/gpu: "1"
memory: 96Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /workspace/model
name: localmodelvolume
readOnly: true
- mountPath: /dev/shm
name: dshm
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: volcano
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /zion0/modelsrepo/models
type: Directory
name: localmodelvolume
- emptyDir:
medium: Memory
sizeLimit: 15Gi
name: dshm
---
apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-7b-mymodel-service
namespace: roadshow
labels:
app: deepseek-r1-7b-mymodel
spec:
type: NodePort
selector:
app: deepseek-r1-7b-mymodel
ports:
- name: http
port: 8080
targetPort: 8080
nodePort: 31312
EOF
# 验证模型可用性
curl -sS http://10.8.17.200:31312/v1/models
2.4 部署DeepseekR1模型的InferencePool
# 通过helm安装inferencepool和epp
helm upgrade --install deepseek-r1-7b-mymodel .
--namespace roadshow --create-namespace
--dependency-update
--set inferencePool.modelServers.matchLabels.app=deepseek-r1-7b-mymodel
-f values.yaml
2.5 路由配置
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
labels:
capsule.clastix.io/managed-by: hypersuite
name: mymodel-route
namespace: roadshow
spec:
hostnames:
- llm.wtsht.cn
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: llm
namespace: kgateway-system
rules:
- backendRefs:
- group: inference.networking.k8s.io
kind: InferencePool
name: qwen3-8b-mymodel
weight: 70
- group: inference.networking.k8s.io
kind: InferencePool
name: deepseek-r1-7b-mymodel
weight: 30
filters:
- type: URLRewrite
urlRewrite:
path:
replacePrefixMatch: /
type: ReplacePrefixMatch
matches:
- path:
type: PathPrefix
value: /roadshow/mymodel
EOF
2.6.测试
2.6.1 基本验证
curl -sS https://llm.xxx.cn/roadshow/mymodel/v1/chat/completions
-H 'Content-Type: application/json'
-d '{
"model":"mymodel",
"messages":[{"role":"user","content":"请用一句话告诉我你是什么模型"}],
"temperature":0.2
}'
curl -sS https://llm.xxx.cn/roadshow/mymodel/v1/chat/completions
-H 'Content-Type: application/json'
-d '{
"model":"mymodel",
"messages":[{"role":"user","content":"请用一句话告诉我你是什么模型"}],
"temperature":0.2
}'
2.6.2 压测
💡 Qwen增量 = Qwen结束总和 – Qwen开始总和 DeepSeek增量 = DeepSeek结束总和 – DeepSeek开始总和
# 在压测开始前。
# 记下 Qwen 组当前的总生成 token。
for ip in $(kubectl -n roadshow get pod -l app=qwen3-8b-mymodel -o wide --no-headers | awk '{print $6}'); do
curl -sS http://$ip:8080/metrics | awk '/^vllm:generation_tokens_total/ {sum+=$2} END {print sum}'
done | awk '{s+=$1} END {print "QWEN_BEFORE=" s}'
QWEN_BEFORE=71938
#记下 DeepSeek 组当前的总生成 token。
for ip in $(kubectl -n roadshow get pod -l app=deepseek-r1-7b-mymodel -o wide --no-headers | awk '{print $6}'); do
curl -sS http://$ip:8080/metrics | awk '/^vllm:generation_tokens_total/ {sum+=$2} END {print sum}'
done | awk '{s+=$1} END {print "DEEPSEEK_BEFORE=" s}'
DEEPSEEK_BEFORE=15545
# 借助VLLM 内置的bench压测工具来发送请求
vllm bench serve
--backend openai-chat
--base-url http://10.8.17.152:8090/roadshow/mymodel
--endpoint /v1/chat/completions
--model mymodel
--tokenizer /workspace/model/Qwen3-8B
--dataset-name random
--random-input-len 512
--random-output-len 256
--random-range-ratio 0.2
--num-prompts 200
--max-concurrency 20
--header Host=llm.xxx.cn
============ Serving Benchmark Result ============
Successful requests: 200
Maximum request concurrency: 20
Benchmark duration (s): 46.05
Total input tokens: 103623
Total generated tokens: 51099
Request throughput (req/s): 4.34
Output token throughput (tok/s): 1109.65
Peak output token throughput (tok/s): 1294.00
Peak concurrent requests: 29.00
Total Token throughput (tok/s): 3359.89
---------------Time to First Token----------------
Mean TTFT (ms): 129.51
Median TTFT (ms): 104.01
P99 TTFT (ms): 367.67
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.34
Median TPOT (ms): 16.96
P99 TPOT (ms): 18.48
---------------Inter-token Latency----------------
Mean ITL (ms): 16.50
Median ITL (ms): 16.35
P99 ITL (ms): 39.70
==================================================
# 压测完毕
# 现在重新统计压测后的 Qwen 总生成 token。
for ip in $(kubectl -n roadshow get pod -l app=qwen3-8b-mymodel -o wide --no-headers | awk '{print $6}'); do
curl -sS http://$ip:8080/metrics | awk '/^vllm:generation_tokens_total/ {sum+=$2} END {print sum}'
done | awk '{s+=$1} END {print "QWEN_AFTER=" s}'
QWEN_AFTER=110214
# 现在重新统计压测后的 DeepSeek 总生成 token。
for ip in $(kubectl -n roadshow get pod -l app=deepseek-r1-7b-mymodel -o wide --no-headers | awk '{print $6}'); do
curl -sS http://$ip:8080/metrics | awk '/^vllm:generation_tokens_total/ {sum+=$2} END {print sum}'
done | awk '{s+=$1} END {print "DEEPSEEK_AFTER=" s}'
DEEPSEEK_AFTER=28596
# 最后验证结果
Qwen 增量:
110214 - 71938 = 38276
DeepSeek 增量:
28596 - 15545 = 13051
总增量:
38276 + 13051 = 51327
占比大约是:
• Qwen:74.6%
• DeepSeek:25.4%
| 项目 | 起始值 | 结束值 | 增量 | 占总增量比例 |
|---|---|---|---|---|
| Qwen | 71938 | 110214 | 38276 | 74.6% |
| DeepSeek | 15545 | 28596 | 13051 | 25.4% |
| 总计 | 87483 | 138810 | 51327 | 100% |
