基于K8s的云原生AI基础设施:架构、部署与实践【009】-AI算力的流量入口(上篇)

1.从 Ingress 走向 Gateway API

在传统的 Kubernetes 场景中,入口流量通常由 Ingress 处理。它提供了一个相对简单的 HTTP 入口模型,通过 Host 和 Path 将请求转发到后端 Service,这种方式在早期应用中已经被广泛采用。但 Kubernetes 社区(由 Kubernetes SIG Network 推动)已经明确:Ingress API 已进入冻结状态(feature freeze),不再新增功能,新的流量治理能力主要在 Gateway API 项目中持续演进。换句话说,Ingress 仍然可用,但其能力边界已经基本固定。

Gateway API 的变化,并不仅仅是功能增强,而是 API 设计方式的调整。与 Ingress 将入口与路由绑定在单一资源中的模型不同,Gateway API 将其拆分为独立资源:Gateway 负责定义流量如何进入集群(端口、协议、TLS 等),HTTPRoute 负责定义匹配规则与转发逻辑。这种拆分带来的关键差异在于:入口、路由以及后续策略能力不再耦合在一个对象中,而是可以分别建模和独立演进。在实际使用中,这意味着入口配置、转发规则以及扩展能力不需要在单一资源中不断叠加,从而降低了配置复杂度和变更影响范围。

在传统 Web 场景中,这种差异并不容易体现;但在 AI 平台中,这一模型差异会被放大。AI 推理流量的一个核心特征是:请求的后端选择很多时候不再是基于静态路径或域名,而是需要根据模型标识、服务状态以及调度策略动态决定。也就是说,入口层处理的问题,从“静态路由”逐渐转变为“动态调度”。

这类逻辑在 Ingress 中缺乏对应的标准字段,实际工程中通常依赖具体 Ingress Controller 的 annotations 扩展实现。这种方式不仅缺乏统一规范,也会导致配置与具体实现强绑定,从而影响可维护性和可移植性。相比之下,Gateway API 提供了一个可扩展的资源模型,使这类动态流量治理能力可以在统一接口下逐步实现,而不依赖控制器私有扩展。这也是当前多种 AI 网关实现以及 Gateway API Inference Extension 能够建立在其之上的基础。

因此,更准确的说法不是“Ingress 不再可用”,而是:Ingress 的能力已经稳定,而 Gateway API 正在成为承载复杂流量治理(尤其是 AI 推理流量)的演进方向。在 AI 平台这种入口逻辑复杂、后端动态性强的场景中,架构重心向 Gateway API 倾斜,是一个自然的技术演进过程。

到这里为止,我们讨论的仍然是流量入口的接口模型,Gateway API 解决的是“怎么描述流量应该如何被处理”,而具体“谁来处理这些流量、如何实现这些能力”,则需要落到实际的网关组件上。在当前生态中,已经出现了一批基于 Gateway API 构建的实现,包括面向通用 API 网关场景的方案,也包括针对 AI 推理场景进行扩展的网关实现。这些组件通常承担数据面(data plane)的流量处理职责,同时也可能包含对应的控制面(control plane)来管理配置与策略。也正是在这一层,方案选择开始变得不再唯一。

但在实际落地过程中可以明显感受到,这类 AI 网关项目仍处于高速演进期:组件拆分、功能边界以及项目形态都在经历剧烈的调整。笔者在早期实践中,以 kgateway 作为 AI 网关的切入点。在早期的官方文档中,kgateway 被定义为控制平面,其数据平面提供两种实现:Envoy-based proxy 与 agentgateway proxy。从官方当时释放的信息来看,在面向 AI 功能的延展上,重心明显在向 agentgateway 倾斜。基于这一预判,笔者在完成初步调研后便开始投入大量的测试验证。然而,从 2.2.x 版本开始,官方架构设计“画风突变”:agentgateway 的 API 被彻底从 kgateway 中剥离,若要继续使用 agentgateway 作为数据面,必须进行繁琐的迁移工作。紧接着,kgateway 与 agentgateway 演变为完全独立的 Helm 仓库,官方文档也随之几经变更。这种核心架构的频繁动荡与不向下兼容的调整,令人苦不堪言。 因此,这一部分的内容并不是要给出一个“唯一正确”的方案,而是基于当前实践过程,对这一类 AI 网关的角色、组合方式以及使用体验进行一次相对客观的梳理。

Envoy-Based Kgateway Proxy

Agentgateway proxy

AI网关的功能非常丰富,笔者在此文中只聚焦在AI算力平台中非常常用功能来展开介绍。

2.AI 推理流量的智能转发

在完成从 Ingress 到 Gateway API 的入口模型演进之后,下一步要解决的问题就是AI 推理流量到底应该如何被更智能地转发到后端模型实例。

传统 Kubernetes 流量入口主要解决的是“请求如何进入集群”和“根据 Host / Path 转发到哪个 Service”。这种模型适用于普通 Web 服务,但对于大模型推理服务来说,问题远不止于此。

一次推理请求背后,实际消耗的是 GPU 显存、KV Cache、Batch 计算能力和 Token 生成能力。不同模型实例即使都处于 Ready 状态,其真实负载也可能完全不同:有的实例正在处理长上下文请求,有的实例 KV Cache 使用率较高,有的实例队列已经堆积,而有的实例仍然相对空闲。如果仍然使用传统的轮询或简单负载均衡策略,请求很可能被转发到一个“看起来可用、但实际上已经很忙”的后端,从而导致 TTFT 增大、排队时间变长,甚至触发超时。

因此,在 AI 推理场景中,入口层不应该只是一个七层转发器,而应该逐步具备推理语义感知能力。也就是说,它不仅要知道请求应该转发到哪个服务,还要能够进一步判断,这个请求应该由哪个模型、哪个副本、哪个端点来处理更合适。

Gateway API Inference Extension 正是围绕这一问题提出的。它是 Kubernetes SIG Network 生态中的官方项目,目标是在 Gateway API 之上,为 Kubernetes 上自托管生成式 AI 模型提供更优化、更标准化的推理流量路由能力。官方文档将其定位为面向自托管生成式 AI 工作负载的优化路由与负载均衡机制,重点能力包括模型感知路由、服务优先级、模型灰度发布,以及基于模型服务实时指标的自定义负载均衡。

从架构上看,Gateway API Inference Extension 并不是“又一个网关产品”。而是在 Gateway API 体系中补齐了一块过去缺失的能力:推理请求进入网关之后,应该如何选择后端端点。

在传统网关中,后端选择通常依赖静态规则,例如 Host、Path、Header、权重或连接状态。而在推理网关中,后端选择需要进一步结合模型服务的运行状态,例如队列长度、KV Cache 使用率、Prefix Cache 命中情况、LoRA Adapter 可用性,以及不同模型实例的能力差异。官方文档中也明确说明,InferencePool 表示一组推理后端,并可以与 Endpoint Picker Extension 关联,由后者基于指标与策略为请求选择更合适的后端。

换句话说,传统网关解决的是:请求能不能被转发到后端。而 AI Aware Inference Routing 要解决的是:请求应该被转发到哪个后端,才能获得更好的推理效率和资源利用率。


2.1 推理网关的核心组件

在 Gateway API Inference Extension 的设计中,整体链路可以理解为三层。

第一层仍然是 Gateway API 本身。Gateway 定义流量如何进入集群,HTTPRoute 定义请求如何匹配与转发。这一部分延续了 Gateway API 的标准资源模型,使入口流量仍然能够通过 Kubernetes 原生方式进行声明式管理。

第二层是 InferencePool。它不再像普通 Service 那样只代表一个静态后端,而是代表一组可以承载某个模型推理请求的模型服务实例。对于平台来说,InferencePool 可以被理解为一个“推理池”:池中可能有多个模型副本,分布在不同节点、不同 GPU 或不同推理引擎之上。

第三层是 Endpoint Picker Extension,也就是 EPP。EPP 是推理调度逻辑的核心组件。它会持续获取模型服务侧的指标,并在请求到来时参与端点选择。官方项目说明中也提到,Inference Gateway 是与 Endpoint Picker 结合的 proxy/load-balancer,用于为 Kubernetes 自托管生成式 AI 工作负载提供优化的路由和负载均衡能力。

因此,一条请求的逻辑链路可以抽象为:

Client
   Gateway
   HTTPRoute
   InferencePool
   Endpoint Picker Extension
   Model Server Pod

 

这个链路中最关键的变化是:请求并不是简单进入一个 Service 后再由 kube-proxy 或传统负载均衡分发,而是进入 InferencePool 后,由 EPP 根据模型实例的实时状态做进一步选择。这也是 AI 推理入口与传统 Web 入口之间最本质的区别。

2.2 Inference Extension基础安装

接下来通过实际动手来看看如何完成工程实践,首先需要部署Inference Extension的CRD

💡 注意,本章节的所有配置基于2026年1月份的kgateway2.2.1,此版本开始将agentgaway的数据平面的crd做了彻底分离。

# 确认当前kgateway部署已经开启了inferenceExtension,如未开启,请参照官方部署文档
helm -n kgateway-system get values kgateway -o yaml | sed -n '1,80p'

helm -n kgateway-system get values agentgateway -o yaml | sed -n '1,80p'

inferenceExtension:
  enabled: true  

# 自动获取 Gateway API Inference Extension 项目“最新稳定版”的发布tag,避免手动指定版本。
IGW_LATEST_RELEASE=$(curl -s https://api.github.com/repos/kubernetes-sigs/gateway-api-inference-extension/releases 
  | jq -r '.[] | select(.prerelease == false) | .tag_name' 
  | sort -V 
  | tail -n1) 

# 查看当前版本
echo "$IGW_LATEST_RELEASE"
v1.3.1
  
# 把Inference Extension所需的Kubernetes扩展API(CRD)安装到集群,使集群“认识”并能创建Inference相关自定义资源。
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml

customresourcedefinition.apiextensions.k8s.io/inferencemodelrewrites.inference.networking.x-k8s.io configured
customresourcedefinition.apiextensions.k8s.io/inferenceobjectives.inference.networking.x-k8s.io configured
customresourcedefinition.apiextensions.k8s.io/inferencepoolimports.inference.networking.x-k8s.io configured
customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.k8s.io configured
customresourcedefinition.apiextensions.k8s.io/inferencepools.inference.networking.x-k8s.io configured  

 


2.3 在 vLLM 推理框架下验证 AI Aware Routing

💡 本样例以常见的单机多卡推理实例Deployment资源类型来做演示。

在实践中,首先以 vLLM 模型服务作为后端进行验证。vLLM 通过 OpenAI 兼容 API 暴露 /v1/models/v1/chat/completions 接口,每个模型服务副本以 Kubernetes Deployment 的形式运行,并通过标签与 InferencePool 关联。

在模型服务部署完成后,先通过 NodePort 直连验证后端是否可用。这一步非常重要,因为只有确认模型服务本身能够正常响应,后续才能判断 Gateway、HTTPRoute、InferencePool 和 EPP 的链路问题。

当直连 /v1/models 能够返回模型 ID,直连 /v1/chat/completions 能够返回正常推理结果后,说明后端模型服务已经具备基础可用性。

kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    aitype: tuili
  name: qwen3-8b-demo
  namespace: roadshow
spec:
  replicas: 3
  selector:
    matchLabels:
      app: qwen3-8b-demo
  template:
    metadata:
      labels:
        aitype: tuili
        app: qwen3-8b-demo
    spec:
      containers:
      - args:
        - |
          seqNum=$(expr 1 - 1)
          CUDA_VISIBLE_DEVICES=$(seq -s, 0 $seqNum) /opt/conda/bin/python3 -m vllm.entrypoints.openai.api_server --model /workspace/model/Qwen3-8B  --port 8080 --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'  --enable-auto-tool-choice --tool-call-parser granite --served-model-name Qwen3-8B --trust-remote-code
        command:
        - /bin/bash
        - -c
        env:
        - name: RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES
          value: "1"
        image: x.x.x.x/tenant_public/vllm-mars:ai3.3-torch2.6-py312-ubuntu22.04-amd64
        imagePullPolicy: IfNotPresent
        name: qwen3-8b-demo-container-01
        resources:
          limits:
            cpu: "12"
            ephemeral-storage: 50Gi
            mars-tech.com/gpu: "1"
            memory: 96Gi
          requests:
            cpu: "12"
            ephemeral-storage: 50Gi
            mars-tech.com/gpu: "1"
            memory: 96Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /workspace/model
          name: localmodelvolume
          readOnly: true
        - mountPath: /dev/shm
          name: dshm
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: volcano
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /zion0/modelsrepo/models
          type: Directory
        name: localmodelvolume
      - emptyDir:
          medium: Memory
          sizeLimit: 15Gi
        name: dshm
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3-8b-demo-service
  namespace: roadshow
  labels:
    app: qwen3-8b-demo
spec:
  type: NodePort
  selector:
    app: qwen3-8b-demo
  ports:
    - name: http
      port: 8080
      targetPort: 8080
      nodePort: 31301
EOF

# 通过内网检查验证服务可用性
curl -sS http://10.8.17.200:31301/v1/models

{"object":"list","data":[{"id":"Qwen3-8B","object":"model","created":1772164142,"owned_by":"vllm","root":"/workspace/model/Qwen3-8B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-a82e27138dc5437484b13e3f91486070","object":"model_permission","created":1772164142,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

curl -sS http://10.8.17.200:31301/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model":"Qwen3-8B",
    "messages":[{"role":"user","content":"你好,用一句话自我介绍"}],
    "temperature":0.2
  }'

 

随后,通过 Gateway 暴露统一入口,并通过 HTTPRoute 将外部路径转发到 InferencePool。这里与普通 HTTPRoute 最大的区别在于:backendRefs 指向的不是普通 Service,而是 InferencePool。这意味着请求进入网关后,并不是被静态转发到某个 Service,而是进入推理池,由 EPP 进一步参与端点选择。

# 1. 配置监听8090端口的agentgateway网关实例
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llm
  namespace: kgateway-system
spec:
  gatewayClassName: agentgateway
  listeners:
  - allowedRoutes:
      namespaces:
        from: All
    name: http
    port: 8090
    protocol: HTTP
EOF

# 检查
kubectl get gateway llm -n kgateway-system
NAME   CLASS          ADDRESS       PROGRAMMED   AGE
llm    agentgateway   10.8.17.152   True         69d

# 2. 配置路由
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  labels:
    capsule.clastix.io/managed-by: hypersuite
  name: qwen3-8b-demo-route
  namespace: roadshow
spec:
  hostnames:
  - llm.xxx.cn
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: llm
    namespace: kgateway-system
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: qwen3-8b-demo
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          replacePrefixMatch: /
          type: ReplacePrefixMatch
    matches:
    - path:
        type: PathPrefix
        value: /roadshow/qwen3-8b-demo
EOF


<span style="color: rgba(203, 145, 47, 1);"><strong># 3. 部署InferencePool与Endpoint Picker Extension</strong></span>
helm upgrade --install qwen3-8b-demo . 
  --namespace roadshow --create-namespace 
  --dependency-update 
  --set inferencePool.modelServers.matchLabels.app=qwen3-8b-demo 
  -f values.yaml

NAME: qwen3-8b-demo
LAST DEPLOYED: Fri Feb 27 04:26:17 2026
NAMESPACE: roadshow
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
InferencePool qwen3-8b-demo deployed.<span style="color: rgba(203, 145, 47, 1);

 

在 EPP 的默认配置中,可以看到类似如下插件组合:

# 缺省pool内使用到的负载均衡策略
kubectl -n roadshow get cm qwen3-8b-demo-epp -o yaml
apiVersion: v1
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
        weight: 2
      - pluginRef: kv-cache-utilization-scorer
        weight: 2
      - pluginRef: prefix-cache-scorer
        weight: 3
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: qwen3-8b-demo
    meta.helm.sh/release-namespace: roadshow
  creationTimestamp: "2026-03-13T03:31:53Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    capsule.clastix.io/managed-by: livedemo
  name: qwen3-8b-demo-epp
  namespace: roadshow
  resourceVersion: "357726537"
  uid: 7133950d-a3ba-4bba-8b3c-509c1abd1195

 

这个配置说明,EPP 并不是简单做轮询,而是会综合考虑队列长度、KV Cache 使用率和 Prefix Cache 相关信息,对后端端点进行评分和选择。随后通过 Gateway 地址访问 /v1/models/v1/chat/completions,可以验证完整链路已经打通:

# 验证业务流程是否正常
curl -sS -H 'Host: llm.xxx.cn' http://10.8.17.152:8090/roadshow/qwen3-8b-demo/v1/models

curl -sS -H 'Host: llm.xxx.cn' http://10.8.17.152:8090/roadshow/qwen3-8b-demo/v1/chat/completions 
  -H 'Content-Type: application/json' 
  -d '{
    "model":"Qwen3-8B",
    "messages":[{"role":"user","content":"你好,用一句话自我介绍"}],
    "temperature":0.2
  }'

# 查看每个副本实际命中情况
kubectl get pods -n roadshow -o wide
qwen3-8b-demo-866dbc74c4-mjhq6                1/1     Running     0          3h57m   192.168.107.194   gpu-worker-65    <none>           <none>
qwen3-8b-demo-866dbc74c4-xmz7j                1/1     Running     0          3h57m   192.168.112.49    gpu-worker-72    <none>           <none>
qwen3-8b-demo-epp-679ff99955-wrg94            1/1     Running     0          33h     192.168.117.45    gpu-worker-77    <none>           <none>

# 这个脚本会每秒查询一次2个 vLLM 后端 Pod 的 /metrics,实时显示每个实例当前正在处理的请求数、排队请求数、缓存使用率以及累计处理的 token。
# 它的目的是让你直观看到压测流量到底落到了哪些后端实例上,以及是否真的在多个 Pod 之间分摊。
watch -n 1 '
for ip in 192.168.107.194 192.168.112.49; do
  echo "===== $ip ====="
  curl -sS http://$ip:8080/metrics | egrep "vllm:num_requests_running|vllm:num_requests_waiting|vllm:gpu_cache_usage_perc|vllm:request_prompt_tokens_sum|vllm:generation_tokens_total"
  echo
done
'

Every 1.0s:                                                                                                                         master-01: Sat Mar 14 13:20:08 2026

===== 192.168.107.194 =====
# HELP vllm:num_requests_running Number of requests in model execution batches.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{engine="0",model_name="Qwen3-8B"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{engine="0",model_name="Qwen3-8B"} 0.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{engine="0",model_name="Qwen3-8B"} 509.0
vllm:request_prompt_tokens_sum{engine="0",model_name="Qwen3-8B"} 30.0

===== 192.168.112.49 =====
# HELP vllm:num_requests_running Number of requests in model execution batches.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{engine="0",model_name="Qwen3-8B"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{engine="0",model_name="Qwen3-8B"} 0.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{engine="0",model_name="Qwen3-8B"} 0.0
vllm:request_prompt_tokens_sum{engine="0",model_name="Qwen3-8B"} 0.0

 

但对于 AI Aware Routing 来说,仅仅“请求能返回”还不够。真正关键的是:要确认 EPP 是否真的参与了调度决策。因此,需要进一步查看 EPP 暴露的 /metrics。在压测过程中,可以观察到如下几类指标:

# 验证EPP从modelserver拿到了用于选路的指标
# 通过epp的服务接口来查看是否真正拿到了modelserver的指标,队列大小会发生变化。
kubectl -n roadshow get svc qwen3-8b-demo-epp -o wide
NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE   SELECTOR
qwen3-8b-demo-epp   ClusterIP   10.101.199.180   <none>        9002/TCP,9090/TCP   11m   inferencepool=qwen3-8b-demo-epp


curl -sS http://10.101.199.180:9090/metrics | egrep "inference_pool_ready_pods|inference_pool_per_pod_queue_size|inference_pool_average_" | head -n 80
# HELP inference_pool_average_kv_cache_utilization [ALPHA] The average kv cache utilization for an inference server pool.
# TYPE inference_pool_average_kv_cache_utilization gauge
inference_pool_average_kv_cache_utilization{name="qwen3-8b-demo"} 0.04881823810044065
# HELP inference_pool_average_queue_size [ALPHA] The average number of requests pending in the model server queue.
# TYPE inference_pool_average_queue_size gauge
inference_pool_average_queue_size{name="qwen3-8b-demo"} 48
# HELP inference_pool_per_pod_queue_size [ALPHA] The total number of requests pending in the model server queue for each underlying pod.
# TYPE inference_pool_per_pod_queue_size gauge
inference_pool_per_pod_queue_size{model_server_pod="qwen3-8b-demo-866dbc74c4-47k6n-rank-0",name="qwen3-8b-demo"} 0
inference_pool_per_pod_queue_size{model_server_pod="qwen3-8b-demo-866dbc74c4-5clwk-rank-0",name="qwen3-8b-demo"} 144
inference_pool_per_pod_queue_size{model_server_pod="qwen3-8b-demo-866dbc74c4-6hv5n-rank-0",name="qwen3-8b-demo"} 0
# HELP inference_pool_ready_pods [ALPHA] The number of ready pods in the inference server pool.
# TYPE inference_pool_ready_pods gauge
inference_pool_ready_pods{name="qwen3-8b-demo"} 3

# 查看EPP metrics,确认是否由EPP来完成了请求的决策。
# inference_extension_scheduler_attempts_total{status="success"} 5
这说明:EPP已经成功执行了5次调度决策(也就是至少有 5  ext-proc 调用触发了“选哪个 endpoint”)。

curl -sS http://10.101.199.180:9090/metrics | egrep -i 'ext.?proc|grpc|http|request|decision|pick|schedule' | head -n 120

# HELP go_cpu_classes_gc_mark_idle_cpu_seconds_total Estimated total CPU time spent performing GC tasks on spare CPU resources that the Go scheduler could not otherwise find a use for. This should be subtracted from the total GC CPU time to obtain a measure of compulsory GC CPU time. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare only with other /cpu/classes metrics. Sourced from /cpu/classes/gc/mark/idle:cpu-seconds.
# HELP go_godebug_non_default_behavior_http2client_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=http2client=... setting. Sourced from /godebug/non-default-behavior/http2client:events.
# TYPE go_godebug_non_default_behavior_http2client_events_total counter
go_godebug_non_default_behavior_http2client_events_total 0
# HELP go_godebug_non_default_behavior_http2server_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=http2server=... setting. Sourced from /godebug/non-default-behavior/http2server:events.
# TYPE go_godebug_non_default_behavior_http2server_events_total counter
go_godebug_non_default_behavior_http2server_events_total 0
# HELP go_godebug_non_default_behavior_httpcookiemaxnum_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=httpcookiemaxnum=... setting. Sourced from /godebug/non-default-behavior/httpcookiemaxnum:events.
# TYPE go_godebug_non_default_behavior_httpcookiemaxnum_events_total counter
go_godebug_non_default_behavior_httpcookiemaxnum_events_total 0
# HELP go_godebug_non_default_behavior_httplaxcontentlength_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=httplaxcontentlength=... setting. Sourced from /godebug/non-default-behavior/httplaxcontentlength:events.
# TYPE go_godebug_non_default_behavior_httplaxcontentlength_events_total counter
go_godebug_non_default_behavior_httplaxcontentlength_events_total 0
# HELP go_godebug_non_default_behavior_httpmuxgo121_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=httpmuxgo121=... setting. Sourced from /godebug/non-default-behavior/httpmuxgo121:events.
# TYPE go_godebug_non_default_behavior_httpmuxgo121_events_total counter
go_godebug_non_default_behavior_httpmuxgo121_events_total 0
# HELP go_godebug_non_default_behavior_httpservecontentkeepheaders_events_total The number of non-default behaviors executed by the net/http package due to a non-default GODEBUG=httpservecontentkeepheaders=... setting. Sourced from /godebug/non-default-behavior/httpservecontentkeepheaders:events.
# TYPE go_godebug_non_default_behavior_httpservecontentkeepheaders_events_total counter
go_godebug_non_default_behavior_httpservecontentkeepheaders_events_total 0
# HELP go_sched_latencies_seconds Distribution of the time goroutines have spent in the scheduler in a runnable state before actually running. Bucket counts increase monotonically. Sourced from /sched/latencies:seconds.
inference_extension_plugin_duration_seconds_bucket{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker",le="0.01"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker",le="0.02"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker",le="0.05"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker",le="0.1"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker",le="+Inf"} 5
inference_extension_plugin_duration_seconds_sum{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker"} 2.5848999999999996e-05
inference_extension_plugin_duration_seconds_count{extension_point="Picker",plugin_name="max-score-picker",plugin_type="max-score-picker"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer",le="0.0001"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer",le="0.0002"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer",le="0.0005"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer",le="0.001"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer",le="+Inf"} 5
inference_extension_plugin_duration_seconds_sum{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer"} 0.00010747600000000001
inference_extension_plugin_duration_seconds_count{extension_point="PreRequest",plugin_name="prefix-cache-scorer",plugin_type="prefix-cache-scorer"} 5
inference_extension_plugin_duration_seconds_bucket{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler",le="0.0001"} 10
inference_extension_plugin_duration_seconds_bucket{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler",le="0.0002"} 10
inference_extension_plugin_duration_seconds_bucket{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler",le="0.05"} 10
inference_extension_plugin_duration_seconds_bucket{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler",le="0.1"} 10
inference_extension_plugin_duration_seconds_sum{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler"} 4.733e-06
inference_extension_plugin_duration_seconds_count{extension_point="ProfilePicker",plugin_name="single-profile-handler",plugin_type="single-profile-handler"} 10
# HELP inference_extension_scheduler_attempts_total [ALPHA] Total number of scheduling attempts.
# TYPE inference_extension_scheduler_attempts_total counter
inference_extension_scheduler_attempts_total{status="success"} 5
# HELP inference_extension_scheduler_e2e_duration_seconds [ALPHA] End-to-end scheduling latency distribution in seconds.
# TYPE inference_extension_scheduler_e2e_duration_seconds histogram
inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0001"} 4
inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0002"} 4
inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0005"} 4
inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.001"} 5
inference_extension_scheduler_e2e_duration_seconds_bucket{le="+Inf"} 5
inference_extension_scheduler_e2e_duration_seconds_sum 0.001092388
inference_extension_scheduler_e2e_duration_seconds_count 5
... ...
inference_objective_request_duration_seconds_bucket{model_name="Qwen3-8B",target_model_name="Qwen3-8B",le="1200"} 5
inference_objective_request_duration_seconds_bucket{model_name="Qwen3-8B",target_model_name="Qwen3-8B",le="1800"} 5
inference_objective_request_duration_seconds_bucket{model_name="Qwen3-8B",target_model_name="Qwen3-8B",le="2700"} 5
inference_objective_request_duration_seconds_bucket{model_name="Qwen3-8B",target_model_name="Qwen3-8B",le="3600"} 5
inference_objective_request_duration_seconds_bucket{model_name="Qwen3-8B",target_model_name="Qwen3-8B",le="+Inf"} 5
inference_objective_request_duration_seconds_sum{model_name="Qwen3-8B",target_model_name="Qwen3-8B"} 5.2231309790000005
inference_objective_request_duration_seconds_count{model_name="Qwen3-8B",target_model_name="Qwen3-8B"} 5

 

这些指标能够证明两件事。

第一,EPP 已经能够看到后端模型池中的 Pod 状态和队列情况。例如 inference_pool_ready_pods

表示当前推理池中 Ready 的模型实例量

inference_pool_per_pod_queue_size

可以显示每个模型 Pod 的队列大小。

第二,EPP 已经执行了调度决策。例如 inference_extension_scheduler_attempts_total{status="success"}

表示调度尝试成功次数。

inference_extension_plugin_duration_seconds

则可以看到具体插件的执行情况。

这一步非常关键。因为它证明当前链路已经不只是 Gateway 到后端的普通七层转发,而是真正进入了推理感知调度阶段。


2.3 在 SGLang推理框架下验证 AI Aware Routing

下面提供SGLang下的配置实例,由于和vLLM大致相似,不再赘述相关配置的解释。

💡 本样例以常见的单机多卡推理实例Deployment类型资源来做演示

1. 配置qwen32b的推理服务

kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sglang-qwen32b
  namespace: demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sglang-qwen32b
  template:
    metadata:
      labels:
        app: sglang-qwen32b
        inference.networking.k8s.io/engine-type: sglang
    spec:
      securityContext:
        seccompProfile:
          type: Unconfined
      containers:
        - name: sglang
          image: 10.8.17.100:60066/sglang/sglang:0.5.4-hpcc.ai3.3.0.13-torch2.6-py310-ubuntu22.04-amd64
          imagePullPolicy: IfNotPresent
          env:
            - name: HPCC_SMALL_PAGESIZE_ENABLE
              value: "1"
            - name: PYTORCH_ENABLE_PG_HIGH_PRIORITY_STREAM
              value: "1"
            - name: HPCC_VISIBLE_DEVICE
              value: "0,1,2,3,4,5,6,7"
            - name: TRITON_ENABLE_HPCC_OPT_MOVE_DOT_OPERANDS_OUT_LOOP
              value: "1"
            - name: TRITON_DISABLE_HPCC_OPT_MMA_PREFETCH
              value: "1"
            - name: TRITON_ENABLE_HPCC_CHAIN_DOT_OPT
              value: "1"
            - name: TRITON_ENABLE_HPCC_COMPILER_INT8_OPT
              value: "True"
            - name: VLLM_PP_LAYER_PARTITION
              value: "16,15,15,15"

          command:
            - sh
            - -c
            - |
              /opt/conda/bin/python3 -m sglang.launch_server \
                --model-path /workspace/model/Qwen3-32B \
                --served-model-name Qwen3-32B \
                --tp 8 \
                --dp 1 \
                --nnodes 1 \
                --node-rank 0 \
                --dist-init-addr 127.0.0.1:5000 \
                --trust-remote-code \
                --attention-backend flashinfer \
                --enable-dp-attention \
                --enable-metrics \
                --host 0.0.0.0 \
                --port 8080
          ports:
            - name: http
              containerPort: 30000
          resources:
            limits:
              mars-tech.com/gpu: 8
            requests:
              mars-tech.com/gpu: 8
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: localmodelvolume
              mountPath: /workspace/model
              readOnly: true
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 100Gi
        - hostPath:
            path: /zion0/modelsrepo/models
            type: Directory
          name: localmodelvolume
---
apiVersion: v1
kind: Service
metadata:
  name: sglang-qwen32b
  namespace: demo
spec:
  type: NodePort
  selector:
    app: sglang-qwen32b
  ports:
    - name: http
      port: 8080
      targetPort: 8080
      nodePort: 30287
EOF


# 验证
# 模型服务是否正常
curl -sS http://10.8.17.200:30287/v1/models

# 测试metrics是否暴露
curl -s http://10.8.17.200:30287/metrics | grep 'sglang:num_queue_reqs'
curl -s http://10.8.17.200:30287/metrics | grep 'sglang:num_running_reqs'
curl -s http://10.8.17.200:30287/metrics | grep 'sglang:token_usage'
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/workspace/model/Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/workspace/model/Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0
# HELP sglang:num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang:num_running_reqs_offline_batch gauge
sglang:num_running_reqs_offline_batch{engine_type="unified",model_name="/workspace/model/Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/workspace/model/Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0


2.部署网关实例

# 配置监听8090端口的agentgateway网关实例
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: llm
  namespace: kgateway-system
spec:
  gatewayClassName: agentgateway
  listeners:
  - allowedRoutes:
      namespaces:
        from: All
    name: http
    port: 8090
    protocol: HTTP
EOF

# 检查
kubectl get gateway llm -n kgateway-system
NAME   CLASS          ADDRESS       PROGRAMMED   AGE
llm    agentgateway   10.8.17.152   True         69d


3.配置路由
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  labels:
    capsule.clastix.io/managed-by: hypersuite
  name: sglang-qwen32b-route
  namespace: demo
spec:
  hostnames:
  - llm.xxx.cn
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: llm
    namespace: kgateway-system
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: sglang-qwen32b-pool
      namespace: demo
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          replacePrefixMatch: /
          type: ReplacePrefixMatch
    matches:
    - path:
        type: PathPrefix
        value: /demo/sglang-qwen32b
EOF

4.部署InferencePool与Endpoint Picker Extension

# 通过helm安装inferencepool和epp
helm upgrade --install sglang-qwen32b-pool . \
  --namespace demo --create-namespace \
  --dependency-update \
  --set inferencePool.modelServers.matchLabels.app=sglang-qwen32b \
  --set inferencePool.modelServerType=sglang \
  --set experimentalHttpRoute.enabled=false \
  -f values.yaml
  
 
5. 测试与验证
# 1.验证业务流程是否正常

curl -sS http://10.8.17.152:8090/demo/sglang-qwen32b/v1/chat/completions \
  -H 'Host: llm.wtsht.cn' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "/workspace/model/Qwen3-32B",
    "messages": [
      {"role":"user","content":"你好,用一句话自我介绍"}
    ],
    "temperature": 0.2
  }'  

# 2. 查看每个副本实际命中情况
kubectl get pods -n demo -l app=sglang-qwen32b -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP               NODE            NOMINATED NODE   READINESS GATES
sglang-qwen32b-788dd97f98-hkcx8   1/1     Running   0          17m     192.168.117.49   gpu-worker-77   <none>           <none>
sglang-qwen32b-788dd97f98-ngkcr   1/1     Running   0          5m58s   192.168.19.13    gpu-worker-96   <none>           <none>

# 这个脚本会每秒查询一次2个 vLLM 后端 Pod 的 /metrics,实时显示每个实例当前正在处理的请求数、排队请求数、缓存使用率以及累计处理的 token。
# 它的目的是让你直观看到压测流量到底落到了哪些后端实例上,以及是否真的在多个 Pod 之间分摊。
watch -n 1 '
for ip in 192.168.117.49 192.168.19.13; do
  echo "===== $ip ====="
  curl -sS http://$ip:8080/metrics | egrep "sglang:num_running_reqs|sglang:num_queue_reqs|sglang:token_usage"
  echo
done
'

# 3. 执行压测
GWIP=10.8.17.152

for i in $(seq 1 100); do
  curl -sS http://$GWIP:8090/demo/sglang-qwen32b/v1/chat/completions \
    -H 'Host: llm.wtsht.cn' \
    -H 'Content-Type: application/json' \
    -d "$(cat <<EOF
{"model":"Qwen3-32B","messages":[{"role":"user","content":"请求编号 REQ-$i,请写一篇不少于2000字的自我介绍,并且每一段都要展开说明。"}],"max_tokens":1024,"temperature":0.2}
EOF
)" >/dev/null &
done
wait

在刚才执行watch的终端来查看

Every 1.0s:                                                                                            master-01: Mon Mar 16 15:30:10 2026

===== 192.168.117.49 =====
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 43.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0021420527801805037
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0
# HELP sglang:num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang:num_running_reqs_offline_batch gauge
sglang:num_running_reqs_offline_batch{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0

===== 192.168.19.13 =====
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 57.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0032260016823625975
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0
# HELP sglang:num_running_reqs_offline_batch The number of running low-priority offline batch requests(label is 'batch').
# TYPE sglang:num_running_reqs_offline_batch gauge
sglang:num_running_reqs_offline_batch{engine_type="unified",model_name="Qwen3-32B",pp_rank="0",tp_rank="0"} 0.0


# 4.查看epp对应metric记录
kubectl get svc sglang-qwen32b-pool-epp -n demo -o wide
NAME                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE   SELECTOR
sglang-qwen32b-pool-epp   ClusterIP   10.96.60.237   <none>        9002/TCP,9090/TCP   50m   inferencepool=sglang-qwen32b-pool-epp


curl -sS http://10.96.60.237:9090/metrics | egrep "inference_pool_ready_pods|inference_pool_per_pod_queue_size|inference_pool_average_"
# HELP inference_pool_average_kv_cache_utilization [ALPHA] The average kv cache utilization for an inference server pool.
# TYPE inference_pool_average_kv_cache_utilization gauge
inference_pool_average_kv_cache_utilization{name="sglang-qwen32b-pool"} 0
# HELP inference_pool_average_queue_size [ALPHA] The average number of requests pending in the model server queue.
# TYPE inference_pool_average_queue_size gauge
inference_pool_average_queue_size{name="sglang-qwen32b-pool"} 0
# HELP inference_pool_per_pod_queue_size [ALPHA] The total number of requests pending in the model server queue for each underlying pod.
# TYPE inference_pool_per_pod_queue_size gauge
inference_pool_per_pod_queue_size{model_server_pod="sglang-qwen32b-788dd97f98-hkcx8-rank-0",name="sglang-qwen32b-pool"} 0
inference_pool_per_pod_queue_size{model_server_pod="sglang-qwen32b-788dd97f98-ngkcr-rank-0",name="sglang-qwen32b-pool"} 0
# HELP inference_pool_ready_pods [ALPHA] The number of ready pods in the inference server pool.
# TYPE inference_pool_ready_pods gauge
inference_pool_ready_pods{name="sglang-qwen32b-pool"} 2

 

Leave a Reply