
问题描述 当从位于美国硅谷的基础设施向 Vertex AI API ( aiplatform.googleapis.com ) 模型: gemini-3-pro-preview 发起流式预测调用时,我们观察到响应流中首个 Token 的延迟异常偏高。首 Token 延迟( TTFT )持续超过 17 秒,而通常情况应低于 2 秒。
server address: 142.250.191.42
1 、Basic Ping Tests (Connectivity & Baseline Latency) Run these commands from the affected server/client in Silicon Valley. ping(base) [root@usa-gg-test01 ~]# ping aiplatform.googleapis.com PING aiplatform.googleapis.com (142.250.191.42) 56(84) bytes of data. 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms
2 、python code test Using the model:gemini-3-pro-preview
import requests import json import time
def stream_gemini_content(): api_key='xxx' url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"
headers = { "x-goog-api-key": api_key, "Content-Type": "application/json" } data = { "contents": [{ "role": "user", "parts": [{ "text": "请讲一个 200 字的故事,不要用推理,直接回答。" }] }], "generationConfig": { "thinkingConfig": { "includeThoughts": False } } } print(f"begin requests: {url} ...") start_time = time.time() first_token_time = None last_chunk_time = None try: with requests.post(url, headers=headers, json=data, stream=True) as response: if response.status_code != 200: print(f"status: {response.status_code}") print(response.text) return print("-" * 50) for line in response.iter_lines(): if not line: continue decoded_line = line.decode('utf-8').strip() if not decoded_line.startswith("data: "): continue json_str = decoded_line[6:] if json_str == "[DONE]": break try: now = time.time() if first_token_time is None: first_token_time = now print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms") print("-" * 50) last_chunk_time = now chunk_data = json.loads(json_str) candidates = chunk_data.get("candidates", []) total_elapsed = (now - start_time) * 1000 chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0 last_chunk_time = now if candidates: cOntent= candidates[0].get("content", {}) parts = content.get("parts", []) if parts: text_chunk = parts[0].get("text", "") print(text_chunk, end="", flush=True) except Exception as e: pass except Exception as e: pass end_time = time.time() print("\n\n" + "-" * 50) print(f"total time: {(end_time - start_time) * 1000:.2f} ms") if name == "main": stream_gemini_content()
代码测试非常慢,200 个字故事就超过 17s 了
1 heqing 2 天前 第一、第二次调用首个 Token 输出延迟是否有显著差异?更换其他模型是否出现相同的现象? |
2 xuliang12187 OP 用了 gemini-2.0-flash 模型首个 token 输出 300ms 200 字的故事,3-4s 就返回了全部内容了 gemini-2.5-flash 首 token 超过 3s 很慢,总时间长度超过 5s ,gemini-3-pro-preview 首个 token 超过 12s ,我们用的 google cloud 企业服务 vertex AI apI 接口。 |
3 chenluo0429 2 天前 via Android 你也没指定不思考啊,gemni3 默认思考级别是高,这不是得先思考再给你回答吗 |
4 xuliang12187 OP @chenluo0429 调过一样,很慢都超过 17s |
5 fov6363 2 天前 +1 ,不加 thinking 太弱智了,加了就是得 10s+,即使是简单的 QA 也不行。问了 chatGPT 说是 vertex 要开那个 endpoint 独占的实例概念,不了会有冷启动,first chunk 只有几百 ms ,但是等到第一次返回就得 10s+ |
6 xuliang12187 OP @fov6363 vertex 先阶段 没有 endpoint 独立实例概念,现在只有 global 全球的。说是有不同付费级别。那个是针对业务并发量高。并不能解决 接口延迟问题 |
7 GXD 2 天前 gemini3 得用`thinking_level`参数来指定推理深度吧,默认是 high |
8 fov6363 2 天前 |