sanguo/sanguo_moziplus_v2

Fork 0

Files

T

cfdaily 39aa904184 auto-sync: 2026-06-02 22:03:37

2026-06-02 22:03:37 +08:00

6.1 KiB

Raw Blame History

Gateway Watchdog 设计文档

版本: 2.0 作者: 庞统（副军师）🐦 日期: 2026-06-02 状态: 已实现

1. 问题背景

1.1 历史问题（v1.0 覆盖）

zhipu GLM-5.1 在高峰期返回 429 限流，导致 Gateway 假死。

1.2 新增问题（v2.0 新增）

线上观察发现两种新的致命信号，Gateway 自身无法恢复，必须重启：

信号	日志关键字	现象
所有 provider 挂了	`lane task error` + `FailoverError`	主模型 + fallback 全部失败，所有 Agent 无法工作
Session 卡死	`stalled session` + `recovery=none`	Gateway diagnostic 检测到 session 卡死但承认自己恢复不了

1.3 为什么 v1.0 检测不到

v1.0 扫描的是 session jsonl（~/.openclaw/agents/*/sessions/*.jsonl），但这些关键字实际出现在 Gateway 进程日志（/tmp/openclaw/openclaw-{日期}.log）中。两类日志的数据格式完全不同。

2. 检测规则（v2.0）

2.1 数据源

Gateway 进程日志：/tmp/openclaw/openclaw-$(date +%Y-%m-%d).log

JSON lines 格式，每行一条记录
关键字在 "message" 字段中（也在 "1" 字段中有副本）
每天自动轮转，watchdog 只读当天文件

2.2 三条规则

规则	匹配条件	阈值	含义	严重度
R1	`message` 含 `lane task error` 且含 `FailoverError`	≥2 次/120s	所有 provider 都挂了，不重启没救	🔴 致命
R2	`message` 含 `stalled session` 且含 `recovery=none`	≥3 次/120s	Gateway 承认自己恢复不了	🟠 严重
R3	`message` 含 `rate_limit` 或含 `"429"`	≥2 次/120s	API 限流	🟡 警告

2.3 匹配示例

R1 匹配日志行：

{"_meta":{"logLevelName":"ERROR"},"message":"lane task error: lane=main durationMs=93772 error=\"FailoverError: ⚠️ API rate limit reached. Please try again later.\""}

R2 匹配日志行：

{"_meta":{"logLevelName":"WARN"},"message":"stalled session: sessionKey=agent:main:main state=processing age=266s recovery=none"}

R3 匹配日志行：

{"_meta":{"logLevelName":"WARN"},"message":"lane task error: ... error=\"FailoverError: ...\"","providerRuntimeFailureKind":"rate_limit"}

2.4 判定逻辑

每分钟执行
  │
  ├─ 1. Gateway health check
  │   └─ 失败 → 直接重启（可能 Gateway 进程已死）
  │
  ├─ 2. 读当天 Gateway 日志，python3 过滤最近 120 秒的行
  │
  ├─ 3. 三条规则分别 grep 计数
  │
  ├─ 4. 任一规则命中阈值
  │   ├─ 在冷却期内 → 只 log 不重启
  │   └─ 不在冷却期 → 重启 Gateway + 记录原因
  │
  └─ 5. 都没命中 → 正常

3. 防重启风暴

冷却期：重启后 5 分钟内再次检测到问题只 log 不重启

状态文件：/tmp/gateway-watchdog-state（JSON）

{"last_restart_time":"2026-06-02T21:00:00+08:00","last_restart_reason":"R1","cooldown_until":1717333800}

4. 重启原因记录

每次重启自动追加到 /tmp/gateway-watchdog-restarts.log（永久文件，用于统计分析）：

{"time":"2026-06-02T21:00:00+0800","reason":"R1","detail":"FailoverError x3","counts":{"r1":3,"r2":0,"r3":1}}

字段说明：

time：重启时间（ISO 格式）
reason：触发规则（R1/R2/R3/health_fail）
detail：匹配到的关键字摘要
counts：所有三条规则的实际命中次数（无论哪条触发都记录全量）

5. 参数

参数	默认值	说明
CHECK_WINDOW	120s	检查最近多少秒的日志
R1_THRESHOLD	2	FailoverError 触发阈值
R2_THRESHOLD	3	stalled recovery=none 触发阈值
R3_THRESHOLD	2	rate_limit/429 触发阈值
COOLDOWN	300s	重启后冷却期
检测频率	60s	crontab 每分钟执行

6. 文件位置

文件	路径	说明
脚本	`scripts/gateway-watchdog.sh`	watchdog 主脚本
本文档	`docs/design/gateway-watchdog.md`	设计文档
运行日志	`/tmp/gateway-watchdog.log`	crontab 重定向输出
状态文件	`/tmp/gateway-watchdog-state`	冷却期控制
重启记录	`/tmp/gateway-watchdog-restarts.log`	重启原因（永久追加）
锁文件	`/tmp/gateway-watchdog.lock`	防并发

7. 部署

# crontab 每分钟执行
(crontab -l 2>/dev/null | grep -v "gateway-watchdog"; \
 echo "* * * * * /Users/chufeng/.openclaw/sanguo_projects/sanguo_moziplus_v2/scripts/gateway-watchdog.sh >> /tmp/gateway-watchdog.log 2>&1") \
 | crontab -

8. 运维

# 查看运行日志
tail -f /tmp/gateway-watchdog.log

# 查看状态（冷却期等）
cat /tmp/gateway-watchdog-state

# 查看重启历史
cat /tmp/gateway-watchdog-restarts.log | python3 -m json.tool

# 统计重启原因
grep "reason" /tmp/gateway-watchdog-restarts.log | python3 -c "
import json, sys
from collections import Counter
reasons = Counter()
for line in sys.stdin:
    reasons[json.loads(line)['reason']] += 1
print('重启统计:')
for k,v in reasons.most_common():
    print(f'  {k}: {v}次')
"

# 手动测试
bash scripts/gateway-watchdog.sh

# 停用
crontab -l | grep -v "gateway-watchdog" | crontab -

9. 与 v1.0 的差异

维度	v1.0	v2.0
数据源	session jsonl 文件	Gateway 进程日志
检测规则	仅 429（1 条）	3 条（FailoverError / stalled / rate_limit）
重启风暴	无防护	5 分钟冷却期
原因记录	无	JSON 文件永久追加
统计能力	无	结构化数据可分析
配套脚本	独立（gateway_monitor.py 不动）	不冲突

10. 已知局限

事后检测 — 检测的是已发生的错误，不是预防
只读当天日志 — 跨天 0 点时日志轮转，前一分钟的日志在新文件里但文件名不同（暂不处理，影响极小）
无通知 — 重启后没有主动推送通知（可后续接入飞书/邮件）
单机 — 只能在 Gateway 所在机器上运行

6.1 KiB Raw Blame History Unescape Escape