auto-sync: 2026-05-26 13:38:34
This commit is contained in:
@@ -56,220 +56,159 @@ daemon:
|
||||
│ │ 进程退出 │ (情况 B 才到这里)
|
||||
```
|
||||
|
||||
进程退出后,读 stdout JSON + stderr + 查任务实际 API 状态。
|
||||
进程退出后,读 stdout JSON(`openclaw agent --json` 输出)+ 查任务 DB 状态。
|
||||
|
||||
### ⚠️ P0 修复:stdout JSON 解析路径
|
||||
### JSON 输出格式
|
||||
|
||||
openclaw agent `--json` 输出格式是 `{ "response": { "meta": { ... } } }`,
|
||||
不是 `{ "meta": { ... } }`。`_parse_stdout_json` 必须取 `data["response"]["meta"]`。
|
||||
`openclaw agent --json` 输出到 stdout 的 JSON 结构:
|
||||
|
||||
修复前 68% 的 spawn 结果 transport=null(62/91 次),A 场景分类全部失效。
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"summary": "completed",
|
||||
"result": {
|
||||
"payloads": [{ "text": "...", "mediaUrl": null }],
|
||||
"meta": {
|
||||
"durationMs": 5673,
|
||||
"executionTrace": {
|
||||
"runner": "gateway",
|
||||
"fallbackUsed": false,
|
||||
"fallbackReason": null
|
||||
},
|
||||
"aborted": false
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### ⚠️ P2 兜底:transport=null 时的 stderr 辅助判断
|
||||
### 可用字段
|
||||
|
||||
如果 P0 修复后 transport 仍为 null(解析失败),A2/A3 判定前先检查 stderr:
|
||||
- stderr 含 lock/busy → lock_conflict
|
||||
- stderr 含 compact → compact_failed
|
||||
- stderr 含 rate_limit → api_error
|
||||
- 否则走 gateway_timeout(续杯)
|
||||
| 字段 | 路径 | 取值范围 | 说明 |
|
||||
|------|------|---------|------|
|
||||
| `status` | `data.status` | `"ok"` / `"error"` / `"timeout"` | CLI 执行结果 |
|
||||
| `summary` | `data.summary` | `"completed"` / 错误信息字符串 | 辅助判断 |
|
||||
| `fallbackUsed` | `data.result.meta.executionTrace.fallbackUsed` | `true` / `false` | 是否 fallback |
|
||||
| `fallbackReason` | `data.result.meta.executionTrace.fallbackReason` | `"gateway_timeout"` 等 | fallback 原因 |
|
||||
| `payloads` | `data.result.payloads` | `[{text, mediaUrl}]` 或空数组 | Agent 回复内容 |
|
||||
|
||||
### A1:exit=0 + transport=gateway + 任务已是 done/review
|
||||
### 分类原则
|
||||
|
||||
- **优先用 `status`**:`status` 是 Gateway 官方提供的执行结果,比推断准确
|
||||
- **不解析 `meta` 的其他字段**:agentMeta、systemPromptReport 等是 OpenClaw 内部信息
|
||||
- **stdout 为空 = 进程异常终止**:`openclaw agent` 正常退出一定会输出 JSON
|
||||
|
||||
### A0:stdout 为空(进程异常终止)
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- meta.transport = "gateway"
|
||||
- meta.fallbackReason = null
|
||||
- 任务 API 状态 = done 或 review
|
||||
- 进程退出(exit=0 或 exit≠0)
|
||||
- stdout 完全为空(没有 JSON 输出)
|
||||
|
||||
原因:进程被异常终止(被 kill、崩溃等),没有走到 writeRuntimeJson
|
||||
|
||||
处理:
|
||||
- 记录 outcome = "process_crash"
|
||||
- counter.release()(wrapped_on_complete 可能没被调用 → ticker T1 兜底)
|
||||
- 不续杯
|
||||
- 任务保持 working → ticker T1 检测 PID 死 → release counter + 推回 pending
|
||||
- 等 ticker 重新 dispatch
|
||||
```
|
||||
|
||||
### A1:status="ok" + summary="completed" + fallbackUsed=false
|
||||
|
||||
```
|
||||
现象:
|
||||
- stdout JSON status = "ok"
|
||||
- summary = "completed"
|
||||
- executionTrace.fallbackUsed = false
|
||||
- 任务 DB status = done 或 review
|
||||
|
||||
原因:Agent 正常完成
|
||||
|
||||
处理:
|
||||
- 记录 outcome = "completed"
|
||||
- counter.release()(由 wrapped_on_complete 保证)
|
||||
- 记录 outcome = "completed"
|
||||
- 无需其他操作
|
||||
```
|
||||
|
||||
### A2:exit=0 + transport=gateway + 任务仍是 working
|
||||
### A2/A3:status="timeout"
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- meta.transport = "gateway"(P0 修复后可正确解析)
|
||||
- 任务 API 状态 = working
|
||||
- stdout JSON status = "timeout"
|
||||
|
||||
原因:Gateway timeout 触发,Agent 被中断但之前已写了 working。
|
||||
Agent 可能执行了步骤1-2(working + 执行),但没完成步骤3-4(outputs + review)
|
||||
原因:Gateway timeout,Agent 被中断
|
||||
|
||||
处理:
|
||||
- counter.release()(v2.0:调用级生命周期)
|
||||
- 续杯次数 +1
|
||||
- 超过上限(3) → ❌ failed + escalate
|
||||
- 未超限 → 🔄 通过 spawn_full_agent 续杯(内部 can_acquire + acquire)
|
||||
- 续杯 message:提示 Agent 检查历史继续未完成工作
|
||||
```
|
||||
|
||||
### A3:exit=0 + transport=gateway + 任务仍是 claimed
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- meta.transport = "gateway"
|
||||
- 任务 API 状态 = claimed
|
||||
|
||||
原因:Gateway timeout 触发,Agent 连步骤1(写 working)都没来得及
|
||||
|
||||
处理:
|
||||
- counter.release()(v2.0:调用级生命周期)
|
||||
- counter.release()
|
||||
- 续杯次数 +1
|
||||
- 超过上限(3) → ❌ failed + escalate
|
||||
- 未超限 → 🔄 通过 spawn_full_agent 续杯
|
||||
- 续杯 message:完整任务 prompt(和首次一样)
|
||||
- 续杯 message:提示 Agent 检查历史继续未完成工作
|
||||
```
|
||||
|
||||
### A4:exit=0 + transport=gateway + 任务已是 failed
|
||||
### A4:status="ok" + 任务 DB status=failed
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- 任务 API 状态 = failed
|
||||
- stdout JSON status = "ok"
|
||||
- 但任务 DB status = failed
|
||||
|
||||
原因:Agent 自己判断无法完成,主动写了 failed
|
||||
原因:Agent 自己判断无法完成,主动标了 failed
|
||||
|
||||
处理:
|
||||
- counter.release()
|
||||
- 记录 outcome = "agent_failed"
|
||||
- counter.release()(由 wrapped_on_complete 保证)
|
||||
- 尊重 Agent 的判断,不续杯
|
||||
- 如果 Agent 写了 detail,记录到事件
|
||||
```
|
||||
|
||||
### A5:exit=0 + transport=embedded + fallbackReason=gateway_timeout
|
||||
### A5/A6:status="ok" + fallbackUsed=true
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- meta.transport = "embedded"
|
||||
- meta.fallbackReason = "gateway_timeout"
|
||||
- stdout JSON status = "ok"
|
||||
- executionTrace.fallbackUsed = true
|
||||
- executionTrace.runner = "embedded"
|
||||
|
||||
原因:Gateway 端超时,CLI 自动 fallback 到本地 embedded 执行。
|
||||
fallback 用的是新 session(gateway-fallback-* 前缀),不在原 session 里。
|
||||
Agent 可能在 fallback 里完成了部分工作。
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(由 wrapped_on_complete 保证)
|
||||
- A5/A6 fallback 不应出现——出现说明 spawn 时 agent 被占用(L3 检查失效)
|
||||
- 记录 ERROR 级日志,含 agent_id/session_id/task_id/transport/fallbackReason/counter_active
|
||||
- 标 failed + escalate
|
||||
- 不续杯
|
||||
```
|
||||
|
||||
### A6:exit=0 + transport=embedded + fallbackReason ≠ gateway_timeout
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code = 0
|
||||
- meta.transport = "embedded"
|
||||
- meta.fallbackReason ≠ "gateway_timeout"(可能是连接断开、Gateway 异常等)
|
||||
|
||||
原因:Gateway 不可用,CLI 本地 fallback
|
||||
原因:Gateway 端超时/错误,CLI fallback 到本地 embedded 执行
|
||||
|
||||
处理:
|
||||
- 同 A5:查任务状态,决定是否续杯
|
||||
- 记录 outcome = "fallback_other"
|
||||
- 查任务 DB 状态
|
||||
- done/review → release counter → 结束(fallback 成功完成了)
|
||||
- working/claimed → release counter → 标 failed + escalate
|
||||
- 记录 outcome = "fallback_timeout",附带 warning
|
||||
```
|
||||
|
||||
### A7:exit≠0 + stderr 含 401/403/auth/unauthorized
|
||||
### A7-A12:status="error"
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 含认证失败关键字
|
||||
- stdout JSON status = "error"
|
||||
- summary 含错误信息
|
||||
|
||||
原因:Gateway token 过期、配置错误
|
||||
原因:各类错误(认证/连接/API/compact/lock/未知)
|
||||
|
||||
处理:
|
||||
- ❌ failed + escalate
|
||||
- 不续杯(重试也会失败)
|
||||
- 记录 outcome = "auth_failed"
|
||||
```
|
||||
|
||||
### A8:exit≠0 + stderr 含 ECONNREFUSED/ETIMEDOUT/gateway closed
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 含连接错误关键字
|
||||
|
||||
原因:Gateway 进程挂了、网络断
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(进程退出 = release)
|
||||
- 不续杯,等 ticker 重新调度
|
||||
- 记录 outcome = "gateway_unreachable"
|
||||
```
|
||||
|
||||
### A9:exit≠0 + stderr 含 rate_limit/500/503/API error
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 含模型 API 错误关键字
|
||||
|
||||
原因:模型提供商限流、服务异常
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(进程退出 = release)
|
||||
- 推回 pending(让 ticker 重新调度)
|
||||
- counter.set_cooldown(agent_id, 120s)(冷却期,防止立即重试又 429)
|
||||
- counter.release()
|
||||
- 不续杯
|
||||
- 记录 outcome = "api_error" / "gateway_unreachable" / "auth_failed" 等
|
||||
- 等 ticker 重新调度
|
||||
```
|
||||
|
||||
### A10:exit≠0 + stderr 含 compaction-diag/context-overflow/timeout-compaction
|
||||
### A 兜底:status 未知值
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 含 compact 相关关键字
|
||||
- stdout 有 JSON 但 status 不是 ok/error/timeout
|
||||
|
||||
原因:compact 后模型返回错误(丢失上下文导致无法继续)
|
||||
原因:未预期的状态值
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(进程退出 = release)
|
||||
- 不续杯,等 ticker 重新调度
|
||||
- 记录 outcome = "compact_failed"
|
||||
处理:
|
||||
- counter.release()
|
||||
- 不续杯
|
||||
- 记录 outcome = "unknown_status"
|
||||
- 等 ticker 重新调度
|
||||
```
|
||||
|
||||
### A11:exit≠0 + stderr 含 lock/busy/concurrent/lane task error
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 含 lock 冲突关键字
|
||||
|
||||
原因:session lock 冲突(webchat/cron/其他 spawner 占用)
|
||||
v2 用 --session-id uuid4 已基本避免,但保留兜底
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(进程退出 = release)
|
||||
- 不续杯,等 ticker 重新调度
|
||||
- 记录 outcome = "lock_conflict"
|
||||
```
|
||||
|
||||
### A12:exit≠0 + stderr 无特殊关键字
|
||||
|
||||
```
|
||||
现象:
|
||||
- exit_code ≠ 0
|
||||
- stderr 无认证/连接/API/compact/lock 关键字
|
||||
|
||||
原因:Agent 自身逻辑错误、工具执行失败、或其他未知错误
|
||||
|
||||
处理(v2.0):
|
||||
- counter.release()(进程退出 = release)
|
||||
- 不续杯,等 ticker 重新调度
|
||||
- 记录 outcome = "agent_error"
|
||||
```
|
||||
|
||||
## 6. 情况 B:monitor_timeout 到了进程还没退出
|
||||
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user