auto-sync: 2026-06-01 21:45:18

2026-06-01 21:45:18 +08:00
parent 08dcb305e0
commit 3fbb3a956c
1 changed files with 148 additions and 2 deletions
@@ -1,6 +1,6 @@
 # #07 Spawner Acquire-First 设计

-> 状态：评审通过，实施中
+> 状态：#07.1 已实施 ✅ | #07.2 已实施 ✅ | #07.3 设计中
 > 作者：庞统
 > 日期：2026-06-01
 > 评审：司马懿
@@ -381,7 +381,153 @@ def _task_on_complete(aid, outcome):
 | review 正常完成 | 标 done（不变） |
 | executor 正常完成 | 标 review（不变） |

-## 七、不在这个方案范围内
+## 七、#07.3 Startup Recovery + Compact Retry
+
+### 7.1 问题
+
+#### 问题 A：PM2 重启后 mail 任务变孤儿
+
+**现象**：18:42 PM2 restart 后，mail-1780310389485 保持 working 状态 34 分钟无人处理。
+
+**根因**：startup recovery 的主路径是 `_check_timeouts` 扫 DB working 任务按时间戳判断超时。但 mail auto-working 时 dispatcher 不设 `started_at` 也不设 `claimed_at`，导致 `_check_timeouts` 行 1284-1286 跳过：
+
+```python
+start_time_str = task.started_at or task.claimed_at
+if not start_time_str:
+    continue  # ← mail 任务永远走这里
+```
+
+**Startup recovery 的两层设计**：
+
+| 层 | 机制 | 重启后 |
+|---|------|--------|
+| L1 主路径 | `_check_timeouts` 扫 DB working → 按时间戳判断超时 → 回收 | ✅ 普通 task 生效，✗ mail 不生效 |
+| L2 补充 | `process_dead` 扫 counter.active_agents → 检查进程存活 | ✗ 重启后 counter 为空 |
+
+L2 重启后失效是预期行为（纯内存态），L1 才是 startup recovery 的主路径。问题出在 L1 对 mail 的时间戳 fallback 缺失。
+
+#### 问题 B：compact_hanging 直接标 failed，不 retry
+
+**现象**：agent 执行过程中触发 auto-compaction，monitor 等了 3×630s=31.5 分钟后放弃，标 `compact_hanging` → failed。
+
+**期望行为**：compact 只是暂时性阻塞，不应该直接 failed。应该让任务有机会重试，agent 在 compact 完成后继续。
+
+#### 问题 C：retry 遇 session busy 时 counter 处理不一致
+
+**现象**：`_do_retry` 调 `spawn_full_agent(skip_counter=True)`，Phase 2 发现 session busy（compact/lock/running）→ AgentBusyError → `_do_retry` catch 后调 `on_complete("retry_agent_busy")` → counter release → retry 丢失。
+
+**原始设计**（spawner-monitor-design.md 原则 5）：“续杯只有 Gateway timeout 才触发：lock/compact/api_error 等不续杯，等 ticker”。设计上 retry 遇 busy 应 release counter → 任务保持 working → ticker 30 秒后重新 dispatch。
+
+**但 Bug-4 fix 后** counter 不 release 了（retry 期间保持持有），导致和原始设计矛盾：counter 不 release → ticker 不会重新 dispatch → 任务卡住。
+
+### 7.2 设计原则
+
+1. **PM2 重启后所有 working 任务必须可回收**：`_check_timeouts` 不能因时间戳缺失跳过任何任务
+2. **compact 是暂时性阻塞，不应导致任务 failed**：compact_hanging 应释放 counter → 等 ticker 重新 dispatch
+3. **retry 遇 session busy 释放 counter → 等 ticker**：spawn 失败了，继续持有 counter 无意义，不如释放给 ticker 走正常 dispatch
+4. **agent 不能同时执行两个任务**：spawn 前 Phase 2 的所有检查（lock/running/compact）都必须通过
+
+### 7.3 改动
+
+#### ACT-1：_check_timeouts 时间戳 fallback（P0）
+
+**文件**：`ticker.py` `_check_timeouts`，1 行改动
+
+```python
+# Before
+start_time_str = task.started_at or task.claimed_at
+
+# After
+start_time_str = task.started_at or task.claimed_at or task.updated_at
+```
+
+**理由**：`_transition_status` 每次状态变更都更新 `updated_at`，mail auto-working 时必然有值。用 `updated_at` 做 fallback 后，重启后超时检测对所有任务类型生效。
+
+#### ACT-2：compact_hanging 不标 failed（P1）
+
+**文件**：`spawner.py` `_handle_monitor_timeout`，~8 行改动
+
+**问题**：compact_hanging 时进程还活着（monitor 等了 3×630s），直接 retry spawn 会撞上正在运行的 session。设计原则“不主动 kill 进程”。
+
+**方案**：compact_hanging 时不标 failed，**只 release counter + 任务保持 working** → 等 ticker 重新 dispatch。
+
+```python
+# Before (compact_hanging)
+self._mark_task(db_path, task_id, "failed", {"reason": "compact_hanging", ...})
+await self._do_on_complete_async(on_complete, agent_id, "compact_hanging")
+
+# After
+# compact 超限 → release counter + 保持 working
+# 等进程自然结束后，ticker _check_timeouts 检测到超时 → 推回 pending → 重新 dispatch
+logger.warning("Agent %s compact hanging after %d waits, releasing counter for ticker re-dispatch",
+               agent_id, compact_wait_count)
+self._compact_waits.pop(task_id, None)  # 清理计数器
+await self._do_on_complete_async(on_complete, agent_id, "compact_hanging")
+# 不标 failed，任务保持 working
+```
+
+**流程**：
+1. compact_hanging → release counter → 任务保持 working（无 active monitor）
+2. 进程自然结束（Gateway timeout / compact 完成 / agent 正常退出）
+3. ticker `_check_timeouts` 检测 working + 超时 → 推回 pending
+4. `_dispatch_pending` → `spawn_full_agent` → Phase 2 检查 session
+5. session 已空闲 → 正常 spawn；session 仍 busy → AgentBusyError → 30 秒后再试
+
+**好处**：
+- 不 kill 进程（遵循设计原则 4）
+- 不走 spawner retry（遵循设计原则 5：lock/compact 等 ticker）
+- 利用已有 ticker 循环自然重试
+- 不需要新的“延迟 retry”机制
+
+#### ACT-3：retry 遇 session busy 释放 counter（P1）
+
+**文件**：`spawner.py` `_do_retry`，~5 行改动
+
+**方案**：retry 遇 AgentBusyError 时 release counter + 任务保持 working → 等 ticker。和 ACT-2 一致。
+
+```python
+# Before
+except AgentBusyError:
+    logger.warning("Retry spawn skipped: %s busy (unexpected)", agent_id)
+    await self._do_on_complete_async(on_complete, agent_id, "retry_agent_busy")
+
+# After
+except AgentBusyError as e:
+    logger.warning("Retry spawn deferred: %s session busy (%s), releasing counter for ticker re-dispatch",
+                   agent_id, e.reason)
+    # release counter + 任务保持 working → ticker 下次 tick 重新 dispatch
+    await self._do_on_complete_async(on_complete, agent_id, "retry_session_busy")
+```
+
+**对 Bug-4 fix 的影响**：Bug-4 fix 让 retry 期间 counter 保持持有（防止 ticker acquire 同一 agent）。但 retry 遇 session busy 时 release counter 是合理的——spawn 失败了，继续持有 counter 无意义（没有 monitor 在等），不如释放给 ticker 走正常 dispatch 路径。
+
+### 7.4 改动范围
+
+| 文件 | 改动 | 行数 |
+|------|------|------|
+| `ticker.py` `_check_timeouts` | `started_at or claimed_at or updated_at` fallback | 1 行 |
+| `spawner.py` `_handle_monitor_timeout` | compact_hanging：删 `_mark_task(failed)`，只 release counter | ~8 行 |
+| `spawner.py` `_do_retry` | AgentBusyError catch 日志优化 | ~5 行 |
+
+**总计：~14 行改动。**
+
+### 7.5 验证
+
+| 测试 | 预期 |
+|------|------|
+| PM2 重启后 mail 孤儿 | `_check_timeouts` 用 `updated_at` fallback → 30 分钟后回收 |
+| compact 等超限 | compact_hanging → release counter → 任务 working → ticker 重新 dispatch |
+| retry 遇 compact | AgentBusyError → release counter → ticker 重新 dispatch |
+| retry 遇 lock | 同上 |
+| 正常 retry（gateway_timeout） | 不变，`_do_retry` 正常执行 |
+
+### 7.6 待讨论
+
+1. **compact_hanging 后进程还活着**：ticker 重新 dispatch 时会撞上 running session → AgentBusyError → 30 秒后再试。如果 compact 持续很久，可能循环多次。是否需要加一个“compact 重试总上限”？
+
+2. **retry 遇 busy 后 counter 释放**：ticker 重新 dispatch 走正常 `_dispatch_pending` → `spawn_full_agent`（不带 skip_counter）→ acquire counter → Phase 2 检查。如果此时 session 仍然 busy，会再次 AgentBusyError。这是预期行为，但意味着“ticker 重试”比“spawner retry”多走一层（acquire + Phase 2 检查），效率稍低。可接受吗？
+
+## 八、不在这个方案范围内

 | 项目 | 说明 | 后续 |
 |------|------|------|