[moz] feat: Runaway Guard per-task dispatch 上限

§15 Runaway Guard — per-task dispatch_count 上限，防止无限循环 dispatch 问题：mail/toolchain task 走 handler auto-working（跳过 claim），不受 claim_timeout 3 次重试兜底保护。如果反复 spawn 但永远到不了 done/failed，会无限循环消耗资源（实际案例：2026-06-15 mention 重复投递事件）。设计： - tasks 表新增 dispatch_count 字段 - 每次 ticker 成功 dispatch 时递增 - dispatch_count >= 10 时自动标 failed（reason=runaway_guard） - 覆盖所有非终态（pending/working/claimed） - 参考 Hermes v0.13 §3 Per-Task 重试上限改动文件： - src/blackboard/db.py: _safe_add_column dispatch_count - src/blackboard/models.py: Task dataclass 加 dispatch_count - src/daemon/ticker.py: dispatch 递增 + _check_timeouts runaway guard - docs/design/15-runaway-guard.md: 设计文档 - tests/integration/test_ticker_integration.py: E13 测试 3 个测试：456 passed, 3 skipped
2026-06-16 00:18:15 +08:00
parent d6cb854f68
commit 415c6899c2
5 changed files with 206 additions and 0 deletions
@@ -0,0 +1,61 @@
+# §15 Runaway Guard — Per-Task Dispatch 上限
+
+> 设计文档 v1.0 | 2026-06-16
+
+## 问题
+
+mail/toolchain task 走 handler auto-working（跳过 claim 阶段），不受 claim_timeout 的 3 次重试兜底保护。如果一个 auto-working task 反复 spawn 但永远到不了 done/failed，会无限循环消耗资源。
+
+### 实际案例
+
+2026-06-15 mention 重复投递事件：`spawn_full_agent` 在 `use_main_session=True` 时返回 `None`，ticker `_process_mentions` 误判为失败，每次 tick（30s）都重试。同一 mention 投递了 4 次，直到 retry_count 达到 mention_queue 的 5 次上限才停止。
+
+直接根因已由 PR #80 修复，但如果类似 bug 再次出现，当前没有任何机制阻止 task 层面的无限循环。
+
+## 设计
+
+### 机制
+
+tasks 表新增 `dispatch_count` 字段，每次 ticker 成功 dispatch 一个 task 时递增。当 `dispatch_count >= 10`（全局默认）时，自动标 failed。
+
+### 默认值选择
+
+全局默认 10 次。参考 Hermes v0.13 Best Practices §3 "Per-Task 重试上限"：
+
+- 简单任务重试 1 次
+- 复杂任务重试 3 次
+- crash recovery（3 次）+ api_retry（3 次）余量 = ~10 次
+
+### 适用范围
+
+所有 task 类型（task/mail/toolchain），所有非终态（pending/working/claimed）。
+
+### 检查时机
+
+在 `_check_timeouts` 方法开头，先于现有的 claimed/working 超时检查执行。
+
+### 与现有机制的关系
+
+| 机制 | 覆盖场景 | 触发动作 |
+|------|---------|---------|
+| claim_timeout retry_count >= 3 | 广播任务无人认领 | 升级庞统 |
+| crash_limit 3/30min | working 状态 crash | 标 failed |
+| api_retry_count | API 连续失败 | 标 failed |
+| 续杯 max_retries 3 | 续杯耗尽 | 标 failed |
+| working timeout | working 超时 | 标 failed 或 done |
+| **runaway_guard 10 次** | **任何状态的无限循环** | **标 failed** |
+
+runaway_guard 是最后一道防线，覆盖所有其他机制遗漏的循环场景。
+
+## 改动文件
+
+| 文件 | 改动 |
+|------|------|
+| `src/blackboard/db.py` | `_safe_add_column(conn, "tasks", "dispatch_count", "INTEGER DEFAULT 0")` |
+| `src/blackboard/models.py` | Task dataclass 加 `dispatch_count: int = 0` |
+| `src/daemon/ticker.py` | `_dispatch_pending` / `_dispatch_reviews` 递增 dispatch_count；`_check_timeouts` 加 runaway guard 检查 |
+
+## 参考
+
+- Hermes v0.13 Kanban Best Practices §3 "Per-Task 重试上限"
+- 实际案例：2026-06-15 mention 重复投递事件（PR #80 修复了直接根因，runaway guard 作为兜底）