diff --git a/docs/design/topic11-multi-project-proposal.md b/docs/design/topic11-multi-project-proposal.md index 91fb49e..d42b5f1 100644 --- a/docs/design/topic11-multi-project-proposal.md +++ b/docs/design/topic11-multi-project-proposal.md @@ -148,25 +148,37 @@ class ProjectManager: > **问题**:所有项目/任务一起排队,项目 A 的长任务阻塞项目 B。 > **新设计**:见 §5.4 per-project 并发调度。 -### 5.3 Daemon 逻辑健康自检 +### 5.3 Daemon 逻辑健康自检 + 线程存活监控(v2 扩展) ```python # §14 风险缓解:连续 N tick 无状态变更则告警 STALE_TICK_THRESHOLD = 20 class DaemonHealth: - def __init__(self): - self._tick_state_changes: dict[str, int] = {} # project_id → 连续无变更 tick 数 + def __init__(self, project_id: str): + self.project_id = project_id + self._idle_ticks = 0 - def record_change(self, project_id: str): - self._tick_state_changes[project_id] = 0 + def record_idle(self): + self._idle_ticks += 1 - def check_stale(self, project_id: str) -> bool: - self._tick_state_changes.setdefault(project_id, 0) - self._tick_state_changes[project_id] += 1 - return self._tick_state_changes[project_id] >= STALE_TICK_THRESHOLD + def record_change(self): + self._idle_ticks = 0 + + def is_stale(self) -> bool: + return self._idle_ticks >= STALE_TICK_THRESHOLD ``` +**线程存活监控**(见 §5.4.4 `Daemon._check_slot_health()`): +- Daemon 主线程每 60s 检查所有 ProjectSlot 线程是否存活 +- 线程死亡 → 记录日志 + 自动重启 +- 连续重启 3 次失败 → 告警(通过 Sanguo Mail 通知用户) + +**计数器超时兜底**: +- 如果 Agent 完成回调丢失(进程被杀、网络断),`ActiveAgentCounter` 不会归零 +- `_check_working_tasks()` 中,working 任务超过 `task_timeout`(默认 10 分钟)视为完成 +- 视为完成时主动 `decrement()`,防止计数器泄漏 + ### 5.4 并发调度模型(v2 新增) #### 5.4.1 问题