auto-sync: 2026-05-16 13:44:41
This commit is contained in:
@@ -148,25 +148,37 @@ class ProjectManager:
|
||||
> **问题**:所有项目/任务一起排队,项目 A 的长任务阻塞项目 B。
|
||||
> **新设计**:见 §5.4 per-project 并发调度。
|
||||
|
||||
### 5.3 Daemon 逻辑健康自检
|
||||
### 5.3 Daemon 逻辑健康自检 + 线程存活监控(v2 扩展)
|
||||
|
||||
```python
|
||||
# §14 风险缓解:连续 N tick 无状态变更则告警
|
||||
STALE_TICK_THRESHOLD = 20
|
||||
|
||||
class DaemonHealth:
|
||||
def __init__(self):
|
||||
self._tick_state_changes: dict[str, int] = {} # project_id → 连续无变更 tick 数
|
||||
def __init__(self, project_id: str):
|
||||
self.project_id = project_id
|
||||
self._idle_ticks = 0
|
||||
|
||||
def record_change(self, project_id: str):
|
||||
self._tick_state_changes[project_id] = 0
|
||||
def record_idle(self):
|
||||
self._idle_ticks += 1
|
||||
|
||||
def check_stale(self, project_id: str) -> bool:
|
||||
self._tick_state_changes.setdefault(project_id, 0)
|
||||
self._tick_state_changes[project_id] += 1
|
||||
return self._tick_state_changes[project_id] >= STALE_TICK_THRESHOLD
|
||||
def record_change(self):
|
||||
self._idle_ticks = 0
|
||||
|
||||
def is_stale(self) -> bool:
|
||||
return self._idle_ticks >= STALE_TICK_THRESHOLD
|
||||
```
|
||||
|
||||
**线程存活监控**(见 §5.4.4 `Daemon._check_slot_health()`):
|
||||
- Daemon 主线程每 60s 检查所有 ProjectSlot 线程是否存活
|
||||
- 线程死亡 → 记录日志 + 自动重启
|
||||
- 连续重启 3 次失败 → 告警(通过 Sanguo Mail 通知用户)
|
||||
|
||||
**计数器超时兜底**:
|
||||
- 如果 Agent 完成回调丢失(进程被杀、网络断),`ActiveAgentCounter` 不会归零
|
||||
- `_check_working_tasks()` 中,working 任务超过 `task_timeout`(默认 10 分钟)视为完成
|
||||
- 视为完成时主动 `decrement()`,防止计数器泄漏
|
||||
|
||||
### 5.4 并发调度模型(v2 新增)
|
||||
|
||||
#### 5.4.1 问题
|
||||
|
||||
Reference in New Issue
Block a user