sanguo/sanguo_moziplus_v2

Fork 0

Files

T

cfdaily b21d496608 auto-sync: 2026-05-21 11:52:01

2026-05-21 11:52:01 +08:00

11 KiB

Raw Blame History

v3.0 调度重构方案：去掉独立 LLM，改为广播认领 + 确定性交接

日期: 2026-05-21 状态: 待司马懿评审 影响文件: router.py, dispatcher.py, ticker.py, main.py, config/default.yaml

1. 问题

1.1 LLMDriver 是设计之外的野路子

当前 Router 的 LLMDriver 用独立 OpenAI() 客户端直连 zhipu API 做路由决策：

不属于 L1（调了 LLM）、不属于 L2（不是 openclaw agent）、不属于 L3（不是完整 Agent）
不走 Gateway（无模型路由、无 fallback、无计费）
需要单独维护 api_base/api_key（和 Gateway 配置重复）

已导致实际故障：general-20260521-0004 因凭据为空反复调度失败。

1.2 _dispatch_pending 绕过了认领机制

设计文档核心原则（§1.2）：

"编排是 AI Agent 在黑板上自主领活（动态协作）" "Daemon 是投递员，不是决策者"

但 _dispatch_pending 的实际行为是：

pending → Router 决定 agent → 强制 claimed + spawn → Agent 被动执行

代码里认领的全部基础设施（API + CAS + inbox 事件）都已经实现，但被跳过了。

1.3 设计违背

architecture-v2.6.md §1.1：

"Agent 决策，Daemon 执行" — 庞统做 plan、张飞领任务、关羽发现风险，都写在黑板上。

Daemon 替 Agent 做决策（Router 决定分给谁），违反了"Agent 决策"原则。

2. 设计文档中已有的完整方案

2.1 认领基础设施（已实现）

组件	代码位置	说明
claim API	`POST /tasks/{id}/claim`	原子 CAS 认领
claim_task()	`operations.py:155`	`WHERE status='pending' AND (assignee IS NULL OR assignee=?)`
inbox agent_claim	`inbox.py:264`	Agent 通过 JSONL 通知 Daemon 认领
claim_timeout	`ticker.py:530`	claimed 超时 5 分钟 → 重置为 pending（续杯）

2.2 竞态解决（已设计，§3.6）

默认：先到先得 — SQLite CAS，谁先 claim 谁做
升级：庞统仲裁 — 争议时 @庞统请求仲裁
最终：用户拍板 — @user 请求用户决定

2.3 续杯兜底（已实现）

pending → 广播 → 无人认领 → claim_timeout(5min) → pending → 再广播 → ... → 庞统兜底

claim_timeout_minutes = 5.0，_check_timeouts 自动把超时的 claimed 重置为 pending。

2.4 三层执行模型（§4.2）

层级	方式	命令	适用场景
L1 Daemon 直接操作	SQLite/文件	—	状态更新、机械验证
L2 spawn sub	隔离 session	`openclaw agent --agent <id> --session-id <uuid>`	scope guard、格式校验
L3 run agent	完整黑板参与者	`openclaw agent --agent <id>`	编码、审查、策略

核心原则：系统只有两种 LLM 调用方式，都通过 Gateway，没有第三种。

3. 方案：广播认领 + 确定性交接

3.1 核心思路

任务调度分两条路径：

确定性路径：已明确知道下一步该谁做 → 直接 spawn，不需要广播

retry → 原执行者
Agent handoff（next_capability）→ 能力匹配
有 assignee → 直接分
review 生命周期 → 能力匹配

广播认领路径：首次分配 / 不确定场景 → spawn 所有空闲 Agent，自主 claim

Agent 读黑板 → 自己判断是否适合 → claim
无人认领 → 续杯 → 庞统兜底

3.2 完整调度流程

Ticker._tick_project()
  │
  ├─ 1. 扫描状态
  ├─ 2. 依赖推进
  ├─ 3. 超时检测（claimed 5min → pending，working 30min → failed）
  │
  ├─ 4. 调度 pending 任务：_dispatch_pending()
  │     for each pending task:
  │       ├─ 有 next_capability？→ 能力匹配 → 直接 spawn 对应 Agent
  │       ├─ 有 assignee？→ 直接 spawn 该 Agent
  │       ├─ retry？→ spawn 原执行者
  │       └─ 都没有？→ _broadcast_claim()
  │            → spawn 每个空闲 Agent，传入"黑板上有新任务，请认领"的 prompt
  │            → Agent 读黑板（GET /tasks?status=pending）
  │            → Agent 判断是否适合自己
  │            → Agent claim（POST /tasks/{id}/claim）或退出
  │
  ├─ 5. 调度 review 任务：_dispatch_reviews()（不变）
  └─ 6-10. 其他处理（不变）

3.3 广播认领的 Spawn Prompt

广播 spawn 的 Agent 收到的 prompt：

你是 {agent_id}。黑板上有新的待认领任务。

## 操作
1. 读黑板查看待认领任务：
   curl http://{api_host}:{api_port}/api/projects/{project_id}/tasks?status=pending

2. 分析每个 pending 任务，判断是否适合你：
   - 你的能力：{capabilities}
   - 任务类型、描述、优先级是否匹配你的专长

3. 如果有适合你的任务，立即认领：
   curl -X POST http://{api_host}:{api_port}/api/projects/{project_id}/tasks/{task_id}/claim \
     -H 'Content-Type: application/json' \
     -d '{"agent": "{agent_id}"}'

4. 认领成功后，开始执行（状态改为 working）

5. 如果没有适合你的任务，直接退出

3.4 Claim 竞争处理

多个 Agent 同时 claim 同一个任务：

SQLite CAS：WHERE status='pending' — 只有第一个成功，其余 rowcount=0
Agent 行为：claim 失败 → 检查其他 pending 任务 → 没有适合的 → 退出
自然负载均衡：不同 Agent 倾向认领不同类型的任务（张飞→coding，司马懿→review）

3.5 无人认领兜底

广播 → 5 分钟内无人 claim → claim_timeout → pending → 下轮 ticker 再广播
→ 连续 3 轮无人认领 → spawn 庞统 → 庞统决定分配或自己执行

连续无人认领的检测：在 events 表中记录 broadcast_sent 事件，_check_timeouts 中统计广播轮次，超过阈值 escalate to 庞统。

4. 改动清单

4.1 删除 `LLMDriver` 类（router.py）

删除整个 LLMDriver 类（~120 行）。AgentRouter.route() 末尾改为返回 delegate。

4.2 `AgentRouter.init` 去掉 `llm_driver` 参数

4.3 新增 `_broadcast_claim`（ticker.py）

司马懿建议：攒一批任务，每轮 ticker 最多广播一次（而非每个 pending 任务触发一次广播）。 5 个任务只需 spawn 5 个 Agent，而不是 25 个。

广播前检查全局并发（司马懿建议 1）：接近上限时跳过本轮广播。

async def _broadcast_claim(self, tasks, db_path, project_id):
    """广播一批待认领任务给所有空闲 Agent，每轮最多广播一次"""
    # 全局并发检查
    if self.counter and self.counter.global_active >= self.counter._max_global - 1:
        logger.info("Skipping broadcast: global concurrent near limit")
        return []

    idle_agents = self._get_idle_agents()
    if not idle_agents:
        return []

    spawned = []
    for agent_id in idle_agents:
        if not await self.counter.can_acquire(agent_id):
            continue
        prompt = self._build_claim_prompt(agent_id, tasks, project_id)
        await self.counter.acquire(agent_id)
        session_id = await self.spawner.spawn_full_agent(
            agent_id=agent_id,
            message=prompt,
            on_complete=lambda aid, _: self.counter.release(aid),
        )
        spawned.append(agent_id)
    return spawned

广播 prompt 包含所有 pending 任务列表：

def _build_claim_prompt(self, agent_id, tasks, project_id):
    task_list = "\n".join([
        f"- ID: {t.id}, 标题: {t.title}, 类型: {t.task_type}, 优先级: {t.priority}"
        for t in tasks
    ])
    return f"""你是 {agent_id}。黑板上有 {len(tasks)} 个待认领任务。

## 待认领任务
{task_list}

## 操作
1. 读黑板查看详情：
   curl http://{api}/api/projects/{project_id}/tasks?status=pending

2. 选择适合你的任务并认领：
   curl -X POST http://{api}/api/projects/{project_id}/tasks/{task_id}/claim \
     -H 'Content-Type: application/json' -d '{{"agent": "{agent_id}"}}'

3. 认领成功后开始执行（状态改为 working）
4. 没有适合你的任务则退出
"""

4.4 Dispatcher 增加 delegate 模式（dispatcher.py）

_build_spawn_message 增加 delegate 分支（庞统兜底 prompt），和广播认领共存：

广播认领失败 → 任务回到 pending → 多轮后 → spawn 庞统 delegate

4.5 main.py 去掉 LLMDriver 初始化

4.6 config/default.yaml 去掉 routing 节

4.7 无人认领检测（复用 retry_count）

司马懿建议：不需要新增 broadcast_sent 事件。无人认领重置 pending 时 retry_count 已在递增，直接用它判断阈值。当 retry_count >= 3 时 escalate to 庞统。

5. 场景对比

场景	改前（独立 LLM 分配）	改后（广播认领 + 确定性交接）
retry	原执行者	不变（确定性）
Agent handoff	能力匹配	不变（确定性）
有 assignee	直接分	不变（确定性）
review 生命周期	能力匹配	不变（确定性）
首次分配	独立 LLM 决定分给谁	广播所有空闲 Agent，自主认领
无人认领	无此场景（强制分配）	续杯 → 庞统兜底

6. 代码量

删：~130 行（LLMDriver + routing config 初始化 + config.yaml routing 节）
改：~30 行（Router.route() 末尾 + Dispatcher._build_spawn_message() delegate 分支）
新增：~60 行（_broadcast_claim + _build_claim_prompt）
净减：~40 行

7. 风险与缓解

#	风险	评估	缓解
1	广播 spawn 消耗资源（每个 pending 任务都 spawn 所有空闲 Agent）	中	只有"无确定性路径"的任务才广播；且 Agent 读黑板后无适合任务会快速退出
2	多 Agent 竞争 claim	低	SQLite CAS 先到先得，已实现
3	无人认领	低	续杯机制兜底，多轮后庞统接管
4	Agent 认领了不适合的任务	低	Agent 有完整上下文（SOUL+AGENTS+能力），比 LLM 判断更准确
5	广播速度比直接分配慢	低	首次分配不需要快，准确比快重要

8. 实施步骤

router.py：删除 LLMDriver 类 + AgentRouter 去掉 llm_driver + route() 末尾改 delegate
ticker.py：新增 _broadcast_claim + _build_claim_prompt，修改 _dispatch_pending 增加广播路径
dispatcher.py：_build_spawn_message() 增加 delegate 分支（庞统兜底）
main.py：删除 llm_driver 初始化块
config/default.yaml：删除 routing 节
测试：创建 pending 任务 → 观察广播 spawn → Agent claim → 执行

11 KiB Raw Blame History Unescape Escape