auto-sync: 2026-05-17 19:47:56

2026-05-17 19:47:56 +08:00
parent d90ad9f305
commit 92e47176e6
1 changed files with 403 additions and 0 deletions
@@ -0,0 +1,403 @@
+# Agent 路由机制重设计方案
+
+**版本**: v1.0  
+**作者**: 庞统（副军师）🐦  
+**日期**: 2026-05-17  
+**状态**: 待确认  
+**触发**: E2E 测试暴露 review 阶段派错 Agent（张飞被派去审查自己），根因是 Daemon 硬编码路由
+
+---
+
+## 1. 问题诊断
+
+### 1.1 当前实现
+
+```
+Ticker tick → dispatcher.decide(task, action_type) → 返回 agent_id → spawn
+```
+
+`decide()` 的逻辑：
+1. action_type 是机械检查 → Daemon 本地执行
+2. task.assignee 有值且已注册 → spawn 这个 agent（**直接用 assignee**）
+3. task.assignee 为空 → 查 capability_map → fallback 庞统
+
+### 1.2 Bug 根因
+
+任务生命周期中 assignee 只在 **执行阶段** 被设置（张飞 claim → assignee="zhangfei-dev"）。
+
+到 **review 阶段**，ticker 调用 `dispatcher.dispatch(task, action_type="review")`，但 `decide()` 走 Level 2：`task.assignee="zhangfei-dev"` 在注册列表中 → 又派给张飞。
+
+### 1.3 更深层的问题
+
+**Daemon 在做 AI 该做的决策。** v2.6 架构明确定义：
+
+| 维度 | v2.6 设计目标 | 当前实现 |
+|------|-------------|---------|
+| 决策者 | Agent（在黑板上自主决策） | Daemon（if-else 硬编码） |
+| Daemon 角色 | 投递员（执行黑板上的决策） | 调度器（决定谁干什么） |
+| 编排方式 | AI agent 在黑板上自主领活（动态协作） | 配置表驱动（非 AI 判断） |
+
+T3-10 的设计原文写着"**配置表驱动非 AI 判断**"——这和 v2.6 的核心原则矛盾。
+
+---
+
+## 2. 调研发现
+
+### 2.1 学术前沿
+
+#### bMAS（Blackboard Multi-Agent System）— arXiv 2507.01701
+
+**核心机制**：Control Unit（LLM 驱动）根据黑板当前内容**动态选择**下一轮该哪个 Agent 行动。
+
+关键发现：
+- 不是固定 DAG，Control Unit 根据黑板状态决定下一步
+- token 效率更高（智能路由不浪费在不相关的 Agent 上）
+- Agent 轮流行动 → 更新黑板 → Control Unit 判断 → 直到共识
+
+#### 自主选择模式（Self-Selection）— arXiv 2510.01285
+
+**核心发现**：**任务不显式分配给 Agent。** 相反，中央 Agent 把需求发布到黑板上，**每个 Agent 自主决定是否参与**。
+
+> "Tasks are not explicitly assigned to helper agents; instead, each agent autonomously decides whether to participate based on its capabilities."
+
+这是最 AI Native 的模式——不需要任何路由规则表。
+
+#### MasRouter（Confidence-Aware Routing）— arXiv 2601.04861
+
+根据任务复杂度动态选择模型规模，引入 confidence 机制：
+- 简单任务 → 小模型
+- 复杂任务 → 大模型
+- 基于历史表现动态更新 Agent 可靠性评分
+
+#### AgentGate — arXiv 2604.06696
+
+结构化路由引擎，用 3B-7B 小模型做路由决策，candidate-aware 微调策略。验证了"路由本身也可以是 AI"的可行性。
+
+### 2.2 生产实践
+
+#### Microsoft Conductor（2026.05）
+
+刚开源的确定性编排工具。核心思路：**YAML 定义工作流，路由是确定性的**。
+
+但它的定位是：当任务**不是探索性的**时（如 code review pipeline），确定性路由比 LLM 动态路由更可靠。关键洞察是：
+- **探索性任务** → LLM 编排（动态）
+- **确定性流程** → 声明式编排（YAML）
+- 两者不是互斥的，而是**分层混合**
+
+#### AWS 动态分派模式
+
+事件驱动架构 + 动态分派：LLM 调用变成智能路由的、上下文感知的事件。
+
+#### Azure Agent Orchestration Patterns
+
+五种模式：顺序、并发、群聊、交接（Handoff）、Magentic。
+- **Handoff 模式**：Agent 完成自己的部分后，**自己决定交接给谁**
+- 关键：控制权从一个 Agent 转移到另一个，不是中央调度
+
+### 2.3 已有调研报告中的线索
+
+| 来源 | 关键洞察 |
+|------|---------|
+| shared-consciousness-research.md | Control Unit 是 LLM 驱动的，不是规则路由；Agent 能力画像是关键 |
+| v2.6-research-01 | Hermes 不信任 Agent 完成声明（系统级保护）；Claude Code Lead 主动协调 |
+| v2.6-research-02 | 事件驱动：complete→auto-unlock 是核心模式 |
+| architecture-v2.6.md | **"Agent 决策，Daemon 执行"**；Daemon 是投递员不是决策者 |
+
+---
+
+## 3. 设计原则
+
+从调研中提炼出三个核心原则：
+
+### P1: 路由决策在 Agent 层，不在 Daemon 层
+
+Daemon 只做"投递"——读黑板、spawn Agent、清理 session。**"谁该做这个任务"的决策由 Agent 自己或由黑板上的声明式数据驱动。**
+
+### P2: Agent 通过黑板声明自己的能力和意图
+
+不是 Daemon 维护一个 capability_map，而是 **Agent 自己在黑板上注册能力画像**。Daemon 查黑板找到匹配的 Agent。
+
+### P3: 执行者声明下一步需要什么
+
+执行阶段的 Agent 完成任务后，在提交产出时声明"下一步需要什么能力"。Daemon 读这个声明，找到匹配的 Agent，spawn 它。
+
+---
+
+## 4. 方案设计
+
+### 4.1 核心机制：Agent 能力画像 + 声明式路由
+
+#### 机制一：Agent 能力画像（Agent Profile）
+
+每个 Agent 在黑板上注册自己的能力画像（不是 Daemon 硬编码）：
+
+```yaml
+# 存储在黑板的 agents 表或独立 agent_profiles 表
+zhangfei-dev:
+  capabilities: [coding, implementation, scripting]
+  can_review: false        # 张飞不做审查
+  max_concurrent: 1
+  performance_score: 0.85  # 基于历史表现的动态评分
+
+simayi-challenger:
+  capabilities: [review, quality_check, debate]
+  can_review: true         # 司马懿专门做审查
+  max_concurrent: 2
+  performance_score: 0.92
+
+pangtong-fujunshi:
+  capabilities: [planning, coordination, escalation, strategy]
+  can_review: true
+  is_fallback: true        # 庞统是最终兜底
+  max_concurrent: 3
+  performance_score: 0.90
+```
+
+**关键**：能力画像是声明式的、可演进的。Agent 的 SOUL.md/IDENTITY.md 中就定义了自己的能力。Daemon 启动时读取 Agent 配置，写入黑板。
+
+#### 机制二：任务生命周期的声明式流转
+
+任务的 `status` 字段仍然驱动状态机，但**每个状态需要什么能力由黑板上的元数据声明**，不是 Daemon 硬编码：
+
+```python
+# 任务的 metadata 字段存储生命周期声明
+# 创建时由创建者（用户或庞统）或默认模板设置
+TASK_LIFECYCLE = {
+    "pending": {
+        "needs": "execution",      # pending 阶段需要 execution 能力
+        "capability": "auto",      # 从 task_type 推断，或显式声明
+    },
+    "review": {
+        "needs": "review",         # review 阶段需要 review 能力
+        "capability": "review",    # 固定查 review 能力的 Agent
+        "exclude_assignee": True,  # 排除执行者（不能自己审自己）
+    },
+    "failed": {
+        "needs": "escalation",     # 失败后需要升级能力
+        "capability": "escalation",
+    }
+}
+```
+
+**这不是模板！** 这是任务生命周期本身固有的语义。区别在于：
+- **模板（v1.0）**：预先定义完整的 DAG 流程，每个节点固定
+- **声明式流转（本方案）**：只声明每个状态需要什么能力，具体谁来由能力画像动态匹配
+
+#### 机制三：执行者声明下一步
+
+Agent 在完成产出提交时，可以声明下一步需要什么：
+
+```json
+// Agent 调用 POST /api/projects/{pid}/tasks/{id}/status 时
+{
+  "status": "review",
+  "agent": "zhangfei-dev",
+  "next_capability": "review",      // 声明下一步需要 review 能力
+  "handoff_note": "代码已实现，请审查质量和安全性"
+}
+```
+
+Daemon 读 `next_capability`，在 Agent 能力画像中找到匹配的 Agent（且排除当前 assignee），spawn 它。
+
+如果不声明 `next_capability`，Daemon 从 `TASK_LIFECYCLE[status].needs` 推断。
+
+### 4.2 Daemon 路由逻辑重写
+
+```python
+class Dispatcher:
+    """Agent 路由器 — 基于能力画像的声明式路由"""
+    
+    def decide(self, task: Task, action_type: str = "") -> dict:
+        # Level 1: 纯机械检查 → Daemon 本地执行（不变）
+        if action_type in self.LOCAL_ACTIONS:
+            return {"level": DispatchLevel.LOCAL, ...}
+        
+        # Level 2: 基于能力画像的路由（替代原来的 assignee 硬编码）
+        needed_capability = self._resolve_needed_capability(task, action_type)
+        exclude = self._get_exclusions(task, action_type)
+        agent_id = self._find_agent_by_capability(
+            needed_capability, 
+            exclude_agents=exclude
+        )
+        
+        if agent_id:
+            return {
+                "level": DispatchLevel.FULL_AGENT,
+                "agent_id": agent_id,
+                "reason": f"Matched capability '{needed_capability}' → {agent_id}",
+            }
+        
+        # Level 3: 无匹配 → 庞统兜底
+        return {
+            "level": DispatchLevel.FULL_AGENT,
+            "agent_id": "pangtong-fujunshi",
+            "reason": "No agent matched capability, fallback to coordinator",
+        }
+
+    def _resolve_needed_capability(self, task: Task, action_type: str) -> str:
+        """推断当前任务阶段需要什么能力"""
+        
+        # 1. 优先看 Agent 声明的 next_capability（黑板上的 handoff_note）
+        if task.next_capability:
+            return task.next_capability
+        
+        # 2. 看任务当前状态对应的生命周期需求
+        lifecycle = TASK_LIFECYCLE.get(task.status)
+        if lifecycle:
+            return lifecycle["capability"]
+        
+        # 3. 看任务类型（fallback）
+        return self._infer_from_task_type(task.task_type)
+
+    def _get_exclusions(self, task: Task, action_type: str) -> set:
+        """获取需要排除的 Agent"""
+        exclude = set()
+        lifecycle = TASK_LIFECYCLE.get(task.status, {})
+        
+        # review 阶段排除执行者（不能自己审自己）
+        if lifecycle.get("exclude_assignee") and task.assignee:
+            exclude.add(task.assignee)
+        
+        return exclude
+
+    def _find_agent_by_capability(self, capability: str, 
+                                   exclude_agents: set = None) -> str | None:
+        """从 Agent 能力画像中找到匹配的 Agent"""
+        candidates = []
+        for agent_id, profile in self.agent_profiles.items():
+            if agent_id in (exclude_agents or set()):
+                continue
+            if capability in profile.get("capabilities", []):
+                candidates.append(agent_id)
+        
+        if not candidates:
+            return None
+        
+        # 多候选时：选负载最低的
+        if len(candidates) > 1:
+            return min(candidates, 
+                       key=lambda a: self.counter._active.get(a, 0))
+        
+        return candidates[0]
+```
+
+### 4.3 assignee 字段语义变更
+
+当前：`assignee` 是"负责人"（整个任务的），一旦设置就贯穿全生命周期。
+
+**改为**：`assignee` 是"当前阶段的执行者"，每次状态流转时更新。
+
+```python
+# 状态流转时自动更新 assignee
+def transition_status(task_id, new_status, agent):
+    # ...
+    if lifecycle.get("exclude_assignee"):
+        # review 阶段：assignee 改为审查者
+        old_assignee = task.assignee  # 保存执行者信息
+        task.previous_assignee = old_assignee  # 新增字段
+        task.assignee = new_agent_id  # 设为审查者
+```
+
+### 4.4 和 v2.6 架构的对齐
+
+| v2.6 原则 | 本方案实现 |
+|-----------|----------|
+| Agent 决策，Daemon 执行 | 路由决策基于 Agent 的能力画像（Agent 声明的能力），Daemon 只做匹配 |
+| Daemon 是投递员不是决策者 | Daemon 不做"谁该做什么"的价值判断，只做能力匹配 |
+| 编排是 AI agent 自主领活 | Agent 自己声明能力、声明下一步需要什么能力 |
+| 黑板是唯一真相源 | 能力画像、任务生命周期声明都在黑板上 |
+
+### 4.5 和模板机制的本质区别
+
+| 维度 | v1.0 模板 | 当前 capability_map | 本方案 |
+|------|----------|--------------------| -------|
+| 路由定义位置 | 模板 YAML | Daemon config YAML | 黑板（Agent 能力画像） |
+| 谁定义能力 | 用户/开发者 | 开发者 | **Agent 自己**（SOUL.md → 黑板） |
+| 每个阶段谁做 | 模板固定 | config 硬编码 | 声明式匹配 + 排除规则 |
+| 可扩展性 | 加模板 | 改代码 | Agent 注册即可 |
+| AI Native 程度 | 低 | 低 | **中高**（Agent 自声明） |
+
+### 4.6 演进路线
+
+本方案是**务实的第一步**。它不是最终的 AI Native 终极形态，而是从"Daemon 硬编码"到"Agent 自主领活"之间的**关键跳板**：
+
+```
+当前: Daemon if-else 硬编码
+  ↓ 本方案
+第一步: Agent 能力画像 + 声明式路由（Daemon 做能力匹配）
+  ↓ 未来
+第二步: Agent 自主领活（Daemon 只广播，Agent 自己 claim）
+  ↓ 更远
+第三步: bMAS Control Unit（LLM 驱动的动态选择）
+```
+
+第一步到第二步的迁移成本很低——能力画像和声明式路由机制不变，只是把"Daemon 查找匹配 → 派发"变成"Daemon 广播需求 → Agent 自己 claim"。这是同一个数据结构的两种消费方式。
+
+---
+
+## 5. 具体改动清单
+
+### 5.1 数据模型变更
+
+| 变更 | 说明 |
+|------|------|
+| 新增 `agent_profiles` 表（或用 agents 表扩展） | 存储 Agent 能力画像 |
+| tasks 表新增 `next_capability` 字段 | Agent 声明下一步需要的能力 |
+| tasks 表新增 `previous_assignee` 字段 | 状态流转时保存前一阶段执行者 |
+| `assignee` 语义变更 | 从"任务负责人"改为"当前阶段执行者" |
+
+### 5.2 代码变更
+
+| 文件 | 变更 |
+|------|------|
+| `dispatcher.py` | 重写 `decide()`：能力匹配替代 assignee 查表 |
+| `dispatcher.py` | 新增 `_resolve_needed_capability()`、`_find_agent_by_capability()`、`_get_exclusions()` |
+| `config/default.yaml` | `capability_map` 改为 `agent_profiles`（每个 Agent 声明自己的能力列表） |
+| `blackboard_routes.py` | status API 接受 `next_capability` 参数 |
+| `ticker.py` | `_dispatch_reviews()` 使用新的 dispatcher 路由 |
+| `blackboard/db.py` | 新增 agent_profiles 表 / 字段 |
+
+### 5.3 不变的部分
+
+| 不变 | 原因 |
+|------|------|
+| 状态机（pending→claimed→working→review→done） | 状态流转语义正确 |
+| 前端 Dashboard | 前端不感知路由逻辑 |
+| Agent prompt 模板（S2） | Agent 仍然按 4 步流程执行 |
+| Spawner 逻辑 | spawn 机制不变 |
+| API 契约（S1） | 对 Agent 透明 |
+
+---
+
+## 6. 和现有优秀实践的对标
+
+| 实践 | 本方案对应 |
+|------|----------|
+| bMAS Control Unit（LLM 驱动） | 本方案用能力画像做结构化匹配（成本更低、确定性更高），未来可演进为 LLM 驱动 |
+| 自主选择模式（arXiv 2510.01285） | 本方案的演进方向：Agent 自主 claim 而非被指派 |
+| Handoff 模式（Azure） | Agent 声明 `next_capability` 就是 Handoff |
+| 声明式编排（Conductor） | 生命周期声明 TASK_LIFECYCLE 是声明式的 |
+| 能力画像（OpenClaw RFC #35203） | agent_profiles 直接实现能力画像 |
+| 幻觉门控（Hermes） | 不变，产出验证逻辑独立于路由 |
+
+---
+
+## 7. 待确认
+
+1. **`agent_profiles` 数据来源**：从 config/default.yaml 读取（启动时写入黑板），还是从 Agent 的 SOUL.md 动态解析？
+2. **`TASK_LIFECYCLE` 定义位置**：硬编码在 dispatcher.py 中，还是也放到 config？
+3. **`assignee` 语义变更的影响**：前端 Dashboard 是否有依赖 assignee = 执行者的假设？
+4. **是否要一步到位到"Agent 自主领活"**（第二步），还是先实现本方案（第一步）？
+
+---
+
+## 8. 参考
+
+- bMAS: arXiv 2507.01701 — Blackboard LLM Multi-Agent System
+- Self-Selection: arXiv 2510.01285 — Agent 自主选择模式
+- MasRouter: arXiv 2601.04861 — Confidence-Aware Routing
+- Microsoft Conductor: github.com/microsoft/conductor — 确定性编排
+- Azure Agent Patterns: learn.microsoft.com — Handoff 模式
+- OpenClaw RFC #35203 — Capability Profiling + Shared Blackboard
+- v2.6 调研报告: docs/research/shared-consciousness-research.md
+- v2.6 架构设计: docs/design/architecture-v2.6.md