Compare commits

..

9 Commits

Author SHA1 Message Date
cfdaily 178818bb15 fix: 修复 §07 中 compact 文件路径引用 (24→15)
CI / lint (pull_request) Successful in 7s
CI / test (pull_request) Successful in 8s
CI / notify-on-failure (pull_request) Successful in 1s
2026-06-13 10:14:45 +08:00
cfdaily eccb4d2723 docs: 设计文档编号重排(20→14, 24→15) + 已完成文档状态标注更新
CI / lint (pull_request) Successful in 7s
CI / test (pull_request) Successful in 9s
CI / notify-on-failure (pull_request) Successful in 0s
2026-06-13 10:12:39 +08:00
pangtong-fujunshi 9e2145171a Merge PR #57 2026-06-13 01:36:24 +00:00
cfdaily 67cad2dd96 fix: _REASON_MAP 补 agent_error 条目(G2)
CI / lint (pull_request) Successful in 6s
CI / test (pull_request) Successful in 8s
CI / notify-on-failure (pull_request) Successful in 0s
spawner 会产生 agent_error reason,之前缺映射走到 _default 显示'未知原因'。
2026-06-13 09:35:15 +08:00
pangtong-fujunshi 79da0bd07e Merge PR #56 2026-06-13 01:34:39 +00:00
cfdaily a116f7e6c0 fix: 注释拼写 must_hives → must_haves
CI / lint (pull_request) Successful in 7s
CI / test (pull_request) Successful in 8s
CI / notify-on-failure (pull_request) Successful in 0s
2026-06-13 09:33:59 +08:00
cfdaily 7fb4d988ec fix: lint 修复 + api_error 测试更新
CI / lint (pull_request) Successful in 6s
CI / test (pull_request) Successful in 8s
CI / notify-on-failure (pull_request) Successful in 0s
- mail_notify: f-string 反斜杠修复、行过长修复、unused import
- test_classify_outcome: api_error should_retry 改 True
2026-06-13 09:29:52 +08:00
cfdaily f4dd9ff78d feat(daemon): Mail 失败通知 v2.0 — api_error retry + 通知增强
CI / lint (pull_request) Failing after 7s
CI / test (pull_request) Has been skipped
CI / notify-on-failure (pull_request) Successful in 1s
P1: api_error rate_limit/500/503 改为可恢复 retry(should_retry=True,60s cooldown)
P2: 通知模板动态化(reason 人话翻译 + detail 信息 + 重试情况 + AI Native 知识库)

设计文档:§20.7 (20-task-type-architecture.md)
2026-06-13 09:27:17 +08:00
pangtong-fujunshi 6520e78c0b Merge PR #55 2026-06-13 01:23:33 +00:00
13 changed files with 141 additions and 38 deletions
+1 -1
View File
@@ -6,7 +6,7 @@
**基于**: PRD-v3.0 §4 四相架构 + architecture-v3.0.md
**作者**: 庞统(副军师)🐦
**日期**: 2026-05-29
**状态**: 实现完成,待 E2E 验证
**状态**: ✅ 已完成(E2E 验证通过)
**评审**: 司马懿
---
+1 -1
View File
@@ -2,7 +2,7 @@
**日期**: 2026-05-30
**作者**: 庞统
**状态**: 已修订 v1.1(根据司马懿 2026-05-30 评审意见
**状态**: ✅ 已完成(spawner/ticker/dispatcher 全部 use_main_session=True
**前置**: `01-four-phase-loop.md`(四相循环 E2E 验证暴露 session 爆炸问题)
---
@@ -3,7 +3,7 @@
> 版本: v1.1
> 日期: 2026-05-30
> 作者: 庞统(副军师)
> 状态: v1.1 修订(司马懿评审意见已纳入
> 状态: ✅ 已完成(@mention + mention_queue 已实现
> 前置: #02 Main Session + Delegation, #03 Prompt 进化
---
+1 -1
View File
@@ -3,7 +3,7 @@
> 版本: v1.2
> 日期: 2026-06-03
> 作者: 庞统(副军师)
> 状态: 待评审(v1.2
> 状态: ✅ 已完成(_startup_recover 7 个方法已实现
> 前置: spawner-monitor-design.md §5 A0Agent crash 恢复)
> 变更: v1.2 两个关键改进:(1) working→pending 保留 current_agent 让同一 agent 接手;(2) reviewing 精确恢复到前置状态而非硬推 done
+4 -4
View File
@@ -1,6 +1,6 @@
# #07 Spawner Acquire-First 设计
> 状态:#07.1 已实施 ✅ | #07.2 已实施 ✅ | #07.3 设计中
> 状态:✅ 已完成(#07.1-#07.2 已实施
> 作者:庞统
> 日期:2026-06-01
> 评审:司马懿
@@ -233,9 +233,9 @@ def _revive_session(agent_id: str) -> bool:
pass
```
### 4.5 O5: compact 检测(§24 rotation-only v3
### 4.5 O5: compact 检测(§15 rotation-only v3
§24 设计文档:`docs/design/24-compact-detection-fix.md`
§15 设计文档:`docs/design/15-compact-detection-fix.md`
**检测方法**:读 gateway 日志尾部 2MB,按 sessionKey 过滤 `[compaction] rotated active transcript` 事件。
如果最近的 rotation 事件在 120s 窗口内 → 视为 compact 循环进行中(可能还在 post-compact retry)。
@@ -243,7 +243,7 @@ def _revive_session(agent_id: str) -> bool:
旧方法 `_check_recent_compaction_jsonl`(扫描 session jsonl 的 `type=compaction` 事件)保留作为 fallback。
```python
# §24 v3: compact 检测优先用 gateway 日志 rotation 事件
# §15 v3: compact 检测优先用 gateway 日志 rotation 事件
if result["status"] not in ("idle", "unknown", None):
session_key = f"agent:{agent_id}:main"
result["recent_compact"] = AgentSpawner._check_compact_in_progress_gateway(
+1 -1
View File
@@ -3,7 +3,7 @@
> 版本: v1.1
> 日期: 2026-06-03
> 作者: 庞统(副军师)
> 状态: 待评审(v1.1
> 状态: ✅ 已完成(rebuttal on_complete + goal gate 已实现
> 前置: #04 黑板协作(@mention+ #08 Classify Outcome
> 关联: T4 审查体系完善
> 变更: v1.1 纳入司马懿评审反馈 — verdict 读 reviews 表 + rebuttal mention spawn 带 on_complete 回调
@@ -3,7 +3,7 @@
> 版本: v1.1
> 日期: 2026-06-03
> 作者: 庞统(副军师)
> 状态: 待终审(v1.1
> 状态: ✅ 已完成(SSE + TaskModal 自动刷新已实现
> 前置: #04 黑板协作(@mention + comment
> 关联: architecture-v3.0.md T3
> 变更: v1.1 纳入司马懿评审反馈 — checkpoint SSE 触发文件修正为 checkpoint_routes.pySSE payload 统一含 project_id
+1 -1
View File
@@ -1,6 +1,6 @@
# 三国团队工具链与开发流程设计
> **状态**: v3.3 — #19 上下文四层改造合并 + CI 修复 + A13 修订
> **状态**: ✅ 已完成(E2E 验证通过,所有 8 步 PASS)
> **作者**: 庞统(副军师)🐦
> **评审**: 司马懿(仲达)🗡️
> **日期**: 2026-06-06
@@ -4,6 +4,8 @@ created: 2026-06-10
version: v3.0
---
> 状态: ✅ 已完成(Step 1-5 全部合并,394 passed
# §1 现状分析(v3.0 更新说明:§1-§13 保留原样,新增 §14-§18,更新 §3/§5/§7
# §1 现状分析
@@ -950,7 +952,7 @@ handler.post_complete(task_id, agent_id, outcome, db_path)
---
## §20. Mail 失败通知机制
## §14. Mail 失败通知机制
### 20.1 背景
@@ -1,6 +1,6 @@
# §24 — Compact 检测方案修正
# §15 — Compact 检测方案修正
> 状态:**v5 已实现**gateway log + jsonl 配对)
> 状态:✅ 已完成gateway log + jsonl 配对)
> 作者:庞统
> 日期:2026-06-11v4),2026-06-13v5
> 框架:基于 §07 Spawner Acquire-First
+117 -21
View File
@@ -1,4 +1,4 @@
"""Mail 失败通知 — 以 system 身份通知发件人"""
"""Mail 失败通知 v2.0 — 以 system 身份通知发件人AI Native"""
from __future__ import annotations
@@ -6,7 +6,7 @@ import json
import logging
from datetime import datetime
from pathlib import Path
from typing import Optional
from typing import Dict, Optional
from src.blackboard.models import Task
from src.blackboard.operations import Blackboard
@@ -15,21 +15,121 @@ from src.config.agents import AGENT_IDS
logger = logging.getLogger(__name__)
# 邮件通知正文模板(统一模板,包含所有可能的失败原因和建议)
_NOTIFY_TEMPLATE = """你的邮件投递失败了。
# ── Reason 人话翻译 + detail 提取 ──────────────────────────────
📧 原始邮件{title}
👤 收件人{to_agent}
失败原因{reason}
def _extract_stderr(detail: dict, max_len: int = 200) -> str:
"""从 detail 中提取 stderr_preview"""
preview = (detail or {}).get("stderr_preview", "")
if preview and len(preview) > max_len:
preview = preview[:max_len] + "..."
return preview
常见失败原因及处理建议
no_reply_found收件人未回复建议重发邮件或通过黑板任务方式联系
auth_failed收件人认证失败需检查 Agent 配置联系姜维(jiangwei-infra)排查
crash_limit收件人处理时多次崩溃系统异常建议稍后重试
task_timeout处理超时建议重发或通过其他方式联系
其他原因建议联系副军师(pangtong-fujunshi)排查
系统自动通知"""
def _fmt_retry_info(reason: str, detail: dict) -> str:
"""格式化重试情况描述"""
_NO_RETRY_REASONS = {
"no_reply_found", "auth_failed", "agent_error",
"agent_failed", "compact_failed",
}
if reason in _NO_RETRY_REASONS:
reason_human = _REASON_MAP.get(reason, _REASON_MAP.get("_default", ("未知原因", lambda d: "")))[0]
return f"无法重试({reason_human}"
count = (detail or {}).get("count", 0)
fallback_count = (detail or {}).get("fallback_count", 0)
if count > 0:
return f"已自动重试 {count}"
if fallback_count > 0:
return f"已自动重试 {fallback_count} 次(fallback"
return "系统已尝试恢复,但仍失败"
# reason_raw → (reason_human_readable, detail_format_fn)
_REASON_MAP: Dict[str, tuple] = {
"no_reply_found": ("收件人未回复(Agent 未能识别或处理此邮件)", lambda d: ""),
"crashed": ("处理时进程崩溃", lambda d: f"stderr: {_extract_stderr(d)}" if _extract_stderr(d) else "无 stderr 输出"),
"max_crash_count": ("连续崩溃达上限", lambda d: f"崩溃 {d.get('count', '?')}"),
"max_retries": ("续杯耗尽(已自动重试)", lambda d: f"重试 {d.get('count', '?')}"),
"max_api_retry_count": ("API 连续失败达上限", lambda d: f"API 重试 {d.get('count', '?')}"),
"max_monitor_timeouts": (
"处理超时达上限",
lambda d: f"超时 {d.get('count', '?')} 次,"
f"共约 {d.get('elapsed_seconds', 0) // 60} 分钟"),
"gateway_timeout": ("Agent 执行超时(已续杯重试)", lambda d: ""),
"session_stuck": ("会话假死(lock PID 死亡)", lambda d: f"假死 {d.get('stuck_count', '?')}"),
"revive_failed": ("会话恢复失败", lambda d: f"假死 {d.get('stuck_count', '?')}"),
"auth_failed": ("Agent 认证失败(配置问题)", lambda d: f"stderr: {_extract_stderr(d)}" if _extract_stderr(d) else ""),
"fallback_exhausted": (
"主模型和备用模型均失败",
lambda d: f"fallback {d.get('fallback_count', '?')} 次,"
f"原因: {d.get('fallback_reason', '未知')}"),
"agent_error": (
"Agent 内部错误",
lambda d: f"stderr: {_extract_stderr(d)}" if _extract_stderr(d) else ""),
"agent_failed": ("收件人主动标记失败", lambda d: ""),
"compact_failed": ("上下文压缩失败", lambda d: f"stderr: {_extract_stderr(d)}" if _extract_stderr(d) else ""),
"compact_hanging": ("上下文压缩长时间未完成", lambda d: ""),
"compact_interrupted": ("上下文压缩被中断(已自动重试)", lambda d: ""),
"gateway_unreachable": (
"Gateway 不可达(已自动重试)",
lambda d: f"stderr: {_extract_stderr(d)}"
if _extract_stderr(d) else ""),
"lock_conflict": ("会话锁冲突(已自动重试)", lambda d: ""),
"max_retry_count": ("重试耗尽", lambda d: f"重试 {d.get('count', '?')}"),
"max_lock_retry_count": ("锁冲突重试耗尽", lambda d: f"重试 {d.get('count', '?')}"),
"max_connect_retry_count": ("连接重试耗尽", lambda d: f"重试 {d.get('count', '?')}"),
"_default": ("未知原因", lambda d: f"stderr: {_extract_stderr(d)}" if _extract_stderr(d) else ""),
}
# 常见失败原因参考(AI Native:提供知识库让收件 AI 自行判断)
_REASON_REFERENCE = """常见失败原因参考:
no_reply_found收件人未回复Agent 未能识别或处理此邮件
crashed / max_crash_count收件人处理时进程崩溃已自动重试 3
max_retries续杯耗尽已自动重试 3 共约 34 分钟
max_api_retry_countAPI 连续失败达上限rate_limit/500/503
max_monitor_timeouts处理超时达上限共约 31.5 分钟
gateway_timeoutAgent 执行超时已续杯重试
session_stuckAgent 会话假死lock PID 死亡revive 失败
revive_failed会话假死后恢复失败
auth_failedAgent 认证失败配置问题
fallback_exhausted主模型和备用模型均失败
agent_failed收件人主动标记失败
compact_failed上下文压缩失败
compact_hanging上下文压缩长时间未完成等待超 31.5 分钟
compact_interrupted上下文压缩被中断已自动重试 3
gateway_unreachableGateway 不可达已自动重试 3
lock_conflict会话锁冲突已自动重试 3
其他建议排查系统日志"""
def _build_notify_text(title: str, to_agent: str, reason: str,
detail: Optional[dict] = None) -> str:
"""构建通知正文(v2.0 AI Native"""
reason_human, detail_fn = _REASON_MAP.get(reason, _REASON_MAP["_default"])
detail_info = detail_fn(detail or {})
retry_info = _fmt_retry_info(reason, detail or {})
lines = [
"邮件投递失败通知",
"",
f"📧 原始邮件:「{title}",
f"👤 收件人:{to_agent}",
f"❌ 失败原因:{reason_human}{reason}",
f"📊 重试情况:{retry_info}",
]
if detail_info:
lines.append("📋 上下文信息:")
lines.append(f" {detail_info}")
lines.append("")
lines.append(_REASON_REFERENCE)
lines.append("")
lines.append("——系统自动通知")
return "\n".join(lines)
def _is_mail_project(db_path: Path) -> bool:
@@ -43,7 +143,7 @@ def notify_mail_failed(db_path: Path, original_mail_id: str,
"""Mail 失败后以 system 身份给发件人发通知邮件
直接通过 Blackboard 创建 Task不走 HTTP API
防递归检查原邮件 must_hives.system_notify true 则跳过
防递归检查原邮件 must_haves.system_notify true 则跳过
发件人不是有效 Agent system 通知庞统代处理避免广播风暴
"""
try:
@@ -83,12 +183,8 @@ def notify_mail_failed(db_path: Path, original_mail_id: str,
original_mail_id, from_agent)
target_agent = "pangtong-fujunshi"
# 构造通知正文
text = _NOTIFY_TEMPLATE.format(
title=title,
to_agent=to_agent,
reason=reason,
)
# 构造通知正文v2.0 AI Native
text = _build_notify_text(title, to_agent, reason, detail)
# 创建通知邮件 Task
notify_id = f"mail-{int(datetime.now().timestamp() * 1000)}"
+4 -1
View File
@@ -845,6 +845,8 @@ curl -X POST http://{api_host}:{api_port}/api/projects/{project_id}/tasks/{task_
cls.get("retry_field", "retry_count")
)
elif outcome == "api_error":
# A9: [DEPRECATED] api_error 已改为 should_retry=True 走续杯路径。
# 此分支理论上不再命中,保留作为安全兜底。
# A9: 429/API 错误 → release counter(on_complete)+ 推回 pending + 冷却
# 有上限:api_retry_count 累计达 max_retries 则标 failed
await self._do_on_complete_async(on_complete, agent_id, outcome)
@@ -1842,7 +1844,8 @@ curl -X POST http://{api_host}:{api_port}/api/projects/{project_id}/tasks/{task_
"retry_field": "retry_count", "cooldown_seconds": 60}
if any(kw in stderr_lower for kw in [
"rate_limit", "500", "503", "api error"]):
return {"outcome": "api_error", "should_retry": False}
return {"outcome": "api_error", "should_retry": True,
"retry_field": "retry_count", "cooldown_seconds": 60}
if any(kw in stderr_lower for kw in [
"compaction-diag", "context-overflow"]):
return {"outcome": "compact_failed", "should_retry": False}
+4 -2
View File
@@ -165,14 +165,16 @@ class TestClassifyErrorApi:
1, {"status": "error"}, "rate_limit exceeded", None
)
assert result["outcome"] == "api_error"
assert result["should_retry"] is False
assert result["should_retry"] is True
assert result["cooldown_seconds"] == 60
def test_stderr_500(self):
result = Spawner._classify_outcome(
1, {"status": "error"}, "HTTP 500 Internal Server Error", None
)
assert result["outcome"] == "api_error"
assert result["should_retry"] is False
assert result["should_retry"] is True
assert result["cooldown_seconds"] == 60
class TestClassifyErrorCompact: