Tool 配对完整性：ReAct Agent 最隐蔽的 Bug

专栏信息

《从零到一构建跨平台 AI 助手：WeClaw 实战指南》专栏

本文是模块八第 7 篇，深入剖析 tool_call/tool_result 配对问题的诊断与修复。

作者与项目

作者简介：翁勇刚 WENG YONGGANG 新概念龙虾-WeClaw 开发团队负责人，一群专注于跨平台 AI 应用的实践者理念："再复杂的技术，也能用代码讲清楚"

项目地址：https://github.com/wyg5208/weclaw.git
官网地址：https://weclaw.link
作者 CSDN：https://blog.csdn.net/yweng18

摘要

本文结构概览：本文从一个"同一个 tool_call_id 在 7 个 ReAct 步骤中反复报告缺失结果"的诡异现象出发，层层剥茧揭示根因——消息截断在 ReAct 中间状态发生，导致 assistant 的 tool_calls 和 tool_results 被分离。然后对比两种修复策略（占位消息 vs 剥离），最终实现原子批量写入和统一 stub 策略。

背景：LLM API 要求每条 assistant 消息中的 tool_call 都必须有对应的 tool_result。当上下文截断破坏了这种配对时，模型要么报错、要么重复执行。

核心问题：ReAct 循环中，消息截断可能在 assistant(tool_calls) 和 tool(tool_result) 之间发生，如何保证配对完整性？

解决方案：原子批量写入 + 孤儿剥离 + 统一 stub 策略

关键成果：

彻底消除孤儿消息问题
ReAct 循环中不再出现"缺失结果"警告
压缩和截断两条路径采用一致的 stub 策略

适合读者：ReAct Agent 开发者，尤其是遇到"工具调用异常"问题的团队

阅读时长：约 12 分钟

关键词：ReAct、Tool 配对、孤儿消息、原子写入、消息截断

一、诡异现象：同一个工具被"遗忘"了 7 次

1.1 日志中的异常

[Agent] Step 1: calling search_web, read_file...
[Validator] WARNING: tool_call 'call_abc' missing result
[Validator] FIX: added placeholder for 'call_abc'

[Agent] Step 2: calling analyze_data...
[Validator] WARNING: tool_call 'call_abc' missing result
[Validator] FIX: added placeholder for 'call_abc'

[Agent] Step 3: calling write_report...
[Validator] WARNING: tool_call 'call_abc' missing result
[Validator] FIX: added placeholder for 'call_abc'

... (重复到 Step 7)

问题：call_abc 这个 tool_call 在 Step 1 时明明有对应的 tool_result，为什么后续每个步骤都说它"缺失结果"？

1.2 更诡异的是：占位消息在累积

Step 1: 添加了 1 个占位消息
Step 2: 添加了 2 个占位消息（上次的 + 新的）
Step 3: 添加了 3 个占位消息
...
Step 7: 添加了 7 个占位消息

消息列表越来越长，但问题始终没解决！

二、根因分析：截断发生在 ReAct 中间状态

2.1 ReAct 循环的消息写入时序

Step 1 的消息序列：
  [msg_N]   assistant: tool_calls: [call_abc(search_web), call_def(read_file)]
  [msg_N+1] tool: call_abc → "搜索结果..."
  [msg_N+2] tool: call_def → "文件内容..."

Step 2 开始时：
  [msg_N]   assistant: tool_calls: [call_abc, call_def]
  [msg_N+1] tool: call_abc → "搜索结果..."
  [msg_N+2] tool: call_def → "文件内容..."
  [msg_N+3] assistant: "分析结果如下..." ← Step 1 的回复

2.2 截断发生的时机

当消息数超过限制时，截断函数 _enforce_limit() 被调用。在 ReAct 循环中，这个函数可能在工具结果尚未全部写入时被调用：

中间状态（Step 1 执行中）：
  [msg_N]   assistant: tool_calls: [call_abc, call_def]  ← 已写入
  [msg_N+1] tool: call_abc → "搜索结果..."              ← 已写入
  [msg_N+2]                                                ← 还没写入！

此时触发截断 → msg_N+1 被保留，但 msg_N+2 不存在
→ call_def 的 tool_result "丢失"
→ 验证器报告"缺失结果"

2.3 占位消息为什么越积越多

# 旧方案：添加占位消息（有 Bug）
def validate_and_fix(messages):
    for msg in messages:
        if msg["role"] == "assistant":
            for tc in msg.get("tool_calls", []):
                tc_id = tc["id"]
                if not has_result(messages, tc_id):
                    # 添加占位消息
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tc_id,
                        "content": "[结果已省略]"
                    })
    # 问题：占位消息添加在 messages 的副本中
    # 原始消息列表未修改 → 下次调用 validate 时又检测到"缺失"

根因：validate_and_fix 操作的是消息列表的副本，占位消息被添加到副本中返回给 LLM，但源数据（数据库中的消息）没有被修复。下一次 ReAct 步骤加载消息时，又从数据库加载了"无占位"的原始消息，于是问题再次出现。

三、两种修复策略的对比

[图片: 占位累积 vs 剥离策略 | 生成方式: 文生图 PROMPT: "Timeline diagram comparing two orphan tool_call handling strategies: Left side shows placeholder messages accumulating over 7 steps with growing red blocks getting larger each step, Right side shows clean stripping approach with stable green blocks of consistent size, technical timeline style with step numbers 1-7, clean white background"]

3.1 策略 A：占位消息（旧方案）

# 为每个孤儿 tool_call 添加占位 tool_result
placeholder = {
    "role": "tool",
    "tool_call_id": orphan_id,
    "content": "[Result omitted due to context compression]"
}
messages.append(placeholder)

问题：

副本操作，不修复源头 → 累积泄漏
每个 ReAct 步骤都添加新占位 → 消息列表膨胀
7 个步骤 × 2 个孤儿 = 14 条冗余占位消息

3.2 策略 B：剥离孤儿（新方案）

# 从 assistant 消息中移除没有结果的 tool_call
def strip_orphan_tool_calls(messages):
    """剥离孤儿 tool_call：从 assistant 消息中移除无结果的调用"""
    result = list(messages)  # 浅拷贝

    # 收集所有有结果的 tool_call_id
    result_ids = {
        m["tool_call_id"] for m in result
        if m.get("role") == "tool"
    }

    # 遍历 assistant 消息，移除孤儿 tool_call
    for msg in result:
        if msg.get("role") == "assistant" and "tool_calls" in msg:
            msg["tool_calls"] = [
                tc for tc in msg["tool_calls"]
                if tc["id"] in result_ids
            ]
            # 如果所有 tool_calls 都被移除了，转为纯文本消息
            if not msg["tool_calls"]:
                del msg["tool_calls"]

    return result

优势：

直接修改源数据中的 assistant 消息
不添加额外消息 → 列表不膨胀
一次剥离，永久生效

3.3 选择剥离策略的理由

维度	占位消息	剥离孤儿
消息膨胀	累积增长	稳定
修复持久性	仅当次有效	永久修复
API 兼容性	好（保留 tool_call 结构）	好（移除无效调用）
信息保留	中（占位文本无信息量）	低（完全移除）
实现复杂度	低	中

四、原子批量写入

4.1 问题：分步写入的中间状态

旧方案中，ReAct 循环的消息是分步写入的：

# 旧方案：分步写入
async def handle_tool_calls(self, tool_calls):
    # 先写 assistant 消息
    await self.add_assistant_message(tool_calls=tool_calls)

    # 然后逐个执行工具并写入结果
    for tc in tool_calls:
        result = await execute_tool(tc)
        await self.add_tool_message(tc["id"], result)
        # ⚠️ 此时可能触发截断检查
        # assistant(tool_calls) 已写入，但后续 tool_result 尚未写入

4.2 原子批量写入

# 新方案：原子批量写入
async def handle_tool_calls(self, tool_calls):
    # 收集所有消息
    batch = []
    batch.append({"role": "assistant", "tool_calls": tool_calls})

    for tc in tool_calls:
        result = await execute_tool(tc)
        batch.append({
            "role": "tool",
            "tool_call_id": tc["id"],
            "content": result
        })

    # 一次性写入所有消息（原子操作）
    await self.add_message_batch(batch)

4.3 add_message_batch 实现

async def add_message_batch(self, messages):
    """原子性批量添加消息

    确保 assistant(tool_calls) + tool(results) 作为一个整体写入，
    不会在中间状态被截断函数打断
    """
    # 先全部添加到内存列表
    self._messages.extend(messages)

    # 然后批量写入数据库
    await self._db.batch_insert(self.session_id, messages)

    # 最后检查是否需要截断（此时所有消息都已完整）
    if self._needs_truncation():
        self._enforce_limit()

五、统一 Stub 策略

5.1 两条路径的一致性问题

上下文管理有两条可能产生"孤儿"的路径：

截断路径：_enforce_limit() 截断消息时可能切断配对
压缩路径：ContextEngine.compress() 压缩旧消息时可能丢失 tool_result

两条路径需要使用一致的孤儿处理策略：

# 统一 stub 策略
ORPHAN_STUB_CONTENT = "[Result unavailable due to context management]"

def ensure_tool_pair_integrity(messages):
    """确保所有 tool_call 都有对应的 tool_result

    对于缺失结果的 tool_call，添加 stub result（而非剥离）
    注意：这与 validate_and_fix 的剥离策略互补——
    剥离用于截断路径，stub 用于压缩路径
    """
    result_ids = {
        m["tool_call_id"] for m in messages
        if m.get("role") == "tool"
    }

    stubs = []
    for msg in messages:
        if msg.get("role") == "assistant":
            for tc in msg.get("tool_calls", []):
                if tc["id"] not in result_ids:
                    stubs.append({
                        "role": "tool",
                        "tool_call_id": tc["id"],
                        "content": ORPHAN_STUB_CONTENT
                    })
                    result_ids.add(tc["id"])  # 避免重复添加

    return messages + stubs

5.2 为什么压缩路径用 stub 而非剥离

截断路径：孤儿是"临时状态"，剥离更安全（不增加消息数）
压缩路径：摘要是"永久替换"，stub 更安全（保留 tool_call 结构，模型知道之前调用了什么工具）

六、Pre-scan 与 Consecutive Loop 的协调陷阱

6.1 验证流程的两个阶段

def validate_message_structure(self, messages):
    """验证消息结构完整性"""

    # Phase 1: Pre-scan（预扫描）
    # 快速检测是否有孤儿 tool_call
    consumed_ids = set()
    for msg in messages:
        if msg.get("role") == "tool":
            consumed_ids.add(msg["tool_call_id"])

    # Phase 2: Consecutive Loop（连续遍历）
    # 逐对检查 assistant → tool 配对
    i = 0
    while i < len(messages):
        msg = messages[i]
        if msg.get("role") == "assistant" and "tool_calls" in msg:
            assistant_pos = i  # 记录 assistant 的真实位置
            for tc in msg["tool_calls"]:
                # 向后查找对应的 tool_result
                found = False
                for j in range(i + 1, min(i + 10, len(messages))):
                    if messages[j].get("tool_call_id") == tc["id"]:
                        found = True
                        break
                if not found and tc["id"] not in consumed_ids:
                    # 孤儿！需要处理
                    self._handle_orphan(messages, assistant_pos, tc)
        i += 1

6.2 陷阱：`_assistant_pos` 记录真实位置

旧代码使用 i - 1 来定位 assistant 消息：

# 旧代码（有 Bug）
assistant_msg = messages[i - 1]  # 假设 tool 消息前面一定是 assistant

# 问题：如果中间有 extras（额外插入的消息），i-1 可能不是 assistant

新代码显式记录 assistant 的位置：

# 新代码（修复）
assistant_pos = i  # 在处理 assistant 消息时记录位置
# ...
# 后续使用 assistant_pos 定位 assistant 消息
assistant_msg = messages[assistant_pos]

七、总结与展望

7.1 核心要点回顾

孤儿消息的根因是"中间状态截断"：ReAct 循环的分步写入导致配对断裂
占位消息会累积：副本操作不修复源头，每个步骤都添加新占位
原子批量写入治本：assistant + tool_results 作为整体写入
两条路径需要一致策略：截断用剥离，压缩用 stub

7.2 一个调试技巧

当你看到同一个 tool_call_id 在多个步骤中反复报告"缺失结果"时，首先检查消息截断是否发生在 ReAct 循环的中间状态。 这几乎总是"分步写入 + 中间截断"的组合问题。

下期预告：《异步压缩：让用户感知不到上下文整理》

同步压缩的用户体验问题
异步后台摘要的架构设计
快照 hash 保护机制

敬请期待！

Tool Pairing Integrity: The Most Insidious Bug in ReAct Agents