fix(async-context-compression): reverse-unfolding to prevent progress drift

- Reconstruct native tool-calling sequences using reverse-unfolding mechanism - Strictly use atomic grouping for safe native tool output trimming - Add comprehensive test coverage for unfolding logic and issue drafts - READMEs and docs synced (v1.4.1)
2026-03-11 03:54:40 +08:00
parent 3210262296
commit cd95b5ff69
16 changed files with 1540 additions and 152 deletions
--- a/plugins/debug/async_context_compression/ISSUE_EXPLANATION.md
+++ b/plugins/debug/async_context_compression/ISSUE_EXPLANATION.md
@@ -0,0 +1,62 @@
+# 异步上下文压缩插件：当前问题与处理状态总结
+
+这份文档详细梳理了我们在处理 `async_context_compression`（异步上下文压缩插件）时，遭遇的“幽灵截断”问题的根本原因，以及我们目前的解决进度。
+
+## 1. 根本原因：两种截然不同的“世界观”（数据序列化差异）
+
+在我们之前的排查中，我曾错误地认为：`outlet`（后置处理阶段）拿到的 `body["messages"]` 是由于截断导致的残缺数据。
+但根据您提供的本地运行日志，**您是对的，`body['messages']` 确实包含了完整的对话历史**。
+
+那么为什么长度会产生 `inlet 看到 27 条`，而 `outlet 只看到 8 条` 这种巨大的差异？
+
+原因在于，OpenWebUI 的管道在进入大模型前和从大模型返回后，使用了**两种完全不同的消息格式**：
+
+### 视图 A：Inlet 阶段（原生 API 展开视图）
+- **特点**：严格遵循 OpenAI 函数调用规范。
+- **状态**：每一次工具调用、工具返回，都被视为一条独立的 message。
+- **例子**：一个包含了复杂搜索的对话。
+  - User: 帮我查一下天气（1条）
+  - Assistant: 发起 tool_call（1条）
+  - Tool: 返回 JSON 结果（1条）
+  - ...多次往复...
+  - **最终总计：27 条。**我们的压缩算法（trim）是基于这个 27 条的坐标系来计算保留多少条的。
+
+### 视图 B：Outlet 阶段（UI HTML 折叠视图）
+- **特点**：专为前端渲染优化的紧凑视图。
+- **状态**：OpenWebUI 在调用完模型后，为了让前端显示出那个好看的、可折叠的工具调用卡片，强行把中间所有的 Tool 交互过程，用 `<details type="tool_calls">...</details>` 的 HTML 代码包裹起来，塞进了一个 `role: assistant` 的 `content` 字符串里！
+- **例子**：同样的对话。
+  - User: 帮我查一下天气（1条）
+  - Assistant: `<details>包含了好多次工具调用和结果的代码</details> 今天天气很好...`（1条）
+  - **最终总计：8 条。**
+
+**💥 灾难发生点：**
+原本的插件逻辑假定 `inlet` 和 `outlet` 共享同一个坐标系。
+1. 在 `inlet` 时，系统计算出：“我需要把前 10 条消息生成摘要，保留后 17 条”。
+2. 系统把“生成前10条摘要”的任务转入后台异步执行。
+3. 后台任务在 `outlet` 阶段被触发，此时它拿到的消息数组变成了**视图 B（总共只有 8 条）。**
+4. 算法试图在只有 8 条消息的数组里，把“前 10 条消息”砍掉并替换为 1 条摘要。
+5. **结果就是：数组索引越界/坐标彻底错乱，触发报错，并且可能将最新的有效消息当成旧消息删掉（过度压缩）。**
+
+---
+
+## 2. 目前已解决的问题 (✅ Done)
+
+为了立刻制止这种因为“坐标系错位”导致的数据破坏，我们已经落实了热修复（Local v1.4.0）：
+
+**✅ 添加了“折叠视图”的探针防御：**
+- 我写了一个函数 `_is_compact_tool_details_view`。
+- 现在，当后台触发生成摘要时，系统会自动扫描 `outlet` 传来的 `messages`。只要发现里面包含 `<details type="tool_calls">` 这种带有 HTML 折叠标签的痕迹，就会**立刻终止并跳过**当前的摘要生成任务。
+- **收益**：彻底杜绝了因数组错位而引发的任务报错和强制裁切。UI 崩溃与历史丢失问题得到遏制。
+
+---
+
+## 3. 当前已解决的遗留问题 (✅ Done: 逆向展开修复)
+
+之前因为跳过生成而引入的新限制：**包含工具调用的长轮次对话，无法自动生成“历史摘要”** 的问题，现已彻底解决。
+
+### 最终实施的技术方案：
+我们通过源码分析发现，OpenWebUI 在进入 `inlet` 时会执行 `convert_output_to_messages` 还原工具调用链。因此，我们在插件的 `outlet` 阶段引入了相同的 **逆向展开 (Deflation/Unfolding)** 机制 `_unfold_messages`。
+
+现在，当后台任务拿到 `outlet` 传来的折叠视图时，不会再选择“跳过”。而是自动提取出潜藏在消息对象体内部的原生 `output` 字段，并**将其重新展开为展开视图**（比如将 8 条假象重新还原为真实的 27 条底层数据），使得它的坐标系与 `inlet` 完全对齐。
+
+至此，带有复杂工具调用的长轮次对话也能安全地进行背景自动压缩，不再有任何截断和强制删减的风险！
--- a/plugins/debug/async_context_compression/REPLY_TO_DHAERN_CN.md
+++ b/plugins/debug/async_context_compression/REPLY_TO_DHAERN_CN.md
@@ -0,0 +1,60 @@
+# 回复 dhaern — 针对最新审查的跟进
+
+感谢您重新审查了最新版本并提出了持续精准的分析意见。以下针对您剩余的两个关切点逐一回应。
+
+---
+
+### 1. `enable_tool_output_trimming` — 不是功能退化，而是行为变化是有意为之
+
+裁剪逻辑依然存在且可正常运行。以下是当前版本与之前版本的行为对比。
+
+**当前行为（`_trim_native_tool_outputs`，第 835–945 行）：**
+- 通过 `_get_atomic_groups` 遍历原子分组。
+- 识别有效的工具调用链：`assistant(tool_calls)` → `tool` → [可选的 assistant 跟进消息]。
+- 如果一条链内所有 `tool` 角色消息的字符数总和超过 **1,200 个字符**，则将 *tool 消息本身的内容* 折叠为一个本地化的 `[Content collapsed]` 占位符，并注入 `metadata.is_trimmed` 标志。
+- 同时遍历包含 `<details type="tool_calls">` HTML 块的 assistant 消息，对其中尺寸过大的 `result` 属性进行相同的折叠处理。
+- 当 `enable_tool_output_trimming=True` 且 `function_calling=native` 时，该函数在 inlet 阶段被调用。
+
+**与旧版本的区别：**  
+旧版的做法是改写 *assistant 跟进消息*，仅保留"最终答案"。新版的做法是折叠 *tool 响应内容本身*。两者都会缩减上下文体积，但新方法能够保留 tool 调用链的结构完整性（这是本次发布中原子分组工作的前提条件）。
+
+插件头部的 docstring 里还有一段过时的描述（"提取最终答案"），与实际行为相悖。最新提交中已将其更正为"将尺寸过大的原生工具输出折叠为简短占位符"。
+
+如果您在寻找旧版本中"仅保留最终答案"的特定行为，该路径已被有意移除，因为它与本次发布引入的原子分组完整性保证相冲突。当前的折叠方案是安全的替代实现。
+
+---
+
+### 2. `compressed_message_count` — 修复是真实有效的；以下是坐标系追踪
+
+您对"从已修改视图重新计算"的担忧，考虑到此前的架构背景，是完全可以理解的。以下精确说明为何当前代码不存在这一问题。
+
+**`outlet` 中的关键变更：**
+```python
+db_messages = self._load_full_chat_messages(chat_id)
+messages_to_unfold = db_messages if (db_messages and len(db_messages) >= len(messages)) else messages
+summary_messages = self._unfold_messages(messages_to_unfold)
+target_compressed_count = self._calculate_target_compressed_count(summary_messages)
+```
+
+`_load_full_chat_messages` 从 OpenWebUI 数据库中获取原始的持久化历史记录。由于在 inlet 渲染期间注入的合成 summary 消息**从未被回写到数据库**，从 DB 路径获取的 `summary_messages` 始终是干净的、未经修改的原始历史记录——没有 summary 标记，没有坐标膨胀。
+
+在此干净列表上调用 `_calculate_target_compressed_count` 的计算逻辑如下（仍在原始历史坐标系内）：
+```
+original_count = len(db_messages)
+raw_target = original_count - keep_last
+target = atomic_align(raw_target)
+```
+
+这个 `target_compressed_count` 值原封不动地传递进 `_generate_summary_async`。在异步任务内部，同一批 `db_messages` 被切片为 `messages[start:target]` 来构建 `middle_messages`。生成完成后（可能从末尾进行原子截断），保存的值为：
+```python
+saved_compressed_count = start_index + len(middle_messages)
+```
+这是原始 DB 消息列表中新摘要实际涵盖到的确切位置——不是目标值，也不是来自不同视图的估算值。
+
+**回退路径（DB 不可用时）** 使用 inlet 渲染后的 body 消息。此时 `_get_summary_view_state` 会读取注入的 summary 标记的 `covered_until` 字段（该字段在写入时已记录为原子对齐后的 `start_index`），因此 `base_progress` 已经处于原始历史坐标系内，计算可以自然延续，不会混用两种视图。
+
+简而言之：该字段在整个调用链中现在具有唯一、一致的语义——即原始持久化消息列表中，当前摘要文本实际覆盖到的索引位置。
+
+---
+
+再次感谢您严格的审查。您在上次发布后标记的这两个问题已得到处理，文档中的过时描述也已更正。如果发现其他问题，欢迎继续反馈。
--- a/plugins/debug/async_context_compression/REPLY_TO_DHAERN_EN.md
+++ b/plugins/debug/async_context_compression/REPLY_TO_DHAERN_EN.md
@@ -0,0 +1,60 @@
+# Reply to dhaern - Follow-up on the Latest Review
+
+Thank you for re-checking the latest version and for the continued precise analysis. Let me address your two remaining concerns directly.
+
+---
+
+### 1. `enable_tool_output_trimming` — Not a regression; behavior change is intentional
+
+The trimming logic is present and functional. Here is what it does now versus before.
+
+**Current behavior (`_trim_native_tool_outputs`, lines 835–945):**
+- Iterates over atomic groups via `_get_atomic_groups`.
+- Identifies valid chains: `assistant(tool_calls)` → `tool` → [optional assistant follow-up].
+- If the combined character count of the `tool` role messages in a chain exceeds **1,200 characters**, it collapses *the tool messages themselves* to a localized `[Content collapsed]` placeholder and injects a `metadata.is_trimmed` flag.
+- Separately walks assistant messages containing `<details type="tool_calls">` HTML blocks and collapses oversized `result` attributes in the same way.
+- The function is called at inlet when `enable_tool_output_trimming=True` and `function_calling=native`.
+
+**What is different from the previous version:**  
+The old approach rewrote the *assistant follow-up* message to keep only the "final answer". The new approach collapses the *tool response content* itself. Both reduce context size, but the new approach preserves the structural integrity of the tool-calling chain (which the atomic grouping work in this release depends on).
+
+The docstring in the plugin header also contained a stale description ("extract only the final answer") that contradicted the actual behavior. That has been corrected in the latest commit to accurately say "collapses oversized native tool outputs to a short placeholder."
+
+If you are looking for the specific "keep only the final answer" behavior from the old version, that path was intentionally removed because it conflicted with the atomic-group integrity guarantees introduced in this release. The current collapse approach is a safe replacement.
+
+---
+
+### 2. `compressed_message_count` — The fix is real; here is the coordinate trace
+
+The concern about "recalculating from the already-modified view" is understandable given the previous architecture. Here is exactly why the current code does not have that problem.
+
+**Key change in `outlet`:**
+```python
+db_messages = self._load_full_chat_messages(chat_id)
+messages_to_unfold = db_messages if (db_messages and len(db_messages) >= len(messages)) else messages
+summary_messages = self._unfold_messages(messages_to_unfold)
+target_compressed_count = self._calculate_target_compressed_count(summary_messages)
+```
+
+`_load_full_chat_messages` fetches the raw persisted history from the OpenWebUI database. Because the synthetic summary message (injected during inlet rendering) is **never written back to the database**, `summary_messages` from the DB path is always the clean, unmodified original history — no summary marker, no coordinate inflation.
+
+`_calculate_target_compressed_count` called on this clean list simply computes:
+```
+original_count = len(db_messages)
+raw_target = original_count - keep_last
+target = atomic_align(raw_target)   # still in original-history coordinates
+```
+
+This `target_compressed_count` value is then passed into `_generate_summary_async` unchanged. Inside the async task, the same `db_messages` list is sliced to `messages[start:target]` to build `middle_messages`. After generation (with potential atomic truncation from the end), the saved value is:
+```python
+saved_compressed_count = start_index + len(middle_messages)
+```
+This is the exact position in the original DB message list up to which the new summary actually covers — not a target, not an estimate from a different view.
+
+**The fallback path (DB unavailable)** uses the inlet-rendered body messages. In that case `_get_summary_view_state` reads `covered_until` from the injected summary marker (which was written as the atomically-aligned `start_index`), so `base_progress` is already in original-history coordinates. The calculation naturally continues from there without mixing views.
+
+In short: the field now has a single, consistent meaning throughout the entire call chain — the index (in the original, persisted message list) up to which the current summary text actually covers.
+
+---
+
+Thank you again for the rigorous review. The two points you flagged after the last release are now addressed, and the documentation stale description has been corrected. Please do let us know if you spot anything else.