feat(async-context-compression): release v1.5.0

- add external chat reference summaries and mixed-script token estimation
- tighten summary budgeting, fallback handling, and frontend error visibility
- sync READMEs, mirrored docs, indexes, and bilingual v1.5.0 release notes
This commit is contained in:
fujie
2026-03-14 16:10:06 +08:00
parent 2f518d4c7a
commit 858d048d81
12 changed files with 2482 additions and 343 deletions

View File

@@ -1,13 +1,19 @@
# Async Context Compression Filter
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.4.1 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.5.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
## What's new in 1.4.1
## What's new in 1.5.0
- **Reverse-Unfolding Mechanism**: Accurately reconstructs the expanded native tool-calling sequence during the outlet phase to permanently fix coordinate drift and missing summaries for long tool-based conversations.
- **Safer Tool Trimming**: Refactored `enable_tool_output_trimming` to strictly use atomic block groups for safe trimming, completely preventing JSON payload corruption.
- **External Chat Reference Summaries**: Added support for referenced chat context blocks that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
- **Fast Multilingual Token Estimation**: Added a new mixed-script token estimation pipeline so inlet/outlet preflight checks can avoid unnecessary exact token counts while staying much closer to real usage.
- **Stronger Working-Memory Prompt**: Refined the XML summary prompt to better preserve actionable context across general chat, coding tasks, and tool-heavy conversations.
- **Clearer Frontend Debug Logs**: Reworked browser-console logging into grouped structural snapshots that are easier to scan during debugging.
- **Safer Tool Trimming Defaults**: Enabled native tool-output trimming by default and exposed a dedicated `tool_trim_threshold_chars` valve with a 600-character default.
- **Safer Referenced-Chat Fallbacks**: If generating a referenced chat summary fails, the new reference-summary path now falls back to direct contextual injection instead of failing the whole chat.
- **Correct Summary Budgeting**: `summary_model_max_context` now controls summary-input fitting, while `max_summary_tokens` remains an output-length cap.
- **More Visible Summary Failures**: Important background summary failures now surface in the browser console (`F12`) and as a status hint even when `show_debug_log` is off.
---
@@ -19,15 +25,85 @@ This filter reduces token consumption in long conversations through intelligent
- ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
- ✅ Flexible retention policy to keep the first and last N messages.
- ✅ Smart injection of historical summaries back into the context.
- ✅ External chat reference summarization with cached-summary reuse, direct injection for small chats, and generated summaries for larger chats.
- ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
- ✅ Native tool output trimming for cleaner context when using function calling.
- ✅ Real-time context usage monitoring with warning notifications (>90%).
-Detailed token logging for precise debugging and optimization.
-Fast multilingual token estimation plus exact token fallback for precise debugging and optimization.
-**Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
-**Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.
---
## What This Fixes
- **Problem 1: A referenced chat could break the current request.**
Before, if the filter needed to summarize a referenced chat and that LLM call failed, the current chat could fail with it. Now it degrades gracefully and injects direct context instead.
- **Problem 2: Some referenced chats were being cut too aggressively.**
Before, the output limit (`max_summary_tokens`) could be treated like the input window, which made large referenced chats shrink earlier than necessary. Now input fitting uses the summary model's real context window (`summary_model_max_context` or model/global fallback).
- **Problem 3: Some background summary failures were too easy to miss.**
Before, a failure during background summary preparation could disappear quietly when frontend debug logging was off. Now important failures are forced to the browser console and also shown through a user-facing status message.
---
## Workflow Overview
This filter operates in two phases:
1. `inlet`: injects stored summaries, processes external chat references, and trims context when required before the request is sent to the model.
2. `outlet`: runs asynchronously after the response is complete, decides whether a new summary should be generated, and persists it when appropriate.
```mermaid
flowchart TD
A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
B --> C{Referenced chats attached?}
C -- No --> D[Load current chat summary if available]
C -- Yes --> E[Inspect each referenced chat]
E --> F{Existing cached summary?}
F -- Yes --> G[Reuse cached summary]
F -- No --> H{Fits direct budget?}
H -- Yes --> I[Inject full referenced chat text]
H -- No --> J[Prepare referenced-chat summary input]
J --> K{Referenced-chat summary call succeeds?}
K -- Yes --> L[Inject generated referenced summary]
K -- No --> M[Fallback to direct contextual injection]
G --> D
I --> D
L --> D
M --> D
D --> N[Build current-chat Head + Summary + Tail]
N --> O{Over max_context_tokens?}
O -- Yes --> P[Trim oldest atomic groups]
O -- No --> Q[Send final context to the model]
P --> Q
Q --> R[Model returns the reply]
R --> S[Outlet rebuilds the full history]
S --> T{Reached compression threshold?}
T -- No --> U[Finish]
T -- Yes --> V[Fit summary input to the summary model context]
V --> W{Background summary call succeeds?}
W -- Yes --> X[Save new chat summary and update status]
W -- No --> Y[Force browser-console error and show status hint]
```
### Key Notes
- `inlet` only injects and trims context. It does not generate the main chat summary.
- `outlet` performs summary generation asynchronously and does not block the current reply.
- External chat references may come from an existing persisted summary, a small chat's full text, or a generated/truncated reference summary.
- If a referenced-chat summary call fails, the filter falls back to direct context injection instead of failing the whole request.
- `summary_model_max_context` controls summary-input fitting. `max_summary_tokens` only controls how long the generated summary may be.
- Important background summary failures are surfaced to the browser console (`F12`) and the chat status area.
- External reference messages are protected during trimming so they are not discarded first.
---
## Installation & Configuration
### 1) Database (automatic)
@@ -51,11 +127,12 @@ This filter reduces token consumption in long conversations through intelligent
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
| `summary_model_max_context` | `0` | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`. |
| `max_summary_tokens` | `16384` | Maximum tokens for the generated summary. |
| `summary_temperature` | `0.3` | Randomness for summary generation. Lower is more deterministic. |
| `summary_model_max_context` | `0` | Input context window used to fit summary requests. If `0`, falls back to `model_thresholds` or global `max_context_tokens`. |
| `max_summary_tokens` | `16384` | Maximum output length for the generated summary. This is not the summary-input context limit. |
| `summary_temperature` | `0.1` | Randomness for summary generation. Lower is more deterministic. |
| `model_thresholds` | `{}` | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models). |
| `enable_tool_output_trimming` | `false` | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer. |
| `enable_tool_output_trimming` | `true` | When enabled for `function_calling: "native"`, trims oversized native tool outputs while keeping the tool-call chain intact. |
| `tool_trim_threshold_chars` | `600` | Trim native tool output blocks once their total content length reaches this threshold. |
| `debug_mode` | `false` | Log verbose debug info. Set to `false` in production. |
| `show_debug_log` | `false` | Print debug logs to browser console (F12). Useful for frontend debugging. |
| `show_token_usage_status` | `true` | Show token usage status notification in the chat interface. |
@@ -71,8 +148,12 @@ If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.
- **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
- **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
- **A referenced chat summary fails**: The current request should continue with a direct-context fallback. Check the browser console (`F12`) if you need the upstream failure details.
- **A background summary silently seems to do nothing**: Important failures now surface in chat status and the browser console (`F12`).
- **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
## Changelog
See [`v1.5.0` Release Notes](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0.md) for the release-specific summary.
See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -1,15 +1,21 @@
# 异步上下文压缩过滤器
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.4.1 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.5.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明。
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
## 1.4.1 版本更新
## 1.5.0 版本更新
- **逆向展开机制**: 引入 `_unfold_messages` 机制以在 `outlet` 阶段精确对齐坐标系,彻底解决了由于前端视图折叠导致长轮次工具调用对话出现进度漂移或跳过生成摘要的问题
- **更安全的工具内容裁剪**: 重构了 `enable_tool_output_trimming`,现在严格使用原子级分组进行安全的原生工具内容裁剪,替代了激进的正则表达式匹配,防止 JSON 载荷损坏
- **外部聊天引用摘要**: 新增对引用聊天上下文的摘要支持。现在可以复用缓存摘要、直接注入较小引用聊天,或先为较大的引用聊天生成摘要再注入
- **快速多语言 Token 预估**: 新增混合脚本 Token 预估链路,使 inlet / outlet 的预检可以减少不必要的精确计数,同时比旧的粗略字符比值更接近真实用量
- **更稳健的工作记忆提示词**: 重写 XML 摘要提示词,增强普通聊天、编码任务和连续工具调用场景下的关键信息保留能力。
- **更清晰的前端调试日志**: 浏览器控制台日志改为分组化、结构化展示,排查上下文压缩行为更直观。
- **更安全的工具裁剪默认值**: 原生工具输出裁剪默认开启,并新增 `tool_trim_threshold_chars` 配置项,默认阈值为 600 字符。
- **更稳妥的引用聊天回退**: 当新的引用聊天摘要路径生成失败时,不再拖垮当前请求,而是自动回退为直接注入上下文。
- **更准确的摘要预算**: `summary_model_max_context` 现在只负责摘要输入窗口,`max_summary_tokens` 继续只负责摘要输出长度。
- **更容易发现摘要失败**: 重要的后台摘要失败现在会强制显示到浏览器控制台 (`F12`),并同步给出状态提示。
---
@@ -21,14 +27,84 @@
-**持久化存储**: 复用 Open WebUI 共享数据库连接,自动支持 PostgreSQL/SQLite 等。
-**灵活保留策略**: 可配置保留对话头部和尾部消息,确保关键信息连贯。
-**智能注入**: 将历史摘要智能注入到新上下文中。
-**外部聊天引用摘要**: 支持复用缓存摘要、小聊天直接注入,以及大聊天先摘要后注入。
-**结构感知裁剪**: 智能折叠过长消息,保留文档骨架(标题、首尾)。
-**原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
-**实时监控**: 实时监控上下文使用情况,超过 90% 发出警告。
-**详细日志**: 提供精确的 Token 统计日志,便于调试。
-**快速预估 + 精确回退**: 提供更快的多语言 Token 预估,并在必要时回退到精确统计,便于调试。
-**智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
-**多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
详细的工作原理和流程请参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
详细的工作原理和更长说明仍可参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
---
## 这次解决了什么问题(通俗版)
- **问题 1引用别的聊天时摘要失败可能把当前对话一起弄挂。**
以前如果过滤器需要先帮被引用聊天做摘要,而这一步的 LLM 调用失败了,当前请求也可能直接失败。现在改成了“能摘要就摘要,失败就退回直接塞上下文”,当前对话不会被一起拖死。
- **问题 2有些被引用聊天被截得太早信息丢得太多。**
以前有一段逻辑把 `max_summary_tokens` 这种“输出长度限制”误当成了“输入上下文窗口”,结果大一点的引用聊天会被过早截断。现在改成按摘要模型真实的输入窗口来算,能保留更多有用内容。
- **问题 3后台摘要失败时用户不容易知道发生了什么。**
以前在 `show_debug_log=false` 时,有些后台失败只会留在内部日志里。现在关键失败会强制打到浏览器控制台,并在聊天状态里提醒去看 `F12`
---
## 工作流总览
该过滤器分为两个阶段:
1. `inlet`:在请求发送给模型前执行,负责注入已有摘要、处理外部聊天引用、并在必要时裁剪上下文。
2. `outlet`:在模型回复完成后异步执行,负责判断是否需要生成新摘要,并在合适时写入数据库。
```mermaid
flowchart TD
A[请求进入 inlet] --> B[规范化工具 ID 并按需裁剪超长工具输出]
B --> C{是否附带引用聊天?}
C -- 否 --> D[如果有当前聊天摘要就先加载]
C -- 是 --> E[逐个检查被引用聊天]
E --> F{已有缓存摘要?}
F -- 是 --> G[直接复用缓存摘要]
F -- 否 --> H{能直接放进当前预算?}
H -- 是 --> I[直接注入完整引用聊天文本]
H -- 否 --> J[准备引用聊天的摘要输入]
J --> K{引用聊天摘要调用成功?}
K -- 是 --> L[注入生成后的引用摘要]
K -- 否 --> M[回退为直接注入上下文]
G --> D
I --> D
L --> D
M --> D
D --> N[为当前聊天构造 Head + Summary + Tail]
N --> O{是否超过 max_context_tokens?}
O -- 是 --> P[从最旧 atomic groups 开始裁剪]
O -- 否 --> Q[把最终上下文发给模型]
P --> Q
Q --> R[模型返回当前回复]
R --> S[Outlet 重建完整历史]
S --> T{达到压缩阈值了吗?}
T -- 否 --> U[结束]
T -- 是 --> V[把摘要输入压到摘要模型可接受的上下文窗口]
V --> W{后台摘要调用成功?}
W -- 是 --> X[保存新摘要并更新状态]
W -- 否 --> Y[强制输出浏览器控制台错误并提示用户查看]
```
### 关键说明
- `inlet` 只负责注入和裁剪上下文,不负责生成当前聊天的主摘要。
- `outlet` 异步生成摘要,不会阻塞当前回复。
- 外部聊天引用可以来自已有持久化摘要、小聊天的完整文本,或动态生成/截断后的引用摘要。
- 如果引用聊天摘要失败,会自动回退为直接注入上下文,而不是让当前请求失败。
- `summary_model_max_context` 控制摘要输入窗口;`max_summary_tokens` 只控制生成摘要的输出长度。
- 重要的后台摘要失败会显示到浏览器控制台 (`F12`) 和聊天状态提示里。
- 外部引用消息在裁剪阶段会被特殊保护,避免被最先删除。
---
@@ -64,8 +140,8 @@
| 参数 | 默认值 | 描述 |
| :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
| `summary_model` | `None` | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型(如 `gemini-2.5-flash``deepseek-v3`)。留空则尝试复用当前对话模型。 |
| `summary_model_max_context` | `0` | 摘要模型的最大上下文 Token 数。如果为 0则回退到 `model_thresholds` 或全局 `max_context_tokens`。 |
| `max_summary_tokens` | `16384` | 生成摘要时允许的最大 Token 数。 |
| `summary_model_max_context` | `0` | 摘要请求可使用的输入上下文窗口。如果为 0则回退到 `model_thresholds` 或全局 `max_context_tokens`。 |
| `max_summary_tokens` | `16384` | 生成摘要时允许的最大输出 Token 数。它不是摘要输入窗口上限。 |
| `summary_temperature` | `0.1` | 控制摘要生成的随机性,较低的值结果更稳定。 |
### 高级配置
@@ -93,7 +169,8 @@
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| `enable_tool_output_trimming` | `false` | 启用时,若 `function_calling: "native"` 激活,将裁剪冗长的工具输出以仅提取最终答案。 |
| `enable_tool_output_trimming` | `true` | 启用后(仅在 `function_calling: "native"` 下生效)会裁剪过大的本机工具输出,保留工具调用链结构并以简短占位替换冗长内容。 |
| `tool_trim_threshold_chars` | `600` | 当本机工具输出累计字符数达到该值时触发裁剪,适用于包含长文本或表格的工具结果。 |
| `debug_mode` | `false` | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
| `show_debug_log` | `false` | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。 |
| `show_token_usage_status` | `true` | 是否在对话结束时显示 Token 使用情况的状态通知。 |
@@ -109,8 +186,12 @@
- **初始系统提示丢失**:将 `keep_first` 设置为大于 0。
- **压缩效果不明显**:提高 `compression_threshold_tokens`,或降低 `keep_first` / `keep_last` 以增强压缩力度。
- **引用聊天摘要失败**:当前请求现在应该会继续执行,并回退为直接注入上下文。如果要看上游失败原因,请打开浏览器控制台 (`F12`)。
- **后台摘要看起来“没反应”**:重要失败现在会同时出现在状态提示和浏览器控制台 (`F12`) 中。
- **提交 Issue**: 如果遇到任何问题,请在 GitHub 上提交 Issue[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
## 更新日志
请查看 [`v1.5.0` 版本发布说明](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0_CN.md) 获取本次版本的独立发布摘要。
完整历史请查看 GitHub 项目: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -20,9 +20,9 @@ Filters act as middleware in the message pipeline:
---
Reduces token consumption in long conversations through intelligent summarization while maintaining coherence.
Reduces token consumption in long conversations with safer summary fallbacks and clearer failure visibility.
**Version:** 1.4.1
**Version:** 1.5.0
[:octicons-arrow-right-24: Documentation](async-context-compression.md)

View File

@@ -20,11 +20,11 @@ Filter 充当消息管线中的中间件:
---
通过智能总结减少长对话的 token 消耗,同时保持连贯性。
通过更稳健的摘要回退和更清晰的失败提示,降低长对话的 token 消耗保持连贯性。
**版本:** 1.4.1
**版本:** 1.5.0
[:octicons-arrow-right-24: 查看文档](async-context-compression.md)
[:octicons-arrow-right-24: 查看文档](async-context-compression.zh.md)
- :material-text-box-plus:{ .lg .middle } **Context Enhancement**