docs(async-context-compression): add community post drafts

2026-03-14 16:29:15 +08:00
parent 158792d82f
commit 8f8147828b
2 changed files with 552 additions and 0 deletions
--- a/plugins/filters/async-context-compression/community_post.md
+++ b/plugins/filters/async-context-compression/community_post.md
@@ -0,0 +1,270 @@
+[![](https://img.shields.io/badge/OpenWebUI%20Community-Get%20Plugin-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
+
+# Async Context Compression: A Production-Scale Working-Memory Filter for OpenWebUI
+
+Long chats do not just get expensive. They also get fragile.
+
+Once a conversation grows large enough, you usually have to choose between two bad options:
+
+- keep the full history and pay a heavy context cost
+- trim aggressively and risk losing continuity, tool state, or important prior decisions
+
+`Async Context Compression` is built to avoid that tradeoff.
+
+It is not a simple “summarize old messages” utility. It is a structure-aware, async, database-backed working-memory system for OpenWebUI that can compress long conversations while preserving conversational continuity, tool-calling integrity, and now, as of `v1.5.0`, referenced-chat context injection as well.
+
+This plugin has now reached the point where it feels complete enough to be described as a serious, high-capability filter rather than a small convenience add-on.
+
+**[📖 Full README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README.md)**  
+**[📝 v1.5.0 Release Notes](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0.md)**
+
+---
+
+## Why This Plugin Exists
+
+OpenWebUI conversations often contain much more than plain chat:
+
+- long-running planning threads
+- coding sessions with repeated tool use
+- model-specific context limits
+- multimodal messages
+- external referenced chats
+- custom models with different context windows
+
+A naive compression strategy is not enough in those environments.
+
+If a filter only drops earlier messages based on length, it can:
+
+- break native tool-calling chains
+- lose critical task state
+- destroy continuity in old chats
+- make debugging impossible
+- hide important provider-side failures
+
+`Async Context Compression` is designed around a stronger premise:
+
+> compress history without treating conversation structure as disposable
+
+That means it tries to preserve what actually matters for the next turn:
+
+- the current goal
+- durable user preferences
+- recent progress
+- tool outputs that still matter
+- error state
+- summary continuity
+- referenced context from other chats
+
+---
+
+## What Makes It Different
+
+This plugin now combines several capabilities that are usually split across separate systems:
+
+### 1. Asynchronous working-memory generation
+
+The current reply is not blocked while the plugin generates a new summary in the background.
+
+### 2. Persistent summary storage
+
+Summaries are stored in OpenWebUI's shared database and reused across turns, instead of being regenerated from scratch every time.
+
+### 3. Structure-aware trimming
+
+The filter respects atomic message boundaries so native tool-calling history is not corrupted by compression.
+
+### 4. External chat reference summarization
+
+New in `v1.5.0`: referenced chats can now be reused as cached summaries, injected directly if small enough, or summarized before injection if too large.
+
+### 5. Mixed-script token estimation
+
+The plugin now uses a much stronger multilingual token estimation path before falling back to exact counting, which helps reduce unnecessary expensive token calculations while staying much closer to real usage.
+
+### 6. Real failure visibility
+
+Important background summary failures are surfaced to the browser console and status messages instead of disappearing silently.
+
+---
+
+## Workflow Overview
+
+This is the current high-level flow:
+
+```mermaid
+flowchart TD
+    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
+    B --> C{Referenced chats attached?}
+    C -- No --> D[Load current chat summary if available]
+    C -- Yes --> E[Inspect each referenced chat]
+
+    E --> F{Existing cached summary?}
+    F -- Yes --> G[Reuse cached summary]
+    F -- No --> H{Fits direct budget?}
+    H -- Yes --> I[Inject full referenced chat text]
+    H -- No --> J[Prepare referenced-chat summary input]
+
+    J --> K{Referenced-chat summary call succeeds?}
+    K -- Yes --> L[Inject generated referenced summary]
+    K -- No --> M[Fallback to direct contextual injection]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[Build current-chat Head + Summary + Tail]
+    N --> O{Over max_context_tokens?}
+    O -- Yes --> P[Trim oldest atomic groups]
+    O -- No --> Q[Send final context to the model]
+    P --> Q
+
+    Q --> R[Model returns the reply]
+    R --> S[Outlet rebuilds the full history]
+    S --> T{Reached compression threshold?}
+    T -- No --> U[Finish]
+    T -- Yes --> V[Fit summary input to the summary model context]
+
+    V --> W{Background summary call succeeds?}
+    W -- Yes --> X[Save new chat summary and update status]
+    W -- No --> Y[Force browser-console error and show status hint]
+```
+
+This is why I consider the plugin “powerful” now: it is no longer solving a single problem. It is coordinating context reduction, summary persistence, tool safety, referenced-chat handling, and model-budget control inside one filter.
+
+---
+
+## New in v1.5.0
+
+This release is important because it turns the plugin from “long-chat compression with strong tool safety” into something closer to a reusable context-management layer.
+
+### External chat reference summaries
+
+This is a new feature in `v1.5.0`, not just a small adjustment.
+
+When a user references another chat:
+
+- the plugin can reuse an existing cached summary
+- inject the full referenced chat if it is small enough
+- or generate a summary first if the referenced chat is too large
+
+That means the filter can now carry relevant context across chats, not just across turns inside the same chat.
+
+### Fast multilingual token estimation
+
+Also new in `v1.5.0`.
+
+The plugin no longer relies on a rough one-size-fits-all character ratio. It now estimates token usage with mixed-script heuristics that behave much better for:
+
+- English
+- Chinese
+- Japanese
+- Korean
+- Cyrillic
+- Arabic
+- Thai
+- mixed-language conversations
+
+This matters because the plugin makes context decisions constantly. Better estimation means fewer unnecessary exact counts and fewer bad preflight assumptions.
+
+### Stronger final-prompt budgeting
+
+The summary path now fits the **real final summary request**, not just an intermediate estimate. That includes:
+
+- prompt wrapper
+- formatted conversation text
+- previous summary
+- reserved output budget
+- safety margin
+
+This directly improves reliability in the large old-chat cases that are hardest to handle.
+
+---
+
+## Why It Feels Complete Now
+
+I would describe the current plugin as “feature-complete for the main problem space,” because it now covers the major operational surfaces that matter in real usage:
+
+- long plain-chat conversations
+- multi-step coding threads
+- native tool-calling conversations
+- persistent summaries
+- custom model thresholds
+- background async generation
+- external chat references
+- multilingual token estimation
+- failure surfacing for debugging
+
+That does not mean it is finished forever. It means the plugin has crossed the line from a narrow experimental filter into a robust context-management system with enough breadth to support demanding OpenWebUI usage patterns.
+
+---
+
+## Scale and Engineering Depth
+
+For people who care about implementation depth, this plugin is not small anymore.
+
+Current code size:
+
+- main plugin: **4,573 lines**
+- focused test file: **1,037 lines**
+- combined visible implementation + regression coverage: **5,610 lines**
+
+Line count is not a quality metric by itself, but at this scale it does say something real:
+
+- the plugin has grown well beyond a toy filter
+- the behavior surface is large enough to require explicit regression testing
+- the plugin now encodes a lot of edge-case handling that only shows up after repeated real-world usage
+
+In other words: this is no longer “just summarize old messages.” It is a fairly serious stateful filter.
+
+---
+
+## Practical Benefits
+
+If you use OpenWebUI heavily, the value is straightforward:
+
+- lower token consumption in long chats
+- better continuity across long-running sessions
+- safer native tool-calling history
+- fewer broken conversations after compression
+- more stable summary generation on large histories
+- better visibility when the provider rejects a summary request
+- useful reuse of context from referenced chats
+
+This plugin is especially valuable if you:
+
+- regularly work in long coding chats
+- use models with strict context budgets
+- rely on native tool calling
+- revisit old project chats
+- want summaries to behave like working memory, not like lossy notes
+
+---
+
+## Installation
+
+- OpenWebUI Community: <https://openwebui.com/posts/async_context_compression_b1655bc8>
+- Source: <https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
+
+If you want the full valve list, deployment notes, and troubleshooting details, the README is the best reference.
+
+---
+
+## Final Note
+
+Do I think this plugin is powerful?
+
+Yes, genuinely.
+
+Not because it is large, but because it now solves the right combination of problems at once:
+
+- cost control
+- continuity
+- structural safety
+- async persistence
+- cross-chat reuse
+- operational debuggability
+
+That combination is what makes it feel strong.
+
+If you have been looking for a serious long-conversation memory/compression filter for OpenWebUI, `Async Context Compression` is now in that category.
--- a/plugins/filters/async-context-compression/community_post_CN.md
+++ b/plugins/filters/async-context-compression/community_post_CN.md
@@ -0,0 +1,282 @@
+[![](https://img.shields.io/badge/OpenWebUI%20%E7%A4%BE%E5%8C%BA-%E8%8E%B7%E5%8F%96%E6%8F%92%E4%BB%B6-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
+
+# Async Context Compression：一个面向生产场景的 OpenWebUI 工作记忆过滤器
+
+长对话的问题，从来不只是“贵”。
+
+当聊天足够长时，通常只剩下两个都不太好的选择：
+
+- 保留完整历史，继续承担很高的上下文成本
+- 粗暴裁剪旧消息，但冒着丢失上下文、工具状态和关键决策的风险
+
+`Async Context Compression` 的目标，就是尽量避免这个二选一。
+
+它不是一个简单的“把老消息总结一下”的小工具，而是一个带有结构感知、异步摘要、数据库持久化能力的 OpenWebUI 工作记忆系统。它的任务不是单纯缩短上下文，而是在压缩长对话的同时，尽量保留：
+
+- 对话连续性
+- 工具调用状态完整性
+- 历史摘要进度
+- 跨聊天引用上下文
+- 出错时的可诊断性
+
+到 `v1.5.0` 这个阶段，我认为它已经不再只是一个“方便的小过滤器”，而是一个足够完整、足够强、也足够有工程深度的上下文管理插件。
+
+**[📖 完整 README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README_CN.md)**  
+**[📝 v1.5.0 发布说明](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0_CN.md)**
+
+---
+
+## 为什么会有这个插件
+
+OpenWebUI 里的真实对话，通常并不只是“用户问一句，模型答一句”。
+
+它常常还包含：
+
+- 很长的项目型对话
+- 多轮编码与调试
+- 原生工具调用
+- 多模态消息
+- 不同模型上下文窗口差异
+- 其他聊天的引用上下文
+
+在这种环境里，单纯靠“按长度裁掉旧消息”其实不够。
+
+如果一个过滤器只会按长度或索引裁剪消息，它很容易：
+
+- 把原生 tool-calling 历史裁坏
+- 丢掉仍然会影响下一轮回复的关键信息
+- 在老聊天里破坏连续性
+- 出问题时几乎无法排查
+- 把上游 provider 报错伪装成模糊的内部错误
+
+`Async Context Compression` 的核心思路更强一些：
+
+> 可以压缩历史，但不能把“对话结构”当成无关紧要的东西一起压掉
+
+它真正想保留的是下一轮最需要的状态：
+
+- 当前目标
+- 持久偏好
+- 最近进展
+- 仍然有效的工具结果
+- 错误状态
+- 已有摘要的连续性
+- 来自其他聊天的相关上下文
+
+---
+
+## 它和普通摘要插件有什么不同
+
+现在这个插件，实际上已经把几个通常要分散在不同系统里的能力组合到了一起：
+
+### 1. 异步工作记忆生成
+
+用户当前这次回复不会被后台摘要阻塞。
+
+### 2. 持久化摘要存储
+
+摘要会写入 OpenWebUI 共享数据库，并在后续轮次中复用，而不是每次都从头重算。
+
+### 3. 结构感知裁剪
+
+裁剪逻辑会尊重原子消息边界，避免把原生 tool-calling 历史裁坏。
+
+### 4. 外部聊天引用摘要
+
+这是 `v1.5.0` 新增的重要能力：被引用聊天现在可以直接复用缓存摘要、在小体量时直接注入、或者在过大时先生成摘要再注入。
+
+### 5. 多语言 Token 预估
+
+插件现在具备更强的多脚本文本 Token 预估逻辑，在很多情况下可以减少不必要的精确计数，同时明显比旧的粗略字符比值更贴近真实用量。
+
+### 6. 失败可见性
+
+关键的后台摘要失败现在会出现在浏览器控制台和状态提示里，不再悄悄消失。
+
+---
+
+## 工作流总览
+
+下面是当前的高层流程：
+
+```mermaid
+flowchart TD
+    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
+    B --> C{Referenced chats attached?}
+    C -- No --> D[Load current chat summary if available]
+    C -- Yes --> E[Inspect each referenced chat]
+
+    E --> F{Existing cached summary?}
+    F -- Yes --> G[Reuse cached summary]
+    F -- No --> H{Fits direct budget?}
+    H -- Yes --> I[Inject full referenced chat text]
+    H -- No --> J[Prepare referenced-chat summary input]
+
+    J --> K{Referenced-chat summary call succeeds?}
+    K -- Yes --> L[Inject generated referenced summary]
+    K -- No --> M[Fallback to direct contextual injection]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[Build current-chat Head + Summary + Tail]
+    N --> O{Over max_context_tokens?}
+    O -- Yes --> P[Trim oldest atomic groups]
+    O -- No --> Q[Send final context to the model]
+    P --> Q
+
+    Q --> R[Model returns the reply]
+    R --> S[Outlet rebuilds the full history]
+    S --> T{Reached compression threshold?}
+    T -- No --> U[Finish]
+    T -- Yes --> V[Fit summary input to the summary model context]
+
+    V --> W{Background summary call succeeds?}
+    W -- Yes --> X[Save new chat summary and update status]
+    W -- No --> Y[Force browser-console error and show status hint]
+```
+
+这也是为什么我会觉得它现在“强”：它已经不再只解决一个问题，而是在一个过滤器里同时协调：
+
+- 上下文压缩
+- 历史摘要复用
+- 工具调用安全性
+- 被引用聊天上下文
+- 模型预算控制
+
+---
+
+## v1.5.0 为什么重要
+
+这个版本的重要性在于，它把插件从“长对话压缩器”推进成了一个更接近“上下文管理层”的东西。
+
+### 外部聊天引用摘要
+
+这是 `v1.5.0` 的新功能，不是小修小补。
+
+当用户引用另一个聊天时，插件现在可以：
+
+- 直接复用已有缓存摘要
+- 如果聊天足够小，直接把完整内容注入
+- 如果聊天太大，先生成摘要再注入
+
+这意味着它现在不仅能跨“轮次”保留上下文，也能开始跨“聊天”携带相关上下文。
+
+### 快速多语言 Token 预估
+
+这同样是 `v1.5.0` 的新能力。
+
+插件不再依赖简单粗暴的统一字符比值，而是改用更适合混合语言文本的估算方式，尤其对下面这些场景更有意义：
+
+- 英文
+- 中文
+- 日文
+- 韩文
+- 西里尔字符
+- 阿拉伯语
+- 泰语
+- 中英混合或多语言混合对话
+
+这很重要，因为上下文管理类插件会不断做预算判断。预估更准，就意味着更少无意义的精确计算，也更不容易在预检阶段做出错误判断。
+
+### 更强的最终请求预算控制
+
+现在的摘要路径会去拟合“真实最终 summary request”，而不是只看一个中间估算值。它会把这些内容都算进去：
+
+- prompt 包装
+- 格式化后的对话文本
+- previous summary
+- 预留输出预算
+- 安全余量
+
+这对老聊天、大聊天和最难处理的边界情况特别关键。
+
+---
+
+## 为什么我觉得它现在已经足够完整
+
+如果把“问题空间”列出来，我会说这个插件现在对主要场景已经覆盖得比较完整了：
+
+- 很长的普通聊天
+- 多轮编码与调试对话
+- 原生工具调用
+- 历史摘要持久化
+- 自定义模型阈值
+- 异步后台摘要
+- 外部聊天引用
+- 多语言 Token 预估
+- 调试可见性
+
+这并不代表它永远不会再迭代，而是说它已经越过了“窄功能实验品”的阶段，进入了一个更像“通用上下文管理系统”的形态。
+
+---
+
+## 代码规模与工程深度
+
+如果你关心实现深度，这个插件现在已经不小了。
+
+当前代码规模：
+
+- 主插件文件：**4,573 行**
+- 聚焦测试文件：**1,037 行**
+- 可见实现 + 回归测试合计：**5,610 行**
+
+代码行数本身不等于质量，但在这个量级上，它至少说明了几件真实的事：
+
+- 这已经不是一个玩具级过滤器
+- 这个插件的行为面足够大，必须靠专门回归测试兜住
+- 它已经积累了很多只有在真实使用中才会暴露出来的边界处理逻辑
+
+也就是说，它现在做的事情，已经明显不是“把老消息总结一下”那么简单。
+
+---
+
+## 实际价值
+
+如果你是 OpenWebUI 的重度用户，这个插件的价值其实很直接：
+
+- 长聊天更省 Token
+- 长会话连续性更好
+- 原生 tool-calling 更安全
+- 压缩后更不容易把会话搞坏
+- 大历史摘要生成更稳定
+- provider 拒绝摘要请求时更容易看到真错误
+- 能复用其他聊天里的有效上下文
+
+尤其适合这些用户：
+
+- 经常做长时间编码聊天
+- 使用上下文窗口比较紧的模型
+- 依赖原生工具调用
+- 经常回看旧项目聊天
+- 希望摘要更像“工作记忆”而不是“丢失细节的简要笔记”
+
+---
+
+## 安装
+
+- OpenWebUI 社区：<https://openwebui.com/posts/async_context_compression_b1655bc8>
+- 源码目录：<https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
+
+如果你想看完整的 valves、部署说明和故障排查，README 仍然是最完整的参考入口。
+
+---
+
+## 最后一句
+
+你问我这个插件是不是强大。
+
+我的答案是：**是，确实强，而且现在已经不是“看起来强”，而是“问题空间覆盖得比较完整”的那种强。**
+
+不是因为它代码多，而是因为它现在同时解决的是一组真正相关的问题：
+
+- 成本控制
+- 连续性
+- 结构安全
+- 异步持久化
+- 跨聊天上下文复用
+- 出错时的可诊断性
+
+正是这几个东西一起成立，才让它现在像一个真正成熟的长对话上下文管理插件。