docs(async-context-compression): add community post drafts

This commit is contained in:
fujie
2026-03-14 16:29:15 +08:00
parent 158792d82f
commit 8f8147828b
2 changed files with 552 additions and 0 deletions

View File

@@ -0,0 +1,270 @@
[![](https://img.shields.io/badge/OpenWebUI%20Community-Get%20Plugin-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
# Async Context Compression: A Production-Scale Working-Memory Filter for OpenWebUI
Long chats do not just get expensive. They also get fragile.
Once a conversation grows large enough, you usually have to choose between two bad options:
- keep the full history and pay a heavy context cost
- trim aggressively and risk losing continuity, tool state, or important prior decisions
`Async Context Compression` is built to avoid that tradeoff.
It is not a simple “summarize old messages” utility. It is a structure-aware, async, database-backed working-memory system for OpenWebUI that can compress long conversations while preserving conversational continuity, tool-calling integrity, and now, as of `v1.5.0`, referenced-chat context injection as well.
This plugin has now reached the point where it feels complete enough to be described as a serious, high-capability filter rather than a small convenience add-on.
**[📖 Full README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README.md)**
**[📝 v1.5.0 Release Notes](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0.md)**
---
## Why This Plugin Exists
OpenWebUI conversations often contain much more than plain chat:
- long-running planning threads
- coding sessions with repeated tool use
- model-specific context limits
- multimodal messages
- external referenced chats
- custom models with different context windows
A naive compression strategy is not enough in those environments.
If a filter only drops earlier messages based on length, it can:
- break native tool-calling chains
- lose critical task state
- destroy continuity in old chats
- make debugging impossible
- hide important provider-side failures
`Async Context Compression` is designed around a stronger premise:
> compress history without treating conversation structure as disposable
That means it tries to preserve what actually matters for the next turn:
- the current goal
- durable user preferences
- recent progress
- tool outputs that still matter
- error state
- summary continuity
- referenced context from other chats
---
## What Makes It Different
This plugin now combines several capabilities that are usually split across separate systems:
### 1. Asynchronous working-memory generation
The current reply is not blocked while the plugin generates a new summary in the background.
### 2. Persistent summary storage
Summaries are stored in OpenWebUI's shared database and reused across turns, instead of being regenerated from scratch every time.
### 3. Structure-aware trimming
The filter respects atomic message boundaries so native tool-calling history is not corrupted by compression.
### 4. External chat reference summarization
New in `v1.5.0`: referenced chats can now be reused as cached summaries, injected directly if small enough, or summarized before injection if too large.
### 5. Mixed-script token estimation
The plugin now uses a much stronger multilingual token estimation path before falling back to exact counting, which helps reduce unnecessary expensive token calculations while staying much closer to real usage.
### 6. Real failure visibility
Important background summary failures are surfaced to the browser console and status messages instead of disappearing silently.
---
## Workflow Overview
This is the current high-level flow:
```mermaid
flowchart TD
A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
B --> C{Referenced chats attached?}
C -- No --> D[Load current chat summary if available]
C -- Yes --> E[Inspect each referenced chat]
E --> F{Existing cached summary?}
F -- Yes --> G[Reuse cached summary]
F -- No --> H{Fits direct budget?}
H -- Yes --> I[Inject full referenced chat text]
H -- No --> J[Prepare referenced-chat summary input]
J --> K{Referenced-chat summary call succeeds?}
K -- Yes --> L[Inject generated referenced summary]
K -- No --> M[Fallback to direct contextual injection]
G --> D
I --> D
L --> D
M --> D
D --> N[Build current-chat Head + Summary + Tail]
N --> O{Over max_context_tokens?}
O -- Yes --> P[Trim oldest atomic groups]
O -- No --> Q[Send final context to the model]
P --> Q
Q --> R[Model returns the reply]
R --> S[Outlet rebuilds the full history]
S --> T{Reached compression threshold?}
T -- No --> U[Finish]
T -- Yes --> V[Fit summary input to the summary model context]
V --> W{Background summary call succeeds?}
W -- Yes --> X[Save new chat summary and update status]
W -- No --> Y[Force browser-console error and show status hint]
```
This is why I consider the plugin “powerful” now: it is no longer solving a single problem. It is coordinating context reduction, summary persistence, tool safety, referenced-chat handling, and model-budget control inside one filter.
---
## New in v1.5.0
This release is important because it turns the plugin from “long-chat compression with strong tool safety” into something closer to a reusable context-management layer.
### External chat reference summaries
This is a new feature in `v1.5.0`, not just a small adjustment.
When a user references another chat:
- the plugin can reuse an existing cached summary
- inject the full referenced chat if it is small enough
- or generate a summary first if the referenced chat is too large
That means the filter can now carry relevant context across chats, not just across turns inside the same chat.
### Fast multilingual token estimation
Also new in `v1.5.0`.
The plugin no longer relies on a rough one-size-fits-all character ratio. It now estimates token usage with mixed-script heuristics that behave much better for:
- English
- Chinese
- Japanese
- Korean
- Cyrillic
- Arabic
- Thai
- mixed-language conversations
This matters because the plugin makes context decisions constantly. Better estimation means fewer unnecessary exact counts and fewer bad preflight assumptions.
### Stronger final-prompt budgeting
The summary path now fits the **real final summary request**, not just an intermediate estimate. That includes:
- prompt wrapper
- formatted conversation text
- previous summary
- reserved output budget
- safety margin
This directly improves reliability in the large old-chat cases that are hardest to handle.
---
## Why It Feels Complete Now
I would describe the current plugin as “feature-complete for the main problem space,” because it now covers the major operational surfaces that matter in real usage:
- long plain-chat conversations
- multi-step coding threads
- native tool-calling conversations
- persistent summaries
- custom model thresholds
- background async generation
- external chat references
- multilingual token estimation
- failure surfacing for debugging
That does not mean it is finished forever. It means the plugin has crossed the line from a narrow experimental filter into a robust context-management system with enough breadth to support demanding OpenWebUI usage patterns.
---
## Scale and Engineering Depth
For people who care about implementation depth, this plugin is not small anymore.
Current code size:
- main plugin: **4,573 lines**
- focused test file: **1,037 lines**
- combined visible implementation + regression coverage: **5,610 lines**
Line count is not a quality metric by itself, but at this scale it does say something real:
- the plugin has grown well beyond a toy filter
- the behavior surface is large enough to require explicit regression testing
- the plugin now encodes a lot of edge-case handling that only shows up after repeated real-world usage
In other words: this is no longer “just summarize old messages.” It is a fairly serious stateful filter.
---
## Practical Benefits
If you use OpenWebUI heavily, the value is straightforward:
- lower token consumption in long chats
- better continuity across long-running sessions
- safer native tool-calling history
- fewer broken conversations after compression
- more stable summary generation on large histories
- better visibility when the provider rejects a summary request
- useful reuse of context from referenced chats
This plugin is especially valuable if you:
- regularly work in long coding chats
- use models with strict context budgets
- rely on native tool calling
- revisit old project chats
- want summaries to behave like working memory, not like lossy notes
---
## Installation
- OpenWebUI Community: <https://openwebui.com/posts/async_context_compression_b1655bc8>
- Source: <https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
If you want the full valve list, deployment notes, and troubleshooting details, the README is the best reference.
---
## Final Note
Do I think this plugin is powerful?
Yes, genuinely.
Not because it is large, but because it now solves the right combination of problems at once:
- cost control
- continuity
- structural safety
- async persistence
- cross-chat reuse
- operational debuggability
That combination is what makes it feel strong.
If you have been looking for a serious long-conversation memory/compression filter for OpenWebUI, `Async Context Compression` is now in that category.

View File

@@ -0,0 +1,282 @@
[![](https://img.shields.io/badge/OpenWebUI%20%E7%A4%BE%E5%8C%BA-%E8%8E%B7%E5%8F%96%E6%8F%92%E4%BB%B6-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
# Async Context Compression一个面向生产场景的 OpenWebUI 工作记忆过滤器
长对话的问题,从来不只是“贵”。
当聊天足够长时,通常只剩下两个都不太好的选择:
- 保留完整历史,继续承担很高的上下文成本
- 粗暴裁剪旧消息,但冒着丢失上下文、工具状态和关键决策的风险
`Async Context Compression` 的目标,就是尽量避免这个二选一。
它不是一个简单的“把老消息总结一下”的小工具,而是一个带有结构感知、异步摘要、数据库持久化能力的 OpenWebUI 工作记忆系统。它的任务不是单纯缩短上下文,而是在压缩长对话的同时,尽量保留:
- 对话连续性
- 工具调用状态完整性
- 历史摘要进度
- 跨聊天引用上下文
- 出错时的可诊断性
`v1.5.0` 这个阶段,我认为它已经不再只是一个“方便的小过滤器”,而是一个足够完整、足够强、也足够有工程深度的上下文管理插件。
**[📖 完整 README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README_CN.md)**
**[📝 v1.5.0 发布说明](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0_CN.md)**
---
## 为什么会有这个插件
OpenWebUI 里的真实对话,通常并不只是“用户问一句,模型答一句”。
它常常还包含:
- 很长的项目型对话
- 多轮编码与调试
- 原生工具调用
- 多模态消息
- 不同模型上下文窗口差异
- 其他聊天的引用上下文
在这种环境里,单纯靠“按长度裁掉旧消息”其实不够。
如果一个过滤器只会按长度或索引裁剪消息,它很容易:
- 把原生 tool-calling 历史裁坏
- 丢掉仍然会影响下一轮回复的关键信息
- 在老聊天里破坏连续性
- 出问题时几乎无法排查
- 把上游 provider 报错伪装成模糊的内部错误
`Async Context Compression` 的核心思路更强一些:
> 可以压缩历史,但不能把“对话结构”当成无关紧要的东西一起压掉
它真正想保留的是下一轮最需要的状态:
- 当前目标
- 持久偏好
- 最近进展
- 仍然有效的工具结果
- 错误状态
- 已有摘要的连续性
- 来自其他聊天的相关上下文
---
## 它和普通摘要插件有什么不同
现在这个插件,实际上已经把几个通常要分散在不同系统里的能力组合到了一起:
### 1. 异步工作记忆生成
用户当前这次回复不会被后台摘要阻塞。
### 2. 持久化摘要存储
摘要会写入 OpenWebUI 共享数据库,并在后续轮次中复用,而不是每次都从头重算。
### 3. 结构感知裁剪
裁剪逻辑会尊重原子消息边界,避免把原生 tool-calling 历史裁坏。
### 4. 外部聊天引用摘要
这是 `v1.5.0` 新增的重要能力:被引用聊天现在可以直接复用缓存摘要、在小体量时直接注入、或者在过大时先生成摘要再注入。
### 5. 多语言 Token 预估
插件现在具备更强的多脚本文本 Token 预估逻辑,在很多情况下可以减少不必要的精确计数,同时明显比旧的粗略字符比值更贴近真实用量。
### 6. 失败可见性
关键的后台摘要失败现在会出现在浏览器控制台和状态提示里,不再悄悄消失。
---
## 工作流总览
下面是当前的高层流程:
```mermaid
flowchart TD
A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
B --> C{Referenced chats attached?}
C -- No --> D[Load current chat summary if available]
C -- Yes --> E[Inspect each referenced chat]
E --> F{Existing cached summary?}
F -- Yes --> G[Reuse cached summary]
F -- No --> H{Fits direct budget?}
H -- Yes --> I[Inject full referenced chat text]
H -- No --> J[Prepare referenced-chat summary input]
J --> K{Referenced-chat summary call succeeds?}
K -- Yes --> L[Inject generated referenced summary]
K -- No --> M[Fallback to direct contextual injection]
G --> D
I --> D
L --> D
M --> D
D --> N[Build current-chat Head + Summary + Tail]
N --> O{Over max_context_tokens?}
O -- Yes --> P[Trim oldest atomic groups]
O -- No --> Q[Send final context to the model]
P --> Q
Q --> R[Model returns the reply]
R --> S[Outlet rebuilds the full history]
S --> T{Reached compression threshold?}
T -- No --> U[Finish]
T -- Yes --> V[Fit summary input to the summary model context]
V --> W{Background summary call succeeds?}
W -- Yes --> X[Save new chat summary and update status]
W -- No --> Y[Force browser-console error and show status hint]
```
这也是为什么我会觉得它现在“强”:它已经不再只解决一个问题,而是在一个过滤器里同时协调:
- 上下文压缩
- 历史摘要复用
- 工具调用安全性
- 被引用聊天上下文
- 模型预算控制
---
## v1.5.0 为什么重要
这个版本的重要性在于,它把插件从“长对话压缩器”推进成了一个更接近“上下文管理层”的东西。
### 外部聊天引用摘要
这是 `v1.5.0` 的新功能,不是小修小补。
当用户引用另一个聊天时,插件现在可以:
- 直接复用已有缓存摘要
- 如果聊天足够小,直接把完整内容注入
- 如果聊天太大,先生成摘要再注入
这意味着它现在不仅能跨“轮次”保留上下文,也能开始跨“聊天”携带相关上下文。
### 快速多语言 Token 预估
这同样是 `v1.5.0` 的新能力。
插件不再依赖简单粗暴的统一字符比值,而是改用更适合混合语言文本的估算方式,尤其对下面这些场景更有意义:
- 英文
- 中文
- 日文
- 韩文
- 西里尔字符
- 阿拉伯语
- 泰语
- 中英混合或多语言混合对话
这很重要,因为上下文管理类插件会不断做预算判断。预估更准,就意味着更少无意义的精确计算,也更不容易在预检阶段做出错误判断。
### 更强的最终请求预算控制
现在的摘要路径会去拟合“真实最终 summary request”而不是只看一个中间估算值。它会把这些内容都算进去
- prompt 包装
- 格式化后的对话文本
- previous summary
- 预留输出预算
- 安全余量
这对老聊天、大聊天和最难处理的边界情况特别关键。
---
## 为什么我觉得它现在已经足够完整
如果把“问题空间”列出来,我会说这个插件现在对主要场景已经覆盖得比较完整了:
- 很长的普通聊天
- 多轮编码与调试对话
- 原生工具调用
- 历史摘要持久化
- 自定义模型阈值
- 异步后台摘要
- 外部聊天引用
- 多语言 Token 预估
- 调试可见性
这并不代表它永远不会再迭代,而是说它已经越过了“窄功能实验品”的阶段,进入了一个更像“通用上下文管理系统”的形态。
---
## 代码规模与工程深度
如果你关心实现深度,这个插件现在已经不小了。
当前代码规模:
- 主插件文件:**4,573 行**
- 聚焦测试文件:**1,037 行**
- 可见实现 + 回归测试合计:**5,610 行**
代码行数本身不等于质量,但在这个量级上,它至少说明了几件真实的事:
- 这已经不是一个玩具级过滤器
- 这个插件的行为面足够大,必须靠专门回归测试兜住
- 它已经积累了很多只有在真实使用中才会暴露出来的边界处理逻辑
也就是说,它现在做的事情,已经明显不是“把老消息总结一下”那么简单。
---
## 实际价值
如果你是 OpenWebUI 的重度用户,这个插件的价值其实很直接:
- 长聊天更省 Token
- 长会话连续性更好
- 原生 tool-calling 更安全
- 压缩后更不容易把会话搞坏
- 大历史摘要生成更稳定
- provider 拒绝摘要请求时更容易看到真错误
- 能复用其他聊天里的有效上下文
尤其适合这些用户:
- 经常做长时间编码聊天
- 使用上下文窗口比较紧的模型
- 依赖原生工具调用
- 经常回看旧项目聊天
- 希望摘要更像“工作记忆”而不是“丢失细节的简要笔记”
---
## 安装
- OpenWebUI 社区:<https://openwebui.com/posts/async_context_compression_b1655bc8>
- 源码目录:<https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
如果你想看完整的 valves、部署说明和故障排查README 仍然是最完整的参考入口。
---
## 最后一句
你问我这个插件是不是强大。
我的答案是:**是,确实强,而且现在已经不是“看起来强”,而是“问题空间覆盖得比较完整”的那种强。**
不是因为它代码多,而是因为它现在同时解决的是一组真正相关的问题:
- 成本控制
- 连续性
- 结构安全
- 异步持久化
- 跨聊天上下文复用
- 出错时的可诊断性
正是这几个东西一起成立,才让它现在像一个真正成熟的长对话上下文管理插件。