docs(async-context-compression): add community post drafts

2026-03-14 16:29:15 +08:00
parent 158792d82f
commit 8f8147828b
2 changed files with 552 additions and 0 deletions
--- a/plugins/filters/async-context-compression/community_post.md
+++ b/plugins/filters/async-context-compression/community_post.md
@@ -0,0 +1,270 @@
 [![](https://img.shields.io/badge/OpenWebUI%20Community-Get%20Plugin-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
 # Async Context Compression: A Production-Scale Working-Memory Filter for OpenWebUI
 Long chats do not just get expensive. They also get fragile.
 Once a conversation grows large enough, you usually have to choose between two bad options:
 - keep the full history and pay a heavy context cost
 - trim aggressively and risk losing continuity, tool state, or important prior decisions
 `Async Context Compression` is built to avoid that tradeoff.
 It is not a simple “summarize old messages” utility. It is a structure-aware, async, database-backed working-memory system for OpenWebUI that can compress long conversations while preserving conversational continuity, tool-calling integrity, and now, as of `v1.5.0`, referenced-chat context injection as well.
 This plugin has now reached the point where it feels complete enough to be described as a serious, high-capability filter rather than a small convenience add-on.
 **[📖 Full README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README.md)**  
 **[📝 v1.5.0 Release Notes](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0.md)**
 ---
 ## Why This Plugin Exists
 OpenWebUI conversations often contain much more than plain chat:
 - long-running planning threads
 - coding sessions with repeated tool use
 - model-specific context limits
 - multimodal messages
 - external referenced chats
 - custom models with different context windows
 A naive compression strategy is not enough in those environments.
 If a filter only drops earlier messages based on length, it can:
 - break native tool-calling chains
 - lose critical task state
 - destroy continuity in old chats
 - make debugging impossible
 - hide important provider-side failures
 `Async Context Compression` is designed around a stronger premise:
 > compress history without treating conversation structure as disposable
 That means it tries to preserve what actually matters for the next turn:
 - the current goal
 - durable user preferences
 - recent progress
 - tool outputs that still matter
 - error state
 - summary continuity
 - referenced context from other chats
 ---
 ## What Makes It Different
 This plugin now combines several capabilities that are usually split across separate systems:
 ### 1. Asynchronous working-memory generation
 The current reply is not blocked while the plugin generates a new summary in the background.
 ### 2. Persistent summary storage
 Summaries are stored in OpenWebUI's shared database and reused across turns, instead of being regenerated from scratch every time.
 ### 3. Structure-aware trimming
 The filter respects atomic message boundaries so native tool-calling history is not corrupted by compression.
 ### 4. External chat reference summarization
 New in `v1.5.0`: referenced chats can now be reused as cached summaries, injected directly if small enough, or summarized before injection if too large.
 ### 5. Mixed-script token estimation
 The plugin now uses a much stronger multilingual token estimation path before falling back to exact counting, which helps reduce unnecessary expensive token calculations while staying much closer to real usage.
 ### 6. Real failure visibility
 Important background summary failures are surfaced to the browser console and status messages instead of disappearing silently.
 ---
 ## Workflow Overview
 This is the current high-level flow:
 ```mermaid
 flowchart TD
    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
    B --> C{Referenced chats attached?}
    C -- No --> D[Load current chat summary if available]
    C -- Yes --> E[Inspect each referenced chat]
    E --> F{Existing cached summary?}
    F -- Yes --> G[Reuse cached summary]
    F -- No --> H{Fits direct budget?}
    H -- Yes --> I[Inject full referenced chat text]
    H -- No --> J[Prepare referenced-chat summary input]
    J --> K{Referenced-chat summary call succeeds?}
    K -- Yes --> L[Inject generated referenced summary]
    K -- No --> M[Fallback to direct contextual injection]
    G --> D
    I --> D
    L --> D
    M --> D
    D --> N[Build current-chat Head + Summary + Tail]
    N --> O{Over max_context_tokens?}
    O -- Yes --> P[Trim oldest atomic groups]
    O -- No --> Q[Send final context to the model]
    P --> Q
    Q --> R[Model returns the reply]
    R --> S[Outlet rebuilds the full history]
    S --> T{Reached compression threshold?}
    T -- No --> U[Finish]
    T -- Yes --> V[Fit summary input to the summary model context]
    V --> W{Background summary call succeeds?}
    W -- Yes --> X[Save new chat summary and update status]
    W -- No --> Y[Force browser-console error and show status hint]
 ```
 This is why I consider the plugin “powerful” now: it is no longer solving a single problem. It is coordinating context reduction, summary persistence, tool safety, referenced-chat handling, and model-budget control inside one filter.
 ---
 ## New in v1.5.0
 This release is important because it turns the plugin from “long-chat compression with strong tool safety” into something closer to a reusable context-management layer.
 ### External chat reference summaries
 This is a new feature in `v1.5.0`, not just a small adjustment.
 When a user references another chat:
 - the plugin can reuse an existing cached summary
 - inject the full referenced chat if it is small enough
 - or generate a summary first if the referenced chat is too large
 That means the filter can now carry relevant context across chats, not just across turns inside the same chat.
 ### Fast multilingual token estimation
 Also new in `v1.5.0`.
 The plugin no longer relies on a rough one-size-fits-all character ratio. It now estimates token usage with mixed-script heuristics that behave much better for:
 - English
 - Chinese
 - Japanese
 - Korean
 - Cyrillic
 - Arabic
 - Thai
 - mixed-language conversations
 This matters because the plugin makes context decisions constantly. Better estimation means fewer unnecessary exact counts and fewer bad preflight assumptions.
 ### Stronger final-prompt budgeting
 The summary path now fits the **real final summary request**, not just an intermediate estimate. That includes:
 - prompt wrapper
 - formatted conversation text
 - previous summary
 - reserved output budget
 - safety margin
 This directly improves reliability in the large old-chat cases that are hardest to handle.
 ---
 ## Why It Feels Complete Now
 I would describe the current plugin as “feature-complete for the main problem space,” because it now covers the major operational surfaces that matter in real usage:
 - long plain-chat conversations
 - multi-step coding threads
 - native tool-calling conversations
 - persistent summaries
 - custom model thresholds
 - background async generation
 - external chat references
 - multilingual token estimation
 - failure surfacing for debugging
 That does not mean it is finished forever. It means the plugin has crossed the line from a narrow experimental filter into a robust context-management system with enough breadth to support demanding OpenWebUI usage patterns.
 ---
 ## Scale and Engineering Depth
 For people who care about implementation depth, this plugin is not small anymore.
 Current code size:
 - main plugin: **4,573 lines**
 - focused test file: **1,037 lines**
 - combined visible implementation + regression coverage: **5,610 lines**
 Line count is not a quality metric by itself, but at this scale it does say something real:
 - the plugin has grown well beyond a toy filter
 - the behavior surface is large enough to require explicit regression testing
 - the plugin now encodes a lot of edge-case handling that only shows up after repeated real-world usage
 In other words: this is no longer “just summarize old messages.” It is a fairly serious stateful filter.
 ---
 ## Practical Benefits
 If you use OpenWebUI heavily, the value is straightforward:
 - lower token consumption in long chats
 - better continuity across long-running sessions
 - safer native tool-calling history
 - fewer broken conversations after compression
 - more stable summary generation on large histories
 - better visibility when the provider rejects a summary request
 - useful reuse of context from referenced chats
 This plugin is especially valuable if you:
 - regularly work in long coding chats
 - use models with strict context budgets
 - rely on native tool calling
 - revisit old project chats
 - want summaries to behave like working memory, not like lossy notes
 ---
 ## Installation
 - OpenWebUI Community: <https://openwebui.com/posts/async_context_compression_b1655bc8>
 - Source: <https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
 If you want the full valve list, deployment notes, and troubleshooting details, the README is the best reference.
 ---
 ## Final Note
 Do I think this plugin is powerful?
 Yes, genuinely.
 Not because it is large, but because it now solves the right combination of problems at once:
 - cost control
 - continuity
 - structural safety
 - async persistence
 - cross-chat reuse
 - operational debuggability
 That combination is what makes it feel strong.
 If you have been looking for a serious long-conversation memory/compression filter for OpenWebUI, `Async Context Compression` is now in that category.
--- a/plugins/filters/async-context-compression/community_post_CN.md
+++ b/plugins/filters/async-context-compression/community_post_CN.md
@@ -0,0 +1,282 @@
 [![](https://img.shields.io/badge/OpenWebUI%20%E7%A4%BE%E5%8C%BA-%E8%8E%B7%E5%8F%96%E6%8F%92%E4%BB%B6-blue?style=for-the-badge)](https://openwebui.com/posts/async_context_compression_b1655bc8)
 # Async Context Compression：一个面向生产场景的 OpenWebUI 工作记忆过滤器
 长对话的问题，从来不只是“贵”。
 当聊天足够长时，通常只剩下两个都不太好的选择：
 - 保留完整历史，继续承担很高的上下文成本
 - 粗暴裁剪旧消息，但冒着丢失上下文、工具状态和关键决策的风险
 `Async Context Compression` 的目标，就是尽量避免这个二选一。
 它不是一个简单的“把老消息总结一下”的小工具，而是一个带有结构感知、异步摘要、数据库持久化能力的 OpenWebUI 工作记忆系统。它的任务不是单纯缩短上下文，而是在压缩长对话的同时，尽量保留：
 - 对话连续性
 - 工具调用状态完整性
 - 历史摘要进度
 - 跨聊天引用上下文
 - 出错时的可诊断性
 到 `v1.5.0` 这个阶段，我认为它已经不再只是一个“方便的小过滤器”，而是一个足够完整、足够强、也足够有工程深度的上下文管理插件。
 **[📖 完整 README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README_CN.md)**  
 **[📝 v1.5.0 发布说明](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0_CN.md)**
 ---
 ## 为什么会有这个插件
 OpenWebUI 里的真实对话，通常并不只是“用户问一句，模型答一句”。
 它常常还包含：
 - 很长的项目型对话
 - 多轮编码与调试
 - 原生工具调用
 - 多模态消息
 - 不同模型上下文窗口差异
 - 其他聊天的引用上下文
 在这种环境里，单纯靠“按长度裁掉旧消息”其实不够。
 如果一个过滤器只会按长度或索引裁剪消息，它很容易：
 - 把原生 tool-calling 历史裁坏
 - 丢掉仍然会影响下一轮回复的关键信息
 - 在老聊天里破坏连续性
 - 出问题时几乎无法排查
 - 把上游 provider 报错伪装成模糊的内部错误
 `Async Context Compression` 的核心思路更强一些：
 > 可以压缩历史，但不能把“对话结构”当成无关紧要的东西一起压掉
 它真正想保留的是下一轮最需要的状态：
 - 当前目标
 - 持久偏好
 - 最近进展
 - 仍然有效的工具结果
 - 错误状态
 - 已有摘要的连续性
 - 来自其他聊天的相关上下文
 ---
 ## 它和普通摘要插件有什么不同
 现在这个插件，实际上已经把几个通常要分散在不同系统里的能力组合到了一起：
 ### 1. 异步工作记忆生成
 用户当前这次回复不会被后台摘要阻塞。
 ### 2. 持久化摘要存储
 摘要会写入 OpenWebUI 共享数据库，并在后续轮次中复用，而不是每次都从头重算。
 ### 3. 结构感知裁剪
 裁剪逻辑会尊重原子消息边界，避免把原生 tool-calling 历史裁坏。
 ### 4. 外部聊天引用摘要
 这是 `v1.5.0` 新增的重要能力：被引用聊天现在可以直接复用缓存摘要、在小体量时直接注入、或者在过大时先生成摘要再注入。
 ### 5. 多语言 Token 预估
 插件现在具备更强的多脚本文本 Token 预估逻辑，在很多情况下可以减少不必要的精确计数，同时明显比旧的粗略字符比值更贴近真实用量。
 ### 6. 失败可见性
 关键的后台摘要失败现在会出现在浏览器控制台和状态提示里，不再悄悄消失。
 ---
 ## 工作流总览
 下面是当前的高层流程：
 ```mermaid
 flowchart TD
    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
    B --> C{Referenced chats attached?}
    C -- No --> D[Load current chat summary if available]
    C -- Yes --> E[Inspect each referenced chat]
    E --> F{Existing cached summary?}
    F -- Yes --> G[Reuse cached summary]
    F -- No --> H{Fits direct budget?}
    H -- Yes --> I[Inject full referenced chat text]
    H -- No --> J[Prepare referenced-chat summary input]
    J --> K{Referenced-chat summary call succeeds?}
    K -- Yes --> L[Inject generated referenced summary]
    K -- No --> M[Fallback to direct contextual injection]
    G --> D
    I --> D
    L --> D
    M --> D
    D --> N[Build current-chat Head + Summary + Tail]
    N --> O{Over max_context_tokens?}
    O -- Yes --> P[Trim oldest atomic groups]
    O -- No --> Q[Send final context to the model]
    P --> Q
    Q --> R[Model returns the reply]
    R --> S[Outlet rebuilds the full history]
    S --> T{Reached compression threshold?}
    T -- No --> U[Finish]
    T -- Yes --> V[Fit summary input to the summary model context]
    V --> W{Background summary call succeeds?}
    W -- Yes --> X[Save new chat summary and update status]
    W -- No --> Y[Force browser-console error and show status hint]
 ```
 这也是为什么我会觉得它现在“强”：它已经不再只解决一个问题，而是在一个过滤器里同时协调：
 - 上下文压缩
 - 历史摘要复用
 - 工具调用安全性
 - 被引用聊天上下文
 - 模型预算控制
 ---
 ## v1.5.0 为什么重要
 这个版本的重要性在于，它把插件从“长对话压缩器”推进成了一个更接近“上下文管理层”的东西。
 ### 外部聊天引用摘要
 这是 `v1.5.0` 的新功能，不是小修小补。
 当用户引用另一个聊天时，插件现在可以：
 - 直接复用已有缓存摘要
 - 如果聊天足够小，直接把完整内容注入
 - 如果聊天太大，先生成摘要再注入
 这意味着它现在不仅能跨“轮次”保留上下文，也能开始跨“聊天”携带相关上下文。
 ### 快速多语言 Token 预估
 这同样是 `v1.5.0` 的新能力。
 插件不再依赖简单粗暴的统一字符比值，而是改用更适合混合语言文本的估算方式，尤其对下面这些场景更有意义：
 - 英文
 - 中文
 - 日文
 - 韩文
 - 西里尔字符
 - 阿拉伯语
 - 泰语
 - 中英混合或多语言混合对话
 这很重要，因为上下文管理类插件会不断做预算判断。预估更准，就意味着更少无意义的精确计算，也更不容易在预检阶段做出错误判断。
 ### 更强的最终请求预算控制
 现在的摘要路径会去拟合“真实最终 summary request”，而不是只看一个中间估算值。它会把这些内容都算进去：
 - prompt 包装
 - 格式化后的对话文本
 - previous summary
 - 预留输出预算
 - 安全余量
 这对老聊天、大聊天和最难处理的边界情况特别关键。
 ---
 ## 为什么我觉得它现在已经足够完整
 如果把“问题空间”列出来，我会说这个插件现在对主要场景已经覆盖得比较完整了：
 - 很长的普通聊天
 - 多轮编码与调试对话
 - 原生工具调用
 - 历史摘要持久化
 - 自定义模型阈值
 - 异步后台摘要
 - 外部聊天引用
 - 多语言 Token 预估
 - 调试可见性
 这并不代表它永远不会再迭代，而是说它已经越过了“窄功能实验品”的阶段，进入了一个更像“通用上下文管理系统”的形态。
 ---
 ## 代码规模与工程深度
 如果你关心实现深度，这个插件现在已经不小了。
 当前代码规模：
 - 主插件文件：**4,573 行**
 - 聚焦测试文件：**1,037 行**
 - 可见实现 + 回归测试合计：**5,610 行**
 代码行数本身不等于质量，但在这个量级上，它至少说明了几件真实的事：
 - 这已经不是一个玩具级过滤器
 - 这个插件的行为面足够大，必须靠专门回归测试兜住
 - 它已经积累了很多只有在真实使用中才会暴露出来的边界处理逻辑
 也就是说，它现在做的事情，已经明显不是“把老消息总结一下”那么简单。
 ---
 ## 实际价值
 如果你是 OpenWebUI 的重度用户，这个插件的价值其实很直接：
 - 长聊天更省 Token
 - 长会话连续性更好
 - 原生 tool-calling 更安全
 - 压缩后更不容易把会话搞坏
 - 大历史摘要生成更稳定
 - provider 拒绝摘要请求时更容易看到真错误
 - 能复用其他聊天里的有效上下文
 尤其适合这些用户：
 - 经常做长时间编码聊天
 - 使用上下文窗口比较紧的模型
 - 依赖原生工具调用
 - 经常回看旧项目聊天
 - 希望摘要更像“工作记忆”而不是“丢失细节的简要笔记”
 ---
 ## 安装
 - OpenWebUI 社区：<https://openwebui.com/posts/async_context_compression_b1655bc8>
 - 源码目录：<https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression>
 如果你想看完整的 valves、部署说明和故障排查，README 仍然是最完整的参考入口。
 ---
 ## 最后一句
 你问我这个插件是不是强大。
 我的答案是：**是，确实强，而且现在已经不是“看起来强”，而是“问题空间覆盖得比较完整”的那种强。**
 不是因为它代码多，而是因为它现在同时解决的是一组真正相关的问题：
 - 成本控制
 - 连续性
 - 结构安全
 - 异步持久化
 - 跨聊天上下文复用
 - 出错时的可诊断性
 正是这几个东西一起成立，才让它现在像一个真正成熟的长对话上下文管理插件。