feat(filters): release v1.3.0 for async context compression

- Add native i18n support across 9 languages - Implement non-blocking frontend log emission for zero TTFB delay - Add token_usage_status_threshold to intelligently control status notifications - Automatically detect and skip compression for copilot_sdk models - Set debug_mode default to false for a quieter production environment - Update documentation and remove legacy bilingual code
2026-02-21 23:44:12 +08:00
parent 04b8108890
commit adc5e0a1f4
8 changed files with 771 additions and 2409 deletions
--- a/docs/plugins/filters/async-context-compression.md
+++ b/docs/plugins/filters/async-context-compression.md
@@ -1,137 +1,81 @@
-# Async Context Compression
+# Async Context Compression Filter

-<span class="category-badge filter">Filter</span>
-<span class="version-badge">v1.2.2</span>
+**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.3.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT

-Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.
+This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
+
+## What's new in 1.3.0
+
+- **Internationalization (i18n)**: Complete localization of user-facing messages across 9 languages (English, Chinese, Japanese, Korean, French, German, Spanish, Italian).
+- **Smart Status Display**: Added `token_usage_status_threshold` valve (default 80%) to intelligently control when token usage status is shown.
+- **Improved Performance**: Frontend language detection and logging are optimized to be completely non-blocking, maintaining lightning-fast TTFB.
+- **Copilot SDK Integration**: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
+- **Configuration**: `debug_mode` is now set to `false` by default for a quieter production experience.

 ---

-## Overview
+## Core Features

-The Async Context Compression filter helps manage token usage in long conversations by:
-
- Intelligently summarizing older messages
- Preserving important context
- Reducing API costs
- Maintaining conversation coherence
-
-This is especially useful for:
-
- Long-running conversations
- Complex multi-turn discussions
- Cost optimization
- Token limit management
-
-## Features
-
- :material-arrow-collapse-vertical: **Smart Compression**: AI-powered context summarization
- :material-clock-fast: **Async Processing**: Non-blocking background compression
- :material-memory: **Context Preservation**: Keeps important information
- :material-currency-usd-off: **Cost Reduction**: Minimize token usage
- :material-console: **Frontend Debugging**: Debug logs in browser console
- :material-alert-circle-check: **Enhanced Error Reporting**: Clear error status notifications
- :material-check-all: **Open WebUI v0.7.x Compatibility**: Dynamic DB session handling
- :material-account-convert: **Improved Compatibility**: Summary role changed to `assistant`
- :material-shield-check: **Enhanced Stability**: Resolved race conditions in state management
- :material-ruler: **Preflight Context Check**: Validates context fit before sending
- :material-format-align-justify: **Structure-Aware Trimming**: Preserves document structure
- :material-content-cut: **Native Tool Output Trimming**: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
- :material-chart-bar: **Detailed Token Logging**: Granular token breakdown
- :material-account-search: **Smart Model Matching**: Inherit config from base models
- :material-image-off: **Multimodal Support**: Images are preserved but tokens are **NOT** calculated
+- ✅ **Full i18n Support**: Native localization across 9 languages.
+- ✅ Automatic compression triggered by token thresholds.
+- ✅ Asynchronous summarization that does not block chat responses.
+- ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
+- ✅ Flexible retention policy to keep the first and last N messages.
+- ✅ Smart injection of historical summaries back into the context.
+- ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
+- ✅ Native tool output trimming for cleaner context when using function calling.
+- ✅ Real-time context usage monitoring with warning notifications (>90%).
+- ✅ Detailed token logging for precise debugging and optimization.
+- ✅ **Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
+- ⚠ **Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.

 ---

-## Installation
+## Installation & Configuration

-1. Download the plugin file: [`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
-2. Upload to OpenWebUI: **Admin Panel** → **Settings** → **Functions**
-3. Configure compression settings
-4. Enable the filter
+### 1) Database (automatic)
+
+- Uses Open WebUI's shared database connection; no extra configuration needed.
+- The `chat_summary` table is created on first run.
+
+### 2) Filter order
+
+- Recommended order: pre-filters (<10) → this filter (10) → post-filters (>10).

 ---

-## How It Works
+## Configuration Parameters

-```mermaid
-graph TD
-    A[Incoming Messages] --> B{Token Count > Threshold?}
-    B -->|No| C[Pass Through]
-    B -->|Yes| D[Summarize Older Messages]
-    D --> E[Preserve Recent Messages]
-    E --> F[Combine Summary + Recent]
-    F --> G[Send to LLM]
-```
+| Parameter                      | Default  | Description                                                                                                                                                           |
+| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `priority`                     | `10`     | Execution order; lower runs earlier.                                                                                                                                  |
+| `compression_threshold_tokens` | `64000`  | Trigger asynchronous summary when total tokens exceed this value. Set to 50%-70% of your model's context window.                                                      |
+| `max_context_tokens`           | `128000` | Hard cap for context; older messages (except protected ones) are dropped if exceeded.                                                                                 |
+| `keep_first`                   | `1`      | Always keep the first N messages (protects system prompts).                                                                                                           |
+| `keep_last`                    | `6`      | Always keep the last N messages to preserve recent context.                                                                                                           |
+| `summary_model`                | `None`   | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
+| `summary_model_max_context`    | `0`      | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`.                                                      |
+| `max_summary_tokens`           | `16384`  | Maximum tokens for the generated summary.                                                                                                                             |
+| `summary_temperature`          | `0.3`    | Randomness for summary generation. Lower is more deterministic.                                                                                                       |
+| `model_thresholds`             | `{}`     | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models).                                                            |
+| `enable_tool_output_trimming`  | `false`  | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer.                                                 |
+| `debug_mode`                   | `false`  | Log verbose debug info. Set to `false` in production.                                                                                                                 |
+| `show_debug_log`               | `false`  | Print debug logs to browser console (F12). Useful for frontend debugging.                                                                                             |
+| `show_token_usage_status`      | `true`   | Show token usage status notification in the chat interface.                                                                                                           |
+| `token_usage_status_threshold` | `80`     | The minimum usage percentage (0-100) required to show a context usage status notification.                                                                            |

 ---

-## Configuration
+## ⭐ Support

-| Option | Type | Default | Description |
-|--------|------|---------|-------------|
-| `compression_threshold_tokens` | integer | `64000` | Trigger compression above this token count |
-| `max_context_tokens` | integer | `128000` | Hard limit for context |
-| `keep_first` | integer | `1` | Always keep the first N messages |
-| `keep_last` | integer | `6` | Always keep the last N messages |
-| `summary_model` | string | `None` | Model to use for summarization |
-| `summary_model_max_context` | integer | `0` | Max context tokens for summary model |
-| `max_summary_tokens` | integer | `16384` | Maximum tokens for the summary |
-| `enable_tool_output_trimming` | boolean | `false` | Enable trimming of large tool outputs |
+If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) is a big motivation for me. Thank you for the support.

---
+## Troubleshooting ❓

-## Example
+- **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
+- **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
+- **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

-### Before Compression
+## Changelog

-```
-[Message 1] User: Tell me about Python...
-[Message 2] AI: Python is a programming language...
-[Message 3] User: What about its history?
-[Message 4] AI: Python was created by Guido...
-[Message 5] User: And its features?
-[Message 6] AI: Python has many features...
-... (many more messages)
-[Message 20] User: Current question
-```
-
-### After Compression
-
-```
-[Summary] Previous conversation covered Python basics,
-history, features, and common use cases...
-
-[Message 18] User: Recent question about decorators
-[Message 19] AI: Decorators in Python are...
-[Message 20] User: Current question
-```
-
---
-
-## Requirements
-
-!!! note "Prerequisites"
-    - OpenWebUI v0.3.0 or later
-    - Access to an LLM for summarization
-
-!!! tip "Best Practices"
-    - Set appropriate token thresholds based on your model's context window
-    - Preserve more recent messages for technical discussions
-    - Test compression settings in non-critical conversations first
-
---
-
-## Troubleshooting
-
-??? question "Compression not triggering?"
-    Check if the token count exceeds your configured threshold. Enable debug logging for more details.
-
-??? question "Important context being lost?"
-    Increase the `preserve_recent` setting or lower the compression ratio.
-
---
-
-## Source Code
-
-[:fontawesome-brands-github: View on GitHub](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
+See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/docs/plugins/filters/async-context-compression.zh.md
+++ b/docs/plugins/filters/async-context-compression.zh.md
@@ -1,137 +1,119 @@
-# Async Context Compression（异步上下文压缩）
+# 异步上下文压缩过滤器

-<span class="category-badge filter">Filter</span>
-<span class="version-badge">v1.2.2</span>
+**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.3.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT

-通过智能摘要减少长对话的 token 消耗，同时保持对话连贯。
+> **重要提示**：为了确保所有过滤器的可维护性和易用性，每个过滤器都应附带清晰、完整的文档，以确保其功能、配置和使用方法得到充分说明。
+
+本过滤器通过智能摘要和消息压缩技术，在保持对话连贯性的同时，显著降低长对话的 Token 消耗。
+
+## 1.3.0 版本更新
+
+- **国际化 (i18n) 支持**: 完成了所有用户可见消息的本地化，现已原生支持 9 种语言（含中、英、日、韩及欧洲主要语言）。
+- **智能状态显示**: 新增 `token_usage_status_threshold` 阀门（默认 80%），可以智能控制何时显示 Token 用量状态，减少不必要的打扰。
+- **性能大幅优化**: 对前端语言检测和日志处理流程进行了非阻塞重构，完全不影响首字节响应时间（TTFB），保持毫秒级极速推流。
+- **Copilot SDK 兼容**: 自动检测并跳过基于 `copilot_sdk` 模型的上下文压缩，避免冲突。
+- **配置项调整**: 为了提供更安静的生产环境体验，`debug_mode` 现已默认设置为 `false`。

 ---

-## 概览
+## 核心特性

-Async Context Compression 过滤器通过以下方式帮助管理长对话的 token 使用：
+- ✅ **全方位国际化**: 原生支持 9 种界面语言。
+- ✅ **自动压缩**: 基于 Token 阈值自动触发上下文压缩。
+- ✅ **异步摘要**: 后台生成摘要，不阻塞当前对话响应。
+- ✅ **持久化存储**: 复用 Open WebUI 共享数据库连接，自动支持 PostgreSQL/SQLite 等。
+- ✅ **灵活保留策略**: 可配置保留对话头部和尾部消息，确保关键信息连贯。
+- ✅ **智能注入**: 将历史摘要智能注入到新上下文中。
+- ✅ **结构感知裁剪**: 智能折叠过长消息，保留文档骨架（标题、首尾）。
+- ✅ **原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
+- ✅ **实时监控**: 实时监控上下文使用情况，超过 90% 发出警告。
+- ✅ **详细日志**: 提供精确的 Token 统计日志，便于调试。
+- ✅ **智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
+- ⚠ **多模态支持**: 图片内容会被保留，但其 Token **不参与计算**。请相应调整阈值。

- 智能总结较早的消息
- 保留关键信息
- 降低 API 成本
- 保持对话一致性
-
-特别适用于：
-
- 长时间会话
- 多轮复杂讨论
- 成本优化
- 上下文长度控制
-
-## 功能特性
-
- :material-arrow-collapse-vertical: **智能压缩**：AI 驱动的上下文摘要
- :material-clock-fast: **异步处理**：后台非阻塞压缩
- :material-memory: **保留上下文**：尽量保留重要信息
- :material-currency-usd-off: **降低成本**：减少 token 使用
- :material-console: **前端调试**：支持浏览器控制台日志
- :material-alert-circle-check: **增强错误报告**：清晰的错误状态通知
- :material-check-all: **Open WebUI v0.7.x 兼容性**：动态数据库会话处理
- :material-account-convert: **兼容性提升**：摘要角色改为 `assistant`
- :material-shield-check: **稳定性增强**：解决状态管理竞态条件
- :material-ruler: **预检上下文检查**：发送前验证上下文是否超限
- :material-format-align-justify: **结构感知裁剪**：保留文档结构的智能裁剪
- :material-content-cut: **原生工具输出裁剪**：自动裁剪冗长的工具输出（注意：非原生工具调用输出不会完整注入上下文）
- :material-chart-bar: **详细 Token 日志**：提供细粒度的 Token 统计
- :material-account-search: **智能模型匹配**：自定义模型自动继承基础模型配置
- :material-image-off: **多模态支持**：图片内容保留但 Token **不参与计算**
+详细的工作原理和流程请参考 [工作流程指南](WORKFLOW_GUIDE_CN.md)。

 ---

-## 安装
+## 安装与配置

-1. 下载插件文件：[`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
-2. 上传到 OpenWebUI：**Admin Panel** → **Settings** → **Functions**
-3. 配置压缩参数
-4. 启用过滤器
+### 1. 数据库（自动）
+
+- 自动使用 Open WebUI 的共享数据库连接，**无需额外配置**。
+- 首次运行自动创建 `chat_summary` 表。
+
+### 2. 过滤器顺序
+
+- 建议顺序：前置过滤器（<10）→ 本过滤器（10）→ 后置过滤器（>10）。

 ---

-## 工作原理
+## 配置参数

-```mermaid
-graph TD
-    A[Incoming Messages] --> B{Token Count > Threshold?}
-    B -->|No| C[Pass Through]
-    B -->|Yes| D[Summarize Older Messages]
-    D --> E[Preserve Recent Messages]
-    E --> F[Combine Summary + Recent]
-    F --> G[Send to LLM]
+您可以在过滤器的设置中调整以下参数：
+
+### 核心参数
+
+| 参数                           | 默认值   | 描述                                                                                  |
+| :----------------------------- | :------- | :------------------------------------------------------------------------------------ |
+| `priority`                     | `10`     | 过滤器执行顺序，数值越小越先执行。                                                    |
+| `compression_threshold_tokens` | `64000`  | **重要**: 当上下文总 Token 超过此值时后台生成摘要，建议设为模型上下文窗口的 50%-70%。 |
+| `max_context_tokens`           | `128000` | **重要**: 上下文硬上限，超过即移除最早消息（保留受保护消息）。                        |
+| `keep_first`                   | `1`      | 始终保留对话开始的 N 条消息，保护系统提示或环境变量。                                 |
+| `keep_last`                    | `6`      | 始终保留对话末尾的 N 条消息，确保最近上下文连贯。                                     |
+
+### 摘要生成配置
+
+| 参数                  | 默认值  | 描述                                                                                                                                        |
+| :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
+| `summary_model`       | `None`  | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型（如 `gemini-2.5-flash`、`deepseek-v3`）。留空则尝试复用当前对话模型。 |
+| `summary_model_max_context` | `0`     | 摘要模型的最大上下文 Token 数。如果为 0，则回退到 `model_thresholds` 或全局 `max_context_tokens`。                                          |
+| `max_summary_tokens`  | `16384` | 生成摘要时允许的最大 Token 数。                                                                                                             |
+| `summary_temperature` | `0.1`   | 控制摘要生成的随机性，较低的值结果更稳定。                                                                                                  |
+
+### 高级配置
+
+#### `model_thresholds` (模型特定阈值)
+
+这是一个字典配置，可为特定模型 ID 覆盖全局 `compression_threshold_tokens` 与 `max_context_tokens`，适用于混合不同上下文窗口的模型。
+
+**默认包含 GPT-4、Claude 3.5、Gemini 1.5/2.0、Qwen 2.5/3、DeepSeek V3 等推荐阈值。**
+
+**配置示例：**
+
+```json
+{
+  "gpt-4": {
+    "compression_threshold_tokens": 8000,
+    "max_context_tokens": 32000
+  },
+  "gemini-2.5-flash": {
+    "compression_threshold_tokens": 734000,
+    "max_context_tokens": 1048576
+  }
+}
 ```

---
-
-## 配置项
-
-| 选项 | 类型 | 默认值 | 说明 |
-|--------|------|---------|-------------|
-| `compression_threshold_tokens` | integer | `64000` | 超过该 token 数触发压缩 |
-| `max_context_tokens` | integer | `128000` | 上下文硬性上限 |
-| `keep_first` | integer | `1` | 始终保留的前 N 条消息 |
-| `keep_last` | integer | `6` | 始终保留的后 N 条消息 |
-| `summary_model` | string | `None` | 用于摘要的模型 |
-| `summary_model_max_context` | integer | `0` | 摘要模型的最大上下文 Token 数 |
-| `max_summary_tokens` | integer | `16384` | 摘要的最大 token 数 |
-| `enable_tool_output_trimming` | boolean | `false` | 启用长工具输出裁剪 |
+| 参数                           | 默认值   | 描述                                                                                                                                    |
+| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
+| `enable_tool_output_trimming`  | `false`  | 启用时，若 `function_calling: "native"` 激活，将裁剪冗长的工具输出以仅提取最终答案。                                                        |
+| `debug_mode`                   | `false`   | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
+| `show_debug_log`               | `false`  | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。                                                                   |
+| `show_token_usage_status`      | `true`   | 是否在对话结束时显示 Token 使用情况的状态通知。                                                                         |
+| `token_usage_status_threshold` | `80`     | 触发显示上下文用量状态通知的最低百分比阈值 (0-100)。                                                                    |

 ---

-## 示例
+## ⭐ 支持

-### 压缩前
+如果这个插件对你有帮助，欢迎到 [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) 点个 Star，这将是我持续改进的动力，感谢支持。

-```
-[Message 1] User: Tell me about Python...
-[Message 2] AI: Python is a programming language...
-[Message 3] User: What about its history?
-[Message 4] AI: Python was created by Guido...
-[Message 5] User: And its features?
-[Message 6] AI: Python has many features...
-... (many more messages)
-[Message 20] User: Current question
-```
+## 故障排除 (Troubleshooting) ❓

-### 压缩后
+- **初始系统提示丢失**：将 `keep_first` 设置为大于 0。
+- **压缩效果不明显**：提高 `compression_threshold_tokens`，或降低 `keep_first` / `keep_last` 以增强压缩力度。
+- **提交 Issue**: 如果遇到任何问题，请在 GitHub 上提交 Issue：[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

-```
-[Summary] Previous conversation covered Python basics,
-history, features, and common use cases...
+## 更新日志

-[Message 18] User: Recent question about decorators
-[Message 19] AI: Decorators in Python are...
-[Message 20] User: Current question
-```
-
---
-
-## 运行要求
-
-!!! note "前置条件"
-    - OpenWebUI v0.3.0 及以上
-    - 需要可用的 LLM 用于摘要
-
-!!! tip "最佳实践"
-    - 根据模型上下文窗口设置合适的 token 阈值
-    - 技术讨论可适当提高 `preserve_recent`
-    - 先在非关键对话中测试压缩效果
-
---
-
-## 常见问题
-
-??? question "没有触发压缩？"
-    检查 token 数是否超过配置的阈值，并开启调试日志了解细节。
-
-??? question "重要上下文丢失？"
-    提高 `preserve_recent` 或降低压缩比例。
-
---
-
-## 源码
-
-[:fontawesome-brands-github: 在 GitHub 查看](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
+完整历史请查看 GitHub 项目： [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/docs/plugins/filters/index.md
+++ b/docs/plugins/filters/index.md
@@ -22,7 +22,7 @@ Filters act as middleware in the message pipeline:

    Reduces token consumption in long conversations through intelligent summarization while maintaining coherence.

-    **Version:** 1.2.2
+    **Version:** 1.3.0

    [:octicons-arrow-right-24: Documentation](async-context-compression.md)

--- a/docs/plugins/filters/index.zh.md
+++ b/docs/plugins/filters/index.zh.md
@@ -22,7 +22,7 @@ Filter 充当消息管线中的中间件：

    通过智能总结减少长对话的 token 消耗，同时保持连贯性。

-    **版本：** 1.2.2
+    **版本：** 1.3.0

    [:octicons-arrow-right-24: 查看文档](async-context-compression.md)