feat(filters): release v1.3.0 for async context compression

- Add native i18n support across 9 languages
- Implement non-blocking frontend log emission for zero TTFB delay
- Add token_usage_status_threshold to intelligently control status notifications
- Automatically detect and skip compression for copilot_sdk models
- Set debug_mode default to false for a quieter production environment
- Update documentation and remove legacy bilingual code
This commit is contained in:
fujie
2026-02-21 23:44:12 +08:00
parent 04b8108890
commit adc5e0a1f4
8 changed files with 771 additions and 2409 deletions

View File

@@ -1,137 +1,81 @@
# Async Context Compression
# Async Context Compression Filter
<span class="category-badge filter">Filter</span>
<span class="version-badge">v1.2.2</span>
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.3.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
## What's new in 1.3.0
- **Internationalization (i18n)**: Complete localization of user-facing messages across 9 languages (English, Chinese, Japanese, Korean, French, German, Spanish, Italian).
- **Smart Status Display**: Added `token_usage_status_threshold` valve (default 80%) to intelligently control when token usage status is shown.
- **Improved Performance**: Frontend language detection and logging are optimized to be completely non-blocking, maintaining lightning-fast TTFB.
- **Copilot SDK Integration**: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
- **Configuration**: `debug_mode` is now set to `false` by default for a quieter production experience.
---
## Overview
## Core Features
The Async Context Compression filter helps manage token usage in long conversations by:
- Intelligently summarizing older messages
- Preserving important context
- Reducing API costs
- Maintaining conversation coherence
This is especially useful for:
- Long-running conversations
- Complex multi-turn discussions
- Cost optimization
- Token limit management
## Features
- :material-arrow-collapse-vertical: **Smart Compression**: AI-powered context summarization
- :material-clock-fast: **Async Processing**: Non-blocking background compression
- :material-memory: **Context Preservation**: Keeps important information
- :material-currency-usd-off: **Cost Reduction**: Minimize token usage
- :material-console: **Frontend Debugging**: Debug logs in browser console
- :material-alert-circle-check: **Enhanced Error Reporting**: Clear error status notifications
- :material-check-all: **Open WebUI v0.7.x Compatibility**: Dynamic DB session handling
- :material-account-convert: **Improved Compatibility**: Summary role changed to `assistant`
- :material-shield-check: **Enhanced Stability**: Resolved race conditions in state management
- :material-ruler: **Preflight Context Check**: Validates context fit before sending
- :material-format-align-justify: **Structure-Aware Trimming**: Preserves document structure
- :material-content-cut: **Native Tool Output Trimming**: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
- :material-chart-bar: **Detailed Token Logging**: Granular token breakdown
- :material-account-search: **Smart Model Matching**: Inherit config from base models
- :material-image-off: **Multimodal Support**: Images are preserved but tokens are **NOT** calculated
-**Full i18n Support**: Native localization across 9 languages.
- ✅ Automatic compression triggered by token thresholds.
- ✅ Asynchronous summarization that does not block chat responses.
- ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
- ✅ Flexible retention policy to keep the first and last N messages.
- ✅ Smart injection of historical summaries back into the context.
- ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
- ✅ Native tool output trimming for cleaner context when using function calling.
- ✅ Real-time context usage monitoring with warning notifications (>90%).
- ✅ Detailed token logging for precise debugging and optimization.
- **Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
- **Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.
---
## Installation
## Installation & Configuration
1. Download the plugin file: [`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
2. Upload to OpenWebUI: **Admin Panel****Settings****Functions**
3. Configure compression settings
4. Enable the filter
### 1) Database (automatic)
- Uses Open WebUI's shared database connection; no extra configuration needed.
- The `chat_summary` table is created on first run.
### 2) Filter order
- Recommended order: pre-filters (<10) → this filter (10) → post-filters (>10).
---
## How It Works
## Configuration Parameters
```mermaid
graph TD
A[Incoming Messages] --> B{Token Count > Threshold?}
B -->|No| C[Pass Through]
B -->|Yes| D[Summarize Older Messages]
D --> E[Preserve Recent Messages]
E --> F[Combine Summary + Recent]
F --> G[Send to LLM]
```
| Parameter | Default | Description |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `priority` | `10` | Execution order; lower runs earlier. |
| `compression_threshold_tokens` | `64000` | Trigger asynchronous summary when total tokens exceed this value. Set to 50%-70% of your model's context window. |
| `max_context_tokens` | `128000` | Hard cap for context; older messages (except protected ones) are dropped if exceeded. |
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
| `summary_model_max_context` | `0` | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`. |
| `max_summary_tokens` | `16384` | Maximum tokens for the generated summary. |
| `summary_temperature` | `0.3` | Randomness for summary generation. Lower is more deterministic. |
| `model_thresholds` | `{}` | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models). |
| `enable_tool_output_trimming` | `false` | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer. |
| `debug_mode` | `false` | Log verbose debug info. Set to `false` in production. |
| `show_debug_log` | `false` | Print debug logs to browser console (F12). Useful for frontend debugging. |
| `show_token_usage_status` | `true` | Show token usage status notification in the chat interface. |
| `token_usage_status_threshold` | `80` | The minimum usage percentage (0-100) required to show a context usage status notification. |
---
## Configuration
## ⭐ Support
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `compression_threshold_tokens` | integer | `64000` | Trigger compression above this token count |
| `max_context_tokens` | integer | `128000` | Hard limit for context |
| `keep_first` | integer | `1` | Always keep the first N messages |
| `keep_last` | integer | `6` | Always keep the last N messages |
| `summary_model` | string | `None` | Model to use for summarization |
| `summary_model_max_context` | integer | `0` | Max context tokens for summary model |
| `max_summary_tokens` | integer | `16384` | Maximum tokens for the summary |
| `enable_tool_output_trimming` | boolean | `false` | Enable trimming of large tool outputs |
If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) is a big motivation for me. Thank you for the support.
---
## Troubleshooting ❓
## Example
- **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
- **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
- **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
### Before Compression
## Changelog
```
[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question
```
### After Compression
```
[Summary] Previous conversation covered Python basics,
history, features, and common use cases...
[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question
```
---
## Requirements
!!! note "Prerequisites"
- OpenWebUI v0.3.0 or later
- Access to an LLM for summarization
!!! tip "Best Practices"
- Set appropriate token thresholds based on your model's context window
- Preserve more recent messages for technical discussions
- Test compression settings in non-critical conversations first
---
## Troubleshooting
??? question "Compression not triggering?"
Check if the token count exceeds your configured threshold. Enable debug logging for more details.
??? question "Important context being lost?"
Increase the `preserve_recent` setting or lower the compression ratio.
---
## Source Code
[:fontawesome-brands-github: View on GitHub](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -1,137 +1,119 @@
# Async Context Compression异步上下文压缩
# 异步上下文压缩过滤器
<span class="category-badge filter">Filter</span>
<span class="version-badge">v1.2.2</span>
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.3.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
通过智能摘要减少长对话的 token 消耗,同时保持对话连贯
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
## 1.3.0 版本更新
- **国际化 (i18n) 支持**: 完成了所有用户可见消息的本地化,现已原生支持 9 种语言(含中、英、日、韩及欧洲主要语言)。
- **智能状态显示**: 新增 `token_usage_status_threshold` 阀门(默认 80%),可以智能控制何时显示 Token 用量状态,减少不必要的打扰。
- **性能大幅优化**: 对前端语言检测和日志处理流程进行了非阻塞重构完全不影响首字节响应时间TTFB保持毫秒级极速推流。
- **Copilot SDK 兼容**: 自动检测并跳过基于 `copilot_sdk` 模型的上下文压缩,避免冲突。
- **配置项调整**: 为了提供更安静的生产环境体验,`debug_mode` 现已默认设置为 `false`
---
## 概览
## 核心特性
Async Context Compression 过滤器通过以下方式帮助管理长对话的 token 使用:
-**全方位国际化**: 原生支持 9 种界面语言。
-**自动压缩**: 基于 Token 阈值自动触发上下文压缩。
-**异步摘要**: 后台生成摘要,不阻塞当前对话响应。
-**持久化存储**: 复用 Open WebUI 共享数据库连接,自动支持 PostgreSQL/SQLite 等。
-**灵活保留策略**: 可配置保留对话头部和尾部消息,确保关键信息连贯。
-**智能注入**: 将历史摘要智能注入到新上下文中。
-**结构感知裁剪**: 智能折叠过长消息,保留文档骨架(标题、首尾)。
-**原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
-**实时监控**: 实时监控上下文使用情况,超过 90% 发出警告。
-**详细日志**: 提供精确的 Token 统计日志,便于调试。
-**智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
-**多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
- 智能总结较早的消息
- 保留关键信息
- 降低 API 成本
- 保持对话一致性
特别适用于:
- 长时间会话
- 多轮复杂讨论
- 成本优化
- 上下文长度控制
## 功能特性
- :material-arrow-collapse-vertical: **智能压缩**AI 驱动的上下文摘要
- :material-clock-fast: **异步处理**:后台非阻塞压缩
- :material-memory: **保留上下文**:尽量保留重要信息
- :material-currency-usd-off: **降低成本**:减少 token 使用
- :material-console: **前端调试**:支持浏览器控制台日志
- :material-alert-circle-check: **增强错误报告**:清晰的错误状态通知
- :material-check-all: **Open WebUI v0.7.x 兼容性**:动态数据库会话处理
- :material-account-convert: **兼容性提升**:摘要角色改为 `assistant`
- :material-shield-check: **稳定性增强**:解决状态管理竞态条件
- :material-ruler: **预检上下文检查**:发送前验证上下文是否超限
- :material-format-align-justify: **结构感知裁剪**:保留文档结构的智能裁剪
- :material-content-cut: **原生工具输出裁剪**:自动裁剪冗长的工具输出(注意:非原生工具调用输出不会完整注入上下文)
- :material-chart-bar: **详细 Token 日志**:提供细粒度的 Token 统计
- :material-account-search: **智能模型匹配**:自定义模型自动继承基础模型配置
- :material-image-off: **多模态支持**:图片内容保留但 Token **不参与计算**
详细的工作原理和流程请参考 [工作流程指南](WORKFLOW_GUIDE_CN.md)。
---
## 安装
## 安装与配置
1. 下载插件文件:[`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
2. 上传到 OpenWebUI**Admin Panel** → **Settings****Functions**
3. 配置压缩参数
4. 启用过滤器
### 1. 数据库(自动)
- 自动使用 Open WebUI 的共享数据库连接,**无需额外配置**。
- 首次运行自动创建 `chat_summary` 表。
### 2. 过滤器顺序
- 建议顺序:前置过滤器(<10→ 本过滤器10→ 后置过滤器(>10
---
## 工作原理
## 配置参数
```mermaid
graph TD
A[Incoming Messages] --> B{Token Count > Threshold?}
B -->|No| C[Pass Through]
B -->|Yes| D[Summarize Older Messages]
D --> E[Preserve Recent Messages]
E --> F[Combine Summary + Recent]
F --> G[Send to LLM]
您可以在过滤器的设置中调整以下参数:
### 核心参数
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :------------------------------------------------------------------------------------ |
| `priority` | `10` | 过滤器执行顺序,数值越小越先执行。 |
| `compression_threshold_tokens` | `64000` | **重要**: 当上下文总 Token 超过此值时后台生成摘要,建议设为模型上下文窗口的 50%-70%。 |
| `max_context_tokens` | `128000` | **重要**: 上下文硬上限,超过即移除最早消息(保留受保护消息)。 |
| `keep_first` | `1` | 始终保留对话开始的 N 条消息,保护系统提示或环境变量。 |
| `keep_last` | `6` | 始终保留对话末尾的 N 条消息,确保最近上下文连贯。 |
### 摘要生成配置
| 参数 | 默认值 | 描述 |
| :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
| `summary_model` | `None` | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型(如 `gemini-2.5-flash``deepseek-v3`)。留空则尝试复用当前对话模型。 |
| `summary_model_max_context` | `0` | 摘要模型的最大上下文 Token 数。如果为 0则回退到 `model_thresholds` 或全局 `max_context_tokens`。 |
| `max_summary_tokens` | `16384` | 生成摘要时允许的最大 Token 数。 |
| `summary_temperature` | `0.1` | 控制摘要生成的随机性,较低的值结果更稳定。 |
### 高级配置
#### `model_thresholds` (模型特定阈值)
这是一个字典配置,可为特定模型 ID 覆盖全局 `compression_threshold_tokens``max_context_tokens`,适用于混合不同上下文窗口的模型。
**默认包含 GPT-4、Claude 3.5、Gemini 1.5/2.0、Qwen 2.5/3、DeepSeek V3 等推荐阈值。**
**配置示例:**
```json
{
"gpt-4": {
"compression_threshold_tokens": 8000,
"max_context_tokens": 32000
},
"gemini-2.5-flash": {
"compression_threshold_tokens": 734000,
"max_context_tokens": 1048576
}
}
```
---
## 配置项
| 选项 | 类型 | 默认值 | 说明 |
|--------|------|---------|-------------|
| `compression_threshold_tokens` | integer | `64000` | 超过该 token 数触发压缩 |
| `max_context_tokens` | integer | `128000` | 上下文硬性上限 |
| `keep_first` | integer | `1` | 始终保留的前 N 条消息 |
| `keep_last` | integer | `6` | 始终保留的后 N 条消息 |
| `summary_model` | string | `None` | 用于摘要的模型 |
| `summary_model_max_context` | integer | `0` | 摘要模型的最大上下文 Token 数 |
| `max_summary_tokens` | integer | `16384` | 摘要的最大 token 数 |
| `enable_tool_output_trimming` | boolean | `false` | 启用长工具输出裁剪 |
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| `enable_tool_output_trimming` | `false` | 启用时,若 `function_calling: "native"` 激活,将裁剪冗长的工具输出以仅提取最终答案。 |
| `debug_mode` | `false` | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
| `show_debug_log` | `false` | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。 |
| `show_token_usage_status` | `true` | 是否在对话结束时显示 Token 使用情况的状态通知。 |
| `token_usage_status_threshold` | `80` | 触发显示上下文用量状态通知的最低百分比阈值 (0-100)。 |
---
## 示例
## ⭐ 支持
### 压缩前
如果这个插件对你有帮助,欢迎到 [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) 点个 Star这将是我持续改进的动力感谢支持。
```
[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question
```
## 故障排除 (Troubleshooting) ❓
### 压缩后
- **初始系统提示丢失**:将 `keep_first` 设置为大于 0。
- **压缩效果不明显**:提高 `compression_threshold_tokens`,或降低 `keep_first` / `keep_last` 以增强压缩力度。
- **提交 Issue**: 如果遇到任何问题,请在 GitHub 上提交 Issue[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
```
[Summary] Previous conversation covered Python basics,
history, features, and common use cases...
## 更新日志
[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question
```
---
## 运行要求
!!! note "前置条件"
- OpenWebUI v0.3.0 及以上
- 需要可用的 LLM 用于摘要
!!! tip "最佳实践"
- 根据模型上下文窗口设置合适的 token 阈值
- 技术讨论可适当提高 `preserve_recent`
- 先在非关键对话中测试压缩效果
---
## 常见问题
??? question "没有触发压缩?"
检查 token 数是否超过配置的阈值,并开启调试日志了解细节。
??? question "重要上下文丢失?"
提高 `preserve_recent` 或降低压缩比例。
---
## 源码
[:fontawesome-brands-github: 在 GitHub 查看](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
完整历史请查看 GitHub 项目: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -22,7 +22,7 @@ Filters act as middleware in the message pipeline:
Reduces token consumption in long conversations through intelligent summarization while maintaining coherence.
**Version:** 1.2.2
**Version:** 1.3.0
[:octicons-arrow-right-24: Documentation](async-context-compression.md)

View File

@@ -22,7 +22,7 @@ Filter 充当消息管线中的中间件:
通过智能总结减少长对话的 token 消耗,同时保持连贯性。
**版本:** 1.2.2
**版本:** 1.3.0
[:octicons-arrow-right-24: 查看文档](async-context-compression.md)