feat(filters): release v1.3.0 for async context compression

- Add native i18n support across 9 languages
- Implement non-blocking frontend log emission for zero TTFB delay
- Add token_usage_status_threshold to intelligently control status notifications
- Automatically detect and skip compression for copilot_sdk models
- Set debug_mode default to false for a quieter production environment
- Update documentation and remove legacy bilingual code
This commit is contained in:
fujie
2026-02-21 23:44:12 +08:00
parent 04b8108890
commit adc5e0a1f4
8 changed files with 771 additions and 2409 deletions

View File

@@ -1,137 +1,81 @@
# Async Context Compression
# Async Context Compression Filter
<span class="category-badge filter">Filter</span>
<span class="version-badge">v1.2.2</span>
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.3.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
Reduces token consumption in long conversations through intelligent summarization while maintaining conversational coherence.
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
## What's new in 1.3.0
- **Internationalization (i18n)**: Complete localization of user-facing messages across 9 languages (English, Chinese, Japanese, Korean, French, German, Spanish, Italian).
- **Smart Status Display**: Added `token_usage_status_threshold` valve (default 80%) to intelligently control when token usage status is shown.
- **Improved Performance**: Frontend language detection and logging are optimized to be completely non-blocking, maintaining lightning-fast TTFB.
- **Copilot SDK Integration**: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
- **Configuration**: `debug_mode` is now set to `false` by default for a quieter production experience.
---
## Overview
## Core Features
The Async Context Compression filter helps manage token usage in long conversations by:
- Intelligently summarizing older messages
- Preserving important context
- Reducing API costs
- Maintaining conversation coherence
This is especially useful for:
- Long-running conversations
- Complex multi-turn discussions
- Cost optimization
- Token limit management
## Features
- :material-arrow-collapse-vertical: **Smart Compression**: AI-powered context summarization
- :material-clock-fast: **Async Processing**: Non-blocking background compression
- :material-memory: **Context Preservation**: Keeps important information
- :material-currency-usd-off: **Cost Reduction**: Minimize token usage
- :material-console: **Frontend Debugging**: Debug logs in browser console
- :material-alert-circle-check: **Enhanced Error Reporting**: Clear error status notifications
- :material-check-all: **Open WebUI v0.7.x Compatibility**: Dynamic DB session handling
- :material-account-convert: **Improved Compatibility**: Summary role changed to `assistant`
- :material-shield-check: **Enhanced Stability**: Resolved race conditions in state management
- :material-ruler: **Preflight Context Check**: Validates context fit before sending
- :material-format-align-justify: **Structure-Aware Trimming**: Preserves document structure
- :material-content-cut: **Native Tool Output Trimming**: Trims verbose tool outputs (Note: Non-native tool outputs are not fully injected into context)
- :material-chart-bar: **Detailed Token Logging**: Granular token breakdown
- :material-account-search: **Smart Model Matching**: Inherit config from base models
- :material-image-off: **Multimodal Support**: Images are preserved but tokens are **NOT** calculated
-**Full i18n Support**: Native localization across 9 languages.
- ✅ Automatic compression triggered by token thresholds.
- ✅ Asynchronous summarization that does not block chat responses.
- ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
- ✅ Flexible retention policy to keep the first and last N messages.
- ✅ Smart injection of historical summaries back into the context.
- ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
- ✅ Native tool output trimming for cleaner context when using function calling.
- ✅ Real-time context usage monitoring with warning notifications (>90%).
- ✅ Detailed token logging for precise debugging and optimization.
- **Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
- **Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.
---
## Installation
## Installation & Configuration
1. Download the plugin file: [`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
2. Upload to OpenWebUI: **Admin Panel****Settings****Functions**
3. Configure compression settings
4. Enable the filter
### 1) Database (automatic)
- Uses Open WebUI's shared database connection; no extra configuration needed.
- The `chat_summary` table is created on first run.
### 2) Filter order
- Recommended order: pre-filters (<10) → this filter (10) → post-filters (>10).
---
## How It Works
## Configuration Parameters
```mermaid
graph TD
A[Incoming Messages] --> B{Token Count > Threshold?}
B -->|No| C[Pass Through]
B -->|Yes| D[Summarize Older Messages]
D --> E[Preserve Recent Messages]
E --> F[Combine Summary + Recent]
F --> G[Send to LLM]
```
| Parameter | Default | Description |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `priority` | `10` | Execution order; lower runs earlier. |
| `compression_threshold_tokens` | `64000` | Trigger asynchronous summary when total tokens exceed this value. Set to 50%-70% of your model's context window. |
| `max_context_tokens` | `128000` | Hard cap for context; older messages (except protected ones) are dropped if exceeded. |
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
| `summary_model_max_context` | `0` | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`. |
| `max_summary_tokens` | `16384` | Maximum tokens for the generated summary. |
| `summary_temperature` | `0.3` | Randomness for summary generation. Lower is more deterministic. |
| `model_thresholds` | `{}` | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models). |
| `enable_tool_output_trimming` | `false` | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer. |
| `debug_mode` | `false` | Log verbose debug info. Set to `false` in production. |
| `show_debug_log` | `false` | Print debug logs to browser console (F12). Useful for frontend debugging. |
| `show_token_usage_status` | `true` | Show token usage status notification in the chat interface. |
| `token_usage_status_threshold` | `80` | The minimum usage percentage (0-100) required to show a context usage status notification. |
---
## Configuration
## ⭐ Support
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `compression_threshold_tokens` | integer | `64000` | Trigger compression above this token count |
| `max_context_tokens` | integer | `128000` | Hard limit for context |
| `keep_first` | integer | `1` | Always keep the first N messages |
| `keep_last` | integer | `6` | Always keep the last N messages |
| `summary_model` | string | `None` | Model to use for summarization |
| `summary_model_max_context` | integer | `0` | Max context tokens for summary model |
| `max_summary_tokens` | integer | `16384` | Maximum tokens for the summary |
| `enable_tool_output_trimming` | boolean | `false` | Enable trimming of large tool outputs |
If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) is a big motivation for me. Thank you for the support.
---
## Troubleshooting ❓
## Example
- **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
- **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
- **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
### Before Compression
## Changelog
```
[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question
```
### After Compression
```
[Summary] Previous conversation covered Python basics,
history, features, and common use cases...
[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question
```
---
## Requirements
!!! note "Prerequisites"
- OpenWebUI v0.3.0 or later
- Access to an LLM for summarization
!!! tip "Best Practices"
- Set appropriate token thresholds based on your model's context window
- Preserve more recent messages for technical discussions
- Test compression settings in non-critical conversations first
---
## Troubleshooting
??? question "Compression not triggering?"
Check if the token count exceeds your configured threshold. Enable debug logging for more details.
??? question "Important context being lost?"
Increase the `preserve_recent` setting or lower the compression ratio.
---
## Source Code
[:fontawesome-brands-github: View on GitHub](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -1,137 +1,119 @@
# Async Context Compression异步上下文压缩
# 异步上下文压缩过滤器
<span class="category-badge filter">Filter</span>
<span class="version-badge">v1.2.2</span>
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.3.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
通过智能摘要减少长对话的 token 消耗,同时保持对话连贯
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
## 1.3.0 版本更新
- **国际化 (i18n) 支持**: 完成了所有用户可见消息的本地化,现已原生支持 9 种语言(含中、英、日、韩及欧洲主要语言)。
- **智能状态显示**: 新增 `token_usage_status_threshold` 阀门(默认 80%),可以智能控制何时显示 Token 用量状态,减少不必要的打扰。
- **性能大幅优化**: 对前端语言检测和日志处理流程进行了非阻塞重构完全不影响首字节响应时间TTFB保持毫秒级极速推流。
- **Copilot SDK 兼容**: 自动检测并跳过基于 `copilot_sdk` 模型的上下文压缩,避免冲突。
- **配置项调整**: 为了提供更安静的生产环境体验,`debug_mode` 现已默认设置为 `false`
---
## 概览
## 核心特性
Async Context Compression 过滤器通过以下方式帮助管理长对话的 token 使用:
-**全方位国际化**: 原生支持 9 种界面语言。
-**自动压缩**: 基于 Token 阈值自动触发上下文压缩。
-**异步摘要**: 后台生成摘要,不阻塞当前对话响应。
-**持久化存储**: 复用 Open WebUI 共享数据库连接,自动支持 PostgreSQL/SQLite 等。
-**灵活保留策略**: 可配置保留对话头部和尾部消息,确保关键信息连贯。
-**智能注入**: 将历史摘要智能注入到新上下文中。
-**结构感知裁剪**: 智能折叠过长消息,保留文档骨架(标题、首尾)。
-**原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
-**实时监控**: 实时监控上下文使用情况,超过 90% 发出警告。
-**详细日志**: 提供精确的 Token 统计日志,便于调试。
-**智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
-**多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
- 智能总结较早的消息
- 保留关键信息
- 降低 API 成本
- 保持对话一致性
特别适用于:
- 长时间会话
- 多轮复杂讨论
- 成本优化
- 上下文长度控制
## 功能特性
- :material-arrow-collapse-vertical: **智能压缩**AI 驱动的上下文摘要
- :material-clock-fast: **异步处理**:后台非阻塞压缩
- :material-memory: **保留上下文**:尽量保留重要信息
- :material-currency-usd-off: **降低成本**:减少 token 使用
- :material-console: **前端调试**:支持浏览器控制台日志
- :material-alert-circle-check: **增强错误报告**:清晰的错误状态通知
- :material-check-all: **Open WebUI v0.7.x 兼容性**:动态数据库会话处理
- :material-account-convert: **兼容性提升**:摘要角色改为 `assistant`
- :material-shield-check: **稳定性增强**:解决状态管理竞态条件
- :material-ruler: **预检上下文检查**:发送前验证上下文是否超限
- :material-format-align-justify: **结构感知裁剪**:保留文档结构的智能裁剪
- :material-content-cut: **原生工具输出裁剪**:自动裁剪冗长的工具输出(注意:非原生工具调用输出不会完整注入上下文)
- :material-chart-bar: **详细 Token 日志**:提供细粒度的 Token 统计
- :material-account-search: **智能模型匹配**:自定义模型自动继承基础模型配置
- :material-image-off: **多模态支持**:图片内容保留但 Token **不参与计算**
详细的工作原理和流程请参考 [工作流程指南](WORKFLOW_GUIDE_CN.md)。
---
## 安装
## 安装与配置
1. 下载插件文件:[`async_context_compression.py`](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression)
2. 上传到 OpenWebUI**Admin Panel** → **Settings****Functions**
3. 配置压缩参数
4. 启用过滤器
### 1. 数据库(自动)
- 自动使用 Open WebUI 的共享数据库连接,**无需额外配置**。
- 首次运行自动创建 `chat_summary` 表。
### 2. 过滤器顺序
- 建议顺序:前置过滤器(<10→ 本过滤器10→ 后置过滤器(>10
---
## 工作原理
## 配置参数
```mermaid
graph TD
A[Incoming Messages] --> B{Token Count > Threshold?}
B -->|No| C[Pass Through]
B -->|Yes| D[Summarize Older Messages]
D --> E[Preserve Recent Messages]
E --> F[Combine Summary + Recent]
F --> G[Send to LLM]
您可以在过滤器的设置中调整以下参数:
### 核心参数
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :------------------------------------------------------------------------------------ |
| `priority` | `10` | 过滤器执行顺序,数值越小越先执行。 |
| `compression_threshold_tokens` | `64000` | **重要**: 当上下文总 Token 超过此值时后台生成摘要,建议设为模型上下文窗口的 50%-70%。 |
| `max_context_tokens` | `128000` | **重要**: 上下文硬上限,超过即移除最早消息(保留受保护消息)。 |
| `keep_first` | `1` | 始终保留对话开始的 N 条消息,保护系统提示或环境变量。 |
| `keep_last` | `6` | 始终保留对话末尾的 N 条消息,确保最近上下文连贯。 |
### 摘要生成配置
| 参数 | 默认值 | 描述 |
| :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
| `summary_model` | `None` | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型(如 `gemini-2.5-flash``deepseek-v3`)。留空则尝试复用当前对话模型。 |
| `summary_model_max_context` | `0` | 摘要模型的最大上下文 Token 数。如果为 0则回退到 `model_thresholds` 或全局 `max_context_tokens`。 |
| `max_summary_tokens` | `16384` | 生成摘要时允许的最大 Token 数。 |
| `summary_temperature` | `0.1` | 控制摘要生成的随机性,较低的值结果更稳定。 |
### 高级配置
#### `model_thresholds` (模型特定阈值)
这是一个字典配置,可为特定模型 ID 覆盖全局 `compression_threshold_tokens``max_context_tokens`,适用于混合不同上下文窗口的模型。
**默认包含 GPT-4、Claude 3.5、Gemini 1.5/2.0、Qwen 2.5/3、DeepSeek V3 等推荐阈值。**
**配置示例:**
```json
{
"gpt-4": {
"compression_threshold_tokens": 8000,
"max_context_tokens": 32000
},
"gemini-2.5-flash": {
"compression_threshold_tokens": 734000,
"max_context_tokens": 1048576
}
}
```
---
## 配置项
| 选项 | 类型 | 默认值 | 说明 |
|--------|------|---------|-------------|
| `compression_threshold_tokens` | integer | `64000` | 超过该 token 数触发压缩 |
| `max_context_tokens` | integer | `128000` | 上下文硬性上限 |
| `keep_first` | integer | `1` | 始终保留的前 N 条消息 |
| `keep_last` | integer | `6` | 始终保留的后 N 条消息 |
| `summary_model` | string | `None` | 用于摘要的模型 |
| `summary_model_max_context` | integer | `0` | 摘要模型的最大上下文 Token 数 |
| `max_summary_tokens` | integer | `16384` | 摘要的最大 token 数 |
| `enable_tool_output_trimming` | boolean | `false` | 启用长工具输出裁剪 |
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| `enable_tool_output_trimming` | `false` | 启用时,若 `function_calling: "native"` 激活,将裁剪冗长的工具输出以仅提取最终答案。 |
| `debug_mode` | `false` | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
| `show_debug_log` | `false` | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。 |
| `show_token_usage_status` | `true` | 是否在对话结束时显示 Token 使用情况的状态通知。 |
| `token_usage_status_threshold` | `80` | 触发显示上下文用量状态通知的最低百分比阈值 (0-100)。 |
---
## 示例
## ⭐ 支持
### 压缩前
如果这个插件对你有帮助,欢迎到 [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) 点个 Star这将是我持续改进的动力感谢支持。
```
[Message 1] User: Tell me about Python...
[Message 2] AI: Python is a programming language...
[Message 3] User: What about its history?
[Message 4] AI: Python was created by Guido...
[Message 5] User: And its features?
[Message 6] AI: Python has many features...
... (many more messages)
[Message 20] User: Current question
```
## 故障排除 (Troubleshooting) ❓
### 压缩后
- **初始系统提示丢失**:将 `keep_first` 设置为大于 0。
- **压缩效果不明显**:提高 `compression_threshold_tokens`,或降低 `keep_first` / `keep_last` 以增强压缩力度。
- **提交 Issue**: 如果遇到任何问题,请在 GitHub 上提交 Issue[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)
```
[Summary] Previous conversation covered Python basics,
history, features, and common use cases...
## 更新日志
[Message 18] User: Recent question about decorators
[Message 19] AI: Decorators in Python are...
[Message 20] User: Current question
```
---
## 运行要求
!!! note "前置条件"
- OpenWebUI v0.3.0 及以上
- 需要可用的 LLM 用于摘要
!!! tip "最佳实践"
- 根据模型上下文窗口设置合适的 token 阈值
- 技术讨论可适当提高 `preserve_recent`
- 先在非关键对话中测试压缩效果
---
## 常见问题
??? question "没有触发压缩?"
检查 token 数是否超过配置的阈值,并开启调试日志了解细节。
??? question "重要上下文丢失?"
提高 `preserve_recent` 或降低压缩比例。
---
## 源码
[:fontawesome-brands-github: 在 GitHub 查看](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/filters/async-context-compression){ .md-button }
完整历史请查看 GitHub 项目: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)

View File

@@ -22,7 +22,7 @@ Filters act as middleware in the message pipeline:
Reduces token consumption in long conversations through intelligent summarization while maintaining coherence.
**Version:** 1.2.2
**Version:** 1.3.0
[:octicons-arrow-right-24: Documentation](async-context-compression.md)

View File

@@ -22,7 +22,7 @@ Filter 充当消息管线中的中间件:
通过智能总结减少长对话的 token 消耗,同时保持连贯性。
**版本:** 1.2.2
**版本:** 1.3.0
[:octicons-arrow-right-24: 查看文档](async-context-compression.md)

View File

@@ -1,18 +1,22 @@
# Async Context Compression Filter
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.2.2 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.3.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
## What's new in 1.2.2
## What's new in 1.3.0
- **Critical Fix**: Resolved `TypeError: 'str' object is not callable` caused by variable name conflict in logging function.
- **Compatibility**: Enhanced `params` handling to support Pydantic objects, improving compatibility with different OpenWebUI versions.
- **Internationalization (i18n)**: Complete localization of user-facing messages across 9 languages (English, Chinese, Japanese, Korean, French, German, Spanish, Italian).
- **Smart Status Display**: Added `token_usage_status_threshold` valve (default 80%) to intelligently control when token usage status is shown.
- **Improved Performance**: Frontend language detection and logging are optimized to be completely non-blocking, maintaining lightning-fast TTFB.
- **Copilot SDK Integration**: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
- **Configuration**: `debug_mode` is now set to `false` by default for a quieter production experience.
---
## Core Features
-**Full i18n Support**: Native localization across 9 languages.
- ✅ Automatic compression triggered by token thresholds.
- ✅ Asynchronous summarization that does not block chat responses.
- ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
@@ -55,8 +59,10 @@ This filter reduces token consumption in long conversations through intelligent
| `summary_temperature` | `0.3` | Randomness for summary generation. Lower is more deterministic. |
| `model_thresholds` | `{}` | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models). |
| `enable_tool_output_trimming` | `false` | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer. |
| `debug_mode` | `true` | Log verbose debug info. Set to `false` in production. |
| `debug_mode` | `false` | Log verbose debug info. Set to `false` in production. |
| `show_debug_log` | `false` | Print debug logs to browser console (F12). Useful for frontend debugging. |
| `show_token_usage_status` | `true` | Show token usage status notification in the chat interface. |
| `token_usage_status_threshold` | `80` | The minimum usage percentage (0-100) required to show a context usage status notification. |
---

View File

@@ -1,20 +1,24 @@
# 异步上下文压缩过滤器
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.2.2 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.3.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明。
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
## 1.2.2 版本更新
## 1.3.0 版本更新
- **严重错误修复**: 解决了因日志函数变量名冲突导致的 `TypeError: 'str' object is not callable` 错误
- **兼容性增强**: 改进了 `params` 处理逻辑以支持 Pydantic 对象,提高了对不同 OpenWebUI 版本的兼容性
- **国际化 (i18n) 支持**: 完成了所有用户可见消息的本地化,现已原生支持 9 种语言(含中、英、日、韩及欧洲主要语言)
- **智能状态显示**: 新增 `token_usage_status_threshold` 阀门(默认 80%),可以智能控制何时显示 Token 用量状态,减少不必要的打扰
- **性能大幅优化**: 对前端语言检测和日志处理流程进行了非阻塞重构完全不影响首字节响应时间TTFB保持毫秒级极速推流。
- **Copilot SDK 兼容**: 自动检测并跳过基于 `copilot_sdk` 模型的上下文压缩,避免冲突。
- **配置项调整**: 为了提供更安静的生产环境体验,`debug_mode` 现已默认设置为 `false`
---
## 核心特性
-**全方位国际化**: 原生支持 9 种界面语言。
-**自动压缩**: 基于 Token 阈值自动触发上下文压缩。
-**异步摘要**: 后台生成摘要,不阻塞当前对话响应。
-**持久化存储**: 复用 Open WebUI 共享数据库连接,自动支持 PostgreSQL/SQLite 等。
@@ -93,9 +97,10 @@
| 参数 | 默认值 | 描述 |
| :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
| `enable_tool_output_trimming` | `false` | 启用时,若 `function_calling: "native"` 激活,将裁剪冗长的工具输出以仅提取最终答案。 |
| `debug_mode` | `true` | 是否在 Open WebUI 的控制台日志中打印详细的调试信息(如 Token 计数、压缩进度、数据库操作等)。生产环境建议设为 `false`。 |
| `debug_mode` | `false` | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
| `show_debug_log` | `false` | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。 |
| `show_token_usage_status` | `true` | 是否在对话结束时显示 Token 使用情况的状态通知。 |
| `token_usage_status_threshold` | `80` | 触发显示上下文用量状态通知的最低百分比阈值 (0-100)。 |
---

View File

@@ -5,17 +5,17 @@ author: Fu-Jie
author_url: https://github.com/Fu-Jie/openwebui-extensions
funding_url: https://github.com/open-webui
description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
version: 1.2.2
version: 1.3.0
openwebui_id: b1655bc8-6de9-4cad-8cb5-a6f7829a02ce
license: MIT
═══════════════════════════════════════════════════════════════════════════════
📌 What's new in 1.2.1
📌 What's new in 1.3.0
═══════════════════════════════════════════════════════════════════════════════
✅ Smart Configuration: Automatically detects base model settings for custom models and adds `summary_model_max_context` for independent summary limits.
Performance & Refactoring: Optimized threshold parsing with caching and removed redundant code for better efficiency.
Bug Fixes & Modernization: Fixed `datetime` deprecation warnings and corrected type annotations.
✅ Smart Status Display: Added `token_usage_status_threshold` valve (default 80%) to control when token usage status is shown, reducing unnecessary notifications.
Copilot SDK Integration: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
Improved User Experience: Status messages now only appear when token usage exceeds the configured threshold, keeping the interface cleaner.
═══════════════════════════════════════════════════════════════════════════════
📌 Overview
@@ -150,7 +150,7 @@ summary_temperature
Description: Controls the randomness of the summary generation. Lower values produce more deterministic output.
debug_mode
Default: true
Default: false
Description: Prints detailed debug information to the log. Recommended to set to `false` in production.
show_debug_log
@@ -268,6 +268,7 @@ import hashlib
import time
import contextlib
import logging
from functools import lru_cache
# Setup logger
logger = logging.getLogger(__name__)
@@ -391,6 +392,130 @@ class ChatSummary(owui_Base):
)
TRANSLATIONS = {
"en-US": {
"status_context_usage": "Context Usage (Estimated): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ High Usage",
"status_loaded_summary": "Loaded historical summary (Hidden {count} historical messages)",
"status_context_summary_updated": "Context Summary Updated: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "Generating context summary in background...",
"status_summary_error": "Summary Error: {error}",
"summary_prompt_prefix": "【Previous Summary: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n",
"summary_prompt_suffix": "\n\n---\nBelow is the recent conversation:",
"tool_trimmed": "... [Tool outputs trimmed]\n{content}",
"content_collapsed": "\n... [Content collapsed] ...\n",
},
"zh-CN": {
"status_context_usage": "上下文用量 (预估): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ 用量较高",
"status_loaded_summary": "已加载历史总结 (隐藏了 {count} 条历史消息)",
"status_context_summary_updated": "上下文总结已更新: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "正在后台生成上下文总结...",
"status_summary_error": "总结生成错误: {error}",
"summary_prompt_prefix": "【前情提要:以下是历史对话的总结,仅供上下文参考。请不要回复总结内容本身,直接回答之后最新的问题。】\n\n",
"summary_prompt_suffix": "\n\n---\n以下是最近的对话:",
"tool_trimmed": "... [工具输出已裁剪]\n{content}",
"content_collapsed": "\n... [内容已折叠] ...\n",
},
"zh-HK": {
"status_context_usage": "上下文用量 (預估): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ 用量較高",
"status_loaded_summary": "已載入歷史總結 (隱藏了 {count} 條歷史訊息)",
"status_context_summary_updated": "上下文總結已更新: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "正在後台生成上下文總結...",
"status_summary_error": "總結生成錯誤: {error}",
"summary_prompt_prefix": "【前情提要:以下是歷史對話的總結,僅供上下文參考。請不要回覆總結內容本身,直接回答之後最新的問題。】\n\n",
"summary_prompt_suffix": "\n\n---\n以下是最近的對話:",
"tool_trimmed": "... [工具輸出已裁剪]\n{content}",
"content_collapsed": "\n... [內容已折疊] ...\n",
},
"zh-TW": {
"status_context_usage": "上下文用量 (預估): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ 用量較高",
"status_loaded_summary": "已載入歷史總結 (隱藏了 {count} 條歷史訊息)",
"status_context_summary_updated": "上下文總結已更新: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "正在後台生成上下文總結...",
"status_summary_error": "總結生成錯誤: {error}",
"summary_prompt_prefix": "【前情提要:以下是歷史對話的總結,僅供上下文参考。請不要回覆總結內容本身,直接回答之後最新的問題。】\n\n",
"summary_prompt_suffix": "\n\n---\n以下是最近的對話:",
"tool_trimmed": "... [工具輸出已裁剪]\n{content}",
"content_collapsed": "\n... [內容已折疊] ...\n",
},
"ja-JP": {
"status_context_usage": "コンテキスト使用量 (推定): {tokens} / {max_tokens} トークン ({ratio}%)",
"status_high_usage": " | ⚠️ 使用量高",
"status_loaded_summary": "履歴の要約を読み込みました ({count} 件の履歴メッセージを非表示)",
"status_context_summary_updated": "コンテキストの要約が更新されました: {tokens} / {max_tokens} トークン ({ratio}%)",
"status_generating_summary": "バックグラウンドでコンテキスト要約を生成しています...",
"status_summary_error": "要約エラー: {error}",
"summary_prompt_prefix": "【これまでのあらすじ:以下は過去の会話の要約であり、コンテキストの参考としてのみ提供されます。要約の内容自体には返答せず、その後の最新の質問に直接答えてください。】\n\n",
"summary_prompt_suffix": "\n\n---\n以下は最近の会話です:",
"tool_trimmed": "... [ツールの出力をトリミングしました]\n{content}",
"content_collapsed": "\n... [コンテンツが折りたたまれました] ...\n",
},
"ko-KR": {
"status_context_usage": "컨텍스트 사용량 (예상): {tokens} / {max_tokens} 토큰 ({ratio}%)",
"status_high_usage": " | ⚠️ 사용량 높음",
"status_loaded_summary": "이전 요약 불러옴 ({count}개의 이전 메시지 숨김)",
"status_context_summary_updated": "컨텍스트 요약 업데이트됨: {tokens} / {max_tokens} 토큰 ({ratio}%)",
"status_generating_summary": "백그라운드에서 컨텍스트 요약 생성 중...",
"status_summary_error": "요약 오류: {error}",
"summary_prompt_prefix": "【이전 요약: 다음은 이전 대화의 요약이며 문맥 참고용으로만 제공됩니다. 요약 내용 자체에 답하지 말고 последу의 최신 질문에 직접 답하세요.】\n\n",
"summary_prompt_suffix": "\n\n---\n다음은 최근 대화입니다:",
"tool_trimmed": "... [도구 출력 잘림]\n{content}",
"content_collapsed": "\n... [내용 접힘] ...\n",
},
"fr-FR": {
"status_context_usage": "Utilisation du contexte (estimée) : {tokens} / {max_tokens} jetons ({ratio}%)",
"status_high_usage": " | ⚠️ Utilisation élevée",
"status_loaded_summary": "Résumé historique chargé ({count} messages d'historique masqués)",
"status_context_summary_updated": "Résumé du contexte mis à jour : {tokens} / {max_tokens} jetons ({ratio}%)",
"status_generating_summary": "Génération du résumé du contexte en arrière-plan...",
"status_summary_error": "Erreur de résumé : {error}",
"summary_prompt_prefix": "【Résumé précédent : Ce qui suit est un résumé de la conversation historique, fourni uniquement pour le contexte. Ne répondez pas au contenu du résumé lui-même ; répondez directement aux dernières questions.】\n\n",
"summary_prompt_suffix": "\n\n---\nVoici la conversation récente :",
"tool_trimmed": "... [Sorties d'outils coupées]\n{content}",
"content_collapsed": "\n... [Contenu réduit] ...\n",
},
"de-DE": {
"status_context_usage": "Kontextnutzung (geschätzt): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ Hohe Nutzung",
"status_loaded_summary": "Historische Zusammenfassung geladen ({count} historische Nachrichten ausgeblendet)",
"status_context_summary_updated": "Kontextzusammenfassung aktualisiert: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "Kontextzusammenfassung wird im Hintergrund generiert...",
"status_summary_error": "Zusammenfassungsfehler: {error}",
"summary_prompt_prefix": "【Vorherige Zusammenfassung: Das Folgende ist eine Zusammenfassung der historischen Konversation, die nur als Kontext dient. Antworten Sie nicht auf den Inhalt der Zusammenfassung selbst, sondern direkt auf die nachfolgenden neuesten Fragen.】\n\n",
"summary_prompt_suffix": "\n\n---\nHier ist die jüngste Konversation:",
"tool_trimmed": "... [Werkzeugausgaben gekürzt]\n{content}",
"content_collapsed": "\n... [Inhalt ausgeblendet] ...\n",
},
"es-ES": {
"status_context_usage": "Uso del contexto (estimado): {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_high_usage": " | ⚠️ Uso elevado",
"status_loaded_summary": "Resumen histórico cargado ({count} mensajes históricos ocultos)",
"status_context_summary_updated": "Resumen del contexto actualizado: {tokens} / {max_tokens} Tokens ({ratio}%)",
"status_generating_summary": "Generando resumen del contexto en segundo plano...",
"status_summary_error": "Error de resumen: {error}",
"summary_prompt_prefix": "【Resumen anterior: El siguiente es un resumen de la conversación histórica, proporcionado solo como contexto. No responda al contenido del resumen en sí; responda directamente a las preguntas más recientes.】\n\n",
"summary_prompt_suffix": "\n\n---\nA continuación se muestra la conversación reciente:",
"tool_trimmed": "... [Salidas de herramientas recortadas]\n{content}",
"content_collapsed": "\n... [Contenido contraído] ...\n",
},
"it-IT": {
"status_context_usage": "Utilizzo contesto (stimato): {tokens} / {max_tokens} Token ({ratio}%)",
"status_high_usage": " | ⚠️ Utilizzo elevato",
"status_loaded_summary": "Riepilogo storico caricato ({count} messaggi storici nascosti)",
"status_context_summary_updated": "Riepilogo contesto aggiornato: {tokens} / {max_tokens} Token ({ratio}%)",
"status_generating_summary": "Generazione riepilogo contesto in background...",
"status_summary_error": "Errore riepilogo: {error}",
"summary_prompt_prefix": "【Riepilogo precedente: Il seguente è un riepilogo della conversazione storica, fornito solo per contesto. Non rispondere al contenuto del riepilogo stesso; rispondi direttamente alle domande più recenti.】\n\n",
"summary_prompt_suffix": "\n\n---\nDi seguito è riportata la conversazione recente:",
"tool_trimmed": "... [Output degli strumenti tagliati]\n{content}",
"content_collapsed": "\n... [Contenuto compresso] ...\n",
},
}
# Global cache for tiktoken encoding
TIKTOKEN_ENCODING = None
if tiktoken:
@@ -400,6 +525,26 @@ if tiktoken:
logger.error(f"[Init] Failed to load tiktoken encoding: {e}")
@lru_cache(maxsize=1024)
def _get_cached_tokens(text: str) -> int:
"""Calculates tokens with LRU caching for exact string matches."""
if not text:
return 0
if TIKTOKEN_ENCODING:
try:
# tiktoken logic is relatively fast, but caching it based on exact string match
# turns O(N) encoding time to O(1) dictionary lookup for historical messages.
return len(TIKTOKEN_ENCODING.encode(text))
except Exception as e:
logger.warning(
f"[Token Count] tiktoken error: {e}, falling back to character estimation"
)
pass
# Fallback strategy: Rough estimation (1 token ≈ 4 chars)
return len(text) // 4
class Filter:
def __init__(self):
self.valves = self.Valves()
@@ -409,8 +554,105 @@ class Filter:
sessionmaker(bind=self._db_engine) if self._db_engine else None
)
self._model_thresholds_cache: Optional[Dict[str, Any]] = None
# Fallback mapping for variants not in TRANSLATIONS keys
self.fallback_map = {
"es-AR": "es-ES",
"es-MX": "es-ES",
"fr-CA": "fr-FR",
"en-CA": "en-US",
"en-GB": "en-US",
"en-AU": "en-US",
"de-AT": "de-DE",
}
self._init_database()
def _resolve_language(self, lang: str) -> str:
"""Resolve the best matching language code from the TRANSLATIONS dict."""
target_lang = lang
# 1. Direct match
if target_lang in TRANSLATIONS:
return target_lang
# 2. Variant fallback (explicit mapping)
if target_lang in self.fallback_map:
target_lang = self.fallback_map[target_lang]
if target_lang in TRANSLATIONS:
return target_lang
# 3. Base language fallback (e.g. fr-BE -> fr-FR)
if "-" in lang:
base_lang = lang.split("-")[0]
for supported_lang in TRANSLATIONS:
if supported_lang.startswith(base_lang + "-"):
return supported_lang
# 4. Final Fallback to en-US
return "en-US"
def _get_translation(self, lang: str, key: str, **kwargs) -> str:
"""Get translated string for the given language and key."""
target_lang = self._resolve_language(lang)
lang_dict = TRANSLATIONS.get(target_lang, TRANSLATIONS["en-US"])
text = lang_dict.get(key, TRANSLATIONS["en-US"].get(key, key))
if kwargs:
try:
text = text.format(**kwargs)
except Exception as e:
logger.warning(f"Translation formatting failed for {key}: {e}")
return text
async def _get_user_context(
self,
__user__: Optional[Dict[str, Any]],
__event_call__: Optional[Callable[[Any], Awaitable[None]]] = None,
) -> Dict[str, str]:
"""Extract basic user context with safe fallbacks."""
if isinstance(__user__, (list, tuple)):
user_data = __user__[0] if __user__ else {}
elif isinstance(__user__, dict):
user_data = __user__
else:
user_data = {}
user_id = user_data.get("id", "unknown_user")
user_name = user_data.get("name", "User")
user_language = user_data.get("language", "en-US")
if __event_call__:
try:
js_code = """
return (
document.documentElement.lang ||
localStorage.getItem('locale') ||
localStorage.getItem('language') ||
navigator.language ||
'en-US'
);
"""
frontend_lang = await asyncio.wait_for(
__event_call__({"type": "execute", "data": {"code": js_code}}),
timeout=1.0,
)
if frontend_lang and isinstance(frontend_lang, str):
user_language = frontend_lang
except asyncio.TimeoutError:
logger.warning(
"Failed to retrieve frontend language: Timeout (using fallback)"
)
except Exception as e:
logger.warning(
f"Failed to retrieve frontend language: {type(e).__name__}: {e}"
)
return {
"user_id": user_id,
"user_name": user_name,
"user_language": user_language,
}
def _parse_model_thresholds(self) -> Dict[str, Any]:
"""Parse model_thresholds string into a dictionary.
@@ -574,7 +816,7 @@ class Filter:
description="The temperature for summary generation.",
)
debug_mode: bool = Field(
default=True, description="Enable detailed logging for debugging."
default=False, description="Enable detailed logging for debugging."
)
show_debug_log: bool = Field(
default=False, description="Show debug logs in the frontend console"
@@ -582,6 +824,12 @@ class Filter:
show_token_usage_status: bool = Field(
default=True, description="Show token usage status notification"
)
token_usage_status_threshold: int = Field(
default=80,
ge=0,
le=100,
description="Only show token usage status when usage exceeds this percentage (0-100). Set to 0 to always show.",
)
enable_tool_output_trimming: bool = Field(
default=False,
description="Enable trimming of large tool outputs (only works with native function calling).",
@@ -654,20 +902,7 @@ class Filter:
def _count_tokens(self, text: str) -> int:
"""Counts the number of tokens in the text."""
if not text:
return 0
if TIKTOKEN_ENCODING:
try:
return len(TIKTOKEN_ENCODING.encode(text))
except Exception as e:
if self.valves.debug_mode:
logger.warning(
f"[Token Count] tiktoken error: {e}, falling back to character estimation"
)
# Fallback strategy: Rough estimation (1 token ≈ 4 chars)
return len(text) // 4
return _get_cached_tokens(text)
def _calculate_messages_tokens(self, messages: List[Dict]) -> int:
"""Calculates the total tokens for a list of messages."""
@@ -693,6 +928,20 @@ class Filter:
return total_tokens
def _estimate_messages_tokens(self, messages: List[Dict]) -> int:
"""Fast estimation of tokens based on character count (1/4 ratio)."""
total_chars = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, list):
for part in content:
if isinstance(part, dict) and part.get("type") == "text":
total_chars += len(part.get("text", ""))
else:
total_chars += len(str(content))
return total_chars // 4
def _get_model_thresholds(self, model_id: str) -> Dict[str, int]:
"""Gets threshold configuration for a specific model.
@@ -830,11 +1079,13 @@ class Filter:
}})();
"""
await __event_call__(
{
"type": "execute",
"data": {"code": js_code},
}
asyncio.create_task(
__event_call__(
{
"type": "execute",
"data": {"code": js_code},
}
)
)
except Exception as e:
logger.error(f"Error emitting debug log: {e}")
@@ -876,17 +1127,55 @@ class Filter:
js_code = f"""
console.log("%c[Compression] {safe_message}", "{css}");
"""
# Add timeout to prevent blocking if frontend connection is broken
await asyncio.wait_for(
event_call({"type": "execute", "data": {"code": js_code}}),
timeout=2.0,
)
except asyncio.TimeoutError:
logger.warning(
f"Failed to emit log to frontend: Timeout (connection may be broken)"
asyncio.create_task(
event_call({"type": "execute", "data": {"code": js_code}})
)
except Exception as e:
logger.error(f"Failed to emit log to frontend: {type(e).__name__}: {e}")
logger.error(
f"Failed to process log to frontend: {type(e).__name__}: {e}"
)
def _should_show_status(self, usage_ratio: float) -> bool:
"""
Check if token usage status should be shown based on threshold.
Args:
usage_ratio: Current usage ratio (0.0 to 1.0)
Returns:
True if status should be shown, False otherwise
"""
if not self.valves.show_token_usage_status:
return False
# If threshold is 0, always show
if self.valves.token_usage_status_threshold == 0:
return True
# Check if usage exceeds threshold
threshold_ratio = self.valves.token_usage_status_threshold / 100.0
return usage_ratio >= threshold_ratio
def _should_skip_compression(
self, body: dict, __model__: Optional[dict] = None
) -> bool:
"""
Check if compression should be skipped.
Returns True if:
1. The base model includes 'copilot_sdk'
"""
# Check if base model includes copilot_sdk
if __model__:
base_model_id = __model__.get("base_model_id", "")
if "copilot_sdk" in base_model_id.lower():
return True
# Also check model in body
model_id = body.get("model", "")
if "copilot_sdk" in model_id.lower():
return True
return False
async def inlet(
self,
@@ -903,6 +1192,19 @@ class Filter:
Compression Strategy: Only responsible for injecting existing summaries, no Token calculation.
"""
# Check if compression should be skipped (e.g., for copilot_sdk)
if self._should_skip_compression(body, __model__):
if self.valves.debug_mode:
logger.info(
"[Inlet] Skipping compression: copilot_sdk detected in base model"
)
if self.valves.show_debug_log and __event_call__:
await self._log(
"[Inlet] ⏭️ Skipping compression: copilot_sdk detected",
event_call=__event_call__,
)
return body
messages = body.get("messages", [])
# --- Native Tool Output Trimming (Opt-in, only for native function calling) ---
@@ -966,8 +1268,14 @@ class Filter:
final_answer = content[last_match_end:].strip()
if final_answer:
msg["content"] = (
f"... [Tool outputs trimmed]\n{final_answer}"
msg["content"] = self._get_translation(
(
__user__.get("language", "en-US")
if __user__
else "en-US"
),
"tool_trimmed",
content=final_answer,
)
trimmed_count += 1
else:
@@ -980,8 +1288,14 @@ class Filter:
if len(parts) > 1:
final_answer = parts[-1].strip()
if final_answer:
msg["content"] = (
f"... [Tool outputs trimmed]\n{final_answer}"
msg["content"] = self._get_translation(
(
__user__.get("language", "en-US")
if __user__
else "en-US"
),
"tool_trimmed",
content=final_answer,
)
trimmed_count += 1
@@ -1173,6 +1487,10 @@ class Filter:
# Target is to compress up to the (total - keep_last) message
target_compressed_count = max(0, len(messages) - self.valves.keep_last)
# Get user context for i18n
user_ctx = await self._get_user_context(__user__, __event_call__)
lang = user_ctx["user_language"]
await self._log(
f"[Inlet] Recorded target compression progress: {target_compressed_count}",
event_call=__event_call__,
@@ -1207,10 +1525,9 @@ class Filter:
# 2. Summary message (Inserted as Assistant message)
summary_content = (
f"【Previous Summary: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n"
f"{summary_record.summary}\n\n"
f"---\n"
f"Below is the recent conversation:"
self._get_translation(lang, "summary_prompt_prefix")
+ f"{summary_record.summary}"
+ self._get_translation(lang, "summary_prompt_suffix")
)
summary_msg = {"role": "assistant", "content": summary_content}
@@ -1249,16 +1566,27 @@ class Filter:
"max_context_tokens", self.valves.max_context_tokens
)
# Calculate total tokens
total_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, calc_messages
)
# --- Fast Estimation Check ---
estimated_tokens = self._estimate_messages_tokens(calc_messages)
# Preflight Check Log
await self._log(
f"[Inlet] 🔎 Preflight Check: {total_tokens}t / {max_context_tokens}t ({(total_tokens/max_context_tokens*100):.1f}%)",
event_call=__event_call__,
)
# Since this is a hard limit check, only skip precise calculation if we are far below it (margin of 15%)
if estimated_tokens < max_context_tokens * 0.85:
total_tokens = estimated_tokens
await self._log(
f"[Inlet] 🔎 Fast Preflight Check (Est): {total_tokens}t / {max_context_tokens}t (Well within limit)",
event_call=__event_call__,
)
else:
# Calculate exact total tokens via tiktoken
total_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, calc_messages
)
# Preflight Check Log
await self._log(
f"[Inlet] 🔎 Precise Preflight Check: {total_tokens}t / {max_context_tokens}t ({(total_tokens/max_context_tokens*100):.1f}%)",
event_call=__event_call__,
)
# If over budget, reduce history (Keep Last)
if total_tokens > max_context_tokens:
@@ -1325,7 +1653,9 @@ class Filter:
first_line_found = True
# Add placeholder if there's more content coming
if idx < last_line_idx:
kept_lines.append("\n... [Content collapsed] ...\n")
kept_lines.append(
self._get_translation(lang, "content_collapsed")
)
continue
# Keep last non-empty line
@@ -1347,8 +1677,13 @@ class Filter:
target_msg["metadata"]["is_trimmed"] = True
# Calculate token reduction
old_tokens = self._count_tokens(content)
new_tokens = self._count_tokens(target_msg["content"])
# Use current token strategy
if total_tokens == estimated_tokens:
old_tokens = len(content) // 4
new_tokens = len(target_msg["content"]) // 4
else:
old_tokens = self._count_tokens(content)
new_tokens = self._count_tokens(target_msg["content"])
diff = old_tokens - new_tokens
total_tokens -= diff
@@ -1362,7 +1697,12 @@ class Filter:
# Strategy 2: Fallback - Drop Oldest Message Entirely (FIFO)
# (User requested to remove progressive trimming for other cases)
dropped = tail_messages.pop(0)
dropped_tokens = self._count_tokens(str(dropped.get("content", "")))
if total_tokens == estimated_tokens:
dropped_tokens = len(str(dropped.get("content", ""))) // 4
else:
dropped_tokens = self._count_tokens(
str(dropped.get("content", ""))
)
total_tokens -= dropped_tokens
if self.valves.show_debug_log and __event_call__:
@@ -1382,14 +1722,24 @@ class Filter:
final_messages = candidate_messages
# Calculate detailed token stats for logging
system_tokens = (
self._count_tokens(system_prompt_msg.get("content", ""))
if system_prompt_msg
else 0
)
head_tokens = self._calculate_messages_tokens(head_messages)
summary_tokens = self._count_tokens(summary_content)
tail_tokens = self._calculate_messages_tokens(tail_messages)
if total_tokens == estimated_tokens:
system_tokens = (
len(system_prompt_msg.get("content", "")) // 4
if system_prompt_msg
else 0
)
head_tokens = self._estimate_messages_tokens(head_messages)
summary_tokens = len(summary_content) // 4
tail_tokens = self._estimate_messages_tokens(tail_messages)
else:
system_tokens = (
self._count_tokens(system_prompt_msg.get("content", ""))
if system_prompt_msg
else 0
)
head_tokens = self._calculate_messages_tokens(head_messages)
summary_tokens = self._count_tokens(summary_content)
tail_tokens = self._calculate_messages_tokens(tail_messages)
system_info = (
f"System({system_tokens}t)" if system_prompt_msg else "System(0t)"
@@ -1408,22 +1758,43 @@ class Filter:
# Prepare status message (Context Usage format)
if max_context_tokens > 0:
usage_ratio = total_section_tokens / max_context_tokens
status_msg = f"Context Usage (Estimated): {total_section_tokens} / {max_context_tokens} Tokens ({usage_ratio*100:.1f}%)"
if usage_ratio > 0.9:
status_msg += " | ⚠️ High Usage"
else:
status_msg = f"Loaded historical summary (Hidden {compressed_count} historical messages)"
# Only show status if threshold is met
if self._should_show_status(usage_ratio):
status_msg = self._get_translation(
lang,
"status_context_usage",
tokens=total_section_tokens,
max_tokens=max_context_tokens,
ratio=f"{usage_ratio*100:.1f}",
)
if usage_ratio > 0.9:
status_msg += self._get_translation(lang, "status_high_usage")
if __event_emitter__:
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
if __event_emitter__:
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
else:
# For the case where max_context_tokens is 0, show summary info without threshold check
if self.valves.show_token_usage_status and __event_emitter__:
status_msg = self._get_translation(
lang, "status_loaded_summary", count=compressed_count
)
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
# Emit debug log to frontend (Keep the structured log as well)
await self._emit_debug_log(
@@ -1454,9 +1825,20 @@ class Filter:
"max_context_tokens", self.valves.max_context_tokens
)
total_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, calc_messages
)
# --- Fast Estimation Check ---
estimated_tokens = self._estimate_messages_tokens(calc_messages)
# Only skip precise calculation if we are clearly below the limit
if estimated_tokens < max_context_tokens * 0.85:
total_tokens = estimated_tokens
await self._log(
f"[Inlet] 🔎 Fast limit check (Est): {total_tokens}t / {max_context_tokens}t",
event_call=__event_call__,
)
else:
total_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, calc_messages
)
if total_tokens > max_context_tokens:
await self._log(
@@ -1476,7 +1858,12 @@ class Filter:
> start_trim_index + 1 # Keep at least 1 message after keep_first
):
dropped = final_messages.pop(start_trim_index)
dropped_tokens = self._count_tokens(str(dropped.get("content", "")))
if total_tokens == estimated_tokens:
dropped_tokens = len(str(dropped.get("content", ""))) // 4
else:
dropped_tokens = self._count_tokens(
str(dropped.get("content", ""))
)
total_tokens -= dropped_tokens
await self._log(
@@ -1485,23 +1872,30 @@ class Filter:
)
# Send status notification (Context Usage format)
if __event_emitter__:
status_msg = f"Context Usage (Estimated): {total_tokens} / {max_context_tokens} Tokens"
if max_context_tokens > 0:
usage_ratio = total_tokens / max_context_tokens
status_msg += f" ({usage_ratio*100:.1f}%)"
if max_context_tokens > 0:
usage_ratio = total_tokens / max_context_tokens
# Only show status if threshold is met
if self._should_show_status(usage_ratio):
status_msg = self._get_translation(
lang,
"status_context_usage",
tokens=total_tokens,
max_tokens=max_context_tokens,
ratio=f"{usage_ratio*100:.1f}",
)
if usage_ratio > 0.9:
status_msg += " | ⚠️ High Usage"
status_msg += self._get_translation(lang, "status_high_usage")
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
if __event_emitter__:
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
body["messages"] = final_messages
@@ -1517,6 +1911,7 @@ class Filter:
body: dict,
__user__: Optional[dict] = None,
__metadata__: dict = None,
__model__: dict = None,
__event_emitter__: Callable[[Any], Awaitable[None]] = None,
__event_call__: Callable[[Any], Awaitable[None]] = None,
) -> dict:
@@ -1524,6 +1919,23 @@ class Filter:
Executed after the LLM response is complete.
Calculates Token count in the background and triggers summary generation (does not block current response, does not affect content output).
"""
# Check if compression should be skipped (e.g., for copilot_sdk)
if self._should_skip_compression(body, __model__):
if self.valves.debug_mode:
logger.info(
"[Outlet] Skipping compression: copilot_sdk detected in base model"
)
if self.valves.show_debug_log and __event_call__:
await self._log(
"[Outlet] ⏭️ Skipping compression: copilot_sdk detected",
event_call=__event_call__,
)
return body
# Get user context for i18n
user_ctx = await self._get_user_context(__user__, __event_call__)
lang = user_ctx["user_language"]
chat_ctx = self._get_chat_context(body, __metadata__)
chat_id = chat_ctx["chat_id"]
if not chat_id:
@@ -1547,6 +1959,7 @@ class Filter:
body,
__user__,
target_compressed_count,
lang,
__event_emitter__,
__event_call__,
)
@@ -1561,6 +1974,7 @@ class Filter:
body: dict,
user_data: Optional[dict],
target_compressed_count: Optional[int],
lang: str = "en-US",
__event_emitter__: Callable[[Any], Awaitable[None]] = None,
__event_call__: Callable[[Any], Awaitable[None]] = None,
):
@@ -1595,37 +2009,58 @@ class Filter:
event_call=__event_call__,
)
# Calculate Token count in a background thread
current_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, messages
)
# --- Fast Estimation Check ---
estimated_tokens = self._estimate_messages_tokens(messages)
await self._log(
f"[🔍 Background Calculation] Token count: {current_tokens}",
event_call=__event_call__,
)
# For triggering summary generation, we need to be more precise if we are in the grey zone
# Margin is 15% (skip tiktoken if estimated is < 85% of threshold)
# Note: We still use tiktoken if we exceed threshold, because we want an accurate usage status report
if estimated_tokens < compression_threshold_tokens * 0.85:
current_tokens = estimated_tokens
await self._log(
f"[🔍 Background Calculation] Fast estimate ({current_tokens}) is well below threshold ({compression_threshold_tokens}). Skipping tiktoken.",
event_call=__event_call__,
)
else:
# Calculate Token count precisely in a background thread
current_tokens = await asyncio.to_thread(
self._calculate_messages_tokens, messages
)
await self._log(
f"[🔍 Background Calculation] Precise token count: {current_tokens}",
event_call=__event_call__,
)
# Send status notification (Context Usage format)
if __event_emitter__ and self.valves.show_token_usage_status:
if __event_emitter__:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
status_msg = f"Context Usage (Estimated): {current_tokens} / {max_context_tokens} Tokens"
if max_context_tokens > 0:
usage_ratio = current_tokens / max_context_tokens
status_msg += f" ({usage_ratio*100:.1f}%)"
if usage_ratio > 0.9:
status_msg += " | ⚠️ High Usage"
# Only show status if threshold is met
if self._should_show_status(usage_ratio):
status_msg = self._get_translation(
lang,
"status_context_usage",
tokens=current_tokens,
max_tokens=max_context_tokens,
ratio=f"{usage_ratio*100:.1f}",
)
if usage_ratio > 0.9:
status_msg += self._get_translation(
lang, "status_high_usage"
)
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
# Check if compression is needed
if current_tokens >= compression_threshold_tokens:
@@ -1642,6 +2077,7 @@ class Filter:
body,
user_data,
target_compressed_count,
lang,
__event_emitter__,
__event_call__,
)
@@ -1672,6 +2108,7 @@ class Filter:
body: dict,
user_data: Optional[dict],
target_compressed_count: Optional[int],
lang: str = "en-US",
__event_emitter__: Callable[[Any], Awaitable[None]] = None,
__event_call__: Callable[[Any], Awaitable[None]] = None,
):
@@ -1811,7 +2248,9 @@ class Filter:
{
"type": "status",
"data": {
"description": "Generating context summary in background...",
"description": self._get_translation(
lang, "status_generating_summary"
),
"done": False,
},
}
@@ -1849,7 +2288,11 @@ class Filter:
{
"type": "status",
"data": {
"description": f"Context summary updated (Compressed {len(middle_messages)} messages)",
"description": self._get_translation(
lang,
"status_loaded_summary",
count=len(middle_messages),
),
"done": True,
},
}
@@ -1910,10 +2353,9 @@ class Filter:
# Summary
summary_content = (
f"【System Prompt: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n"
f"{new_summary}\n\n"
f"---\n"
f"Below is the recent conversation:"
self._get_translation(lang, "summary_prompt_prefix")
+ f"{new_summary}"
+ self._get_translation(lang, "summary_prompt_suffix")
)
summary_msg = {"role": "assistant", "content": summary_content}
@@ -1943,23 +2385,32 @@ class Filter:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
# 6. Emit Status
status_msg = f"Context Summary Updated: {token_count} / {max_context_tokens} Tokens"
# 6. Emit Status (only if threshold is met)
if max_context_tokens > 0:
ratio = (token_count / max_context_tokens) * 100
status_msg += f" ({ratio:.1f}%)"
if ratio > 90.0:
status_msg += " | ⚠️ High Usage"
usage_ratio = token_count / max_context_tokens
# Only show status if threshold is met
if self._should_show_status(usage_ratio):
status_msg = self._get_translation(
lang,
"status_context_summary_updated",
tokens=token_count,
max_tokens=max_context_tokens,
ratio=f"{usage_ratio*100:.1f}",
)
if usage_ratio > 0.9:
status_msg += self._get_translation(
lang, "status_high_usage"
)
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
except Exception as e:
await self._log(
f"[Status] Error calculating tokens: {e}",
@@ -1979,7 +2430,9 @@ class Filter:
{
"type": "status",
"data": {
"description": f"Summary Error: {str(e)[:100]}...",
"description": self._get_translation(
lang, "status_summary_error", error=str(e)[:100]
),
"done": True,
},
}