feat(async-context-compression): v1.6.0 absolute system message protection
- Redefine keep_first to count only non-system messages, protecting the first N user/assistant exchanges plus all interleaved system messages - System messages in the compression gap are now extracted and preserved as original messages instead of being summarized - System messages dropped during forced trimming are re-inserted into the final output - Change keep_first default from 1 to 0 - Update docstring, README, README_CN, WORKFLOW_GUIDE_CN, and docs mirrors Fixes #62
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Async Context Compression Filter
|
||||
|
||||
| By [Fu-Jie](https://github.com/Fu-Jie) · v1.5.0 | [⭐ Star this repo](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| By [Fu-Jie](https://github.com/Fu-Jie) · v1.6.0 | [⭐ Star this repo](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| :--- | ---: |
|
||||
|
||||
|  |  |  |  |  |  |  |
|
||||
@@ -8,6 +8,25 @@
|
||||
|
||||
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
|
||||
|
||||
## Install with Batch Install Plugins
|
||||
|
||||
If you already use [Batch Install Plugins from GitHub](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/tools/batch-install-plugins), you can install or update this plugin with:
|
||||
|
||||
```text
|
||||
Install plugin from Fu-Jie/openwebui-extensions
|
||||
```
|
||||
|
||||
When the selection dialog opens, search for this plugin, check it, and continue.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> If the official OpenWebUI Community version is already installed, remove it first. After that, Batch Install Plugins can keep this plugin updated in future runs.
|
||||
|
||||
## What's new in 1.6.0
|
||||
|
||||
- **Fixed `keep_first` Logic**: Re-defined `keep_first` to protect the first N **non-system** messages plus all interleaved system messages. This ensures initial context (e.g., identity, task instructions) is preserved correctly.
|
||||
- **Absolute System Message Protection**: System messages are now strictly excluded from compression. Any system message encountered in the history (even late-injected ones) is preserved as an original message in the final context.
|
||||
- **Improved Context Assembly**: Summaries now only target User and Assistant dialogue, ensuring that system instructions injected by other plugins are never "eaten" by the summarizer.
|
||||
|
||||
## What's new in 1.5.0
|
||||
|
||||
- **External Chat Reference Summaries**: Added support for referenced chat context blocks that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
|
||||
@@ -41,6 +60,10 @@ This filter reduces token consumption in long conversations through intelligent
|
||||
|
||||
## What This Fixes
|
||||
|
||||
- **Problem: System Messages being summarized/lost.**
|
||||
Previously, the filter could include system messages (especially those injected late by other plugins) in its summarization zone, causing important instructions to be lost. Now, all system messages are strictly preserved in their original role and never summarized.
|
||||
- **Problem: Incorrect `keep_first` behavior.**
|
||||
Previously, `keep_first` simply took the first $N$ messages. If those were only system messages, the initial user/assistant messages (which are often important for context) would be summarized. Now, `keep_first` ensures that $N$ non-system messages are protected.
|
||||
- **Problem 1: A referenced chat could break the current request.**
|
||||
Before, if the filter needed to summarize a referenced chat and that LLM call failed, the current chat could fail with it. Now it degrades gracefully and injects direct context instead.
|
||||
- **Problem 2: Some referenced chats were being cut too aggressively.**
|
||||
@@ -128,7 +151,7 @@ flowchart TD
|
||||
| `priority` | `10` | Execution order; lower runs earlier. |
|
||||
| `compression_threshold_tokens` | `64000` | Trigger asynchronous summary when total tokens exceed this value. Set to 50%-70% of your model's context window. |
|
||||
| `max_context_tokens` | `128000` | Hard cap for context; older messages (except protected ones) are dropped if exceeded. |
|
||||
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
|
||||
| `keep_first` | `1` | Number of initial **non-system** messages to always keep (plus all preceding system prompts). |
|
||||
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
|
||||
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
|
||||
| `summary_model_max_context` | `0` | Input context window used to fit summary requests. If `0`, falls back to `model_thresholds` or global `max_context_tokens`. |
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 异步上下文压缩过滤器
|
||||
|
||||
| 作者:[Fu-Jie](https://github.com/Fu-Jie) · v1.5.0 | [⭐ 点个 Star 支持项目](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| 作者:[Fu-Jie](https://github.com/Fu-Jie) · v1.6.0 | [⭐ 点个 Star 支持项目](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| :--- | ---: |
|
||||
|
||||
|  |  |  |  |  |  |  |
|
||||
@@ -10,6 +10,25 @@
|
||||
|
||||
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
|
||||
|
||||
## 使用 Batch Install Plugins 安装
|
||||
|
||||
如果你已经安装了 [Batch Install Plugins from GitHub](https://github.com/Fu-Jie/openwebui-extensions/tree/main/plugins/tools/batch-install-plugins),可以用下面这句来安装或更新当前插件:
|
||||
|
||||
```text
|
||||
从 Fu-Jie/openwebui-extensions 安装插件
|
||||
```
|
||||
|
||||
当选择弹窗打开后,搜索当前插件,勾选后继续安装即可。
|
||||
|
||||
> [!IMPORTANT]
|
||||
> 如果你已经安装了 OpenWebUI 官方社区里的同名版本,请先删除旧版本,否则重新安装时可能报错。删除后,Batch Install Plugins 后续就可以继续负责更新这个插件。
|
||||
|
||||
## 1.6.0 版本更新
|
||||
|
||||
- **修正 `keep_first` 逻辑**:重新定义了 `keep_first` 的功能,现在它负责保护前 N 条**非系统消息**(以及它们之前的所有系统提示词)。这确保了初始对话背景(如身份设定、任务说明)能被正确保留。
|
||||
- **系统消息绝对保护**:系统消息现在被严格排除在压缩范围之外。历史记录中遇到的任何系统消息(甚至是后期注入的消息)都会作为原始消息保留在最终上下文中。
|
||||
- **改进的上下文组装**:摘要现在仅针对用户和助手的对话,确保其他插件注入的系统指令永远不会被摘要器“吃掉”。
|
||||
|
||||
## 1.5.0 版本更新
|
||||
|
||||
- **外部聊天引用摘要**: 新增对引用聊天上下文的摘要支持。现在可以复用缓存摘要、直接注入较小引用聊天,或先为较大的引用聊天生成摘要再注入。
|
||||
@@ -39,12 +58,14 @@
|
||||
- ✅ **智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
|
||||
- ⚠ **多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
|
||||
|
||||
详细的工作原理和更长说明仍可参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
|
||||
|
||||
---
|
||||
|
||||
## 这次解决了什么问题(通俗版)
|
||||
|
||||
- **问题:系统消息被摘要或丢失。**
|
||||
以前,过滤器可能会将被引用或后期注入的系统消息包含在摘要区域内,导致重要的指令丢失。现在,所有系统消息都严格按原样保留,永不被摘要。
|
||||
- **问题:`keep_first` 逻辑不符合预期。**
|
||||
以前 `keep_first` 只是简单提取前 N 条消息。如果前几条全是系统消息,初始的问答(通常对上下文很重要)就会被压缩掉。现在 `keep_first` 确保保护 N 条非系统消息。
|
||||
- **问题 1:引用别的聊天时,摘要失败可能把当前对话一起弄挂。**
|
||||
以前如果过滤器需要先帮被引用聊天做摘要,而这一步的 LLM 调用失败了,当前请求也可能直接失败。现在改成了“能摘要就摘要,失败就退回直接塞上下文”,当前对话不会被一起拖死。
|
||||
- **问题 2:有些被引用聊天被截得太早,信息丢得太多。**
|
||||
@@ -72,11 +93,11 @@ flowchart TD
|
||||
F -- 是 --> G[直接复用缓存摘要]
|
||||
F -- 否 --> H{能直接放进当前预算?}
|
||||
H -- 是 --> I[直接注入完整引用聊天文本]
|
||||
H -- 否 --> J[准备引用聊天的摘要输入]
|
||||
H -- No --> J[准备引用聊天的摘要输入]
|
||||
|
||||
J --> K{引用聊天摘要调用成功?}
|
||||
K -- 是 --> L[注入生成后的引用摘要]
|
||||
K -- 否 --> M[回退为直接注入上下文]
|
||||
K -- No --> M[回退为直接注入上下文]
|
||||
|
||||
G --> D
|
||||
I --> D
|
||||
@@ -136,7 +157,7 @@ flowchart TD
|
||||
| `priority` | `10` | 过滤器执行顺序,数值越小越先执行。 |
|
||||
| `compression_threshold_tokens` | `64000` | **重要**: 当上下文总 Token 超过此值时后台生成摘要,建议设为模型上下文窗口的 50%-70%。 |
|
||||
| `max_context_tokens` | `128000` | **重要**: 上下文硬上限,超过即移除最早消息(保留受保护消息)。 |
|
||||
| `keep_first` | `1` | 始终保留对话开始的 N 条消息,保护系统提示或环境变量。 |
|
||||
| `keep_first` | `1` | 始终保留对话开始的 N 条**非系统消息**(以及它们之前的所有系统提示词)。 |
|
||||
| `keep_last` | `6` | 始终保留对话末尾的 N 条消息,确保最近上下文连贯。 |
|
||||
|
||||
### 摘要生成配置
|
||||
|
||||
@@ -22,7 +22,7 @@ Filters act as middleware in the message pipeline:
|
||||
|
||||
Reduces token consumption in long conversations with safer summary fallbacks and clearer failure visibility.
|
||||
|
||||
**Version:** 1.5.0
|
||||
**Version:** 1.6.0
|
||||
|
||||
[:octicons-arrow-right-24: Documentation](async-context-compression.md)
|
||||
|
||||
|
||||
@@ -22,7 +22,7 @@ Filter 充当消息管线中的中间件:
|
||||
|
||||
通过更稳健的摘要回退和更清晰的失败提示,降低长对话的 token 消耗并保持连贯性。
|
||||
|
||||
**版本:** 1.5.0
|
||||
**版本:** 1.6.0
|
||||
|
||||
[:octicons-arrow-right-24: 查看文档](async-context-compression.zh.md)
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Async Context Compression Filter
|
||||
|
||||
| By [Fu-Jie](https://github.com/Fu-Jie) · v1.5.0 | [⭐ Star this repo](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| By [Fu-Jie](https://github.com/Fu-Jie) · v1.6.0 | [⭐ Star this repo](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| :--- | ---: |
|
||||
|
||||
|  |  |  |  |  |  |  |
|
||||
@@ -21,6 +21,12 @@ When the selection dialog opens, search for this plugin, check it, and continue.
|
||||
> [!IMPORTANT]
|
||||
> If the official OpenWebUI Community version is already installed, remove it first. After that, Batch Install Plugins can keep this plugin updated in future runs.
|
||||
|
||||
## What's new in 1.6.0
|
||||
|
||||
- **Fixed `keep_first` Logic**: Re-defined `keep_first` to protect the first N **non-system** messages plus all interleaved system messages. This ensures initial context (e.g., identity, task instructions) is preserved correctly.
|
||||
- **Absolute System Message Protection**: System messages are now strictly excluded from compression. Any system message encountered in the history (even late-injected ones) is preserved as an original message in the final context. This ensures dynamic instructions (like live time/location from other plugins) remain accurate and never summarized.
|
||||
- **Improved Context Assembly**: Summaries now only target User and Assistant dialogue, ensuring that system instructions injected by other plugins are never "eaten" by the summarizer.
|
||||
|
||||
## What's new in 1.5.0
|
||||
|
||||
- **External Chat Reference Summaries**: Added support for referenced chat context blocks that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
|
||||
@@ -54,6 +60,10 @@ When the selection dialog opens, search for this plugin, check it, and continue.
|
||||
|
||||
## What This Fixes
|
||||
|
||||
- **Problem: System Messages being summarized/lost.**
|
||||
Previously, the filter could include system messages (especially those injected late by other plugins) in its summarization zone, causing important instructions to be lost. Now, all system messages are strictly preserved in their original role and never summarized.
|
||||
- **Problem: Incorrect `keep_first` behavior.**
|
||||
Previously, `keep_first` simply took the first $N$ messages. If those were only system messages, the initial user/assistant messages (which are often important for context) would be summarized. Now, `keep_first` ensures that $N$ non-system messages are protected.
|
||||
- **Problem 1: A referenced chat could break the current request.**
|
||||
Before, if the filter needed to summarize a referenced chat and that LLM call failed, the current chat could fail with it. Now it degrades gracefully and injects direct context instead.
|
||||
- **Problem 2: Some referenced chats were being cut too aggressively.**
|
||||
@@ -141,7 +151,7 @@ flowchart TD
|
||||
| `priority` | `10` | Execution order; lower runs earlier. |
|
||||
| `compression_threshold_tokens` | `64000` | Trigger asynchronous summary when total tokens exceed this value. Set to 50%-70% of your model's context window. |
|
||||
| `max_context_tokens` | `128000` | Hard cap for context; older messages (except protected ones) are dropped if exceeded. |
|
||||
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
|
||||
| `keep_first` | `0` | Number of initial **non-system** messages to always keep (plus all preceding system prompts). |
|
||||
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
|
||||
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
|
||||
| `summary_model_max_context` | `0` | Input context window used to fit summary requests. If `0`, falls back to `model_thresholds` or global `max_context_tokens`. |
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 异步上下文压缩过滤器
|
||||
|
||||
| 作者:[Fu-Jie](https://github.com/Fu-Jie) · v1.5.0 | [⭐ 点个 Star 支持项目](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| 作者:[Fu-Jie](https://github.com/Fu-Jie) · v1.6.0 | [⭐ 点个 Star 支持项目](https://github.com/Fu-Jie/openwebui-extensions) |
|
||||
| :--- | ---: |
|
||||
|
||||
|  |  |  |  |  |  |  |
|
||||
@@ -23,6 +23,12 @@
|
||||
> [!IMPORTANT]
|
||||
> 如果你已经安装了 OpenWebUI 官方社区里的同名版本,请先删除旧版本,否则重新安装时可能报错。删除后,Batch Install Plugins 后续就可以继续负责更新这个插件。
|
||||
|
||||
## 1.6.0 版本更新
|
||||
|
||||
- **修正 `keep_first` 逻辑**:重新定义了 `keep_first` 的功能,现在它负责保护前 N 条**非系统消息**(以及它们之前的所有系统提示词)。这确保了初始对话背景(如身份设定、任务说明)能被正确保留。
|
||||
- **系统消息绝对保护**:系统消息现在被严格排除在压缩范围之外。历史记录中遇到的任何系统消息(甚至是后期注入的消息)都会作为原始消息保留在最终上下文中。
|
||||
- **改进的上下文组装**:摘要现在仅针对用户和助手的对话,确保其他插件注入的系统指令永远不会被摘要器“吃掉”。
|
||||
|
||||
## 1.5.0 版本更新
|
||||
|
||||
- **外部聊天引用摘要**: 新增对引用聊天上下文的摘要支持。现在可以复用缓存摘要、直接注入较小引用聊天,或先为较大的引用聊天生成摘要再注入。
|
||||
@@ -52,12 +58,14 @@
|
||||
- ✅ **智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
|
||||
- ⚠ **多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
|
||||
|
||||
详细的工作原理和更长说明仍可参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
|
||||
|
||||
---
|
||||
|
||||
## 这次解决了什么问题(通俗版)
|
||||
|
||||
- **问题:系统消息被摘要或丢失。**
|
||||
以前,过滤器可能会将被引用或后期注入的系统消息包含在摘要区域内,导致重要的指令丢失。现在,所有系统消息都严格按原样保留,永不被摘要。
|
||||
- **问题:`keep_first` 逻辑不符合预期。**
|
||||
以前 `keep_first` 只是简单提取前 N 条消息。如果前几条全是系统消息,初始的问答(通常对上下文很重要)就会被压缩掉。现在 `keep_first` 确保保护 N 条非系统消息。
|
||||
- **问题 1:引用别的聊天时,摘要失败可能把当前对话一起弄挂。**
|
||||
以前如果过滤器需要先帮被引用聊天做摘要,而这一步的 LLM 调用失败了,当前请求也可能直接失败。现在改成了“能摘要就摘要,失败就退回直接塞上下文”,当前对话不会被一起拖死。
|
||||
- **问题 2:有些被引用聊天被截得太早,信息丢得太多。**
|
||||
@@ -85,11 +93,11 @@ flowchart TD
|
||||
F -- 是 --> G[直接复用缓存摘要]
|
||||
F -- 否 --> H{能直接放进当前预算?}
|
||||
H -- 是 --> I[直接注入完整引用聊天文本]
|
||||
H -- 否 --> J[准备引用聊天的摘要输入]
|
||||
H -- No --> J[准备引用聊天的摘要输入]
|
||||
|
||||
J --> K{引用聊天摘要调用成功?}
|
||||
K -- 是 --> L[注入生成后的引用摘要]
|
||||
K -- 否 --> M[回退为直接注入上下文]
|
||||
K -- No --> M[回退为直接注入上下文]
|
||||
|
||||
G --> D
|
||||
I --> D
|
||||
@@ -149,7 +157,7 @@ flowchart TD
|
||||
| `priority` | `10` | 过滤器执行顺序,数值越小越先执行。 |
|
||||
| `compression_threshold_tokens` | `64000` | **重要**: 当上下文总 Token 超过此值时后台生成摘要,建议设为模型上下文窗口的 50%-70%。 |
|
||||
| `max_context_tokens` | `128000` | **重要**: 上下文硬上限,超过即移除最早消息(保留受保护消息)。 |
|
||||
| `keep_first` | `1` | 始终保留对话开始的 N 条消息,保护系统提示或环境变量。 |
|
||||
| `keep_first` | `0` | 始终保留对话开始的 N 条**非系统消息**(以及它们之前的所有系统提示词)。 |
|
||||
| `keep_last` | `6` | 始终保留对话末尾的 N 条消息,确保最近上下文连贯。 |
|
||||
|
||||
### 摘要生成配置
|
||||
|
||||
@@ -110,7 +110,7 @@
|
||||
```
|
||||
|
||||
**关键参数**:
|
||||
- `keep_first`:保留前 N 条消息(默认 1)
|
||||
- `keep_first`:保留前 N 条非系统消息(默认 0)
|
||||
- `keep_last`:保留后 N 条消息(默认 6)
|
||||
- 摘要注入位置:首条消息的内容前
|
||||
|
||||
@@ -432,7 +432,7 @@ Valves(
|
||||
max_context_tokens=128000, # 硬性上限
|
||||
|
||||
# 消息保留策略
|
||||
keep_first=1, # 保留首条(系统提示)
|
||||
keep_first=0, # 不保留首条非系统消息(默认)
|
||||
keep_last=6, # 保留末6条(最近对话)
|
||||
|
||||
# 摘要模型
|
||||
@@ -601,13 +601,13 @@ outlet 阶段:
|
||||
总消息: 20 条
|
||||
预估 Token: 8000 个
|
||||
|
||||
压缩后(keep_first=1, keep_last=6):
|
||||
头部消息: 1 条 (1600 Token)
|
||||
压缩后(keep_first=0, keep_last=6):
|
||||
头部消息: 0 条 (0 Token)
|
||||
摘要: ~800 Token (嵌入在头部)
|
||||
尾部消息: 6 条 (3200 Token)
|
||||
总计: 7 条有效输入 (~5600 Token)
|
||||
总计: 6 条有效输入 (~4000 Token)
|
||||
|
||||
节省:8000 - 5600 = 2400 Token (30% 节省)
|
||||
节省:8000 - 4000 = 4000 Token (50% 节省)
|
||||
|
||||
随对话变长,节省比例可达 65% 以上
|
||||
```
|
||||
|
||||
@@ -5,30 +5,23 @@ author: Fu-Jie
|
||||
author_url: https://github.com/Fu-Jie/openwebui-extensions
|
||||
funding_url: https://github.com/open-webui
|
||||
description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
|
||||
version: 1.5.0
|
||||
version: 1.6.0
|
||||
openwebui_id: b1655bc8-6de9-4cad-8cb5-a6f7829a02ce
|
||||
license: MIT
|
||||
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
📌 What's new in 1.5.0
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
✅ Stronger Working-Memory Prompt: Refined XML summary instructions to better preserve useful context in general chat and multi-step tool workflows.
|
||||
✅ Clearer Frontend Debug Logs: Reworked browser-console debug output into grouped, structured snapshots for faster diagnosis.
|
||||
✅ Safer Tool Trimming Defaults: Native tool-output trimming is now enabled by default with a configurable 600-character threshold.
|
||||
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
📌 Overview
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
|
||||
This filter reduces token consumption in long conversations through intelligent
|
||||
summarization and message compression while maintaining conversational coherence.
|
||||
|
||||
Core Features:
|
||||
✅ Automatic compression triggered by Token count threshold
|
||||
✅ Automatic compression triggered by token count threshold
|
||||
✅ Asynchronous summary generation (does not block user response)
|
||||
✅ Persistent storage with database support (PostgreSQL and SQLite)
|
||||
✅ Flexible retention policy (configurable to keep first and last N messages)
|
||||
✅ Smart summary injection to maintain context
|
||||
✅ Flexible retention policy (keep first N non-system messages + last N messages)
|
||||
✅ Absolute system message protection (never compressed or discarded)
|
||||
✅ Structure-aware trimming to preserve document skeleton
|
||||
✅ Native tool output trimming for function calling support
|
||||
|
||||
@@ -41,21 +34,51 @@ Phase 1: Inlet (Pre-request processing)
|
||||
1. Receives all messages in the current conversation.
|
||||
2. Checks for a previously saved summary.
|
||||
3. If a summary exists and the message count exceeds the retention threshold:
|
||||
├─ Extracts the first N messages to be kept.
|
||||
├─ Extracts the first N non-system messages to be kept (plus all
|
||||
│ interleaved system messages).
|
||||
├─ Injects the summary into the first message.
|
||||
├─ Extracts the last N messages to be kept.
|
||||
└─ Combines them into a new message list: [Kept First Messages + Summary] + [Kept Last Messages].
|
||||
└─ Combines them into: [Kept First + Summary + Gap System Messages + Kept Last]
|
||||
4. Sends the compressed message list to the LLM.
|
||||
|
||||
Phase 2: Outlet (Post-response processing)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
1. Triggered after the LLM response is complete.
|
||||
2. Checks if the Token count has reached the compression threshold.
|
||||
3. If the threshold is met, an asynchronous background task is started to generate a summary:
|
||||
├─ Extracts messages to be summarized (excluding the kept first and last messages).
|
||||
2. Checks if the token count has reached the compression threshold.
|
||||
3. If the threshold is met, an asynchronous background task is started:
|
||||
├─ Extracts messages to be summarized (excluding the kept first and last).
|
||||
├─ Calls the LLM to generate a concise summary.
|
||||
└─ Saves the summary to the database.
|
||||
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
🛡️ System Message Protection
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
System messages are strictly excluded from compression and always preserved in
|
||||
the final context. This ensures that dynamic instructions injected by other
|
||||
plugins (e.g., live time/location context) remain accurate throughout the
|
||||
conversation.
|
||||
|
||||
Protection Rules:
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
1. `keep_first` counts only non-system messages. System messages within the
|
||||
first N non-system messages are automatically preserved.
|
||||
2. System messages in the compression gap (between kept-first and kept-last)
|
||||
are extracted and preserved as original messages, not summarized.
|
||||
3. During forced trimming (when exceeding `max_context_tokens`), system
|
||||
messages from dropped atomic groups are re-inserted into the final output.
|
||||
|
||||
Example:
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
Messages: [sys, user1, sys(injected), user2, ..., user10, user11]
|
||||
keep_first=0, keep_last=2
|
||||
|
||||
Effective keep_first=0 (no non-system messages protected)
|
||||
Gap: [sys, user1, sys(injected), user2, ..., user9]
|
||||
Preserved from gap: [sys, sys(injected)]
|
||||
|
||||
Final output: [sys, summary, sys(injected), user10, user11]
|
||||
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
💾 Storage
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
@@ -80,7 +103,7 @@ Open WebUI's database settings automatically.
|
||||
📊 Compression Example
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
Scenario: A 20-message conversation (Default settings: keep first 1, keep last 6)
|
||||
Scenario: A 20-message conversation (Default settings: keep first 0, keep last 6)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
Before Compression:
|
||||
Message 1: [Initial prompt + First question]
|
||||
@@ -104,75 +127,89 @@ Scenario: A 20-message conversation (Default settings: keep first 1, keep last 6
|
||||
|
||||
priority
|
||||
Default: 10
|
||||
Description: The execution order of the filter. Lower numbers run first.
|
||||
Description: Priority level for the filter operations. Lower numbers run first.
|
||||
|
||||
compression_threshold_tokens
|
||||
Default: 64000
|
||||
Description: When the total context Token count exceeds this value, compression is triggered.
|
||||
Recommendation: Adjust based on your model's context window and cost.
|
||||
Description: When total context Token count exceeds this value, trigger compression (Global Default).
|
||||
|
||||
max_context_tokens
|
||||
Default: 128000
|
||||
Description: Hard limit for context. Exceeding this value will force removal of the earliest messages.
|
||||
Description: Hard limit for context. Exceeding this value will force removal of earliest messages (Global Default).
|
||||
|
||||
model_thresholds
|
||||
Default: {}
|
||||
Description: Threshold override configuration for specific models.
|
||||
Example: {"gpt-4": {"compression_threshold_tokens": 8000, "max_context_tokens": 32000}}
|
||||
Default: "" (empty string)
|
||||
Description: Per-model threshold overrides.
|
||||
Format: model_id:compression_threshold:max_context (comma-separated).
|
||||
Example: gpt-4:8000:32000,claude-3:100000:200000
|
||||
|
||||
keep_first
|
||||
Default: 0
|
||||
Description: Keep the first N non-system messages plus all interleaved system messages. Set to 0 to disable.
|
||||
|
||||
keep_last
|
||||
Default: 6
|
||||
Description: Always keep the last N full messages.
|
||||
|
||||
summary_model
|
||||
Default: None
|
||||
Description: The model ID used to generate the summary. If empty, uses the current conversation's model.
|
||||
Recommendation:
|
||||
- Configure a fast, economical, and compatible model, such as `deepseek-v3`, `gemini-2.5-flash`, `gpt-4.1`.
|
||||
- If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, this field must be specified.
|
||||
|
||||
summary_model_max_context
|
||||
Default: 0
|
||||
Description: Max context tokens for the summary model. If 0, falls back to model_thresholds or global max_context_tokens.
|
||||
Example: gemini-flash=1000000, gpt-4o-mini=128000
|
||||
|
||||
max_summary_tokens
|
||||
Default: 16384
|
||||
Description: The maximum number of tokens for the summary.
|
||||
|
||||
summary_temperature
|
||||
Default: 0.1
|
||||
Description: The temperature for summary generation. Lower values produce more deterministic output.
|
||||
|
||||
enable_tool_output_trimming
|
||||
Default: true
|
||||
Description: When enabled and `function_calling: "native"` is active, collapses oversized native tool outputs to a short placeholder while preserving the tool-call chain structure.
|
||||
Description: Enable trimming of large tool outputs (only works with native function calling).
|
||||
|
||||
tool_trim_threshold_chars
|
||||
Default: 600
|
||||
Description: Trim native tool outputs when their total content length reaches this many characters.
|
||||
|
||||
keep_first
|
||||
Default: 1
|
||||
Description: Always keep the first N messages of the conversation. Set to 0 to disable. The first message often contains important system prompts.
|
||||
show_token_usage_status
|
||||
Default: true
|
||||
Description: Show token usage status notification.
|
||||
|
||||
keep_last
|
||||
Default: 6
|
||||
Description: Always keep the last N full messages of the conversation to ensure context coherence.
|
||||
|
||||
summary_model
|
||||
Default: None
|
||||
Description: The LLM used to generate the summary.
|
||||
Recommendation:
|
||||
- It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`, `gemini-2.5-flash`, `gpt-4.1`.
|
||||
- If left empty, the filter will attempt to use the model from the current conversation.
|
||||
Note:
|
||||
- If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, leaving this field empty may cause summary generation to fail. In this case, you must specify a valid model.
|
||||
|
||||
max_summary_tokens
|
||||
Default: 16384
|
||||
Description: The maximum number of tokens allowed for the generated summary.
|
||||
|
||||
summary_temperature
|
||||
Default: 0.1
|
||||
Description: Controls the randomness of the summary generation. Lower values produce more deterministic output.
|
||||
token_usage_status_threshold
|
||||
Default: 80
|
||||
Description: Only show token usage status when usage exceeds this percentage (0-100). Set to 0 to always show.
|
||||
|
||||
debug_mode
|
||||
Default: false
|
||||
Description: Prints detailed debug information to the log. Recommended to set to `false` in production.
|
||||
Description: Enable detailed logging for debugging. Recommended to set to `false` in production.
|
||||
|
||||
show_debug_log
|
||||
Default: false
|
||||
Description: Print debug logs to browser console (F12). Useful for frontend debugging.
|
||||
Description: Show debug logs in the frontend console (F12). Useful for frontend debugging.
|
||||
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
🔧 Deployment
|
||||
═══════════════════════════════════════════════════════
|
||||
═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
The plugin automatically uses Open WebUI's shared database connection.
|
||||
No additional database configuration is required.
|
||||
|
||||
Suggested Filter Installation Order:
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
It is recommended to set the priority of this filter relatively high (a smaller number) to ensure it runs before other filters that might modify message content. A typical order might be:
|
||||
It is recommended to set the priority of this filter relatively high (a smaller
|
||||
number) to ensure it runs before other filters that might modify message content.
|
||||
A typical order might be:
|
||||
|
||||
1. Filters that need access to the full, uncompressed history (priority < 10)
|
||||
(e.g., a filter that injects a system-level prompt)
|
||||
(e.g., a filter that injects a system-level prompt like live context)
|
||||
2. This compression filter (priority = 10)
|
||||
3. Filters that run after compression (priority > 10)
|
||||
(e.g., a final output formatting filter)
|
||||
@@ -216,7 +253,8 @@ Statistics:
|
||||
✓ The `chat_summary` table will be created automatically on first run.
|
||||
|
||||
2. Retention Policy
|
||||
⚠ The `keep_first` setting is crucial for preserving initial messages that contain system prompts. Configure it as needed.
|
||||
⚠ `keep_first` counts only non-system messages. System messages are always
|
||||
preserved regardless of this setting.
|
||||
|
||||
3. Performance
|
||||
⚠ Summary generation is asynchronous and will not block the user response.
|
||||
@@ -252,7 +290,8 @@ Solution:
|
||||
|
||||
Problem: Initial system prompt is lost
|
||||
Solution:
|
||||
- Ensure `keep_first` is set to a value greater than 0 to preserve the initial messages containing this information.
|
||||
- System messages are always preserved. If a system prompt is missing, check
|
||||
whether another filter is modifying or removing it.
|
||||
|
||||
Problem: Compression effect is not significant
|
||||
Solution:
|
||||
@@ -1249,13 +1288,25 @@ class Filter:
|
||||
return groups
|
||||
|
||||
def _get_effective_keep_first(self, messages: List[Dict]) -> int:
|
||||
"""Protect configured head messages and all leading system messages."""
|
||||
last_system_index = -1
|
||||
"""
|
||||
Calculate the index to protect the first N NON-SYSTEM messages.
|
||||
All system messages encountered before reaching the Nth non-system message are also kept.
|
||||
"""
|
||||
if not messages:
|
||||
return 0
|
||||
|
||||
non_system_count = 0
|
||||
target_index = 0
|
||||
|
||||
for i, msg in enumerate(messages):
|
||||
if msg.get("role") == "system":
|
||||
last_system_index = i
|
||||
|
||||
return max(self.valves.keep_first, last_system_index + 1)
|
||||
if msg.get("role") != "system":
|
||||
non_system_count += 1
|
||||
|
||||
target_index = i + 1
|
||||
if non_system_count >= self.valves.keep_first:
|
||||
break
|
||||
|
||||
return target_index
|
||||
|
||||
def _align_tail_start_to_atomic_boundary(
|
||||
self, messages: List[Dict], raw_start_index: int, protected_prefix: int
|
||||
@@ -1475,9 +1526,9 @@ class Filter:
|
||||
)
|
||||
|
||||
keep_first: int = Field(
|
||||
default=1,
|
||||
default=0,
|
||||
ge=0,
|
||||
description="Always keep the first N messages. Set to 0 to disable.",
|
||||
description="Keep the first N non-system messages plus all interleaved system messages. Set to 0 to disable.",
|
||||
)
|
||||
keep_last: int = Field(
|
||||
default=6, ge=0, description="Always keep the last N full messages."
|
||||
@@ -3047,6 +3098,14 @@ class Filter:
|
||||
messages, raw_start_index, effective_keep_first
|
||||
)
|
||||
|
||||
# --- Extract Preserved System Messages from the Gap ---
|
||||
# Any system message in the gap (messages[effective_keep_first:start_index])
|
||||
# must be preserved according to policy.
|
||||
gap_messages = messages[effective_keep_first:start_index]
|
||||
preserved_system_messages = [
|
||||
msg for msg in gap_messages if isinstance(msg, dict) and msg.get("role") == "system"
|
||||
]
|
||||
|
||||
# 3. Summary message (Inserted as Assistant message)
|
||||
external_refs = body.pop("__external_references__", None)
|
||||
summary_msg = self._build_summary_message(
|
||||
@@ -3105,7 +3164,7 @@ class Filter:
|
||||
# --- Preflight Check & Budgeting (Simplified) ---
|
||||
|
||||
# Assemble candidate messages (for output)
|
||||
candidate_messages = head_messages + [summary_msg] + tail_messages
|
||||
candidate_messages = head_messages + [summary_msg] + preserved_system_messages + tail_messages
|
||||
|
||||
# Prepare messages for token calculation (include system prompt if missing)
|
||||
calc_messages = candidate_messages
|
||||
@@ -3189,7 +3248,7 @@ class Filter:
|
||||
)
|
||||
|
||||
# Re-assemble
|
||||
candidate_messages = head_messages + [summary_msg] + tail_messages
|
||||
candidate_messages = head_messages + [summary_msg] + preserved_system_messages + tail_messages
|
||||
|
||||
await self._log(
|
||||
"[Inlet] ✂️ Sent-context history reduced\n"
|
||||
@@ -3209,6 +3268,7 @@ class Filter:
|
||||
)
|
||||
head_tokens = self._estimate_messages_tokens(head_messages)
|
||||
summary_tokens = self._estimate_content_tokens(summary_content)
|
||||
preserved_system_tokens = self._estimate_messages_tokens(preserved_system_messages)
|
||||
tail_tokens = self._estimate_messages_tokens(tail_messages)
|
||||
else:
|
||||
system_tokens = (
|
||||
@@ -3218,14 +3278,15 @@ class Filter:
|
||||
)
|
||||
head_tokens = self._calculate_messages_tokens(head_messages)
|
||||
summary_tokens = self._count_tokens(summary_content)
|
||||
preserved_system_tokens = self._calculate_messages_tokens(preserved_system_messages)
|
||||
tail_tokens = self._calculate_messages_tokens(tail_messages)
|
||||
|
||||
system_info = (
|
||||
f"System({system_tokens}t)" if system_prompt_msg else "System(0t)"
|
||||
f"System({system_tokens + preserved_system_tokens}t)" if (system_prompt_msg or preserved_system_messages) else "System(0t)"
|
||||
)
|
||||
|
||||
total_section_tokens = (
|
||||
system_tokens + head_tokens + summary_tokens + tail_tokens
|
||||
system_tokens + head_tokens + summary_tokens + preserved_system_tokens + tail_tokens
|
||||
)
|
||||
|
||||
await self._log(
|
||||
@@ -3371,15 +3432,33 @@ class Filter:
|
||||
# Use atomic grouping to preserve tool-calling integrity
|
||||
trimmable = candidate_messages[effective_keep_first:]
|
||||
atomic_groups = self._get_atomic_groups(trimmable)
|
||||
|
||||
# To follow policy "system messages never lost", we maintain a list of
|
||||
# system messages that were part of dropped groups.
|
||||
dropped_but_preserved_systems = []
|
||||
|
||||
while total_tokens > max_context_tokens and len(atomic_groups) > 1:
|
||||
dropped_group_indices = atomic_groups.pop(0)
|
||||
dropped_tokens = 0
|
||||
for _ in range(len(dropped_group_indices)):
|
||||
dropped = trimmable.pop(0)
|
||||
|
||||
# Absolute protections:
|
||||
# 1. External references (often large and specialized)
|
||||
# 2. System messages (instructions)
|
||||
if self._is_external_reference_message(dropped):
|
||||
trimmable.insert(0, dropped)
|
||||
# Stop dropping this group if we hit a protected message
|
||||
# (Though groups should be pure, this is a safety net)
|
||||
break
|
||||
|
||||
if isinstance(dropped, dict) and dropped.get("role") == "system":
|
||||
dropped_but_preserved_systems.append(dropped)
|
||||
# Even if preserved, it counts as "dropped" from the trimmable flow
|
||||
# to avoid infinite loop, but its tokens remain in the budget.
|
||||
# We don't subtract its tokens here.
|
||||
continue
|
||||
|
||||
if total_tokens == estimated_tokens:
|
||||
dropped_tokens += self._estimate_content_tokens(
|
||||
dropped.get("content", "")
|
||||
@@ -3390,8 +3469,9 @@ class Filter:
|
||||
)
|
||||
total_tokens -= dropped_tokens
|
||||
|
||||
# Re-assemble: [Head] + [Preserved Systems from Dropped Groups] + [Remaining Trimmable/Tail]
|
||||
candidate_messages = (
|
||||
candidate_messages[:effective_keep_first] + trimmable
|
||||
candidate_messages[:effective_keep_first] + dropped_but_preserved_systems + trimmable
|
||||
)
|
||||
|
||||
await self._log(
|
||||
|
||||
Reference in New Issue
Block a user