feat(async-context-compression): release v1.4.0 with structure-aware grouping and session locking

- Introduced Atomic Message Grouping to prevent tool-calling corruption (Issue #56) - Implemented Tail Boundary Alignment for deterministic context truncation - Added per-chat asynchronous session locking to prevent duplicate background tasks - Enhanced summarization traceability with message IDs and names - Synchronized version and changelog across all documentation files - Optimized release-prep skill to remove redundant H1 titles Closes #56
2026-03-09 20:31:25 +08:00
parent 2eee7c5d35
commit f061d82409
27 changed files with 3540 additions and 286 deletions
--- a/plugins/filters/async-context-compression/DEPLOYMENT_REFERENCE.md
+++ b/plugins/filters/async-context-compression/DEPLOYMENT_REFERENCE.md
@@ -0,0 +1,354 @@
+# ✨ 异步上下文压缩本地部署工具 — 完整文件清单
+
+## 📦 新增文件总览
+
+为 async_context_compression Filter 插件增加的本地部署功能包括：
+
+```
+openwebui-extensions/
+├── scripts/
+│   ├── ✨ deploy_async_context_compression.py    (新增) 专用部署脚本 [70 行]
+│   ├── ✨ deploy_filter.py                        (新增) 通用 Filter 部署工具 [300 行]
+│   ├── ✨ DEPLOYMENT_GUIDE.md                     (新增) 完整部署指南 [详细]
+│   ├── ✨ DEPLOYMENT_SUMMARY.md                   (新增) 技术架构总结 [详细]
+│   ├── ✨ QUICK_START.md                          (新增) 快速参考卡片 [速查]
+│   ├── ✨ README.md                               (新增) 脚本使用说明 [本文]
+│   └── deploy_pipe.py                            (已有) Pipe 部署工具
+│
+└── tests/
+    └── scripts/
+        └── ✨ test_deploy_filter.py                (新增) 单元测试 [10个测试 ✅]
+```
+
+## 🎯 快速使用
+
+### 最简单的方式 — 一行命令
+
+```bash
+cd scripts && python deploy_async_context_compression.py
+```
+
+**✅ 结果**: 
+- async_context_compression Filter 被部署到本地 OpenWebUI
+- 无需重启 OpenWebUI，立即生效
+- 显示部署状态和后续步骤
+
+### 第一次使用建议
+
+```bash
+# 1. 进入 scripts 目录
+cd scripts
+
+# 2. 查看所有可用的部署脚本
+ls -la deploy_*.py
+
+# 3. 阅读快速开始指南
+cat QUICK_START.md
+
+# 4. 部署 async_context_compression
+python deploy_async_context_compression.py
+```
+
+## 📚 文件详细说明
+
+### 1. `deploy_async_context_compression.py` ⭐ 推荐
+
+**最快速的部署方式！**
+
+```bash
+python deploy_async_context_compression.py
+```
+
+**特点**:
+- 专为 async_context_compression 优化
+- 一条命令完成部署
+- 清晰的成功/失败提示
+- 显示后续配置步骤
+
+**代码**: 约 70 行，简洁清晰
+
+---
+
+### 2. `deploy_filter.py` — 通用工具
+
+支持部署 **所有 Filter 插件**
+
+```bash
+# 默认部署 async_context_compression
+python deploy_filter.py
+
+# 部署其他 Filter
+python deploy_filter.py folder-memory
+python deploy_filter.py context_enhancement_filter
+
+# 列出所有可用 Filter
+python deploy_filter.py --list
+```
+
+**特点**:
+- 通用的 Filter 部署框架
+- 自动元数据提取
+- 支持多个插件
+- 智能错误处理
+
+**代码**: 约 300 行，完整功能
+
+---
+
+### 3. `QUICK_START.md` — 快速参考
+
+一页纸的速查表，包含：
+- ⚡ 30秒快速开始
+- 📋 常见命令表格
+- ❌ 故障排除速查
+
+**适合**: 第二次及以后使用
+
+---
+
+### 4. `DEPLOYMENT_GUIDE.md` — 完整指南
+
+详细的部署指南，包含：
+- 前置条件检查
+- 分步工作流
+- API 密钥获取方法
+- 详细的故障排除
+- CI/CD 集成示例
+
+**适合**: 首次部署或需要深入了解
+
+---
+
+### 5. `DEPLOYMENT_SUMMARY.md` — 技术总结
+
+技术架构和实现细节：
+- 工作原理流程图
+- 元数据提取机制
+- API 集成说明
+- 安全最佳实践
+
+**适合**: 开发者和想了解实现的人
+
+---
+
+### 6. `test_deploy_filter.py` — 单元测试
+
+完整的测试覆盖：
+
+```bash
+pytest tests/scripts/test_deploy_filter.py -v
+```
+
+**测试内容**: 10 个单元测试 ✅
+- Filter 发现
+- 元数据提取
+- 负载构建
+- 版本处理
+
+---
+
+## 🚀 三个使用场景
+
+### 场景 1: 快速部署（最常用）
+
+```bash
+cd scripts
+python deploy_async_context_compression.py
+# 完成！✅
+```
+
+**耗时**: 5 秒  
+**适合**: 日常开发迭代
+
+---
+
+### 场景 2: 部署其他 Filter
+
+```bash
+cd scripts
+python deploy_filter.py --list        # 查看所有
+python deploy_filter.py folder-memory  # 部署指定的
+```
+
+**耗时**: 5 秒 × N  
+**适合**: 管理多个 Filter
+
+---
+
+### 场景 3: 完整设置（首次）
+
+```bash
+cd scripts
+
+# 1. 创建 API 密钥配置
+echo "api_key=sk-your-key" > .env
+
+# 2. 验证配置
+cat .env
+
+# 3. 部署
+python deploy_async_context_compression.py
+
+# 4. 查看结果
+curl http://localhost:3003/api/v1/functions
+```
+
+**耗时**: 1 分钟  
+**适合**: 第一次设置
+
+---
+
+## 📋 文件访问指南
+
+| 我想... | 文件 | 命令 |
+|---------|------|------|
+| 部署 async_context_compression | deploy_async_context_compression.py | `python deploy_async_context_compression.py` |
+| 看快速参考 | QUICK_START.md | `cat QUICK_START.md` |
+| 完整指南 | DEPLOYMENT_GUIDE.md | `cat DEPLOYMENT_GUIDE.md` |
+| 技术细节 | DEPLOYMENT_SUMMARY.md | `cat DEPLOYMENT_SUMMARY.md` |
+| 运行测试 | test_deploy_filter.py | `pytest tests/scripts/test_deploy_filter.py -v` |
+| 部署其他 Filter | deploy_filter.py | `python deploy_filter.py --list` |
+
+## ✅ 验证清单
+
+确保一切就绪：
+
+```bash
+# 1. 检查所有部署脚本都已创建
+ls -la scripts/deploy*.py
+# 应该看到: deploy_pipe.py, deploy_filter.py, deploy_async_context_compression.py
+
+# 2. 检查所有文档都已创建
+ls -la scripts/*.md
+# 应该看到: DEPLOYMENT_GUIDE.md, DEPLOYMENT_SUMMARY.md, QUICK_START.md, README.md
+
+# 3. 检查测试存在
+ls -la tests/scripts/test_deploy_filter.py
+
+# 4. 运行一次测试验证
+python -m pytest tests/scripts/test_deploy_filter.py -v
+# 应该看到: 10 passed ✅
+
+# 5. 尝试部署
+cd scripts && python deploy_async_context_compression.py
+```
+
+## 🎓 学习路径
+
+### 初学者路径
+
+```
+1. 阅读本文件 (5 分钟)
+2. 阅读 QUICK_START.md (5 分钟)
+3. 运行部署脚本 (5 分钟)
+4. 在 OpenWebUI 中测试 (5 分钟)
+```
+
+### 开发者路径
+
+```
+1. 阅读本文件
+2. 阅读 DEPLOYMENT_GUIDE.md
+3. 阅读 DEPLOYMENT_SUMMARY.md
+4. 查看源代码: deploy_filter.py
+5. 运行测试: pytest tests/scripts/test_deploy_filter.py -v
+```
+
+## 🔧 常见问题
+
+### Q: 如何更新已部署的插件？
+
+```bash
+# 修改代码后
+vim ../plugins/filters/async-context-compression/async_context_compression.py
+
+# 重新部署（自动覆盖）
+python deploy_async_context_compression.py
+```
+
+### Q: 支持哪些 Filter？
+
+```bash
+python deploy_filter.py --list
+```
+
+### Q: 如何获取 API 密钥？
+
+1. 打开 OpenWebUI
+2. 点击用户菜单 → Settings
+3. 找到 "API Keys" 部分
+4. 复制密钥到 `.env` 文件
+
+### Q: 脚本失败了怎么办？
+
+1. 查看错误信息
+2. 参考 `QUICK_START.md` 的故障排除部分
+3. 或查看 `DEPLOYMENT_GUIDE.md` 的详细说明
+
+### Q: 安全吗？
+
+✅ 完全安全
+
+- API 密钥存储在本地 `.env` 文件
+- `.env` 已添加到 `.gitignore`
+- 绝不会被提交到 Git
+- 密钥可随时轮换
+
+### Q: 可以在生产环境使用吗？
+
+✅ 可以
+
+- 生产环境建议通过 CI/CD 秘密管理
+- 参考 `DEPLOYMENT_GUIDE.md` 中的 GitHub Actions 示例
+
+## 🚦 快速状态检查
+
+```bash
+# 检查所有部署工具是否就绪
+cd scripts
+
+# 查看脚本列表
+ls -la deploy*.py
+
+# 查看文档列表
+ls -la *.md | grep -i deploy
+
+# 验证测试通过
+python -m pytest tests/scripts/test_deploy_filter.py -q
+
+# 执行部署
+python deploy_async_context_compression.py
+```
+
+## 📞 下一步
+
+1. **立即尝试**: `cd scripts && python deploy_async_context_compression.py`
+2. **查看结果**: 打开 OpenWebUI → Settings → Filters → 找 "Async Context Compression"
+3. **启用使用**: 在对话中启用这个 Filter，体验上下文压缩功能
+4. **继续开发**: 修改代码后重复部署过程
+
+## 📝 更多资源
+
+- 🚀 快速开始: [QUICK_START.md](QUICK_START.md)
+- 📖 完整指南: [DEPLOYMENT_GUIDE.md](DEPLOYMENT_GUIDE.md)
+- 🏗️ 技术架构: [DEPLOYMENT_SUMMARY.md](DEPLOYMENT_SUMMARY.md)
+- 🧪 测试套件: [test_deploy_filter.py](../tests/scripts/test_deploy_filter.py)
+
+---
+
+## 📊 文件统计
+
+```
+新增 Python 脚本:     2 个 (deploy_filter.py, deploy_async_context_compression.py)
+新增文档文件:         4 个 (DEPLOYMENT_*.md, QUICK_START.md)
+新增测试文件:         1 个 (test_deploy_filter.py)
+新增总代码行数:       ~600 行
+测试覆盖率:           10/10 单元测试通过 ✅
+```
+
+---
+
+**创建日期**: 2026-03-09  
+**最好用于**: 本地开发和快速迭代  
+**维护者**: Fu-Jie  
+**项目**: [openwebui-extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/plugins/filters/async-context-compression/ISSUE_56_ANALYSIS.md
+++ b/plugins/filters/async-context-compression/ISSUE_56_ANALYSIS.md
@@ -0,0 +1,189 @@
+# Issue #56: Critical tool-calling corruption and multiple reliability issues
+
+## Overview
+This document consolidates all reported issues in the async-context-compression filter as described in [GitHub Issue #56](https://github.com/Fu-Jie/openwebui-extensions/issues/56).
+
+---
+
+## Issue List
+
+### 1. 🔴 CRITICAL: Native tool-calling history can be corrupted
+
+**Severity**: Critical  
+**Impact**: Conversation integrity
+
+#### Description
+The compression logic removes individual messages without preserving native tool-calling structures as atomic units. This can break the relationship between assistant `tool_calls` and their corresponding `tool` result messages.
+
+#### Symptom
+```
+No tool call found for function call output with call_id ...
+```
+
+#### Root Cause
+- Assistant messages containing `tool_calls` can be removed while their matching `tool` result messages remain
+- This creates orphaned tool outputs that reference non-existent `tool_call_id`s
+- The model/provider rejects the request because the `call_id` no longer matches any tool call in history
+
+#### Expected Behavior
+Compression must treat tool-calling blocks atomically:
+- `assistant(tool_calls)` message
+- Corresponding `tool` result message(s)
+- Optional assistant follow-up that consumes tool results
+
+Should never be split or partially removed.
+
+---
+
+### 2. 🟠 HIGH: Compression progress mixes original-history and compressed-view semantics
+
+**Severity**: High  
+**Impact**: Summary advancement consistency
+
+#### Description
+The plugin stores `compressed_message_count` as progress over the original conversation history, but later recalculates it from the already-compressed conversation view. This mixes two different coordinate systems for the same field.
+
+#### Problem
+- Original-history progress (before compression)
+- Compressed-view progress (after compression)
+
+These two meanings are inconsistent, causing:
+- Summary advancement to become inconsistent
+- Summary progress to stall after summaries already exist
+- Later updates to be measured in a different coordinate system than stored values
+
+#### Expected Behavior
+Progress tracking must use a single, consistent coordinate system throughout the lifetime of the conversation.
+
+---
+
+### 3. 🟡 MEDIUM: Async summary generation has no per-chat lock
+
+**Severity**: Medium  
+**Impact**: Token usage, race conditions
+
+#### Description
+Each response can launch a new background summary task for the same chat, even if one is already in progress.
+
+#### Problems
+- Duplicate summary work
+- Increased token usage
+- Race conditions in saved summary state
+- Potential data consistency issues
+
+#### Expected Behavior
+Use per-chat locking to ensure only one summary task runs per chat at a time.
+
+---
+
+### 4. 🟡 MEDIUM: Native tool-output trimming is too aggressive
+
+**Severity**: Medium  
+**Impact**: Content accuracy in technical conversations
+
+#### Description
+The tool-output trimming heuristics can rewrite or trim normal assistant messages if they contain patterns such as:
+- Code fences (triple backticks)
+- `Arguments:` text
+- `<tool_code>` tags
+
+#### Problem
+This is risky in technical conversations and may alter valid assistant content unintentionally.
+
+#### Expected Behavior
+Trimming logic should be more conservative and avoid modifying assistant messages that are not actually tool-output summaries.
+
+---
+
+### 5. 🟡 MEDIUM: `max_context_tokens = 0` has inconsistent semantics
+
+**Severity**: Medium  
+**Impact**: Determinism, configuration clarity
+
+#### Description
+The setting `max_context_tokens = 0` behaves inconsistently across different code paths:
+- In some paths: behaves like "no threshold" (special mode, no compression)
+- In other paths: still triggers reduction/truncation logic
+
+#### Problem
+Non-deterministic behavior makes the setting unpredictable and confusing for users.
+
+#### Expected Behavior
+- Define clear semantics for `max_context_tokens = 0`
+- Apply consistently across all code paths
+- Document the intended behavior
+
+---
+
+### 6. 🔵 LOW: Corrupted Korean i18n string
+
+**Severity**: Low  
+**Impact**: User experience for Korean speakers
+
+#### Description
+One translation string contains broken mixed-language text.
+
+#### Expected Behavior
+Clean up the Korean translation string to be properly formatted and grammatically correct.
+
+---
+
+## Related / Broader Context
+
+**Note from issue reporter**: The critical bug is not limited to tool-calling fields alone. Because compression deletes or replaces whole message objects, it can also drop other per-message fields such as:
+- Message-level `id`
+- `metadata`
+- `name`
+- Similar per-message attributes
+
+So the issue is broader than native tool-calling: any integration relying on per-message metadata may also be affected when messages are trimmed or replaced.
+
+---
+
+## Reproduction Steps
+
+1. Start a chat with a model using native tool calling
+2. Enable the async-context-compression filter
+3. Send a conversation long enough to trigger compression / summary generation
+4. Let the model perform multiple tool calls across several turns
+5. Continue the same chat after the filter has already compressed part of the history
+
+**Expected**: Chat continues normally  
+**Actual**: Chat can become desynchronized and fail with errors like `No tool call found for function call output with call_id ...`
+
+**Control Test**:
+- With filter disabled: failure does not occur
+- With filter enabled: failure reproduces reliably
+
+---
+
+## Suggested Fix Direction
+
+### High Priority (Blocks Issue #56)
+
+1. **Preserve tool-calling atomicity**: Compress history in a way that never separates `assistant(tool_calls)` from its corresponding `tool` messages
+2. **Unify progress tracking**: Use a single, consistent coordinate system for `compressed_message_count` throughout
+3. **Add per-chat locking**: Ensure only one background summary task runs per chat at a time
+
+### Medium Priority
+
+4. **Conservative trimming**: Refine tool-output trimming heuristics to avoid altering valid assistant content
+5. **Define `max_context_tokens = 0` semantics**: Make behavior consistent and predictable
+6. **Fix i18n**: Clean up the corrupted Korean translation string
+
+---
+
+## Environment
+
+- **Plugin**: async-context-compression
+- **OpenWebUI Version**: 0.8.9
+- **OS**: Ubuntu 24.04 LTS ARM64
+- **Reported by**: @dhaern
+- **Issue Date**: [Recently opened]
+
+---
+
+## References
+
+- [GitHub Issue #56](https://github.com/Fu-Jie/openwebui-extensions/issues/56)
+- Plugin: `plugins/filters/async-context-compression/async_context_compression.py`
--- a/plugins/filters/async-context-compression/ISSUE_56_ANALYSIS.zh.md
+++ b/plugins/filters/async-context-compression/ISSUE_56_ANALYSIS.zh.md
@@ -0,0 +1,189 @@
+# Issue #56: 异步上下文压缩中的关键工具调用破坏和多个可靠性问题
+
+## 概述
+本文档汇总了 [GitHub Issue #56](https://github.com/Fu-Jie/openwebui-extensions/issues/56) 中所有关于异步上下文压缩过滤器的已报告问题。
+
+---
+
+## 问题列表
+
+### 1. 🔴 关键：原生工具调用历史可能被破坏
+
+**严重级别**: 关键  
+**影响范围**: 对话完整性
+
+#### 描述
+压缩逻辑逐条删除消息，而不是把原生工具调用结构作为原子整体保留。这可能会破坏 assistant `tool_calls` 与其对应 `tool` 结果消息的关系。
+
+#### 症状
+```
+No tool call found for function call output with call_id ...
+```
+
+#### 根本原因
+- 包含 `tool_calls` 的 assistant 消息可能被删除，但其对应的 `tool` 结果消息仍保留
+- 这会产生孤立的工具输出，引用不存在的 `tool_call_id`
+- 模型/API 提供商会拒绝该请求，因为 `call_id` 不再匹配历史中的任何工具调用
+
+#### 期望行为
+压缩必须把工具调用块当作原子整体对待：
+- `assistant(tool_calls)` 消息
+- 对应的 `tool` 结果消息
+- 可选的 assistant 跟进消息（消费工具结果）
+
+这些消息的任何部分都不应被分割或部分删除。
+
+---
+
+### 2. 🟠 高优先级：压缩进度混淆了原始历史和压缩视图语义
+
+**严重级别**: 高  
+**影响范围**: 摘要进度一致性
+
+#### 描述
+插件将 `compressed_message_count` 存储为原始对话历史的进度，但稍后从已压缩的对话视图重新计算。这混淆了同一字段的两个不同坐标系。
+
+#### 问题
+- 原始历史进度（压缩前）
+- 压缩视图进度（压缩后）
+
+这两个含义不一致，造成：
+- 摘要进度变得不一致
+- 摘要已存在后进度可能停滞
+- 后续更新用不同于存储值的坐标系测量
+
+#### 期望行为
+进度跟踪必须在对话整个生命周期中使用单一、一致的坐标系。
+
+---
+
+### 3. 🟡 中等优先级：异步摘要生成没有每聊天锁
+
+**严重级别**: 中等  
+**影响范围**: 令牌使用、竞态条件
+
+#### 描述
+每个响应都可能为同一聊天启动新的后台摘要任务，即使已有任务在进行中。
+
+#### 问题
+- 摘要工作重复
+- 令牌使用增加
+- 已保存摘要状态出现竞态条件
+- 数据一致性问题
+
+#### 期望行为
+使用每聊天锁机制确保每次只有一个摘要任务在该聊天中运行。
+
+---
+
+### 4. 🟡 中等优先级：原生工具输出裁剪太激进
+
+**严重级别**: 中等  
+**影响范围**: 技术对话的内容准确性
+
+#### 描述
+工具输出裁剪启发式方法会重写或裁剪普通 assistant 消息，如果包含诸如以下模式：
+- 代码围栏（三个反引号）
+- `Arguments:` 文本
+- `<tool_code>` 标签
+
+#### 问题
+这在技术对话中存在风险，可能无意中更改有效的 assistant 内容。
+
+#### 期望行为
+裁剪逻辑应更保守，避免修改非工具输出摘要的 assistant 消息。
+
+---
+
+### 5. 🟡 中等优先级：`max_context_tokens = 0` 语义不一致
+
+**严重级别**: 中等  
+**影响范围**: 确定性、配置清晰度
+
+#### 描述
+设置 `max_context_tokens = 0` 在不同代码路径中行为不一致：
+- 在某些路径中：像"无阈值"一样（特殊模式，无压缩）
+- 在其他路径中：仍然触发缩减/截断逻辑
+
+#### 问题
+非确定性行为使设置变得不可预测和令人困惑。
+
+#### 期望行为
+- 为 `max_context_tokens = 0` 定义清晰语义
+- 在所有代码路径中一致应用
+- 清楚地记录预期行为
+
+---
+
+### 6. 🔵 低优先级：破损的韩文 i18n 字符串
+
+**严重级别**: 低  
+**影响范围**: 韩文使用者的用户体验
+
+#### 描述
+一个翻译字符串包含破损的混合语言文本。
+
+#### 期望行为
+清理韩文翻译字符串，使其格式正确和语法正确。
+
+---
+
+## 相关/更广泛的上下文
+
+**问题报告者附注**：关键错误不仅限于工具调用字段。由于压缩删除或替换整个消息对象，它还可能丢弃其他每消息字段，例如：
+- 消息级 `id`
+- `metadata`
+- `name`
+- 其他每消息属性
+
+因此问题范围广于原生工具调用：任何依赖每消息元数据的集成在消息被裁剪或替换时也可能受影响。
+
+---
+
+## 复现步骤
+
+1. 使用原生工具调用启动与模型的聊天
+2. 启用异步上下文压缩过滤器
+3. 发送足够长的对话以触发压缩/摘要生成
+4. 让模型在几个回合中执行多个工具调用
+5. 在过滤器已压缩部分历史后继续同一聊天
+
+**期望**: 聊天继续正常运行  
+**实际**: 聊天可能变得不同步并失败，出现错误如 `No tool call found for function call output with call_id ...`
+
+**对照测试**:
+- 禁用过滤器：不出现失败
+- 启用过滤器：可靠地复现失败
+
+---
+
+## 建议的修复方向
+
+### 高优先级（阻止 Issue #56）
+
+1. **保护工具调用原子性**：以不分割 `assistant(tool_calls)` 与其对应 `tool` 消息的方式压缩历史
+2. **统一进度跟踪**：在整个过程中使用单一、一致的坐标系统追踪 `compressed_message_count`
+3. **添加每聊天锁**：确保每次只有一个后台摘要任务在该聊天中运行
+
+### 中等优先级
+
+4. **保守的裁剪**：精化工具输出裁剪启发式方法，避免更改有效 assistant 内容
+5. **定义 `max_context_tokens = 0` 语义**：使行为一致且可预测
+6. **修复 i18n**：清理破损的韩文翻译字符串
+
+---
+
+## 环境
+
+- **插件**: async-context-compression
+- **OpenWebUI 版本**: 0.8.9
+- **操作系统**: Ubuntu 24.04 LTS ARM64
+- **报告者**: @dhaern
+- **问题日期**: [最近提交]
+
+---
+
+## 参考资源
+
+- [GitHub Issue #56](https://github.com/Fu-Jie/openwebui-extensions/issues/56)
+- 插件: `plugins/filters/async-context-compression/async_context_compression.py`
--- a/plugins/filters/async-context-compression/README.md
+++ b/plugins/filters/async-context-compression/README.md
@@ -1,16 +1,15 @@
 # Async Context Compression Filter

-**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.3.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
+**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.4.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT

 This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.

-## What's new in 1.3.0
+## What's new in 1.4.0

- **Internationalization (i18n)**: Complete localization of user-facing messages across 9 languages (English, Chinese, Japanese, Korean, French, German, Spanish, Italian).
- **Smart Status Display**: Added `token_usage_status_threshold` valve (default 80%) to intelligently control when token usage status is shown.
- **Improved Performance**: Frontend language detection and logging are optimized to be completely non-blocking, maintaining lightning-fast TTFB.
- **Copilot SDK Integration**: Automatically detects and skips compression for copilot_sdk based models to prevent conflicts.
- **Configuration**: `debug_mode` is now set to `false` by default for a quieter production experience.
+- **Atomic Message Grouping**: Introduced structure-aware grouping for `assistant-tool-tool-assistant` chains to prevent "No tool call found" errors.
+- **Tail Boundary Alignment**: Implemented automatic correction for truncation points to ensure they don't fall inside a tool-calling sequence.
+- **Chat Session Locking**: Added a session-based lock to prevent multiple concurrent summary tasks for the same chat ID.
+- **Enhanced Traceability**: Improved summary formatting to include message IDs, names, and metadata for better context tracking.

 ---

--- a/plugins/filters/async-context-compression/README_CN.md
+++ b/plugins/filters/async-context-compression/README_CN.md
@@ -1,18 +1,17 @@
 # 异步上下文压缩过滤器

-**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.3.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
+**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.4.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT

 > **重要提示**：为了确保所有过滤器的可维护性和易用性，每个过滤器都应附带清晰、完整的文档，以确保其功能、配置和使用方法得到充分说明。

 本过滤器通过智能摘要和消息压缩技术，在保持对话连贯性的同时，显著降低长对话的 Token 消耗。

-## 1.3.0 版本更新
+## 1.4.0 版本更新

- **国际化 (i18n) 支持**: 完成了所有用户可见消息的本地化，现已原生支持 9 种语言（含中、英、日、韩及欧洲主要语言）。
- **智能状态显示**: 新增 `token_usage_status_threshold` 阀门（默认 80%），可以智能控制何时显示 Token 用量状态，减少不必要的打扰。
- **性能大幅优化**: 对前端语言检测和日志处理流程进行了非阻塞重构，完全不影响首字节响应时间（TTFB），保持毫秒级极速推流。
- **Copilot SDK 兼容**: 自动检测并跳过基于 `copilot_sdk` 模型的上下文压缩，避免冲突。
- **配置项调整**: 为了提供更安静的生产环境体验，`debug_mode` 现已默认设置为 `false`。
+- **原子消息组 (Atomic Grouping)**: 引入结构感知的消息分组逻辑，确保工具调用链被整体保留或移除，彻底解决 "No tool call found" 错误。
+- **尾部边界自动对齐**: 实现了截断点的自动修正逻辑，确保历史上下文截断不会落在工具调用序列中间。
+- **会话级异步锁**: 增加了基于 `chat_id` 的后台任务锁，防止同一会话并发触发多个总结任务。
+- **元数据溯源增强**: 优化了总结输入格式，在总结中保留了消息 ID、参与者名称及关键元数据，提升上下文可追踪性。

 ---

--- a/plugins/filters/async-context-compression/async_context_compression.py
+++ b/plugins/filters/async-context-compression/async_context_compression.py
@@ -5,7 +5,7 @@ author: Fu-Jie
 author_url: https://github.com/Fu-Jie/openwebui-extensions
 funding_url: https://github.com/open-webui
 description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
-version: 1.3.0
+version: 1.4.0
 openwebui_id: b1655bc8-6de9-4cad-8cb5-a6f7829a02ce
 license: MIT

@@ -460,7 +460,7 @@ TRANSLATIONS = {
        "status_context_summary_updated": "컨텍스트 요약 업데이트됨: {tokens} / {max_tokens} 토큰 ({ratio}%)",
        "status_generating_summary": "백그라운드에서 컨텍스트 요약 생성 중...",
        "status_summary_error": "요약 오류: {error}",
-        "summary_prompt_prefix": "【이전 요약: 다음은 이전 대화의 요약이며 문맥 참고용으로만 제공됩니다. 요약 내용 자체에 답하지 말고 последу의 최신 질문에 직접 답하세요.】\n\n",
+        "summary_prompt_prefix": "【이전 요약: 다음은 이전 대화의 요약이며 문맥 참고용으로만 제공됩니다. 요약 내용 자체에 답하지 말고 최신 질문에 직접 답하세요.】\n\n",
        "summary_prompt_suffix": "\n\n---\n다음은 최근 대화입니다:",
        "tool_trimmed": "... [도구 출력 잘림]\n{content}",
        "content_collapsed": "\n... [내용 접힘] ...\n",
@@ -566,6 +566,8 @@ class Filter:
            "de-AT": "de-DE",
        }

+        # Concurrency control: Lock per chat session
+        self._chat_locks = {}
        self._init_database()

    def _resolve_language(self, lang: str) -> str:
@@ -604,6 +606,104 @@ class Filter:
                logger.warning(f"Translation formatting failed for {key}: {e}")
        return text

+    def _get_chat_lock(self, chat_id: str) -> asyncio.Lock:
+        """Get or create an asyncio lock for a specific chat ID."""
+        if chat_id not in self._chat_locks:
+            self._chat_locks[chat_id] = asyncio.Lock()
+        return self._chat_locks[chat_id]
+
+    def _get_atomic_groups(self, messages: List[Dict]) -> List[List[int]]:
+        """
+        Groups message indices into atomic units that must be kept or dropped together.
+        Specifically handles native tool-calling sequences:
+        - assistant(tool_calls)
+        - tool(s)
+        - assistant(final response)
+        """
+        groups = []
+        current_group = []
+
+        for i, msg in enumerate(messages):
+            role = msg.get("role")
+            has_tool_calls = bool(msg.get("tool_calls"))
+
+            # Logic:
+            # 1. If assistant message has tool_calls, it starts a potential block.
+            # 2. If message is 'tool' role, it MUST belong to the preceding assistant group.
+            # 3. If message is 'assistant' and follows a 'tool' group, it's the final answer.
+
+            if role == "assistant" and has_tool_calls:
+                # Close previous group if any
+                if current_group:
+                    groups.append(current_group)
+                current_group = [i]
+            elif role == "tool":
+                # Force tool results into the current group
+                if not current_group:
+                    # An orphaned tool result? Group it alone but warn
+                    groups.append([i])
+                else:
+                    current_group.append(i)
+            elif (
+                role == "assistant"
+                and current_group
+                and messages[current_group[-1]].get("role") == "tool"
+            ):
+                # This is likely the assistant follow-up consuming tool results
+                current_group.append(i)
+                groups.append(current_group)
+                current_group = []
+            else:
+                # Regular message (user, or assistant without tool calls)
+                if current_group:
+                    groups.append(current_group)
+                    current_group = []
+                groups.append([i])
+
+        if current_group:
+            groups.append(current_group)
+
+        return groups
+
+    def _get_effective_keep_first(self, messages: List[Dict]) -> int:
+        """Protect configured head messages and all leading system messages."""
+        last_system_index = -1
+        for i, msg in enumerate(messages):
+            if msg.get("role") == "system":
+                last_system_index = i
+
+        return max(self.valves.keep_first, last_system_index + 1)
+
+    def _align_tail_start_to_atomic_boundary(
+        self, messages: List[Dict], raw_start_index: int, protected_prefix: int
+    ) -> int:
+        """
+        Align the retained tail to an atomic-group boundary.
+
+        If the raw tail start falls in the middle of an assistant/tool/assistant
+        chain, move it backward to the start of that chain so the next request
+        never begins with an orphaned tool result or assistant follow-up.
+        """
+        aligned_start = max(raw_start_index, protected_prefix)
+
+        if aligned_start <= protected_prefix or aligned_start >= len(messages):
+            return aligned_start
+
+        trimmable = messages[protected_prefix:]
+        local_start = aligned_start - protected_prefix
+
+        for group in self._get_atomic_groups(trimmable):
+            group_start = group[0]
+            group_end = group[-1] + 1
+
+            if local_start == group_start:
+                return aligned_start
+
+            if group_start < local_start < group_end:
+                return protected_prefix + group_start
+
+        return aligned_start
+
    async def _get_user_context(
        self,
        __user__: Optional[Dict[str, Any]],
@@ -1218,87 +1318,6 @@ class Filter:
                content = msg.get("content", "")
                if not isinstance(content, str):
                    continue
-
-                role = msg.get("role")
-
-                # Only process assistant messages with native tool outputs
-                if role == "assistant":
-                    # Detect tool output markers in assistant content
-                    if "tool_call_id:" in content or (
-                        content.startswith('"') and "\\&quot;" in content
-                    ):
-                        # Always trim tool outputs when enabled
-
-                        if self.valves.show_debug_log and __event_call__:
-                            await self._log(
-                                f"[Inlet] 🔍 Native tool output detected in assistant message.",
-                                event_call=__event_call__,
-                            )
-
-                        # Strategy 1: Tool Output / Code Block Trimming
-                        # Detect if message contains large tool outputs or code blocks
-                        # Improved regex to be less brittle
-                        is_tool_output = (
-                            "&quot;" in content
-                            or "Arguments:" in content
-                            or "```" in content
-                            or "<tool_code>" in content
-                        )
-
-                        if is_tool_output:
-                            # Regex to find the last occurrence of a tool output block or code block
-                            # This pattern looks for:
-                            # 1. OpenWebUI's escaped JSON format: ""&quot;...&quot;""
-                            # 2. "Arguments: {...}" pattern
-                            # 3. Generic code blocks: ```...```
-                            # 4. <tool_code>...</tool_code>
-                            # It captures the content *after* the last such block.
-                            tool_output_pattern = r'(?:""&quot;.*?&quot;""|Arguments:\s*\{[^}]+\}|```.*?```|<tool_code>.*?</tool_code>)\s*'
-
-                            # Find all matches
-                            matches = list(
-                                re.finditer(tool_output_pattern, content, re.DOTALL)
-                            )
-
-                            if matches:
-                                # Get the end position of the last match
-                                last_match_end = matches[-1].end()
-
-                                # Everything after the last tool output is the final answer
-                                final_answer = content[last_match_end:].strip()
-
-                                if final_answer:
-                                    msg["content"] = self._get_translation(
-                                        (
-                                            __user__.get("language", "en-US")
-                                            if __user__
-                                            else "en-US"
-                                        ),
-                                        "tool_trimmed",
-                                        content=final_answer,
-                                    )
-                                    trimmed_count += 1
-                            else:
-                                # Fallback: If no specific pattern matched, but it was identified as tool output,
-                                # try a simpler split or just mark as trimmed if no final answer can be extracted.
-                                # (Preserving backward compatibility or different model behaviors)
-                                parts = re.split(
-                                    r"(?:Arguments:\s*\{[^}]+\})\n+", content
-                                )
-                                if len(parts) > 1:
-                                    final_answer = parts[-1].strip()
-                                    if final_answer:
-                                        msg["content"] = self._get_translation(
-                                            (
-                                                __user__.get("language", "en-US")
-                                                if __user__
-                                                else "en-US"
-                                            ),
-                                            "tool_trimmed",
-                                            content=final_answer,
-                                        )
-                                        trimmed_count += 1
-
            if trimmed_count > 0 and self.valves.show_debug_log and __event_call__:
                await self._log(
                    f"[Inlet] ✂️ Trimmed {trimmed_count} tool output message(s).",
@@ -1500,12 +1519,7 @@ class Filter:
        summary_record = await asyncio.to_thread(self._load_summary_record, chat_id)

        # Calculate effective_keep_first to ensure all system messages are protected
-        last_system_index = -1
-        for i, msg in enumerate(messages):
-            if msg.get("role") == "system":
-                last_system_index = i
-
-        effective_keep_first = max(self.valves.keep_first, last_system_index + 1)
+        effective_keep_first = self._get_effective_keep_first(messages)

        final_messages = []

@@ -1531,9 +1545,13 @@ class Filter:
            )
            summary_msg = {"role": "assistant", "content": summary_content}

-            # 3. Tail messages (Tail) - All messages starting from the last compression point
-            # Note: Must ensure head messages are not duplicated
-            start_index = max(compressed_count, effective_keep_first)
+            # 3. Tail messages (Tail) - All messages starting from the last compression point.
+            # Align legacy/raw progress to an atomic boundary so old summary rows do not
+            # reintroduce orphaned tool messages into the retained tail.
+            raw_start_index = max(compressed_count, effective_keep_first)
+            start_index = self._align_tail_start_to_atomic_boundary(
+                messages, raw_start_index, effective_keep_first
+            )
            tail_messages = messages[start_index:]

            if self.valves.show_debug_log and __event_call__:
@@ -1570,7 +1588,14 @@ class Filter:
            estimated_tokens = self._estimate_messages_tokens(calc_messages)

            # Since this is a hard limit check, only skip precise calculation if we are far below it (margin of 15%)
-            if estimated_tokens < max_context_tokens * 0.85:
+            # max_context_tokens == 0 means "no limit", skip reduction entirely
+            if max_context_tokens <= 0:
+                total_tokens = estimated_tokens
+                await self._log(
+                    f"[Inlet] 🔎 No max_context_tokens limit set (0). Skipping reduction. Est: {total_tokens}t",
+                    event_call=__event_call__,
+                )
+            elif estimated_tokens < max_context_tokens * 0.85:
                total_tokens = estimated_tokens
                await self._log(
                    f"[Inlet] 🔎 Fast Preflight Check (Est): {total_tokens}t / {max_context_tokens}t (Well within limit)",
@@ -1588,126 +1613,36 @@ class Filter:
                    event_call=__event_call__,
                )

-            # If over budget, reduce history (Keep Last)
-            if total_tokens > max_context_tokens:
-                await self._log(
-                    f"[Inlet] ⚠️ Candidate prompt ({total_tokens} Tokens) exceeds limit ({max_context_tokens}). Reducing history...",
-                    log_type="warning",
-                    event_call=__event_call__,
-                )
+                # Identify atomic groups to avoid breaking tool-calling context
+                atomic_groups = self._get_atomic_groups(tail_messages)

-                # Dynamically remove messages from the start of tail_messages
-                # Always try to keep at least the last message (usually user input)
-                while total_tokens > max_context_tokens and len(tail_messages) > 1:
-                    # Strategy 1: Structure-Aware Assistant Trimming
-                    # Retain: Headers (#), First Line, Last Line. Collapse the rest.
-                    target_msg = None
-                    target_idx = -1
+                while total_tokens > max_context_tokens and len(atomic_groups) > 1:
+                    # Strategy 1: Structure-Aware Assistant Trimming (Optional, only for non-tool messages)
+                    # For simplicity and reliability in this fix, we prioritize Group-Drop over partial trim
+                    # if a group contains tool calls.

-                    # Find the oldest assistant message that is long and not yet trimmed
-                    for i, msg in enumerate(tail_messages):
-                        # Skip the last message (usually user input, protect it)
-                        if i == len(tail_messages) - 1:
-                            break
+                    # Strategy 2: Drop Oldest Atomic Group Entirely
+                    dropped_group_indices = atomic_groups.pop(0)
+                    # Note: indices in dropped_group_indices are relative to ORIGINAL tail_messages
+                    # But since we are popping from tail_messages itself, we need to be careful.

-                        if msg.get("role") == "assistant":
-                            content = str(msg.get("content", ""))
-                            is_trimmed = msg.get("metadata", {}).get(
-                                "is_trimmed", False
-                            )
-                            # Only target messages that are reasonably long (> 200 chars)
-                            if len(content) > 200 and not is_trimmed:
-                                target_msg = msg
-                                target_idx = i
-                                break
-
-                    # If found a suitable assistant message, apply structure-aware trimming
-                    if target_msg:
-                        content = str(target_msg.get("content", ""))
-                        lines = content.split("\n")
-                        kept_lines = []
-
-                        # Logic: Keep headers, first non-empty line, last non-empty line
-                        first_line_found = False
-                        last_line_idx = -1
-
-                        # Find last non-empty line index
-                        for idx in range(len(lines) - 1, -1, -1):
-                            if lines[idx].strip():
-                                last_line_idx = idx
-                                break
-
-                        for idx, line in enumerate(lines):
-                            stripped = line.strip()
-                            if not stripped:
-                                continue
-
-                            # Keep headers (H1-H6, requires space after #)
-                            if re.match(r"^#{1,6}\s+", stripped):
-                                kept_lines.append(line)
-                                continue
-
-                            # Keep first non-empty line
-                            if not first_line_found:
-                                kept_lines.append(line)
-                                first_line_found = True
-                                # Add placeholder if there's more content coming
-                                if idx < last_line_idx:
-                                    kept_lines.append(
-                                        self._get_translation(lang, "content_collapsed")
-                                    )
-                                continue
-
-                            # Keep last non-empty line
-                            if idx == last_line_idx:
-                                kept_lines.append(line)
-                                continue
-
-                        # Update message content
-                        new_content = "\n".join(kept_lines)
-
-                        # Safety check: If trimming didn't save much (e.g. mostly headers), force drop
-                        if len(new_content) > len(content) * 0.8:
-                            # Fallback to drop if structure preservation is too verbose
-                            pass
+                    # Extract and drop messages in this group from the actual list
+                    # Since we always pop group 0, we pop len(dropped_group_indices) times from front
+                    dropped_tokens = 0
+                    for _ in range(len(dropped_group_indices)):
+                        dropped = tail_messages.pop(0)
+                        if total_tokens == estimated_tokens:
+                            dropped_tokens += len(str(dropped.get("content", ""))) // 4
                        else:
-                            target_msg["content"] = new_content
-                            if "metadata" not in target_msg:
-                                target_msg["metadata"] = {}
-                            target_msg["metadata"]["is_trimmed"] = True
+                            dropped_tokens += self._count_tokens(
+                                str(dropped.get("content", ""))
+                            )

-                            # Calculate token reduction
-                            # Use current token strategy
-                            if total_tokens == estimated_tokens:
-                                old_tokens = len(content) // 4
-                                new_tokens = len(target_msg["content"]) // 4
-                            else:
-                                old_tokens = self._count_tokens(content)
-                                new_tokens = self._count_tokens(target_msg["content"])
-                            diff = old_tokens - new_tokens
-                            total_tokens -= diff
-
-                            if self.valves.show_debug_log and __event_call__:
-                                await self._log(
-                                    f"[Inlet] 📉 Structure-trimmed Assistant message. Saved: {diff} tokens.",
-                                    event_call=__event_call__,
-                                )
-                            continue
-
-                    # Strategy 2: Fallback - Drop Oldest Message Entirely (FIFO)
-                    # (User requested to remove progressive trimming for other cases)
-                    dropped = tail_messages.pop(0)
-                    if total_tokens == estimated_tokens:
-                        dropped_tokens = len(str(dropped.get("content", ""))) // 4
-                    else:
-                        dropped_tokens = self._count_tokens(
-                            str(dropped.get("content", ""))
-                        )
                    total_tokens -= dropped_tokens

                    if self.valves.show_debug_log and __event_call__:
                        await self._log(
-                            f"[Inlet] 🗑️ Dropped message from history to fit context. Role: {dropped.get('role')}, Tokens: {dropped_tokens}",
+                            f"[Inlet] 🗑️ Dropped atomic group ({len(dropped_group_indices)} msgs) to fit context. Tokens: {dropped_tokens}",
                            event_call=__event_call__,
                        )

@@ -1829,7 +1764,14 @@ class Filter:
            estimated_tokens = self._estimate_messages_tokens(calc_messages)

            # Only skip precise calculation if we are clearly below the limit
-            if estimated_tokens < max_context_tokens * 0.85:
+            # max_context_tokens == 0 means "no limit", skip reduction entirely
+            if max_context_tokens <= 0:
+                total_tokens = estimated_tokens
+                await self._log(
+                    f"[Inlet] 🔎 No max_context_tokens limit set (0). Skipping reduction. Est: {total_tokens}t",
+                    event_call=__event_call__,
+                )
+            elif estimated_tokens < max_context_tokens * 0.85:
                total_tokens = estimated_tokens
                await self._log(
                    f"[Inlet] 🔎 Fast limit check (Est): {total_tokens}t / {max_context_tokens}t",
@@ -1840,34 +1782,34 @@ class Filter:
                    self._calculate_messages_tokens, calc_messages
                )

-            if total_tokens > max_context_tokens:
+            if total_tokens > max_context_tokens and max_context_tokens > 0:
                await self._log(
                    f"[Inlet] ⚠️ Original messages ({total_tokens} Tokens) exceed limit ({max_context_tokens}). Reducing history...",
                    log_type="warning",
                    event_call=__event_call__,
                )

-                # Dynamically remove messages from the start
-                # We'll respect effective_keep_first to protect system prompts
+                # Use atomic grouping to preserve tool-calling integrity
+                trimmable = final_messages[effective_keep_first:]
+                atomic_groups = self._get_atomic_groups(trimmable)

-                start_trim_index = effective_keep_first
-
-                while (
-                    total_tokens > max_context_tokens
-                    and len(final_messages)
-                    > start_trim_index + 1  # Keep at least 1 message after keep_first
-                ):
-                    dropped = final_messages.pop(start_trim_index)
-                    if total_tokens == estimated_tokens:
-                        dropped_tokens = len(str(dropped.get("content", ""))) // 4
-                    else:
-                        dropped_tokens = self._count_tokens(
-                            str(dropped.get("content", ""))
-                        )
+                while total_tokens > max_context_tokens and len(atomic_groups) > 1:
+                    dropped_group_indices = atomic_groups.pop(0)
+                    dropped_tokens = 0
+                    for _ in range(len(dropped_group_indices)):
+                        dropped = trimmable.pop(0)
+                        if total_tokens == estimated_tokens:
+                            dropped_tokens += len(str(dropped.get("content", ""))) // 4
+                        else:
+                            dropped_tokens += self._count_tokens(
+                                str(dropped.get("content", ""))
+                            )
                    total_tokens -= dropped_tokens

+                final_messages = final_messages[:effective_keep_first] + trimmable
+
                await self._log(
-                    f"[Inlet] ✂️ Messages reduced. New total: {total_tokens} Tokens",
+                    f"[Inlet] ✂️ Messages reduced (atomic). New total: {total_tokens} Tokens",
                    event_call=__event_call__,
                )

@@ -1948,12 +1890,28 @@ class Filter:
        model = body.get("model") or ""
        messages = body.get("messages", [])

-        # Calculate target compression progress directly
-        target_compressed_count = max(0, len(messages) - self.valves.keep_last)
+        # Calculate target compression progress directly, then align it to an atomic
+        # boundary so the saved summary never cuts through a tool-calling block.
+        effective_keep_first = self._get_effective_keep_first(messages)
+        raw_target_compressed_count = max(0, len(messages) - self.valves.keep_last)
+        target_compressed_count = self._align_tail_start_to_atomic_boundary(
+            messages, raw_target_compressed_count, effective_keep_first
+        )
+
+        # Process Token calculation and summary generation asynchronously in the background
+        # Use a lock to prevent multiple concurrent summary tasks for the same chat
+        chat_lock = self._get_chat_lock(chat_id)
+
+        if chat_lock.locked():
+            if self.valves.debug_mode:
+                logger.info(
+                    f"[Outlet] Skipping summary task for {chat_id}: Task already in progress"
+                )
+            return body

-        # Process Token calculation and summary generation asynchronously in the background (do not wait for completion, do not affect output)
        asyncio.create_task(
-            self._check_and_generate_summary_async(
+            self._locked_summary_task(
+                chat_lock,
                chat_id,
                model,
                body,
@@ -1967,6 +1925,31 @@ class Filter:

        return body

+    async def _locked_summary_task(
+        self,
+        lock: asyncio.Lock,
+        chat_id: str,
+        model: str,
+        body: dict,
+        user_data: Optional[dict],
+        target_compressed_count: Optional[int],
+        lang: str,
+        __event_emitter__: Callable,
+        __event_call__: Callable,
+    ):
+        """Wrapper to run summary generation with an async lock."""
+        async with lock:
+            await self._check_and_generate_summary_async(
+                chat_id,
+                model,
+                body,
+                user_data,
+                target_compressed_count,
+                lang,
+                __event_emitter__,
+                __event_call__,
+            )
+
    async def _check_and_generate_summary_async(
        self,
        chat_id: str,
@@ -2134,11 +2117,19 @@ class Filter:
                    event_call=__event_call__,
                )

-            # 2. Determine the range of messages to compress (Middle)
-            start_index = self.valves.keep_first
-            end_index = len(messages) - self.valves.keep_last
-            if self.valves.keep_last == 0:
-                end_index = len(messages)
+            # 2. Determine the range of messages to compress (Middle).
+            # Use the same aligned boundary used for summary persistence so the tail
+            # always starts at an atomic-group boundary.
+            start_index = self._get_effective_keep_first(messages)
+            if target_compressed_count is None:
+                raw_end_index = max(0, len(messages) - self.valves.keep_last)
+                end_index = self._align_tail_start_to_atomic_boundary(
+                    messages, raw_end_index, start_index
+                )
+            else:
+                end_index = self._align_tail_start_to_atomic_boundary(
+                    messages, target_compressed_count, start_index
+                )

            # Ensure indices are valid
            if start_index >= end_index:
@@ -2204,7 +2195,12 @@ class Filter:
            # Add buffer for prompt and output (approx 2000 tokens)
            estimated_input_tokens = middle_tokens + 2000

-            if estimated_input_tokens > max_context_tokens:
+            if max_context_tokens <= 0:
+                await self._log(
+                    "[🤖 Async Summary Task] No max_context_tokens limit set (0). Skipping middle-message truncation.",
+                    event_call=__event_call__,
+                )
+            elif estimated_input_tokens > max_context_tokens:
                excess_tokens = estimated_input_tokens - max_context_tokens
                await self._log(
                    f"[🤖 Async Summary Task] ⚠️ Middle messages ({middle_tokens} Tokens) + Buffer exceed summary model limit ({max_context_tokens}), need to remove approx {excess_tokens} Tokens",
@@ -2212,20 +2208,24 @@ class Filter:
                    event_call=__event_call__,
                )

-                # Remove from the head of middle_messages
+                # Remove from the head of middle_messages using atomic groups
+                # to avoid creating orphaned tool-call/tool-result pairs.
                removed_tokens = 0
                removed_count = 0

-                while removed_tokens < excess_tokens and middle_messages:
-                    msg_to_remove = middle_messages.pop(0)
-                    msg_tokens = self._count_tokens(
-                        str(msg_to_remove.get("content", ""))
-                    )
-                    removed_tokens += msg_tokens
-                    removed_count += 1
+                summary_atomic_groups = self._get_atomic_groups(middle_messages)
+                while removed_tokens < excess_tokens and len(summary_atomic_groups) > 1:
+                    group_indices = summary_atomic_groups.pop(0)
+                    for _ in range(len(group_indices)):
+                        msg_to_remove = middle_messages.pop(0)
+                        msg_tokens = self._count_tokens(
+                            str(msg_to_remove.get("content", ""))
+                        )
+                        removed_tokens += msg_tokens
+                        removed_count += 1

                await self._log(
-                    f"[🤖 Async Summary Task] Removed {removed_count} messages, totaling {removed_tokens} Tokens",
+                    f"[🤖 Async Summary Task] Removed {removed_count} messages (atomic), totaling {removed_tokens} Tokens",
                    event_call=__event_call__,
                )

@@ -2443,12 +2443,26 @@ class Filter:
            logger.exception("[🤖 Async Summary Task] Unhandled exception")

    def _format_messages_for_summary(self, messages: list) -> str:
-        """Formats messages for summarization."""
+        """
+        Formats messages for summarization with metadata awareness.
+        Preserves IDs, names, and key metadata fragments to ensure traceability.
+        """
        formatted = []
        for i, msg in enumerate(messages, 1):
            role = msg.get("role", "unknown")
            content = msg.get("content", "")

+            # Extract Identity Metadata
+            msg_id = msg.get("id", "N/A")
+            msg_name = msg.get("name", "")
+            # Only pick non-system, interesting metadata keys
+            metadata = msg.get("metadata", {})
+            safe_meta = {
+                k: v
+                for k, v in metadata.items()
+                if k not in ["is_trimmed", "is_summary"]
+            }
+
            # Handle multimodal content
            if isinstance(content, list):
                text_parts = []
@@ -2460,10 +2474,13 @@ class Filter:
            # Handle role name
            role_name = {"user": "User", "assistant": "Assistant"}.get(role, role)

-            # User requested to remove truncation to allow full context for summary
-            # unless it exceeds model limits (which is handled by the LLM call itself or max_tokens)
+            meta_str = f" [ID: {msg_id}]"
+            if msg_name:
+                meta_str += f" [Name: {msg_name}]"
+            if safe_meta:
+                meta_str += f" [Meta: {safe_meta}]"

-            formatted.append(f"[{i}] {role_name}: {content}")
+            formatted.append(f"[{i}] {role_name}{meta_str}: {content}")

        return "\n\n".join(formatted)

@@ -2511,11 +2528,15 @@ This conversation may contain previous summaries (as system messages or text) an
 *   **Progress & Conclusions**: Completed steps and reached consensus.
 *   **Action Items/Next Steps**: Clear follow-up actions.

+### Identity Traceability
+The input dialogue contains message IDs (e.g., [ID: ...]) and optional names. 
+If a specific message contributes a critical decision, a unique code snippet, or a tool-calling result, please reference its ID or Name in your summary to maintain traceability.
+
 ---
 {new_conversation_text}
 ---

-Based on the content above, generate the summary:
+Based on the content above, generate the summary (including key message identities where relevant):
 """
        # Determine the model to use
        model = self._clean_model_id(self.valves.summary_model) or self._clean_model_id(
--- a/plugins/filters/async-context-compression/post_mortem_issue_56.md
+++ b/plugins/filters/async-context-compression/post_mortem_issue_56.md
@@ -0,0 +1,169 @@
+# Async Context Compression 核心故障分析与修复总结 (Issue #56)
+
+Report: <https://github.com/Fu-Jie/openwebui-extensions/issues/56>
+
+## 1. 问题分析
+
+### 1.1 Critical: Tool-Calling 结构损坏
+
+- **故障根源**: 插件在压缩历史消息时采用了“消息感知 (Message-Aware)”而非“结构感知 (Structure-Aware)”的策略。大模型的 `tool-calling` 依赖于 `assistant(tool_calls)` 与紧随其后的 `tool(s)` 消息的严格配对。
+- **后果**: 如果压缩导致只有 `tool_calls` 被总结，而其对应的 `tool` 结果仍留在上下文，将触发 `No tool call found` 致命错误。
+
+### 1.2 High: 坐标系偏移导致进度错位
+
+- **故障根源**: 插件此前使用 `len(messages)` 计算总结进度。由于总结后消息列表变短，旧的索引无法正确映射回原始历史坐标。
+- **后果**: 导致总结逻辑在对话进行中反复处理重叠的区间，或在某些边界条件下停止推进。
+
+### 1.3 Medium: 并发竞态与元数据丢失
+
+- **并发**: 缺乏针对 `chat_id` 的后台任务锁，导致并发请求下可能触发多个 LLM 总结任务。
+- **元数据**: 消息被折叠为总结块后，其原始的 `id`、`name` 和扩展 `metadata` 彻底消失，破坏了依赖这些指纹的第三方集成。
+
+---
+
+## 2. 修复方案 (核心重构)
+
+### 2.1 引入原子消息组 (Atomic Grouping)
+
+实现 `_get_atomic_groups` 算法，将 `assistant-tool-assistant` 的调用链识别并标记。确保这些组被**整体保留或整体移除**。
+
+该算法应用于两处截断路径：
+
+1. **inlet 阶段**（有 summary / 无 summary 两条路径均已覆盖）
+2. **outlet 后台 summary 任务**中，当 `middle_messages` 超出 summary model 上下文窗口需要截断时，同样使用原子组删除，防止在进入 LLM 总结前产生孤立的 tool result。（2026-03-09 补丁）
+
+具体做法：
+
+- `_get_atomic_groups(messages)` 会把消息扫描成多个“不可拆分单元”。
+- 当遇到 `assistant` 且带 `tool_calls` 时，开启一个原子组。
+- 后续所有 `tool` 消息都会被并入这个原子组。
+- 如果紧跟着出现消费工具结果的 assistant 跟进回复，也会并入同一个原子组。
+- 这样做之后，裁剪逻辑不再按“单条消息”删除，而是按“整组消息”删除。
+
+这解决了 Issue #56 最核心的问题：
+
+- 过去：可能删掉 `assistant(tool_calls)`，却留下 `tool` 结果
+- 现在：要么整组一起保留，要么整组一起移除
+
+也就是说，发送给模型的历史上下文不再出现孤立的 `tool_call_id`。
+
+### 2.1.1 Tail 边界对齐 (Atomic Boundary Alignment)
+
+除了按组删除之外，还新增了 `_align_tail_start_to_atomic_boundary` 来修正“保留尾部”的起点。
+
+原因是：即使 `compressed_message_count` 本身来自旧数据或原始计数，如果它刚好落在一个工具调用链中间，直接拿来做 `tail` 起点仍然会造成损坏。
+
+修复步骤如下：
+
+1. 先计算理论上的 `raw_start_index`
+2. 调用 `_align_tail_start_to_atomic_boundary(messages, raw_start_index, protected_prefix)`
+3. 如果该起点落在某个原子组内部，就自动回退到该组起始位置
+4. 用修正后的 `start_index` 重建 `tail_messages`
+
+这个逻辑同时用于：
+
+- `inlet` 中已存在 summary 时的 tail 重建
+- `outlet` 中计算 `target_compressed_count`
+- 后台 summary 任务里计算 `middle_messages` / `tail` 分界线
+
+因此，修复并不只是“删除时按组删除”，而是连“边界落点”本身都改成结构感知。
+
+### 2.2 实现单会话异步锁 (Chat Session Lock)
+
+在 `Filter` 类中维护 `_chat_locks`。在 `outlet` 阶段，如果检测到已有后台任务持有该锁，则自动跳过当前请求，确保一个 `chat_id` 始终只有一个任务在运行。
+
+具体流程：
+
+1. `outlet` 先通过 `_get_chat_lock(chat_id)` 取得当前会话的锁对象
+2. 如果 `chat_lock.locked()` 为真，直接跳过本次后台总结任务
+3. 如果没有任务在运行，则创建 `_locked_summary_task(...)`
+4. `_locked_summary_task` 内部用 `async with lock:` 包裹真正的 `_check_and_generate_summary_async(...)`
+
+这样修复后，同一个会话不会再并发发起多个 summary LLM 调用，也不会出现多个后台任务互相覆盖 `compressed_message_count` 或 summary 内容的情况。
+
+### 2.3 元数据溯源 (Metadata Traceability)
+
+重构总结数据的格式化流程：
+
+- 提取消息 ID (`msg[id]`)、参与者名称 (`msg[name]`) 和关键元数据。
+- 将这些身份标识以 `[ID: xxx] [Name: yyy]` 的形式注入 LLM 的总结输入。
+- 增强总结提示词 (Prompt)，要求模型按 ID 引用重要行为。
+
+这里的修复目的不是“恢复被压缩消息的原始对象”，而是尽量保留它们的身份痕迹，降低以下风险：
+
+- 压缩后 summary 完全失去消息来源
+- 某段关键决策、工具结果或用户要求在总结中无法追溯
+- 依赖消息身份的后续分析或人工排查变得困难
+
+当前实现方式是 `_format_messages_for_summary`：
+
+- 把每条消息格式化为 `[序号] Role [ID: ...] [Name: ...] [Meta: ...]: content`
+- 多模态内容会先抽出文本部分再汇总
+- summary prompt 中明确要求模型保留关键 ID / Name 的可追踪性
+
+这不能等价替代原始消息对象，但比“直接丢掉所有身份信息后只保留一段自然语言总结”安全很多。
+
+### 2.4 `max_context_tokens = 0` 语义统一
+
+Issue #56 里还有一个不太显眼但实际会影响行为的一致性问题：
+
+- `inlet` 路径已经把 `max_context_tokens <= 0` 视为“无限制，不做裁剪”
+- 但后台 summary 任务里，之前仍会继续拿 `0` 参与 `estimated_input_tokens > max_context_tokens` 判断
+
+这会造成前台请求和后台总结对同一配置的解释不一致。
+
+修复后：
+
+- `inlet` 与后台 summary 路径统一使用 `<= 0` 表示“no limit”
+- 当 `max_context_tokens <= 0` 时，后台任务会直接跳过 `middle_messages` 的截断逻辑
+- 并新增回归测试，确保该行为不会再次退化
+
+这一步虽然不如 tool-calling 原子化那么显眼，但它解决了“配置含义前后不一致”的稳定性问题。
+
+### 2.5 tool-output trimming 的风险收敛
+
+Issue #56 提到原先的 tool-output trimming 可能误伤普通 assistant 内容。对此没有继续扩展一套更复杂的启发式规则，而是采用了更保守的收敛策略：
+
+- `enable_tool_output_trimming` 默认保持 `False`
+- 当前 trimming 分支不再主动重写普通 assistant 内容
+
+这意味着插件优先保证“不误伤正常消息”，而不是冒险做激进裁剪。对于这个 bug 修复阶段，这是一个刻意的稳定性优先决策。
+
+### 2.6 修复顺序总结
+
+从实现层面看，这次修复不是单点补丁，而是一组按顺序落下去的结构性改动：
+
+1. 先把消息从“单条处理”升级为“原子组处理”
+2. 再把 tail / middle 的边界从“裸索引”升级为“结构感知边界”
+3. 再加每会话异步锁，堵住并发 summary 覆盖
+4. 再补 summary 输入格式，让被压缩历史仍保留可追踪身份信息
+5. 最后统一 `max_context_tokens = 0` 的语义，并加测试防回归
+
+因此，Issue #56 的修复本质上是：
+
+把这个过滤器从“按字符串和长度裁剪消息”重构成“按对话结构和上下文契约裁剪消息”。
+
+---
+
+## 3. 修复覆盖范围对照表
+
+| # | 严重级别 | 问题 | 状态 |
+|---|----------|------|------|
+| 1 | **Critical** | tool-calling 消息被单条压缩 → `No tool call found` | ✅ inlet 两条路径均已原子化 |
+| 2 | **High** | `compressed_message_count` 坐标系混用 | ✅ outlet 始终在原始消息空间计算 |
+| 3 | **Medium** | 无 per-chat 异步锁 | ✅ `_chat_locks` + `asyncio.Lock()` |
+| 4 | **Medium** | tool-output 修剪过于激进 | ✅ 默认 `False`；循环体已置空 |
+| 5 | **Medium** | `max_context_tokens = 0` 语义不一致 | ✅ 统一 `<= 0` 表示"无限制" |
+| 6 | **Low** | 韩语 i18n 字符串混入俄文字符 | ✅ 已替换为纯韩文 |
+| 7 | **(后发现)** | summary 任务内截断不使用原子组 | ✅ 2026-03-09 补丁：改用 `_get_atomic_groups` |
+
+## 4. 验证结论
+
+- **inlet 路径**: `_get_atomic_groups` 贯穿 `inlet` 两条分支，以原子组为单位丢弃消息，永不产生孤立 tool result。
+- **summary 任务**: 超出上下文限制时，同样以原子组截断 `middle_messages`，保证进入 LLM 的输入完整性。
+- **并发控制**: `chat_lock.locked()` 确保同一 `chat_id` 同时只有一个总结任务运行。
+- **元数据**: `_format_messages_for_summary` 以 `[ID: xxx]` 形式保留原始消息身份标识。
+
+## 5. 后置建议
+
+该修复旨在将过滤器从“关键词总结”提升到“结构感知代理”的层面。在后续开发中，应继续保持对 OpenWebUI 原生消息指纹的尊重。
--- a/plugins/filters/async-context-compression/v1.4.0.md
+++ b/plugins/filters/async-context-compression/v1.4.0.md
@@ -0,0 +1,26 @@
+## Overview
+
+**[🚀 Get/Update on OpenWebUI Community](https://openwebui.com/posts/async_context_compression_b1655bc8)**
+
+This release focuses on improving the structural integrity of chat history when using function-calling models and enhancing task reliability through concurrent task management. Version 1.4.0 introduces "Atomic Message Grouping" to prevent chat context corruption and a session-based locking mechanism to ensure stable background operations.
+
+## New Features
+
+- **Atomic Message Grouping**: A new structure-aware logic that identifies and groups `assistant-tool-tool-assistant` calling sequences. This ensures that tool results are never orphaned from their calls during compression.
+- **Tail Boundary Alignment**: Automatically corrects truncation indices to ensure the recent context "tail" starts at a valid message boundary, preventing partial tool-calling sequences from being sent to the LLM.
+- **Chat Session Locking**: Implements a per-chat-id asynchronous lock to prevent multiple summary tasks from running concurrently for the same session, reducing redundant LLM calls and race conditions.
+- **Metadata Traceability**: Summarization inputs now include message IDs, participant names, and key metadata labels, allowing the summary model to maintain better traceability in its output.
+
+## Bug Fixes
+
+- **Fixed "No tool call found" Errors**: By enforcing atomic grouping, the filter no longer truncates the context in a way that separates tool calls from their results.
+- **Improved Progress Calculation**: Fixed an issue where summarizing messages would cause the progress tracking to drift due to shifting list indices.
+- **Prevented Duplicate Summary Tasks**: The new locking mechanism ensures that only one background summary process is active per session.
+
+## Related Issues
+
+- **[#56](https://github.com/Fu-Jie/openwebui-extensions/issues/56)**: Tool-Calling context corruption and concurrent summary tasks.
+
+## Related PRs
+
+- **[#61](https://github.com/Fu-Jie/openwebui-extensions/pull/61)**: (Placeholder) Full implementation of structure-aware grouping.
--- a/plugins/filters/async-context-compression/v1.4.0_CN.md
+++ b/plugins/filters/async-context-compression/v1.4.0_CN.md
@@ -0,0 +1,26 @@
+## 概述
+
+**[🚀 在 OpenWebUI 社区获取/更新](https://openwebui.com/posts/async_context_compression_b1655bc8)**
+
+本次发布重点优化了在使用工具调用（Function Calling）模型时对话历史的结构完整性，并通过并发任务管理增强了系统的可靠性。1.4.0 版本引入了“原子消息组”逻辑以防止上下文损坏，并增加了会话级锁定机制以确保后台任务的稳定运行。
+
+## 新功能
+
+- **原子消息组 (Atomic Grouping)**: 引入结构感知的消息处理逻辑，能够识别并成组处理 `assistant-tool-tool-assistant` 调用序列。这确保了在压缩过程中，工具结果永远不会与其调用指令分离。
+- **尾部边界自动对齐**: 自动修正截断索引，确保保留的“尾部”上下文从合法的消息边界开始，防止将残缺的工具调用序列发送给大模型。
+- **会话级异步锁**: 为每个 `chat_id` 实现异步锁，防止同一会话并发触发多个总结任务，减少冗余的 LLM 调用并消除竞态条件。
+- **元数据溯源增强**: 总结输入现在包含消息 ID、参与者名称和关键元数据标签，使总结模型能够在其输出中保持更好的可追踪性。
+
+## 问题修复
+
+- **彻底解决 "No tool call found" 错误**: 通过强制执行原子分组，过滤器不再会以分离工具调用及其结果的方式截断上下文。
+- **优化进度计算**: 修复了总结消息后由于列表索引偏移导致进度跟踪漂移的问题。
+- **防止重复总结任务**: 新的锁定机制确保每个会话在同一时间只有一个后台总结进程在运行。
+
+## 相关 Issue
+
+- **[#56](https://github.com/Fu-Jie/openwebui-extensions/issues/56)**: 修复工具调用上下文损坏及并发总结任务冲突问题。
+
+## 相关 PR
+
+- **[#61](https://github.com/Fu-Jie/openwebui-extensions/pull/61)**: (占位符) 结构感知消息分组的完整实现。