feat(async-context-compression): release v1.5.0

- add external chat reference summaries and mixed-script token estimation - tighten summary budgeting, fallback handling, and frontend error visibility - sync READMEs, mirrored docs, indexes, and bilingual v1.5.0 release notes
2026-03-14 16:10:06 +08:00
parent 2f518d4c7a
commit 858d048d81
12 changed files with 2482 additions and 343 deletions
--- a/README.md
+++ b/README.md
@@ -27,7 +27,7 @@ A collection of enhancements, plugins, and prompts for [open-webui](https://gith
 | 🥈 | [Smart Infographic](https://openwebui.com/posts/smart_infographic_ad6f0c7f) | ![v](https://img.shields.io/badge/v-1.5.0-blue?style=flat) | ![p2_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p2_dl.json&style=flat) | ![p2_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p2_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--02--13-gray?style=flat) |
 | 🥉 | [Markdown Normalizer](https://openwebui.com/posts/markdown_normalizer_baaa8732) | ![v](https://img.shields.io/badge/v-1.2.8-blue?style=flat) | ![p3_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p3_dl.json&style=flat) | ![p3_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p3_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--08-gray?style=flat) |
 | 4️⃣ | [Export to Word Enhanced](https://openwebui.com/posts/export_to_word_enhanced_formatting_fca6a315) | ![v](https://img.shields.io/badge/v-0.4.4-blue?style=flat) | ![p4_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p4_dl.json&style=flat) | ![p4_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p4_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--02--13-gray?style=flat) |
-| 5️⃣ | [Async Context Compression](https://openwebui.com/posts/async_context_compression_b1655bc8) | ![v](https://img.shields.io/badge/v-1.4.2-blue?style=flat) | ![p5_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_dl.json&style=flat) | ![p5_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--13-gray?style=flat) |
+| 5️⃣ | [Async Context Compression](https://openwebui.com/posts/async_context_compression_b1655bc8) | ![v](https://img.shields.io/badge/v-1.5.0-blue?style=flat) | ![p5_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_dl.json&style=flat) | ![p5_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--14-gray?style=flat) |
 | 6️⃣ | [AI Task Instruction Generator](https://openwebui.com/posts/ai_task_instruction_generator_9bab8b37) | ![v](https://img.shields.io/badge/v-N/A-gray?style=flat) | ![p6_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p6_dl.json&style=flat) | ![p6_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p6_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--01--28-gray?style=flat) |

 ### 📈 Total Downloads Trend
--- a/README_CN.md
+++ b/README_CN.md
@@ -24,7 +24,7 @@ OpenWebUI 增强功能集合。包含个人开发与收集的插件、提示词
 | 🥈 | [Smart Infographic](https://openwebui.com/posts/smart_infographic_ad6f0c7f) | ![v](https://img.shields.io/badge/v-1.5.0-blue?style=flat) | ![p2_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p2_dl.json&style=flat) | ![p2_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p2_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--02--13-gray?style=flat) |
 | 🥉 | [Markdown Normalizer](https://openwebui.com/posts/markdown_normalizer_baaa8732) | ![v](https://img.shields.io/badge/v-1.2.8-blue?style=flat) | ![p3_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p3_dl.json&style=flat) | ![p3_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p3_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--08-gray?style=flat) |
 | 4️⃣ | [Export to Word Enhanced](https://openwebui.com/posts/export_to_word_enhanced_formatting_fca6a315) | ![v](https://img.shields.io/badge/v-0.4.4-blue?style=flat) | ![p4_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p4_dl.json&style=flat) | ![p4_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p4_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--02--13-gray?style=flat) |
-| 5️⃣ | [Async Context Compression](https://openwebui.com/posts/async_context_compression_b1655bc8) | ![v](https://img.shields.io/badge/v-1.4.2-blue?style=flat) | ![p5_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_dl.json&style=flat) | ![p5_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--13-gray?style=flat) |
+| 5️⃣ | [Async Context Compression](https://openwebui.com/posts/async_context_compression_b1655bc8) | ![v](https://img.shields.io/badge/v-1.5.0-blue?style=flat) | ![p5_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_dl.json&style=flat) | ![p5_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p5_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--03--14-gray?style=flat) |
 | 6️⃣ | [AI Task Instruction Generator](https://openwebui.com/posts/ai_task_instruction_generator_9bab8b37) | ![v](https://img.shields.io/badge/v-N/A-gray?style=flat) | ![p6_dl](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p6_dl.json&style=flat) | ![p6_vw](https://img.shields.io/endpoint?url=https%3A%2F%2Fgist.githubusercontent.com%2FFu-Jie%2Fdb3d95687075a880af6f1fba76d679c6%2Fraw%2Fbadge_p6_vw.json&style=flat) | ![updated](https://img.shields.io/badge/2026--01--28-gray?style=flat) |

 ### 📈 总下载量累计趋势
--- a/docs/plugins/filters/async-context-compression.md
+++ b/docs/plugins/filters/async-context-compression.md
@@ -1,13 +1,19 @@
 # Async Context Compression Filter

-**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.4.1 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
+**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.5.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT

 This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.

-## What's new in 1.4.1
+## What's new in 1.5.0

- **Reverse-Unfolding Mechanism**: Accurately reconstructs the expanded native tool-calling sequence during the outlet phase to permanently fix coordinate drift and missing summaries for long tool-based conversations.
- **Safer Tool Trimming**: Refactored `enable_tool_output_trimming` to strictly use atomic block groups for safe trimming, completely preventing JSON payload corruption.
+- **External Chat Reference Summaries**: Added support for referenced chat context blocks that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
+- **Fast Multilingual Token Estimation**: Added a new mixed-script token estimation pipeline so inlet/outlet preflight checks can avoid unnecessary exact token counts while staying much closer to real usage.
+- **Stronger Working-Memory Prompt**: Refined the XML summary prompt to better preserve actionable context across general chat, coding tasks, and tool-heavy conversations.
+- **Clearer Frontend Debug Logs**: Reworked browser-console logging into grouped structural snapshots that are easier to scan during debugging.
+- **Safer Tool Trimming Defaults**: Enabled native tool-output trimming by default and exposed a dedicated `tool_trim_threshold_chars` valve with a 600-character default.
+- **Safer Referenced-Chat Fallbacks**: If generating a referenced chat summary fails, the new reference-summary path now falls back to direct contextual injection instead of failing the whole chat.
+- **Correct Summary Budgeting**: `summary_model_max_context` now controls summary-input fitting, while `max_summary_tokens` remains an output-length cap.
+- **More Visible Summary Failures**: Important background summary failures now surface in the browser console (`F12`) and as a status hint even when `show_debug_log` is off.

 ---

@@ -19,15 +25,85 @@ This filter reduces token consumption in long conversations through intelligent
 - ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
 - ✅ Flexible retention policy to keep the first and last N messages.
 - ✅ Smart injection of historical summaries back into the context.
+- ✅ External chat reference summarization with cached-summary reuse, direct injection for small chats, and generated summaries for larger chats.
 - ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
 - ✅ Native tool output trimming for cleaner context when using function calling.
 - ✅ Real-time context usage monitoring with warning notifications (>90%).
- ✅ Detailed token logging for precise debugging and optimization.
+- ✅ Fast multilingual token estimation plus exact token fallback for precise debugging and optimization.
 - ✅ **Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
 - ⚠ **Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.

 ---

+## What This Fixes
+
+- **Problem 1: A referenced chat could break the current request.**
+  Before, if the filter needed to summarize a referenced chat and that LLM call failed, the current chat could fail with it. Now it degrades gracefully and injects direct context instead.
+- **Problem 2: Some referenced chats were being cut too aggressively.**
+  Before, the output limit (`max_summary_tokens`) could be treated like the input window, which made large referenced chats shrink earlier than necessary. Now input fitting uses the summary model's real context window (`summary_model_max_context` or model/global fallback).
+- **Problem 3: Some background summary failures were too easy to miss.**
+  Before, a failure during background summary preparation could disappear quietly when frontend debug logging was off. Now important failures are forced to the browser console and also shown through a user-facing status message.
+
+---
+
+## Workflow Overview
+
+This filter operates in two phases:
+
+1. `inlet`: injects stored summaries, processes external chat references, and trims context when required before the request is sent to the model.
+2. `outlet`: runs asynchronously after the response is complete, decides whether a new summary should be generated, and persists it when appropriate.
+
+```mermaid
+flowchart TD
+    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
+    B --> C{Referenced chats attached?}
+    C -- No --> D[Load current chat summary if available]
+    C -- Yes --> E[Inspect each referenced chat]
+
+    E --> F{Existing cached summary?}
+    F -- Yes --> G[Reuse cached summary]
+    F -- No --> H{Fits direct budget?}
+    H -- Yes --> I[Inject full referenced chat text]
+    H -- No --> J[Prepare referenced-chat summary input]
+
+    J --> K{Referenced-chat summary call succeeds?}
+    K -- Yes --> L[Inject generated referenced summary]
+    K -- No --> M[Fallback to direct contextual injection]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[Build current-chat Head + Summary + Tail]
+    N --> O{Over max_context_tokens?}
+    O -- Yes --> P[Trim oldest atomic groups]
+    O -- No --> Q[Send final context to the model]
+    P --> Q
+
+    Q --> R[Model returns the reply]
+    R --> S[Outlet rebuilds the full history]
+    S --> T{Reached compression threshold?}
+    T -- No --> U[Finish]
+    T -- Yes --> V[Fit summary input to the summary model context]
+
+    V --> W{Background summary call succeeds?}
+    W -- Yes --> X[Save new chat summary and update status]
+    W -- No --> Y[Force browser-console error and show status hint]
+```
+
+### Key Notes
+
+- `inlet` only injects and trims context. It does not generate the main chat summary.
+- `outlet` performs summary generation asynchronously and does not block the current reply.
+- External chat references may come from an existing persisted summary, a small chat's full text, or a generated/truncated reference summary.
+- If a referenced-chat summary call fails, the filter falls back to direct context injection instead of failing the whole request.
+- `summary_model_max_context` controls summary-input fitting. `max_summary_tokens` only controls how long the generated summary may be.
+- Important background summary failures are surfaced to the browser console (`F12`) and the chat status area.
+- External reference messages are protected during trimming so they are not discarded first.
+
+---
+
 ## Installation & Configuration

 ### 1) Database (automatic)
@@ -51,11 +127,12 @@ This filter reduces token consumption in long conversations through intelligent
 | `keep_first`                   | `1`      | Always keep the first N messages (protects system prompts).                                                                                                           |
 | `keep_last`                    | `6`      | Always keep the last N messages to preserve recent context.                                                                                                           |
 | `summary_model`                | `None`   | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
-| `summary_model_max_context`    | `0`      | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`.                                                      |
-| `max_summary_tokens`           | `16384`  | Maximum tokens for the generated summary.                                                                                                                             |
-| `summary_temperature`          | `0.3`    | Randomness for summary generation. Lower is more deterministic.                                                                                                       |
+| `summary_model_max_context`    | `0`      | Input context window used to fit summary requests. If `0`, falls back to `model_thresholds` or global `max_context_tokens`.                                          |
+| `max_summary_tokens`           | `16384`  | Maximum output length for the generated summary. This is not the summary-input context limit.                                                                         |
+| `summary_temperature`          | `0.1`    | Randomness for summary generation. Lower is more deterministic.                                                                                                       |
 | `model_thresholds`             | `{}`     | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models).                                                            |
-| `enable_tool_output_trimming`  | `false`  | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer.                                                 |
+| `enable_tool_output_trimming`  | `true`   | When enabled for `function_calling: "native"`, trims oversized native tool outputs while keeping the tool-call chain intact.                                          |
+| `tool_trim_threshold_chars`     | `600`    | Trim native tool output blocks once their total content length reaches this threshold.                                                                                 |
 | `debug_mode`                   | `false`  | Log verbose debug info. Set to `false` in production.                                                                                                                 |
 | `show_debug_log`               | `false`  | Print debug logs to browser console (F12). Useful for frontend debugging.                                                                                             |
 | `show_token_usage_status`      | `true`   | Show token usage status notification in the chat interface.                                                                                                           |
@@ -71,8 +148,12 @@ If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.

 - **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
 - **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
+- **A referenced chat summary fails**: The current request should continue with a direct-context fallback. Check the browser console (`F12`) if you need the upstream failure details.
+- **A background summary silently seems to do nothing**: Important failures now surface in chat status and the browser console (`F12`).
 - **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

 ## Changelog

+See [`v1.5.0` Release Notes](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0.md) for the release-specific summary.
+
 See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/docs/plugins/filters/async-context-compression.zh.md
+++ b/docs/plugins/filters/async-context-compression.zh.md
@@ -1,15 +1,21 @@
 # 异步上下文压缩过滤器

-**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.4.1 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
+**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.5.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT

 > **重要提示**：为了确保所有过滤器的可维护性和易用性，每个过滤器都应附带清晰、完整的文档，以确保其功能、配置和使用方法得到充分说明。

 本过滤器通过智能摘要和消息压缩技术，在保持对话连贯性的同时，显著降低长对话的 Token 消耗。

-## 1.4.1 版本更新
+## 1.5.0 版本更新

- **逆向展开机制**: 引入 `_unfold_messages` 机制以在 `outlet` 阶段精确对齐坐标系，彻底解决了由于前端视图折叠导致长轮次工具调用对话出现进度漂移或跳过生成摘要的问题。
- **更安全的工具内容裁剪**: 重构了 `enable_tool_output_trimming`，现在严格使用原子级分组进行安全的原生工具内容裁剪，替代了激进的正则表达式匹配，防止 JSON 载荷损坏。
+- **外部聊天引用摘要**: 新增对引用聊天上下文的摘要支持。现在可以复用缓存摘要、直接注入较小引用聊天，或先为较大的引用聊天生成摘要再注入。
+- **快速多语言 Token 预估**: 新增混合脚本 Token 预估链路，使 inlet / outlet 的预检可以减少不必要的精确计数，同时比旧的粗略字符比值更接近真实用量。
+- **更稳健的工作记忆提示词**: 重写 XML 摘要提示词，增强普通聊天、编码任务和连续工具调用场景下的关键信息保留能力。
+- **更清晰的前端调试日志**: 浏览器控制台日志改为分组化、结构化展示，排查上下文压缩行为更直观。
+- **更安全的工具裁剪默认值**: 原生工具输出裁剪默认开启，并新增 `tool_trim_threshold_chars` 配置项，默认阈值为 600 字符。
+- **更稳妥的引用聊天回退**: 当新的引用聊天摘要路径生成失败时，不再拖垮当前请求，而是自动回退为直接注入上下文。
+- **更准确的摘要预算**: `summary_model_max_context` 现在只负责摘要输入窗口，`max_summary_tokens` 继续只负责摘要输出长度。
+- **更容易发现摘要失败**: 重要的后台摘要失败现在会强制显示到浏览器控制台 (`F12`)，并同步给出状态提示。

 ---

@@ -21,14 +27,84 @@
 - ✅ **持久化存储**: 复用 Open WebUI 共享数据库连接，自动支持 PostgreSQL/SQLite 等。
 - ✅ **灵活保留策略**: 可配置保留对话头部和尾部消息，确保关键信息连贯。
 - ✅ **智能注入**: 将历史摘要智能注入到新上下文中。
+- ✅ **外部聊天引用摘要**: 支持复用缓存摘要、小聊天直接注入，以及大聊天先摘要后注入。
 - ✅ **结构感知裁剪**: 智能折叠过长消息，保留文档骨架（标题、首尾）。
 - ✅ **原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
 - ✅ **实时监控**: 实时监控上下文使用情况，超过 90% 发出警告。
- ✅ **详细日志**: 提供精确的 Token 统计日志，便于调试。
+- ✅ **快速预估 + 精确回退**: 提供更快的多语言 Token 预估，并在必要时回退到精确统计，便于调试。
 - ✅ **智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
 - ⚠ **多模态支持**: 图片内容会被保留，但其 Token **不参与计算**。请相应调整阈值。

-详细的工作原理和流程请参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
+详细的工作原理和更长说明仍可参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
+
+---
+
+## 这次解决了什么问题（通俗版）
+
+- **问题 1：引用别的聊天时，摘要失败可能把当前对话一起弄挂。**
+  以前如果过滤器需要先帮被引用聊天做摘要，而这一步的 LLM 调用失败了，当前请求也可能直接失败。现在改成了“能摘要就摘要，失败就退回直接塞上下文”，当前对话不会被一起拖死。
+- **问题 2：有些被引用聊天被截得太早，信息丢得太多。**
+  以前有一段逻辑把 `max_summary_tokens` 这种“输出长度限制”误当成了“输入上下文窗口”，结果大一点的引用聊天会被过早截断。现在改成按摘要模型真实的输入窗口来算，能保留更多有用内容。
+- **问题 3：后台摘要失败时，用户不容易知道发生了什么。**
+  以前在 `show_debug_log=false` 时，有些后台失败只会留在内部日志里。现在关键失败会强制打到浏览器控制台，并在聊天状态里提醒去看 `F12`。
+
+---
+
+## 工作流总览
+
+该过滤器分为两个阶段：
+
+1. `inlet`：在请求发送给模型前执行，负责注入已有摘要、处理外部聊天引用、并在必要时裁剪上下文。
+2. `outlet`：在模型回复完成后异步执行，负责判断是否需要生成新摘要，并在合适时写入数据库。
+
+```mermaid
+flowchart TD
+    A[请求进入 inlet] --> B[规范化工具 ID 并按需裁剪超长工具输出]
+    B --> C{是否附带引用聊天?}
+    C -- 否 --> D[如果有当前聊天摘要就先加载]
+    C -- 是 --> E[逐个检查被引用聊天]
+
+    E --> F{已有缓存摘要?}
+    F -- 是 --> G[直接复用缓存摘要]
+    F -- 否 --> H{能直接放进当前预算?}
+    H -- 是 --> I[直接注入完整引用聊天文本]
+    H -- 否 --> J[准备引用聊天的摘要输入]
+
+    J --> K{引用聊天摘要调用成功?}
+    K -- 是 --> L[注入生成后的引用摘要]
+    K -- 否 --> M[回退为直接注入上下文]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[为当前聊天构造 Head + Summary + Tail]
+    N --> O{是否超过 max_context_tokens?}
+    O -- 是 --> P[从最旧 atomic groups 开始裁剪]
+    O -- 否 --> Q[把最终上下文发给模型]
+    P --> Q
+
+    Q --> R[模型返回当前回复]
+    R --> S[Outlet 重建完整历史]
+    S --> T{达到压缩阈值了吗?}
+    T -- 否 --> U[结束]
+    T -- 是 --> V[把摘要输入压到摘要模型可接受的上下文窗口]
+
+    V --> W{后台摘要调用成功?}
+    W -- 是 --> X[保存新摘要并更新状态]
+    W -- 否 --> Y[强制输出浏览器控制台错误并提示用户查看]
+```
+
+### 关键说明
+
+- `inlet` 只负责注入和裁剪上下文，不负责生成当前聊天的主摘要。
+- `outlet` 异步生成摘要，不会阻塞当前回复。
+- 外部聊天引用可以来自已有持久化摘要、小聊天的完整文本，或动态生成/截断后的引用摘要。
+- 如果引用聊天摘要失败，会自动回退为直接注入上下文，而不是让当前请求失败。
+- `summary_model_max_context` 控制摘要输入窗口；`max_summary_tokens` 只控制生成摘要的输出长度。
+- 重要的后台摘要失败会显示到浏览器控制台 (`F12`) 和聊天状态提示里。
+- 外部引用消息在裁剪阶段会被特殊保护，避免被最先删除。

 ---

@@ -64,8 +140,8 @@
 | 参数                  | 默认值  | 描述                                                                                                                                        |
 | :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
 | `summary_model`       | `None`  | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型（如 `gemini-2.5-flash`、`deepseek-v3`）。留空则尝试复用当前对话模型。 |
-| `summary_model_max_context` | `0`     | 摘要模型的最大上下文 Token 数。如果为 0，则回退到 `model_thresholds` 或全局 `max_context_tokens`。                                          |
-| `max_summary_tokens`  | `16384` | 生成摘要时允许的最大 Token 数。                                                                                                             |
+| `summary_model_max_context` | `0`     | 摘要请求可使用的输入上下文窗口。如果为 0，则回退到 `model_thresholds` 或全局 `max_context_tokens`。                                          |
+| `max_summary_tokens`  | `16384` | 生成摘要时允许的最大输出 Token 数。它不是摘要输入窗口上限。                                                                                 |
 | `summary_temperature` | `0.1`   | 控制摘要生成的随机性，较低的值结果更稳定。                                                                                                  |

 ### 高级配置
@@ -93,7 +169,8 @@

 | 参数                           | 默认值   | 描述                                                                                                                                    |
 | :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
-| `enable_tool_output_trimming`  | `false`  | 启用时，若 `function_calling: "native"` 激活，将裁剪冗长的工具输出以仅提取最终答案。                                                        |
+| `enable_tool_output_trimming`  | `true`   | 启用后（仅在 `function_calling: "native"` 下生效）会裁剪过大的本机工具输出，保留工具调用链结构并以简短占位替换冗长内容。             |
+| `tool_trim_threshold_chars`     | `600`    | 当本机工具输出累计字符数达到该值时触发裁剪，适用于包含长文本或表格的工具结果。                                                           |
 | `debug_mode`                   | `false`   | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
 | `show_debug_log`               | `false`  | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。                                                                   |
 | `show_token_usage_status`      | `true`   | 是否在对话结束时显示 Token 使用情况的状态通知。                                                                         |
@@ -109,8 +186,12 @@

 - **初始系统提示丢失**：将 `keep_first` 设置为大于 0。
 - **压缩效果不明显**：提高 `compression_threshold_tokens`，或降低 `keep_first` / `keep_last` 以增强压缩力度。
+- **引用聊天摘要失败**：当前请求现在应该会继续执行，并回退为直接注入上下文。如果要看上游失败原因，请打开浏览器控制台 (`F12`)。
+- **后台摘要看起来“没反应”**：重要失败现在会同时出现在状态提示和浏览器控制台 (`F12`) 中。
 - **提交 Issue**: 如果遇到任何问题，请在 GitHub 上提交 Issue：[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

 ## 更新日志

+请查看 [`v1.5.0` 版本发布说明](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/v1.5.0_CN.md) 获取本次版本的独立发布摘要。
+
 完整历史请查看 GitHub 项目： [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/docs/plugins/filters/index.md
+++ b/docs/plugins/filters/index.md
@@ -20,9 +20,9 @@ Filters act as middleware in the message pipeline:

    ---

-    Reduces token consumption in long conversations through intelligent summarization while maintaining coherence.
+    Reduces token consumption in long conversations with safer summary fallbacks and clearer failure visibility.

-    **Version:** 1.4.1
+    **Version:** 1.5.0

    [:octicons-arrow-right-24: Documentation](async-context-compression.md)

--- a/docs/plugins/filters/index.zh.md
+++ b/docs/plugins/filters/index.zh.md
@@ -20,11 +20,11 @@ Filter 充当消息管线中的中间件：

    ---

-    通过智能总结减少长对话的 token 消耗，同时保持连贯性。
+    通过更稳健的摘要回退和更清晰的失败提示，降低长对话的 token 消耗并保持连贯性。

-    **版本：** 1.4.1
+    **版本：** 1.5.0

-    [:octicons-arrow-right-24: 查看文档](async-context-compression.md)
+    [:octicons-arrow-right-24: 查看文档](async-context-compression.zh.md)

 - :material-text-box-plus:{ .lg .middle } **Context Enhancement**

--- a/plugins/filters/async-context-compression/README.md
+++ b/plugins/filters/async-context-compression/README.md
@@ -1,14 +1,19 @@
 # Async Context Compression Filter

-**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.4.2 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT
+**Author:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **Version:** 1.5.0 | **Project:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **License:** MIT

 This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.

-## What's new in 1.4.2
+## What's new in 1.5.0

- **Enhanced Summary Path Robustness**: Thread `__request__` context through entire summary generation pipeline for reliable authentication and provider handling.
- **Improved Error Diagnostics**: LLM response validation failures now include complete response body in error logs for transparent troubleshooting.
- **Smart Previous Summary Loading**: Automatically load and merge previous summaries from DB when not present in outlet payload, enabling incremental state merging across summary generations.
+- **External Chat Reference Summaries**: Added support for referenced chat context blocks that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
+- **Fast Multilingual Token Estimation**: Added a new mixed-script token estimation pipeline so inlet/outlet preflight checks can avoid unnecessary exact token counts while staying much closer to real usage.
+- **Stronger Working-Memory Prompt**: Refined the XML summary prompt to better preserve actionable context across general chat, coding tasks, and tool-heavy conversations.
+- **Clearer Frontend Debug Logs**: Reworked browser-console logging into grouped structural snapshots that are easier to scan during debugging.
+- **Safer Tool Trimming Defaults**: Enabled native tool-output trimming by default and exposed a dedicated `tool_trim_threshold_chars` valve with a 600-character default.
+- **Safer Referenced-Chat Fallbacks**: If generating a referenced chat summary fails, the new reference-summary path now falls back to direct contextual injection instead of failing the whole chat.
+- **Correct Summary Budgeting**: `summary_model_max_context` now controls summary-input fitting, while `max_summary_tokens` remains an output-length cap.
+- **More Visible Summary Failures**: Important background summary failures now surface in the browser console (`F12`) and as a status hint even when `show_debug_log` is off.

 ---

@@ -20,15 +25,85 @@ This filter reduces token consumption in long conversations through intelligent
 - ✅ Persistent storage via Open WebUI's shared database connection (PostgreSQL, SQLite, etc.).
 - ✅ Flexible retention policy to keep the first and last N messages.
 - ✅ Smart injection of historical summaries back into the context.
+- ✅ External chat reference summarization with cached-summary reuse, direct injection for small chats, and generated summaries for larger chats.
 - ✅ Structure-aware trimming that preserves document structure (headers, intro, conclusion).
 - ✅ Native tool output trimming for cleaner context when using function calling.
 - ✅ Real-time context usage monitoring with warning notifications (>90%).
- ✅ Detailed token logging for precise debugging and optimization.
+- ✅ Fast multilingual token estimation plus exact token fallback for precise debugging and optimization.
 - ✅ **Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
 - ⚠ **Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.

 ---

+## What This Fixes
+
+- **Problem 1: A referenced chat could break the current request.**
+  Before, if the filter needed to summarize a referenced chat and that LLM call failed, the current chat could fail with it. Now it degrades gracefully and injects direct context instead.
+- **Problem 2: Some referenced chats were being cut too aggressively.**
+  Before, the output limit (`max_summary_tokens`) could be treated like the input window, which made large referenced chats shrink earlier than necessary. Now input fitting uses the summary model's real context window (`summary_model_max_context` or model/global fallback).
+- **Problem 3: Some background summary failures were too easy to miss.**
+  Before, a failure during background summary preparation could disappear quietly when frontend debug logging was off. Now important failures are forced to the browser console and also shown through a user-facing status message.
+
+---
+
+## Workflow Overview
+
+This filter operates in two phases:
+
+1. `inlet`: injects stored summaries, processes external chat references, and trims context when required before the request is sent to the model.
+2. `outlet`: runs asynchronously after the response is complete, decides whether a new summary should be generated, and persists it when appropriate.
+
+```mermaid
+flowchart TD
+    A[Request enters inlet] --> B[Normalize tool IDs and optionally trim large tool outputs]
+    B --> C{Referenced chats attached?}
+    C -- No --> D[Load current chat summary if available]
+    C -- Yes --> E[Inspect each referenced chat]
+
+    E --> F{Existing cached summary?}
+    F -- Yes --> G[Reuse cached summary]
+    F -- No --> H{Fits direct budget?}
+    H -- Yes --> I[Inject full referenced chat text]
+    H -- No --> J[Prepare referenced-chat summary input]
+
+    J --> K{Referenced-chat summary call succeeds?}
+    K -- Yes --> L[Inject generated referenced summary]
+    K -- No --> M[Fallback to direct contextual injection]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[Build current-chat Head + Summary + Tail]
+    N --> O{Over max_context_tokens?}
+    O -- Yes --> P[Trim oldest atomic groups]
+    O -- No --> Q[Send final context to the model]
+    P --> Q
+
+    Q --> R[Model returns the reply]
+    R --> S[Outlet rebuilds the full history]
+    S --> T{Reached compression threshold?}
+    T -- No --> U[Finish]
+    T -- Yes --> V[Fit summary input to the summary model context]
+
+    V --> W{Background summary call succeeds?}
+    W -- Yes --> X[Save new chat summary and update status]
+    W -- No --> Y[Force browser-console error and show status hint]
+```
+
+### Key Notes
+
+- `inlet` only injects and trims context. It does not generate the main chat summary.
+- `outlet` performs summary generation asynchronously and does not block the current reply.
+- External chat references may come from an existing persisted summary, a small chat's full text, or a generated/truncated reference summary.
+- If a referenced-chat summary call fails, the filter falls back to direct context injection instead of failing the whole request.
+- `summary_model_max_context` controls summary-input fitting. `max_summary_tokens` only controls how long the generated summary may be.
+- Important background summary failures are surfaced to the browser console (`F12`) and the chat status area.
+- External reference messages are protected during trimming so they are not discarded first.
+
+---
+
 ## Installation & Configuration

 ### 1) Database (automatic)
@@ -52,11 +127,12 @@ This filter reduces token consumption in long conversations through intelligent
 | `keep_first`                   | `1`      | Always keep the first N messages (protects system prompts).                                                                                                           |
 | `keep_last`                    | `6`      | Always keep the last N messages to preserve recent context.                                                                                                           |
 | `summary_model`                | `None`   | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
-| `summary_model_max_context`    | `0`      | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`.                                                      |
-| `max_summary_tokens`           | `16384`  | Maximum tokens for the generated summary.                                                                                                                             |
-| `summary_temperature`          | `0.3`    | Randomness for summary generation. Lower is more deterministic.                                                                                                       |
+| `summary_model_max_context`    | `0`      | Input context window used to fit summary requests. If `0`, falls back to `model_thresholds` or global `max_context_tokens`.                                          |
+| `max_summary_tokens`           | `16384`  | Maximum output length for the generated summary. This is not the summary-input context limit.                                                                         |
+| `summary_temperature`          | `0.1`    | Randomness for summary generation. Lower is more deterministic.                                                                                                       |
 | `model_thresholds`             | `{}`     | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models).                                                            |
-| `enable_tool_output_trimming`  | `false`  | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer.                                                 |
+| `enable_tool_output_trimming`  | `true`   | When enabled for `function_calling: "native"`, trims oversized native tool outputs while keeping the tool-call chain intact.                                          |
+| `tool_trim_threshold_chars`     | `600`    | Trim native tool output blocks once their total content length reaches this threshold.                                                                                 |
 | `debug_mode`                   | `false`  | Log verbose debug info. Set to `false` in production.                                                                                                                 |
 | `show_debug_log`               | `false`  | Print debug logs to browser console (F12). Useful for frontend debugging.                                                                                             |
 | `show_token_usage_status`      | `true`   | Show token usage status notification in the chat interface.                                                                                                           |
@@ -72,8 +148,12 @@ If this plugin has been useful, a star on [OpenWebUI Extensions](https://github.

 - **Initial system prompt is lost**: Keep `keep_first` greater than 0 to protect the initial message.
 - **Compression effect is weak**: Raise `compression_threshold_tokens` or lower `keep_first` / `keep_last` to allow more aggressive compression.
+- **A referenced chat summary fails**: The current request should continue with a direct-context fallback. Check the browser console (`F12`) if you need the upstream failure details.
+- **A background summary silently seems to do nothing**: Important failures now surface in chat status and the browser console (`F12`).
 - **Submit an Issue**: If you encounter any problems, please submit an issue on GitHub: [OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

 ## Changelog

+See [`v1.5.0` Release Notes](./v1.5.0.md) for the release-specific summary.
+
 See the full history on GitHub: [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/plugins/filters/async-context-compression/README_CN.md
+++ b/plugins/filters/async-context-compression/README_CN.md
@@ -1,16 +1,21 @@
 # 异步上下文压缩过滤器

-**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.4.2 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT
+**作者:** [Fu-Jie](https://github.com/Fu-Jie/openwebui-extensions) | **版本:** 1.5.0 | **项目:** [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions) | **许可证:** MIT

 > **重要提示**：为了确保所有过滤器的可维护性和易用性，每个过滤器都应附带清晰、完整的文档，以确保其功能、配置和使用方法得到充分说明。

 本过滤器通过智能摘要和消息压缩技术，在保持对话连贯性的同时，显著降低长对话的 Token 消耗。

-## 1.4.2 版本更新
+## 1.5.0 版本更新

- **强化摘要路径健壮性**: 在整个摘要生成管道中透传 `__request__` 上下文，确保认证和提供商处理的可靠性。
- **改进错误诊断**: LLM 响应校验失败时，错误日志现在包含完整的响应体，便于透明的故障排除。
- **智能旧摘要加载**: 当 outlet payload 中缺失摘要消息时，自动从 DB 加载并合并前一代摘要，实现增量式状态合并。
+- **外部聊天引用摘要**: 新增对引用聊天上下文的摘要支持。现在可以复用缓存摘要、直接注入较小引用聊天，或先为较大的引用聊天生成摘要再注入。
+- **快速多语言 Token 预估**: 新增混合脚本 Token 预估链路，使 inlet / outlet 的预检可以减少不必要的精确计数，同时比旧的粗略字符比值更接近真实用量。
+- **更稳健的工作记忆提示词**: 重写 XML 摘要提示词，增强普通聊天、编码任务和连续工具调用场景下的关键信息保留能力。
+- **更清晰的前端调试日志**: 浏览器控制台日志改为分组化、结构化展示，排查上下文压缩行为更直观。
+- **更安全的工具裁剪默认值**: 原生工具输出裁剪默认开启，并新增 `tool_trim_threshold_chars` 配置项，默认阈值为 600 字符。
+- **更稳妥的引用聊天回退**: 当新的引用聊天摘要路径生成失败时，不再拖垮当前请求，而是自动回退为直接注入上下文。
+- **更准确的摘要预算**: `summary_model_max_context` 现在只负责摘要输入窗口，`max_summary_tokens` 继续只负责摘要输出长度。
+- **更容易发现摘要失败**: 重要的后台摘要失败现在会强制显示到浏览器控制台 (`F12`)，并同步给出状态提示。

 ---

@@ -22,14 +27,84 @@
 - ✅ **持久化存储**: 复用 Open WebUI 共享数据库连接，自动支持 PostgreSQL/SQLite 等。
 - ✅ **灵活保留策略**: 可配置保留对话头部和尾部消息，确保关键信息连贯。
 - ✅ **智能注入**: 将历史摘要智能注入到新上下文中。
+- ✅ **外部聊天引用摘要**: 支持复用缓存摘要、小聊天直接注入，以及大聊天先摘要后注入。
 - ✅ **结构感知裁剪**: 智能折叠过长消息，保留文档骨架（标题、首尾）。
 - ✅ **原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
 - ✅ **实时监控**: 实时监控上下文使用情况，超过 90% 发出警告。
- ✅ **详细日志**: 提供精确的 Token 统计日志，便于调试。
+- ✅ **快速预估 + 精确回退**: 提供更快的多语言 Token 预估，并在必要时回退到精确统计，便于调试。
 - ✅ **智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
 - ⚠ **多模态支持**: 图片内容会被保留，但其 Token **不参与计算**。请相应调整阈值。

-详细的工作原理和流程请参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
+详细的工作原理和更长说明仍可参考 [工作流程指南](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/WORKFLOW_GUIDE_CN.md)。
+
+---
+
+## 这次解决了什么问题（通俗版）
+
+- **问题 1：引用别的聊天时，摘要失败可能把当前对话一起弄挂。**
+  以前如果过滤器需要先帮被引用聊天做摘要，而这一步的 LLM 调用失败了，当前请求也可能直接失败。现在改成了“能摘要就摘要，失败就退回直接塞上下文”，当前对话不会被一起拖死。
+- **问题 2：有些被引用聊天被截得太早，信息丢得太多。**
+  以前有一段逻辑把 `max_summary_tokens` 这种“输出长度限制”误当成了“输入上下文窗口”，结果大一点的引用聊天会被过早截断。现在改成按摘要模型真实的输入窗口来算，能保留更多有用内容。
+- **问题 3：后台摘要失败时，用户不容易知道发生了什么。**
+  以前在 `show_debug_log=false` 时，有些后台失败只会留在内部日志里。现在关键失败会强制打到浏览器控制台，并在聊天状态里提醒去看 `F12`。
+
+---
+
+## 工作流总览
+
+该过滤器分为两个阶段：
+
+1. `inlet`：在请求发送给模型前执行，负责注入已有摘要、处理外部聊天引用、并在必要时裁剪上下文。
+2. `outlet`：在模型回复完成后异步执行，负责判断是否需要生成新摘要，并在合适时写入数据库。
+
+```mermaid
+flowchart TD
+    A[请求进入 inlet] --> B[规范化工具 ID 并按需裁剪超长工具输出]
+    B --> C{是否附带引用聊天?}
+    C -- 否 --> D[如果有当前聊天摘要就先加载]
+    C -- 是 --> E[逐个检查被引用聊天]
+
+    E --> F{已有缓存摘要?}
+    F -- 是 --> G[直接复用缓存摘要]
+    F -- 否 --> H{能直接放进当前预算?}
+    H -- 是 --> I[直接注入完整引用聊天文本]
+    H -- 否 --> J[准备引用聊天的摘要输入]
+
+    J --> K{引用聊天摘要调用成功?}
+    K -- 是 --> L[注入生成后的引用摘要]
+    K -- 否 --> M[回退为直接注入上下文]
+
+    G --> D
+    I --> D
+    L --> D
+    M --> D
+
+    D --> N[为当前聊天构造 Head + Summary + Tail]
+    N --> O{是否超过 max_context_tokens?}
+    O -- 是 --> P[从最旧 atomic groups 开始裁剪]
+    O -- 否 --> Q[把最终上下文发给模型]
+    P --> Q
+
+    Q --> R[模型返回当前回复]
+    R --> S[Outlet 重建完整历史]
+    S --> T{达到压缩阈值了吗?}
+    T -- 否 --> U[结束]
+    T -- 是 --> V[把摘要输入压到摘要模型可接受的上下文窗口]
+
+    V --> W{后台摘要调用成功?}
+    W -- 是 --> X[保存新摘要并更新状态]
+    W -- 否 --> Y[强制输出浏览器控制台错误并提示用户查看]
+```
+
+### 关键说明
+
+- `inlet` 只负责注入和裁剪上下文，不负责生成当前聊天的主摘要。
+- `outlet` 异步生成摘要，不会阻塞当前回复。
+- 外部聊天引用可以来自已有持久化摘要、小聊天的完整文本，或动态生成/截断后的引用摘要。
+- 如果引用聊天摘要失败，会自动回退为直接注入上下文，而不是让当前请求失败。
+- `summary_model_max_context` 控制摘要输入窗口；`max_summary_tokens` 只控制生成摘要的输出长度。
+- 重要的后台摘要失败会显示到浏览器控制台 (`F12`) 和聊天状态提示里。
+- 外部引用消息在裁剪阶段会被特殊保护，避免被最先删除。

 ---

@@ -65,8 +140,8 @@
 | 参数                  | 默认值  | 描述                                                                                                                                        |
 | :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
 | `summary_model`       | `None`  | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型（如 `gemini-2.5-flash`、`deepseek-v3`）。留空则尝试复用当前对话模型。 |
-| `summary_model_max_context` | `0`     | 摘要模型的最大上下文 Token 数。如果为 0，则回退到 `model_thresholds` 或全局 `max_context_tokens`。                                          |
-| `max_summary_tokens`  | `16384` | 生成摘要时允许的最大 Token 数。                                                                                                             |
+| `summary_model_max_context` | `0`     | 摘要请求可使用的输入上下文窗口。如果为 0，则回退到 `model_thresholds` 或全局 `max_context_tokens`。                                          |
+| `max_summary_tokens`  | `16384` | 生成摘要时允许的最大输出 Token 数。它不是摘要输入窗口上限。                                                                                 |
 | `summary_temperature` | `0.1`   | 控制摘要生成的随机性，较低的值结果更稳定。                                                                                                  |

 ### 高级配置
@@ -94,7 +169,8 @@

 | 参数                           | 默认值   | 描述                                                                                                                                    |
 | :----------------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
-| `enable_tool_output_trimming`  | `false`  | 启用时，若 `function_calling: "native"` 激活，将裁剪冗长的工具输出以仅提取最终答案。                                                        |
+| `enable_tool_output_trimming`  | `true`   | 启用后（仅在 `function_calling: "native"` 下生效）会裁剪过大的本机工具输出，保留工具调用链结构并以简短占位替换冗长内容。             |
+| `tool_trim_threshold_chars`     | `600`    | 当本机工具输出累计字符数达到该值时触发裁剪，适用于包含长文本或表格的工具结果。                                                           |
 | `debug_mode`                   | `false`   | 是否在 Open WebUI 的控制台日志中打印详细的调试信息。生产环境默认且建议设为 `false`。 |
 | `show_debug_log`               | `false`  | 是否在浏览器控制台 (F12) 打印调试日志。便于前端调试。                                                                   |
 | `show_token_usage_status`      | `true`   | 是否在对话结束时显示 Token 使用情况的状态通知。                                                                         |
@@ -110,8 +186,12 @@

 - **初始系统提示丢失**：将 `keep_first` 设置为大于 0。
 - **压缩效果不明显**：提高 `compression_threshold_tokens`，或降低 `keep_first` / `keep_last` 以增强压缩力度。
+- **引用聊天摘要失败**：当前请求现在应该会继续执行，并回退为直接注入上下文。如果要看上游失败原因，请打开浏览器控制台 (`F12`)。
+- **后台摘要看起来“没反应”**：重要失败现在会同时出现在状态提示和浏览器控制台 (`F12`) 中。
 - **提交 Issue**: 如果遇到任何问题，请在 GitHub 上提交 Issue：[OpenWebUI Extensions Issues](https://github.com/Fu-Jie/openwebui-extensions/issues)

 ## 更新日志

+请查看 [`v1.5.0` 版本发布说明](./v1.5.0_CN.md) 获取本次版本的独立发布摘要。
+
 完整历史请查看 GitHub 项目： [OpenWebUI Extensions](https://github.com/Fu-Jie/openwebui-extensions)
--- a/plugins/filters/async-context-compression/async_context_compression.py
+++ b/plugins/filters/async-context-compression/async_context_compression.py
--- a/plugins/filters/async-context-compression/test_async_context_compression.py
+++ b/plugins/filters/async-context-compression/test_async_context_compression.py
@@ -18,6 +18,63 @@ def _ensure_module(name: str) -> types.ModuleType:
    return module


+def _install_dependency_stubs() -> None:
+    pydantic_module = _ensure_module("pydantic")
+    sqlalchemy_module = _ensure_module("sqlalchemy")
+    sqlalchemy_orm_module = _ensure_module("sqlalchemy.orm")
+    sqlalchemy_engine_module = _ensure_module("sqlalchemy.engine")
+
+    class DummyBaseModel:
+        def __init__(self, **kwargs):
+            annotations = getattr(self.__class__, "__annotations__", {})
+            for field_name in annotations:
+                if field_name in kwargs:
+                    value = kwargs[field_name]
+                else:
+                    value = getattr(self.__class__, field_name, None)
+                setattr(self, field_name, value)
+
+    def dummy_field(default=None, **kwargs):
+        return default
+
+    class DummyMetadata:
+        def create_all(self, *args, **kwargs):
+            return None
+
+    def dummy_declarative_base():
+        class DummyBase:
+            metadata = DummyMetadata()
+
+        return DummyBase
+
+    def dummy_sessionmaker(*args, **kwargs):
+        return lambda: None
+
+    class DummyEngine:
+        pass
+
+    def dummy_column(*args, **kwargs):
+        return None
+
+    def dummy_type(*args, **kwargs):
+        return None
+
+    def dummy_inspect(*args, **kwargs):
+        return types.SimpleNamespace(has_table=lambda *a, **k: False)
+
+    pydantic_module.BaseModel = DummyBaseModel
+    pydantic_module.Field = dummy_field
+    sqlalchemy_module.Column = dummy_column
+    sqlalchemy_module.String = dummy_type
+    sqlalchemy_module.Text = dummy_type
+    sqlalchemy_module.DateTime = dummy_type
+    sqlalchemy_module.Integer = dummy_type
+    sqlalchemy_module.inspect = dummy_inspect
+    sqlalchemy_orm_module.declarative_base = dummy_declarative_base
+    sqlalchemy_orm_module.sessionmaker = dummy_sessionmaker
+    sqlalchemy_engine_module.Engine = DummyEngine
+
+
 def _install_openwebui_stubs() -> None:
    _ensure_module("open_webui")
    _ensure_module("open_webui.utils")
@@ -47,7 +104,8 @@ def _install_openwebui_stubs() -> None:
            return None

    class DummyRequest:
-        pass
+        def __init__(self, *args, **kwargs):
+            pass

    chat_module.generate_chat_completion = generate_chat_completion
    users_module.Users = DummyUsers
@@ -57,6 +115,7 @@ def _install_openwebui_stubs() -> None:
    fastapi_requests.Request = DummyRequest


+_install_dependency_stubs()
 _install_openwebui_stubs()
 spec = importlib.util.spec_from_file_location(MODULE_NAME, PLUGIN_PATH)
 module = importlib.util.module_from_spec(spec)
@@ -189,9 +248,12 @@ class TestAsyncContextCompression(unittest.TestCase):
            {"role": "assistant", "content": "Final answer"},
        ]

-        trimmed_count = self.filter._trim_native_tool_outputs(messages, "en-US")
+        trimmed_count, trim_debug = self.filter._trim_native_tool_outputs(
+            messages, "en-US"
+        )

        self.assertEqual(trimmed_count, 1)
+        self.assertIsNone(trim_debug)
        self.assertEqual(messages[1]["content"], "... [Content collapsed] ...")
        self.assertTrue(messages[1]["metadata"]["is_trimmed"])
        self.assertTrue(messages[2]["metadata"]["tool_outputs_trimmed"])
@@ -213,9 +275,12 @@ class TestAsyncContextCompression(unittest.TestCase):
            }
        ]

-        trimmed_count = self.filter._trim_native_tool_outputs(messages, "en-US")
+        trimmed_count, trim_debug = self.filter._trim_native_tool_outputs(
+            messages, "en-US"
+        )

        self.assertEqual(trimmed_count, 1)
+        self.assertIsNone(trim_debug)
        self.assertIn(
            'result="&quot;... [Content collapsed] ...&quot;"',
            messages[0]["content"],
@@ -258,9 +323,12 @@ class TestAsyncContextCompression(unittest.TestCase):
            {"role": "tool", "content": "x" * 1600},
        ]

-        trimmed_count = self.filter._trim_native_tool_outputs(messages, "en-US")
+        trimmed_count, trim_debug = self.filter._trim_native_tool_outputs(
+            messages, "en-US"
+        )

        self.assertEqual(trimmed_count, 1)
+        self.assertIsNone(trim_debug)
        self.assertEqual(messages[1]["content"], "... [Content collapsed] ...")
        self.assertTrue(messages[1]["metadata"]["is_trimmed"])

@@ -391,11 +459,55 @@ class TestAsyncContextCompression(unittest.TestCase):

        self.assertTrue(create_task_called)

-    def test_summary_save_progress_matches_truncated_input(self):
+    def test_estimate_messages_tokens_counts_output_text_parts(self):
+        messages = [
+            {
+                "role": "assistant",
+                "content": [{"type": "output_text", "text": "abcd" * 25}],
+            }
+        ]
+
+        self.assertEqual(
+            self.filter._estimate_messages_tokens(messages),
+            module._estimate_text_tokens("abcd" * 25),
+        )
+
+    def test_unfold_messages_keeps_plain_assistant_output_when_expand_is_not_richer(self):
+        misc_module = _ensure_module("open_webui.utils.misc")
+        misc_module.convert_output_to_messages = lambda output, raw=True: [
+            {
+                "role": "assistant",
+                "content": [{"type": "output_text", "text": "Plain reply"}],
+            }
+        ]
+
+        messages = [
+            {
+                "id": "assistant-1",
+                "role": "assistant",
+                "content": "Plain reply",
+                "output": [
+                    {
+                        "type": "message",
+                        "role": "assistant",
+                        "content": [{"type": "output_text", "text": "Plain reply"}],
+                    }
+                ],
+            }
+        ]
+
+        unfolded = self.filter._unfold_messages(messages)
+
+        self.assertEqual(len(unfolded), 1)
+        self.assertEqual(unfolded[0]["id"], "assistant-1")
+        self.assertEqual(unfolded[0]["content"], "Plain reply")
+        self.assertNotIn("output", unfolded[0])
+
+    def test_summary_save_progress_matches_final_prompt_shrink(self):
        self.filter.valves.keep_first = 1
        self.filter.valves.keep_last = 1
        self.filter.valves.summary_model = "fake-summary-model"
-        self.filter.valves.summary_model_max_context = 0
+        self.filter.valves.summary_model_max_context = 1200

        captured = {}
        events = []
@@ -404,12 +516,14 @@ class TestAsyncContextCompression(unittest.TestCase):
            events.append(event)

        async def mock_summary_llm(
-            previous_summary,
            new_conversation_text,
            body,
            user_data,
-            __event_call__,
+            __event_call__=None,
+            __request__=None,
+            previous_summary=None,
        ):
+            captured["conversation_text"] = new_conversation_text
            return "new summary"

        def mock_save_summary(chat_id, summary, compressed_count):
@@ -424,17 +538,22 @@ class TestAsyncContextCompression(unittest.TestCase):
        self.filter._call_summary_llm = mock_summary_llm
        self.filter._save_summary = mock_save_summary
        self.filter._get_model_thresholds = lambda model_id: {
-            "max_context_tokens": 3500
+            "max_context_tokens": 1200
        }
-        self.filter._calculate_messages_tokens = lambda messages: len(messages) * 1000
-        self.filter._count_tokens = lambda text: 1000
+        self.filter._format_messages_for_summary = lambda messages: "\n".join(
+            msg["content"] for msg in messages
+        )
+        self.filter._build_summary_prompt = (
+            lambda conversation_text, previous_summary=None: conversation_text
+        )
+        self.filter._count_tokens = lambda text: len(text)

        messages = [
            {"role": "system", "content": "System prompt"},
-            {"role": "user", "content": "Question 1"},
-            {"role": "assistant", "content": "Answer 1"},
-            {"role": "user", "content": "Question 2"},
-            {"role": "assistant", "content": "Answer 2"},
+            {"role": "user", "content": "Q" * 100},
+            {"role": "assistant", "content": "A" * 100},
+            {"role": "user", "content": "B" * 100},
+            {"role": "assistant", "content": "C" * 100},
            {"role": "user", "content": "Question 3"},
        ]

@@ -453,9 +572,466 @@ class TestAsyncContextCompression(unittest.TestCase):

        self.assertEqual(captured["chat_id"], "chat-1")
        self.assertEqual(captured["summary"], "new summary")
-        self.assertEqual(captured["compressed_count"], 2)
+        self.assertEqual(captured["compressed_count"], 3)
+        self.assertEqual(captured["conversation_text"], f"{'Q' * 100}\n{'A' * 100}")
        self.assertTrue(any(event["type"] == "status" for event in events))

+    def test_generate_summary_async_drops_previous_summary_when_prompt_still_oversized(self):
+        self.filter.valves.keep_first = 1
+        self.filter.valves.keep_last = 1
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.summary_model_max_context = 1200
+
+        captured = {}
+
+        async def mock_summary_llm(
+            new_conversation_text,
+            body,
+            user_data,
+            __event_call__=None,
+            __request__=None,
+            previous_summary=None,
+        ):
+            captured["conversation_text"] = new_conversation_text
+            captured["previous_summary"] = previous_summary
+            return "new summary"
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        self.filter._log = noop_log
+        self.filter._call_summary_llm = mock_summary_llm
+        self.filter._save_summary = lambda *args: None
+        self.filter._get_model_thresholds = lambda model_id: {
+            "max_context_tokens": 1200
+        }
+        self.filter._format_messages_for_summary = lambda messages: "\n".join(
+            msg["content"] for msg in messages
+        )
+        self.filter._build_summary_prompt = (
+            lambda conversation_text, previous_summary=None: (
+                (previous_summary or "") + "\n" + conversation_text
+            )
+        )
+        self.filter._count_tokens = lambda text: len(text)
+        self.filter._load_summary = lambda chat_id, body: "P" * 220
+
+        messages = [
+            {"role": "system", "content": "System prompt"},
+            {"role": "user", "content": "Q" * 60},
+            {"role": "assistant", "content": "Answer 1"},
+            {"role": "user", "content": "Question 2"},
+        ]
+
+        asyncio.run(
+            self.filter._generate_summary_async(
+                messages=messages,
+                chat_id="chat-1",
+                body={"model": "fake-summary-model"},
+                user_data={"id": "user-1"},
+                target_compressed_count=2,
+                lang="en-US",
+                __event_emitter__=None,
+                __event_call__=None,
+            )
+        )
+
+        self.assertEqual(captured["conversation_text"], "Q" * 60)
+        self.assertIsNone(captured["previous_summary"])
+
+    def test_call_summary_llm_surfaces_provider_error_dict(self):
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.show_debug_log = False
+
+        async def fake_generate_chat_completion(request, payload, user):
+            return {"error": {"message": "context too long", "code": 400}}
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        frontend_calls = []
+
+        async def fake_event_call(payload):
+            frontend_calls.append(payload)
+            return True
+
+        original_generate = module.generate_chat_completion
+        original_get_user = getattr(module.Users, "get_user_by_id", None)
+
+        module.generate_chat_completion = fake_generate_chat_completion
+        module.Users.get_user_by_id = staticmethod(
+            lambda user_id: types.SimpleNamespace(email="user@example.com")
+        )
+        self.filter._log = noop_log
+        self.filter._get_model_thresholds = lambda model_id: {
+            "max_context_tokens": 8192
+        }
+        self.filter._build_summary_prompt = (
+            lambda conversation_text, previous_summary=None: conversation_text
+        )
+
+        try:
+            with self.assertRaises(Exception) as exc_info:
+                asyncio.run(
+                    self.filter._call_summary_llm(
+                        "conversation",
+                        {"model": "fake-summary-model"},
+                        {"id": "user-1"},
+                        __event_call__=fake_event_call,
+                    )
+                )
+        finally:
+            module.generate_chat_completion = original_generate
+            if original_get_user is None:
+                delattr(module.Users, "get_user_by_id")
+            else:
+                module.Users.get_user_by_id = original_get_user
+
+        self.assertIn("Upstream provider error: context too long", str(exc_info.exception))
+        self.assertNotIn(
+            "LLM response format incorrect or empty", str(exc_info.exception)
+        )
+        self.assertTrue(frontend_calls)
+        self.assertEqual(frontend_calls[0]["type"], "execute")
+        self.assertIn("console.error", frontend_calls[0]["data"]["code"])
+        self.assertIn("context too long", frontend_calls[0]["data"]["code"])
+
+    def test_generate_summary_async_status_guides_user_to_browser_console(self):
+        self.filter.valves.keep_first = 1
+        self.filter.valves.keep_last = 1
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.summary_model_max_context = 1200
+        self.filter.valves.show_debug_log = False
+
+        events = []
+        frontend_calls = []
+
+        async def fake_summary_llm(*args, **kwargs):
+            raise Exception("boom details")
+
+        async def fake_emitter(event):
+            events.append(event)
+
+        async def fake_event_call(payload):
+            frontend_calls.append(payload)
+            return True
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        self.filter._log = noop_log
+        self.filter._call_summary_llm = fake_summary_llm
+        self.filter._get_model_thresholds = lambda model_id: {
+            "max_context_tokens": 1200
+        }
+        self.filter._format_messages_for_summary = lambda messages: "\n".join(
+            msg["content"] for msg in messages
+        )
+        self.filter._build_summary_prompt = (
+            lambda conversation_text, previous_summary=None: conversation_text
+        )
+        self.filter._count_tokens = lambda text: len(text)
+
+        messages = [
+            {"role": "system", "content": "System prompt"},
+            {"role": "user", "content": "Q" * 40},
+            {"role": "assistant", "content": "A" * 40},
+            {"role": "user", "content": "Question 2"},
+        ]
+
+        asyncio.run(
+            self.filter._generate_summary_async(
+                messages=messages,
+                chat_id="chat-1",
+                body={"model": "fake-summary-model"},
+                user_data={"id": "user-1"},
+                target_compressed_count=2,
+                lang="en-US",
+                __event_emitter__=fake_emitter,
+                __event_call__=fake_event_call,
+            )
+        )
+
+        self.assertTrue(frontend_calls)
+        self.assertIn("console.error", frontend_calls[0]["data"]["code"])
+        self.assertIn("boom details", frontend_calls[0]["data"]["code"])
+        status_descriptions = [
+            event["data"]["description"]
+            for event in events
+            if event.get("type") == "status"
+        ]
+        self.assertTrue(
+            any("Check browser console (F12) for details" in text for text in status_descriptions)
+        )
+
+    def test_check_and_generate_summary_async_forces_frontend_and_status_on_pre_summary_error(
+        self,
+    ):
+        self.filter.valves.show_debug_log = False
+
+        events = []
+        frontend_calls = []
+
+        async def fake_emitter(event):
+            events.append(event)
+
+        async def fake_event_call(payload):
+            frontend_calls.append(payload)
+            return True
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        def fail_estimate(_messages):
+            raise Exception("pre summary boom")
+
+        self.filter._log = noop_log
+        self.filter._estimate_messages_tokens = fail_estimate
+        self.filter._get_model_thresholds = lambda model_id: {
+            "compression_threshold_tokens": 100,
+            "max_context_tokens": 1000,
+        }
+
+        asyncio.run(
+            self.filter._check_and_generate_summary_async(
+                chat_id="chat-1",
+                model="fake-model",
+                body={"messages": [{"role": "user", "content": "Hello"}]},
+                user_data={"id": "user-1"},
+                target_compressed_count=1,
+                lang="en-US",
+                __event_emitter__=fake_emitter,
+                __event_call__=fake_event_call,
+            )
+        )
+
+        self.assertTrue(frontend_calls)
+        self.assertIn("console.error", frontend_calls[0]["data"]["code"])
+        self.assertIn("pre summary boom", frontend_calls[0]["data"]["code"])
+        status_descriptions = [
+            event["data"]["description"]
+            for event in events
+            if event.get("type") == "status"
+        ]
+        self.assertTrue(
+            any("Check browser console (F12) for details" in text for text in status_descriptions)
+        )
+
+    def test_external_reference_message_detection_matches_injected_marker(self):
+        message = {
+            "role": "assistant",
+            "content": "External refs",
+            "metadata": {
+                "is_summary": True,
+                "is_external_references": True,
+                "source": "external_references",
+            },
+        }
+
+        self.assertTrue(self.filter._is_external_reference_message(message))
+
+    def test_handle_external_chat_references_falls_back_when_summary_llm_errors(self):
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.max_summary_tokens = 4096
+
+        async def fake_summary_llm(*args, **kwargs):
+            raise Exception("reference summary failed")
+
+        self.filter._call_summary_llm = fake_summary_llm
+        self.filter._load_summary_record = lambda chat_id: None
+        self.filter._load_full_chat_messages = lambda chat_id: [
+            {"role": "user", "content": "Referenced question"},
+            {"role": "assistant", "content": "Referenced answer"},
+        ]
+        self.filter._format_messages_for_summary = (
+            lambda messages: "Referenced conversation body"
+        )
+        self.filter._get_model_thresholds = lambda model_id: {
+            "max_context_tokens": 5001
+        }
+        self.filter._estimate_messages_tokens = lambda messages: 5000
+
+        body = {
+            "model": "main-model",
+            "messages": [{"role": "user", "content": "Current prompt"}],
+            "metadata": {
+                "files": [
+                    {
+                        "type": "chat",
+                        "id": "chat-ref-1",
+                        "name": "Referenced Chat",
+                    }
+                ]
+            },
+        }
+
+        result = asyncio.run(
+            self.filter._handle_external_chat_references(
+                body,
+                user_data={"id": "user-1"},
+            )
+        )
+
+        self.assertIn("__external_references__", result)
+        self.assertIn(
+            "Referenced conversation body",
+            result["__external_references__"]["content"],
+        )
+
+    def test_generate_referenced_summaries_background_uses_model_context_window_fallback(
+        self,
+    ):
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.summary_model_max_context = 0
+        self.filter.valves.max_summary_tokens = 64
+
+        captured = {}
+        truncate_calls = []
+
+        async def fake_summary_llm(
+            new_conversation_text,
+            body,
+            user_data,
+            __event_call__=None,
+            __request__=None,
+            previous_summary=None,
+        ):
+            captured["conversation_text"] = new_conversation_text
+            return "cached summary"
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        self.filter._call_summary_llm = fake_summary_llm
+        self.filter._log = noop_log
+        self.filter._save_summary = lambda *args: None
+        self.filter._get_model_thresholds = lambda model_id: {
+            "max_context_tokens": 5000
+        }
+        self.filter._truncate_messages_for_summary = (
+            lambda messages, max_tokens: truncate_calls.append(max_tokens) or "truncated"
+        )
+
+        conversation_text = "x" * 600
+
+        asyncio.run(
+            self.filter._generate_referenced_summaries_background(
+                [
+                    {
+                        "chat_id": "chat-ref-ctx",
+                        "title": "Referenced Chat",
+                        "conversation_text": conversation_text,
+                        "covers_full_history": True,
+                        "covered_message_count": 1,
+                    }
+                ],
+                user_data={"id": "user-1"},
+            )
+        )
+
+        self.assertEqual(captured["conversation_text"], conversation_text)
+        self.assertEqual(truncate_calls, [])
+
+    def test_generate_referenced_summaries_background_uses_summary_llm_signature(self):
+        self.filter.valves.summary_model = "fake-summary-model"
+
+        captured = {}
+
+        async def fake_summary_llm(
+            new_conversation_text,
+            body,
+            user_data,
+            __event_call__=None,
+            __request__=None,
+            previous_summary=None,
+        ):
+            captured["conversation_text"] = new_conversation_text
+            captured["body"] = body
+            captured["user_data"] = user_data
+            captured["request"] = __request__
+            captured["previous_summary"] = previous_summary
+            return "cached reference summary"
+
+        def fake_save_summary(chat_id, summary, compressed_count):
+            captured["saved"] = (chat_id, summary, compressed_count)
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        self.filter._call_summary_llm = fake_summary_llm
+        self.filter._save_summary = fake_save_summary
+        self.filter._log = noop_log
+
+        request = object()
+
+        asyncio.run(
+            self.filter._generate_referenced_summaries_background(
+                [
+                    {
+                        "chat_id": "chat-ref-1",
+                        "title": "Referenced Chat",
+                        "conversation_text": "Full referenced conversation",
+                        "covers_full_history": True,
+                        "covered_message_count": 3,
+                    }
+                ],
+                user_data={"id": "user-1"},
+                __request__=request,
+            )
+        )
+
+        self.assertEqual(captured["conversation_text"], "Full referenced conversation")
+        self.assertEqual(captured["body"]["model"], "fake-summary-model")
+        self.assertEqual(captured["user_data"], {"id": "user-1"})
+        self.assertIs(captured["request"], request)
+        self.assertIsNone(captured["previous_summary"])
+        self.assertEqual(
+            captured["saved"], ("chat-ref-1", "cached reference summary", 3)
+        )
+
+    def test_generate_referenced_summaries_background_skips_progress_save_for_truncation(self):
+        self.filter.valves.summary_model = "fake-summary-model"
+        self.filter.valves.summary_model_max_context = 100
+
+        saved_calls = []
+        captured = {}
+
+        async def fake_summary_llm(
+            new_conversation_text,
+            body,
+            user_data,
+            __event_call__=None,
+            __request__=None,
+            previous_summary=None,
+        ):
+            captured["conversation_text"] = new_conversation_text
+            return "ephemeral summary"
+
+        async def noop_log(*args, **kwargs):
+            return None
+
+        self.filter._call_summary_llm = fake_summary_llm
+        self.filter._save_summary = lambda *args: saved_calls.append(args)
+        self.filter._log = noop_log
+        self.filter._load_full_chat_messages = lambda chat_id: [
+            {"role": "user", "content": "msg 1"},
+            {"role": "assistant", "content": "msg 2"},
+        ]
+        self.filter._format_messages_for_summary = lambda messages: "x" * 600
+        self.filter._truncate_messages_for_summary = (
+            lambda messages, max_tokens: "tail only"
+        )
+
+        asyncio.run(
+            self.filter._generate_referenced_summaries_background(
+                [{"chat_id": "chat-ref-2", "title": "Large Referenced Chat"}],
+                user_data={"id": "user-1"},
+            )
+        )
+
+        self.assertEqual(captured["conversation_text"], "tail only")
+        self.assertEqual(saved_calls, [])
+

 if __name__ == "__main__":
    unittest.main()
--- a/plugins/filters/async-context-compression/v1.5.0.md
+++ b/plugins/filters/async-context-compression/v1.5.0.md
@@ -0,0 +1,27 @@
+[![](https://img.shields.io/badge/OpenWebUI%20Community-Get%20Plugin-blue?style=for-the-badge)](https://openwebui.com/f/fujie/async_context_compression)
+
+## Overview
+
+Compared with the previous git version (`1.4.2`), this release introduces two major new capabilities: external chat reference summarization and a much stronger multilingual token-estimation pipeline. It also improves the reliability of the surrounding summary workflow, especially when provider-side failures occur.
+
+**[📖 README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README.md)**
+
+## New Features
+
+- **External Chat Reference Summaries**: Add support for referenced chat context injection that can reuse cached summaries, inject small referenced chats directly, or generate summaries for larger referenced chats before injection.
+- **Fast Multilingual Token Estimation**: Replace the old rough `len(text)//4` fallback with a new mixed-script estimation pipeline so preflight decisions stay much closer to actual usage across English, Chinese, Japanese, Korean, Arabic, Cyrillic, Thai, and mixed content.
+- **Stronger Working-Memory Prompt**: Refined the XML summary prompt so generated working memory preserves more actionable state across general chat, coding tasks, and tool-heavy conversations.
+- **Clearer Frontend Debug Logs**: Reworked browser-console debug output into grouped structural snapshots that make inlet/outlet state easier to inspect.
+- **Safer Tool Trimming Defaults**: Enabled native tool-output trimming by default and exposed `tool_trim_threshold_chars` with a 600-character threshold.
+
+## Bug Fixes
+
+- **Referenced-Chat Fallback Reliability**: If the new referenced-chat summary path fails, the active request now falls back to direct contextual injection instead of failing the whole chat.
+- **Correct Summary Budgeting**: Fixed referenced-chat summary preparation so `summary_model_max_context` controls summary-input fitting, while `max_summary_tokens` remains an output cap.
+- **Visible Background Failures**: Important background summary failures now surface to the browser console and chat status even when `show_debug_log` is disabled.
+- **Provider Error Surfacing**: Improved summary-call error extraction so non-standard upstream provider error payloads are reported more clearly.
+
+## Release Notes
+
+- Bilingual plugin README files and mirrored docs pages were refreshed for the `1.5.0` release.
+- This release is aimed at reducing silent failure modes and making summary behavior easier to reason about during debugging.
--- a/plugins/filters/async-context-compression/v1.5.0_CN.md
+++ b/plugins/filters/async-context-compression/v1.5.0_CN.md
@@ -0,0 +1,27 @@
+[![](https://img.shields.io/badge/OpenWebUI%20Community-Get%20Plugin-blue?style=for-the-badge)](https://openwebui.com/f/fujie/async_context_compression)
+
+## 概述
+
+相较上一个 git 版本（`1.4.2`），本次发布新增了两个重要能力：外部聊天引用摘要，以及更强的多语言 Token 预估链路。同时也补强了围绕这些新能力的摘要流程稳定性，特别是上游提供商报错时的回退与可见性。
+
+**[📖 README](https://github.com/Fu-Jie/openwebui-extensions/blob/main/plugins/filters/async-context-compression/README_CN.md)**
+
+## 新功能
+
+- **外部聊天引用摘要**：新增引用聊天上下文注入能力。现在可以复用缓存摘要、直接注入较小引用聊天，或先为较大的引用聊天生成摘要再注入。
+- **快速多语言 Token 预估**：用新的混合脚本估算链路替代旧的 `len(text)//4` 粗略回退，使预检在英文、中文、日文、韩文、阿拉伯文、西里尔文、泰文及混合内容下都更接近真实用量。
+- **更稳健的工作记忆提示词**：重写 XML 摘要提示词，让生成出的 working memory 在普通聊天、编码任务和密集工具调用场景下保留更多可操作上下文。
+- **更清晰的前端调试日志**：浏览器控制台调试输出改为分组化、结构化展示，更容易观察 inlet / outlet 的真实状态。
+- **更安全的工具裁剪默认值**：原生工具输出裁剪默认开启，并新增 `tool_trim_threshold_chars`，默认阈值为 600 字符。
+
+## 问题修复
+
+- **引用聊天回退更稳妥**：当新的引用聊天摘要路径失败时，当前请求会自动回退为直接注入上下文，而不是整个对话一起失败。
+- **摘要预算计算更准确**：修复引用聊天摘要准备逻辑，明确由 `summary_model_max_context` 控制摘要输入窗口，而 `max_summary_tokens` 只控制摘要输出长度。
+- **后台失败更容易发现**：即使关闭 `show_debug_log`，关键后台摘要失败现在也会显示到浏览器控制台和聊天状态提示中。
+- **提供商错误信息更清晰**：改进摘要调用的错误提取逻辑，让非标准上游错误载荷也能更准确地显示出来。
+
+## 发布说明
+
+- 已同步更新中英插件 README 与 docs 镜像页，确保 `1.5.0` 发布说明一致。
+- 本次版本的目标，是减少“静默失败”这类难排查问题，并让摘要行为在调试时更容易理解。