feat: 新增插件系统、多种插件类型、开发指南及多语言文档。

2025-12-20 12:34:49 +08:00
commit eaa6319991
74 changed files with 28409 additions and 0 deletions
--- a/plugins/filters/README.md
+++ b/plugins/filters/README.md
@@ -0,0 +1,45 @@
+# Filters
+
+English | [中文](./README_CN.md)
+
+Filters process and modify user input before it is sent to the LLM. This directory contains various filters that can be used to extend OpenWebUI functionality.
+
+## 📋 Filter List
+
+| Filter Name | Description | Documentation |
+| :--- | :--- | :--- |
+| **Async Context Compression** | Reduces token consumption in long conversations through intelligent summarization and message compression while maintaining conversational coherence. | [English](./async-context-compression/async_context_compression.md) / [中文](./async-context-compression/async_context_compression_cn.md) |
+
+## 🚀 Quick Start
+
+### Installing a Filter
+
+1. Navigate to the desired filter directory
+2. Download the corresponding `.py` file to your local machine
+3. Open OpenWebUI Admin Settings and find the "Filters" section
+4. Upload the Python file
+5. Configure the filter parameters according to its documentation
+6. Refresh the page and enable the filter in your chat settings
+
+## 📖 Development Guide
+
+When adding a new filter, please follow these steps:
+
+1. **Create Filter Directory**: Create a new folder in the current directory (e.g., `my_filter/`)
+2. **Write Filter Code**: Create a `.py` file with clear documentation of functionality and configuration in comments
+3. **Write Documentation**:
+   - Create `filter_name.md` (English version)
+   - Create `filter_name_cn.md` (Chinese version)
+   - Documentation should include: feature description, configuration parameters, usage examples, and troubleshooting
+4. **Update This List**: Add your new filter to the table above
+
+## ⚙️ Configuration Best Practices
+
+- **Priority Management**: Set appropriate filter priority to ensure correct execution order
+- **Parameter Tuning**: Adjust filter parameters based on your specific needs
+- **Debug Logging**: Enable debug mode during development, disable in production
+- **Performance Testing**: Test filter performance under high load
+
+---
+
+> **Contributor Note**: To ensure project maintainability and user experience, please provide clear and complete documentation for each new filter, including feature description, parameter configuration, usage examples, and troubleshooting guide.
--- a/plugins/filters/README_CN.md
+++ b/plugins/filters/README_CN.md
@@ -0,0 +1,67 @@
+# 自动上下文合并过滤器 (Auto Context Merger Filter)
+
+## 概述
+
+`auto_context_merger` 是一个 Open WebUI 过滤器插件，旨在通过自动收集和注入上一回合多模型回答的上下文，来增强后续对话的连贯性和深度。当用户在一次多模型回答之后提出新的后续问题时，此过滤器会自动激活。
+
+它会从对话历史中识别出上一回合所有 AI 模型的回答，将它们按照清晰的格式直接拼接起来，然后作为一个系统消息注入到当前请求中。这样，当前模型在处理用户的新问题时，就能直接参考到之前所有 AI 的观点，从而提供更全面、更连贯的回答。
+
+## 工作原理
+
+1.  **触发时机**: 当用户在一次“多模型回答”之后，发送新的后续问题时，此过滤器会自动激活。
+2.  **获取历史数据**: 过滤器会使用当前对话的 `chat_id`，从数据库中加载完整的对话历史记录。
+3.  **分析上一回合**: 通过分析对话树结构，它能准确找到用户上一个问题，以及当时所有 AI 模型给出的并行回答。
+4.  **直接格式化**: 如果检测到上一回合确实有多个 AI 回答，它会收集所有这些 AI 的回答内容。
+5.  **智能注入**: 将这些格式化后的回答作为一个系统消息，注入到当前请求的 `messages` 列表的开头，紧邻用户的新问题之前。
+6.  **传递给目标模型**: 修改后的消息体（包含格式化后的上下文）将传递给用户最初选择的目标模型。目标模型在生成响应时，将能够利用这个更丰富的上下文。
+7.  **状态更新**: 在整个处理过程中，过滤器会通过 `__event_emitter__` 提供实时状态更新，让用户了解处理进度。
+
+## 配置 (Valves)
+
+您可以在 Open WebUI 的管理界面中配置此过滤器的 `Valves`。
+
+*   **`CONTEXT_PREFIX`** (字符串, 必填):
+    *   **描述**: 注入的系统消息的前缀文本。它会出现在合并后的上下文之前，用于向模型解释这段内容的来源和目的。
+    *   **示例**: `**背景知识**：为了更好地回答您的新问题，请参考上一轮对话中多个AI模型给出的回答：\n\n`
+
+## 如何使用
+
+1.  **部署过滤器**: 将 `auto_context_merger.py` 文件放置在 Open WebUI 实例的 `plugins/filters/` 目录下。
+2.  **启用过滤器**: 登录 Open WebUI 管理界面，导航到 **Workspace -> Functions**。找到 `auto_context_merger` 过滤器并启用它。
+3.  **配置参数**: 点击 `auto_context_merger` 过滤器旁边的编辑按钮，根据您的需求配置 `CONTEXT_PREFIX`。
+4.  **开始对话**:
+    *   首先，向一个模型提问，并确保有多个模型（例如通过 `gemini_manifold` 或其他多模型工具）给出回答。
+    *   然后，针对这个多模型回答，提出您的后续问题。
+    *   此过滤器将自动激活，将上一回合所有 AI 的回答合并并注入到当前请求中。
+
+## 示例
+
+假设您配置了 `CONTEXT_PREFIX` 为默认值。
+
+1.  **用户提问**: “解释一下量子力学”
+2.  **多个 AI 回答** (例如，模型 A 和模型 B 都给出了回答)
+3.  **用户再次提问**: “那么，量子纠缠和量子隧穿有什么区别？”
+
+此时，`auto_context_merger` 过滤器将自动激活：
+1.  它会获取模型 A 和模型 B 对“解释一下量子力学”的回答。
+2.  将它们格式化为：
+    ```
+    **背景知识**：为了更好地回答您的新问题，请参考上一轮对话中多个AI模型给出的回答：
+
+    **来自模型 '模型A名称' 的回答是：**
+    [模型A对量子力学的解释]
+
+    ---
+
+    **来自模型 '模型B名称' 的回答是：**
+    [模型B对量子力学的解释]
+    ```
+3.  然后，将这段内容作为一个系统消息，注入到当前请求中，紧邻“那么，量子纠缠和量子隧穿有什么区别？”这个用户问题之前。
+
+最终，模型将收到一个包含所有相关上下文的请求，从而能够更准确、更全面地回答您的后续问题。
+
+## 注意事项
+
+*   此过滤器旨在增强多模型对话的连贯性，通过提供更丰富的上下文来帮助模型理解后续问题。
+*   确保您的 Open WebUI 实例中已配置并启用了 `gemini_manifold` 或其他能够产生多模型回答的工具，以便此过滤器能够检测到多模型历史。
+*   此过滤器不会增加额外的模型调用，因此不会显著增加延迟或成本。它只是对现有历史数据进行格式化和注入。
--- a/plugins/filters/async-context-compression/async_context_compression.md
+++ b/plugins/filters/async-context-compression/async_context_compression.md
@@ -0,0 +1,77 @@
+# Async Context Compression Filter
+
+**Author:** [Fu-Jie](https://github.com/Fu-Jie) | **Version:** 1.0.0 | **License:** MIT
+
+> **Important Note**: To ensure the maintainability and usability of all filters, each filter should be accompanied by clear and complete documentation to fully explain its functionality, configuration, and usage.
+
+This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
+
+---
+
+## Core Features
+
+-   ✅ **Automatic Compression**: Triggers context compression automatically based on a message count threshold.
+-   ✅ **Asynchronous Summarization**: Generates summaries in the background without blocking the current chat response.
+-   ✅ **Persistent Storage**: Supports both PostgreSQL and SQLite databases to ensure summaries are not lost after a service restart.
+-   ✅ **Flexible Retention Policy**: Freely configure the number of initial and final messages to keep, ensuring critical information and context continuity.
+-   ✅ **Smart Injection**: Intelligently injects the generated historical summary into the new context.
+
+---
+
+## Installation & Configuration
+
+### 1. Environment Variable
+
+This plugin requires a database connection. You **must** configure the `DATABASE_URL` in your Open WebUI environment variables.
+
+-   **PostgreSQL Example**:
+    ```
+    DATABASE_URL=postgresql://user:password@host:5432/openwebui
+    ```
+-   **SQLite Example**:
+    ```
+    DATABASE_URL=sqlite:///path/to/your/data/webui.db
+    ```
+
+### 2. Filter Order
+
+It is recommended to set the priority of this filter relatively high (a smaller number) to ensure it runs before other filters that might modify message content. A typical order might be:
+
+1.  **Pre-Filters (priority < 10)**
+    -   e.g., A filter that injects a system-level prompt.
+2.  **This Compression Filter (priority = 10)**
+3.  **Post-Filters (priority > 10)**
+    -   e.g., A filter that formats the final output.
+
+---
+
+## Configuration Parameters
+
+You can adjust the following parameters in the filter's settings:
+
+| Parameter | Default | Description |
+| :--- | :--- | :--- |
+| `priority` | `10` | The execution order of the filter. Lower numbers run first. |
+| `compression_threshold` | `15` | When the total message count reaches this value, a background summary generation will be triggered. |
+| `keep_first` | `1` | Always keep the first N messages. The first message often contains important system prompts. |
+| `keep_last` | `6` | Always keep the last N messages to ensure contextual coherence. |
+| `summary_model` | `None` | The model used for generating summaries. **Strongly recommended** to set a fast, economical, and compatible model (e.g., `gemini-2.5-flash`). If left empty, it will try to use the current chat's model, which may fail if it's an incompatible model type (like a Pipe model). |
+| `max_summary_tokens` | `4000` | The maximum number of tokens allowed for the generated summary. |
+| `summary_temperature` | `0.3` | Controls the randomness of the summary. Lower values are more deterministic. |
+| `debug_mode` | `true` | Whether to print detailed debug information to the log. Recommended to set to `false` in production. |
+
+---
+
+## Troubleshooting
+
+-   **Problem: Database connection failed.**
+    -   **Solution**: Please ensure the `DATABASE_URL` environment variable is set correctly and that the database service is running.
+
+-   **Problem: Summary not generated.**
+    -   **Solution**: Check if the `compression_threshold` has been met and verify that `summary_model` is configured correctly. Check the logs for detailed errors.
+
+-   **Problem: Initial system prompt is lost.**
+    -   **Solution**: Ensure `keep_first` is set to a value greater than 0 to preserve the initial messages containing important information.
+
+-   **Problem: Compression effect is not significant.**
+    -   **Solution**: Try increasing the `compression_threshold` or decreasing the `keep_first` / `keep_last` values.
--- a/plugins/filters/async-context-compression/async_context_compression.py
+++ b/plugins/filters/async-context-compression/async_context_compression.py
@@ -0,0 +1,780 @@
+"""
+title: Async Context Compression
+id: async_context_compression
+author: Fu-Jie
+author_url: https://github.com/Fu-Jie
+funding_url: https://github.com/Fu-Jie/awesome-openwebui
+description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
+version: 1.0.1
+license: MIT
+
+═══════════════════════════════════════════════════════════════════════════════
+📌 Overview
+═══════════════════════════════════════════════════════════════════════════════
+
+This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
+
+Core Features:
+  ✅ Automatic compression triggered by a message count threshold
+  ✅ Asynchronous summary generation (does not block user response)
+  ✅ Persistent storage with database support (PostgreSQL and SQLite)
+  ✅ Flexible retention policy (configurable to keep first and last N messages)
+  ✅ Smart summary injection to maintain context
+
+═══════════════════════════════════════════════════════════════════════════════
+🔄 Workflow
+═══════════════════════════════════════════════════════════════════════════════
+
+Phase 1: Inlet (Pre-request processing)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  1. Receives all messages in the current conversation.
+  2. Checks for a previously saved summary.
+  3. If a summary exists and the message count exceeds the retention threshold:
+     ├─ Extracts the first N messages to be kept.
+     ├─ Injects the summary into the first message.
+     ├─ Extracts the last N messages to be kept.
+     └─ Combines them into a new message list: [Kept First Messages + Summary] + [Kept Last Messages].
+  4. Sends the compressed message list to the LLM.
+
+Phase 2: Outlet (Post-response processing)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  1. Triggered after the LLM response is complete.
+  2. Checks if the message count has reached the compression threshold.
+  3. If the threshold is met, an asynchronous background task is started to generate a summary:
+     ├─ Extracts messages to be summarized (excluding the kept first and last messages).
+     ├─ Calls the LLM to generate a concise summary.
+     └─ Saves the summary to the database.
+
+═══════════════════════════════════════════════════════════════════════════════
+💾 Storage
+═══════════════════════════════════════════════════════════════════════════════
+
+This filter uses a database for persistent storage, configured via the `DATABASE_URL` environment variable. It supports both PostgreSQL and SQLite.
+
+Configuration:
+  - The `DATABASE_URL` environment variable must be set.
+  - PostgreSQL Example: `postgresql://user:password@host:5432/openwebui`
+  - SQLite Example: `sqlite:///path/to/your/database.db`
+
+The filter automatically selects the appropriate database driver based on the `DATABASE_URL` prefix (`postgres` or `sqlite`).
+
+  Table Structure (`chat_summary`):
+    - id: Primary Key (auto-increment)
+    - chat_id: Unique chat identifier (indexed)
+    - summary: The summary content (TEXT)
+    - compressed_message_count: The original number of messages
+    - created_at: Timestamp of creation
+    - updated_at: Timestamp of last update
+
+═══════════════════════════════════════════════════════════════════════════════
+📊 Compression Example
+═══════════════════════════════════════════════════════════════════════════════
+
+Scenario: A 20-message conversation (Default settings: keep first 1, keep last 6)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  Before Compression:
+    Message 1: [Initial prompt + First question]
+    Messages 2-14: [Historical conversation]
+    Messages 15-20: [Recent conversation]
+    Total: 20 full messages
+
+  After Compression:
+    Message 1: [Initial prompt + Historical summary + First question]
+    Messages 15-20: [Last 6 full messages]
+    Total: 7 messages
+
+  Effect:
+    ✓ Saves 13 messages (approx. 65%)
+    ✓ Retains full context
+    ✓ Protects important initial prompts
+
+═══════════════════════════════════════════════════════════════════════════════
+⚙️ Configuration
+═══════════════════════════════════════════════════════════════════════════════
+
+priority
+  Default: 10
+  Description: The execution order of the filter. Lower numbers run first.
+
+compression_threshold
+  Default: 15
+  Description: When the message count reaches this value, a background summary generation will be triggered after the conversation ends.
+  Recommendation: Adjust based on your model's context window and cost.
+
+keep_first
+  Default: 1
+  Description: Always keep the first N messages of the conversation. Set to 0 to disable. The first message often contains important system prompts.
+
+keep_last
+  Default: 6
+  Description: Always keep the last N full messages of the conversation to ensure context coherence.
+
+summary_model
+  Default: None
+  Description: The LLM used to generate the summary.
+  Recommendation:
+    - It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`、`gemini-2.5-flash`、`gpt-4.1`。
+    - If left empty, the filter will attempt to use the model from the current conversation.
+  Note:
+    - If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, leaving this field empty may cause summary generation to fail. In this case, you must specify a valid model.
+
+max_summary_tokens
+  Default: 4000
+  Description: The maximum number of tokens allowed for the generated summary.
+
+summary_temperature
+  Default: 0.3
+  Description: Controls the randomness of the summary generation. Lower values produce more deterministic output.
+
+debug_mode
+  Default: true
+  Description: Prints detailed debug information to the log. Recommended to set to `false` in production.
+
+🔧 Deployment
+═══════════════════════════════════════════════════════
+
+Docker Compose Example:
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  services:
+    openwebui:
+      environment:
+        DATABASE_URL: postgresql://user:password@postgres:5432/openwebui
+      depends_on:
+        - postgres
+
+    postgres:
+      image: postgres:15-alpine
+      environment:
+        POSTGRES_USER: user
+        POSTGRES_PASSWORD: password
+        POSTGRES_DB: openwebui
+
+Suggested Filter Installation Order:
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+It is recommended to set the priority of this filter relatively high (a smaller number) to ensure it runs before other filters that might modify message content. A typical order might be:
+
+  1. Filters that need access to the full, uncompressed history (priority < 10)
+     (e.g., a filter that injects a system-level prompt)
+  2. This compression filter (priority = 10)
+  3. Filters that run after compression (priority > 10)
+     (e.g., a final output formatting filter)
+
+═══════════════════════════════════════════════════════════════════════════════
+📝 Database Query Examples
+═══════════════════════════════════════════════════════════════════════════════
+
+View all summaries:
+  SELECT
+    chat_id,
+    LEFT(summary, 100) as summary_preview,
+    compressed_message_count,
+    updated_at
+  FROM chat_summary
+  ORDER BY updated_at DESC;
+
+Query a specific conversation:
+  SELECT *
+  FROM chat_summary
+  WHERE chat_id = 'your_chat_id';
+
+Delete old summaries:
+  DELETE FROM chat_summary
+  WHERE updated_at < NOW() - INTERVAL '30 days';
+
+Statistics:
+  SELECT
+    COUNT(*) as total_summaries,
+    AVG(LENGTH(summary)) as avg_summary_length,
+    AVG(compressed_message_count) as avg_msg_count
+  FROM chat_summary;
+
+═══════════════════════════════════════════════════════════════════════════════
+⚠️ Important Notes
+═══════════════════════════════════════════════════════════════════════════════
+
+1. Database Permissions
+   ⚠ Ensure the user specified in `DATABASE_URL` has permissions to create tables.
+   ⚠ The `chat_summary` table will be created automatically on first run.
+
+2. Retention Policy
+   ⚠ The `keep_first` setting is crucial for preserving initial messages that contain system prompts. Configure it as needed.
+
+3. Performance
+   ⚠ Summary generation is asynchronous and will not block the user response.
+   ⚠ There will be a brief background processing time when the threshold is first met.
+
+4. Cost Optimization
+   ⚠ The summary model is called once each time the threshold is met.
+   ⚠ Set `compression_threshold` reasonably to avoid frequent calls.
+   ⚠ It's recommended to use a fast and economical model to generate summaries.
+
+5. Multimodal Support
+   ✓ This filter supports multimodal messages containing images.
+   ✓ The summary is generated only from the text content.
+   ✓ Non-text parts (like images) are preserved in their original messages during compression.
+
+═══════════════════════════════════════════════════════════════════════════════
+🐛 Troubleshooting
+═══════════════════════════════════════════════════════════════════════════════
+
+Problem: Database connection failed
+Solution:
+  1. Verify that the `DATABASE_URL` environment variable is set correctly.
+  2. Confirm that `DATABASE_URL` starts with either `sqlite` or `postgres`.
+  3. Ensure the database service is running and network connectivity is normal.
+  4. Validate the username, password, host, and port in the connection URL.
+  5. Check the Open WebUI container logs for detailed error messages.
+
+Problem: Summary not generated
+Solution:
+  1. Check if the `compression_threshold` has been met.
+  2. Verify that the `summary_model` is configured correctly.
+  3. Check the debug logs for any error messages.
+
+Problem: Initial system prompt is lost
+Solution:
+  - Ensure `keep_first` is set to a value greater than 0 to preserve the initial messages containing this information.
+
+Problem: Compression effect is not significant
+Solution:
+  1. Increase the `compression_threshold` appropriately.
+  2. Decrease the number of `keep_last` or `keep_first`.
+  3. Check if the conversation is actually long enough.
+
+
+"""
+
+from pydantic import BaseModel, Field, model_validator
+from typing import Optional
+import asyncio
+import json
+import hashlib
+import os
+
+# Open WebUI built-in imports
+from open_webui.utils.chat import generate_chat_completion
+from open_webui.models.users import Users
+from fastapi.requests import Request
+from open_webui.main import app as webui_app
+
+# Database imports
+from sqlalchemy import create_engine, Column, String, Text, DateTime, Integer
+from sqlalchemy.ext.declarative import declarative_base
+from sqlalchemy.orm import sessionmaker
+from datetime import datetime
+
+Base = declarative_base()
+
+
+class ChatSummary(Base):
+    """Chat Summary Storage Table"""
+
+    __tablename__ = "chat_summary"
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    chat_id = Column(String(255), unique=True, nullable=False, index=True)
+    summary = Column(Text, nullable=False)
+    compressed_message_count = Column(Integer, default=0)
+    created_at = Column(DateTime, default=datetime.utcnow)
+    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
+
+
+class Filter:
+    def __init__(self):
+        self.valves = self.Valves()
+        self._db_engine = None
+        self._SessionLocal = None
+        self._init_database()
+
+    def _init_database(self):
+        """Initializes the database connection and table."""
+        try:
+            database_url = os.getenv("DATABASE_URL")
+
+            if not database_url:
+                print("[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable.")
+                self._db_engine = None
+                self._SessionLocal = None
+                return
+
+            db_type = None
+            engine_args = {}
+
+            if database_url.startswith("sqlite"):
+                db_type = "SQLite"
+                engine_args = {
+                    "connect_args": {"check_same_thread": False},
+                    "echo": False,
+                }
+            elif database_url.startswith("postgres"):
+                db_type = "PostgreSQL"
+                if database_url.startswith("postgres://"):
+                    database_url = database_url.replace(
+                        "postgres://", "postgresql://", 1
+                    )
+                    print("[Database] ℹ️ Automatically converted postgres:// to postgresql://")
+                engine_args = {
+                    "pool_pre_ping": True,
+                    "pool_recycle": 3600,
+                    "echo": False,
+                }
+            else:
+                print(
+                    f"[Database] ❌ Error: Unsupported database type. DATABASE_URL must start with 'sqlite' or 'postgres'. Current value: {database_url}"
+                )
+                self._db_engine = None
+                self._SessionLocal = None
+                return
+
+            # Create database engine
+            self._db_engine = create_engine(database_url, **engine_args)
+
+            # Create session factory
+            self._SessionLocal = sessionmaker(
+                autocommit=False, autoflush=False, bind=self._db_engine
+            )
+
+            # Create table if it doesn't exist
+            Base.metadata.create_all(bind=self._db_engine)
+
+            print(f"[Database] ✅ Successfully connected to {db_type} and initialized the chat_summary table.")
+
+        except Exception as e:
+            print(f"[Database] ❌ Initialization failed: {str(e)}")
+            self._db_engine = None
+            self._SessionLocal = None
+
+    class Valves(BaseModel):
+        priority: int = Field(
+            default=10, description="Priority level for the filter operations."
+        )
+        compression_threshold: int = Field(
+            default=15, ge=0, description="The number of messages at which to trigger compression."
+        )
+        keep_first: int = Field(
+            default=1, ge=0, description="Always keep the first N messages. Set to 0 to disable."
+        )
+        keep_last: int = Field(default=6, ge=0, description="Always keep the last N messages.")
+        summary_model: str = Field(
+            default=None,
+            description="The model to use for generating the summary. If empty, uses the current conversation's model.",
+        )
+        max_summary_tokens: int = Field(
+            default=4000, ge=1, description="The maximum number of tokens for the summary."
+        )
+        summary_temperature: float = Field(
+            default=0.3, ge=0.0, le=2.0, description="The temperature for summary generation."
+        )
+        debug_mode: bool = Field(default=True, description="Enable detailed logging for debugging.")
+
+        @model_validator(mode="after")
+        def check_thresholds(self) -> "Valves":
+            kept_count = self.keep_first + self.keep_last
+            if self.compression_threshold <= kept_count:
+                raise ValueError(
+                    f"compression_threshold ({self.compression_threshold}) must be greater than "
+                    f"the sum of keep_first ({self.keep_first}) and keep_last ({self.keep_last}) ({kept_count})."
+                )
+            return self
+
+    def _save_summary(self, chat_id: str, summary: str, body: dict):
+        """Saves the summary to the database."""
+        if not self._SessionLocal:
+            if self.valves.debug_mode:
+                print("[Storage] Database not initialized, skipping summary save.")
+            return
+
+        try:
+            session = self._SessionLocal()
+            try:
+                # Find existing record
+                existing = (
+                    session.query(ChatSummary).filter_by(chat_id=chat_id).first()
+                )
+
+                if existing:
+                    # Update existing record
+                    existing.summary = summary
+                    existing.compressed_message_count = len(body.get("messages", []))
+                    existing.updated_at = datetime.utcnow()
+                else:
+                    # Create new record
+                    new_summary = ChatSummary(
+                        chat_id=chat_id,
+                        summary=summary,
+                        compressed_message_count=len(body.get("messages", [])),
+                    )
+                    session.add(new_summary)
+
+                session.commit()
+
+                if self.valves.debug_mode:
+                    action = "Updated" if existing else "Created"
+                    print(f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})")
+
+            finally:
+                session.close()
+
+        except Exception as e:
+            print(f"[Storage] ❌ Database save failed: {str(e)}")
+
+    def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
+        """Loads the summary from the database."""
+        if not self._SessionLocal:
+            if self.valves.debug_mode:
+                print("[Storage] Database not initialized, cannot load summary.")
+            return None
+
+        try:
+            session = self._SessionLocal()
+            try:
+                record = (
+                    session.query(ChatSummary).filter_by(chat_id=chat_id).first()
+                )
+
+                if record:
+                    if self.valves.debug_mode:
+                        print(f"[Storage] Loaded summary from database (Chat ID: {chat_id})")
+                        print(
+                            f"[Storage] Last updated: {record.updated_at}, Original message count: {record.compressed_message_count}"
+                        )
+                    return record.summary
+
+            finally:
+                session.close()
+
+        except Exception as e:
+            print(f"[Storage] ❌ Database read failed: {str(e)}")
+
+        return None
+
+    def _inject_summary_to_first_message(self, message: dict, summary: str) -> dict:
+        """Injects the summary into the first message by prepending it."""
+        content = message.get("content", "")
+        summary_block = f"【Historical Conversation Summary】\n{summary}\n\n---\nBelow is the recent conversation:\n\n"
+
+        # Handle different content types
+        if isinstance(content, list):  # Multimodal content
+            # Find the first text part and insert the summary before it
+            new_content = []
+            summary_inserted = False
+
+            for part in content:
+                if (
+                    isinstance(part, dict)
+                    and part.get("type") == "text"
+                    and not summary_inserted
+                ):
+                    # Prepend summary to the first text part
+                    new_content.append(
+                        {"type": "text", "text": summary_block + part.get("text", "")}
+                    )
+                    summary_inserted = True
+                else:
+                    new_content.append(part)
+
+            # If no text part, insert at the beginning
+            if not summary_inserted:
+                new_content.insert(0, {"type": "text", "text": summary_block})
+
+            message["content"] = new_content
+
+        elif isinstance(content, str):  # Plain text
+            message["content"] = summary_block + content
+
+        return message
+
+    async def inlet(
+        self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
+    ) -> dict:
+        """
+        Executed before sending to the LLM.
+        Compression Strategy:
+        1. Keep the first N messages.
+        2. Inject the summary into the first message (if keep_first > 0).
+        3. Keep the last N messages.
+        """
+        messages = body.get("messages", [])
+        chat_id = __metadata__["chat_id"]
+
+        if self.valves.debug_mode:
+            print(f"\n{'='*60}")
+            print(f"[Inlet] Chat ID: {chat_id}")
+            print(f"[Inlet] Received {len(messages)} messages")
+
+        # [Optimization] Load summary in a background thread to avoid blocking the event loop.
+        if self.valves.debug_mode:
+            print("[Optimization] Loading summary in a background thread to avoid blocking the event loop.")
+        saved_summary = await asyncio.to_thread(self._load_summary, chat_id, body)
+
+        total_kept_count = self.valves.keep_first + self.valves.keep_last
+
+        if saved_summary and len(messages) > total_kept_count:
+            if self.valves.debug_mode:
+                print(f"[Inlet] Found saved summary, applying compression.")
+
+            first_messages_to_keep = []
+
+            if self.valves.keep_first > 0:
+                # Copy the initial messages to keep
+                first_messages_to_keep = [
+                    m.copy() for m in messages[: self.valves.keep_first]
+                ]
+                # Inject the summary into the very first message
+                first_messages_to_keep[0] = self._inject_summary_to_first_message(
+                    first_messages_to_keep[0], saved_summary
+                )
+            else:
+                # If not keeping initial messages, create a new system message for the summary
+                summary_block = (
+                    f"【Historical Conversation Summary】\n{saved_summary}\n\n---\nBelow is the recent conversation:\n\n"
+                )
+                first_messages_to_keep.append(
+                    {"role": "system", "content": summary_block}
+                )
+
+            # Keep the last messages
+            last_messages_to_keep = (
+                messages[-self.valves.keep_last :] if self.valves.keep_last > 0 else []
+            )
+
+            # Combine: [Kept initial messages (with summary)] + [Kept recent messages]
+            body["messages"] = first_messages_to_keep + last_messages_to_keep
+
+            if self.valves.debug_mode:
+                print(f"[Inlet] ✂️ Compression complete:")
+                print(f"  - Original messages: {len(messages)}")
+                print(f"  - Compressed to: {len(body['messages'])}")
+                print(
+                    f"  - Structure: [Keep first {self.valves.keep_first} (with summary)] + [Keep last {self.valves.keep_last}]"
+                )
+                print(f"  - Saved: {len(messages) - len(body['messages'])} messages")
+        else:
+            if self.valves.debug_mode:
+                if not saved_summary:
+                    print(f"[Inlet] No summary found, using full conversation history.")
+                else:
+                    print(f"[Inlet] Message count does not exceed retention threshold, no compression applied.")
+
+        if self.valves.debug_mode:
+            print(f"{'='*60}\n")
+
+        return body
+
+    async def outlet(
+        self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
+    ) -> dict:
+        """
+        Executed after the LLM response is complete.
+        Triggers summary generation asynchronously.
+        """
+        messages = body.get("messages", [])
+        chat_id = __metadata__["chat_id"]
+
+        if self.valves.debug_mode:
+            print(f"\n{'='*60}")
+            print(f"[Outlet] Chat ID: {chat_id}")
+            print(f"[Outlet] Response complete, current message count: {len(messages)}")
+
+        # Check if compression is needed
+        if len(messages) >= self.valves.compression_threshold:
+            if self.valves.debug_mode:
+                print(
+                    f"[Outlet] ⚡ Compression threshold reached ({len(messages)} >= {self.valves.compression_threshold})"
+                )
+                print(f"[Outlet] Preparing to generate summary in the background...")
+
+            # Generate summary asynchronously in the background
+            asyncio.create_task(
+                self._generate_summary_async(messages, chat_id, body, __user__)
+            )
+        else:
+            if self.valves.debug_mode:
+                print(
+                    f"[Outlet] Compression threshold not reached ({len(messages)} < {self.valves.compression_threshold})"
+                )
+
+        if self.valves.debug_mode:
+            print(f"{'='*60}\n")
+
+        return body
+
+    async def _generate_summary_async(
+        self, messages: list, chat_id: str, body: dict, user_data: Optional[dict]
+    ):
+        """
+        Generates a summary asynchronously in the background.
+        """
+        try:
+            if self.valves.debug_mode:
+                print(f"\n[🤖 Async Summary Task] Starting...")
+
+            # Messages to summarize: exclude kept initial and final messages
+            if self.valves.keep_last > 0:
+                messages_to_summarize = messages[
+                    self.valves.keep_first : -self.valves.keep_last
+                ]
+            else:
+                messages_to_summarize = messages[self.valves.keep_first :]
+
+            if len(messages_to_summarize) == 0:
+                if self.valves.debug_mode:
+                    print(f"[🤖 Async Summary Task] No messages to summarize, skipping.")
+                return
+
+            if self.valves.debug_mode:
+                print(f"[🤖 Async Summary Task] Preparing to summarize {len(messages_to_summarize)} messages.")
+                print(
+                    f"[🤖 Async Summary Task] Protecting: First {self.valves.keep_first} + Last {self.valves.keep_last} messages."
+                )
+
+            # Build conversation history text
+            conversation_text = self._format_messages_for_summary(messages_to_summarize)
+
+            # Call LLM to generate summary
+            summary = await self._call_summary_llm(conversation_text, body, user_data)
+
+            # [Optimization] Save summary in a background thread to avoid blocking the event loop.
+            if self.valves.debug_mode:
+                print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
+            await asyncio.to_thread(self._save_summary, chat_id, summary, body)
+
+            if self.valves.debug_mode:
+                print(f"[🤖 Async Summary Task] ✅ Complete! Summary length: {len(summary)} characters.")
+                print(f"[🤖 Async Summary Task] Summary preview: {summary[:150]}...")
+
+        except Exception as e:
+            print(f"[🤖 Async Summary Task] ❌ Error: {str(e)}")
+            import traceback
+
+            traceback.print_exc()
+            # Save a simple placeholder even on failure
+            fallback_summary = (
+                f"[Historical Conversation Summary] Contains content from approximately {len(messages_to_summarize)} messages."
+            )
+            
+            # [Optimization] Save summary in a background thread to avoid blocking the event loop.
+            if self.valves.debug_mode:
+                print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
+            await asyncio.to_thread(self._save_summary, chat_id, fallback_summary, body)
+
+    def _format_messages_for_summary(self, messages: list) -> str:
+        """Formats messages for summarization."""
+        formatted = []
+        for i, msg in enumerate(messages, 1):
+            role = msg.get("role", "unknown")
+            content = msg.get("content", "")
+
+            # Handle multimodal content
+            if isinstance(content, list):
+                text_parts = []
+                for part in content:
+                    if isinstance(part, dict) and part.get("type") == "text":
+                        text_parts.append(part.get("text", ""))
+                content = " ".join(text_parts)
+
+            # Handle role name
+            role_name = {"user": "User", "assistant": "Assistant"}.get(role, role)
+
+            # Limit length of each message to avoid excessive length
+            if len(content) > 500:
+                content = content[:500] + "..."
+
+            formatted.append(f"[{i}] {role_name}: {content}")
+
+        return "\n\n".join(formatted)
+
+    async def _call_summary_llm(
+        self, conversation_text: str, body: dict, user_data: dict
+    ) -> str:
+        """
+        Calls the LLM to generate a summary using Open WebUI's built-in method.
+        """
+        if self.valves.debug_mode:
+            print(f"[🤖 LLM Call] Using Open WebUI's built-in method.")
+
+        # Build summary prompt
+        summary_prompt = f"""
+You are a professional conversation context compression assistant. Your task is to perform a high-fidelity compression of the [Conversation Content] below, producing a concise summary that can be used directly as context for subsequent conversation. Strictly adhere to the following requirements:
+
+MUST RETAIN: Topics/goals, user intent, key facts and data, important parameters and constraints, deadlines, decisions/conclusions, action items and their status, and technical details like code/commands (code must be preserved as is).
+REMOVE: Greetings, politeness, repetitive statements, off-topic chatter, and procedural details (unless essential). For information that has been overturned or is outdated, please mark it as "Obsolete: <explanation>" when retaining.
+CONFLICT RESOLUTION: If there are contradictions or multiple revisions, retain the latest consistent conclusion and list unresolved or conflicting points under "Points to Clarify".
+STRUCTURE AND TONE: Output in structured bullet points. Be logical, objective, and concise. Summarize from a third-person perspective. Use code blocks to preserve technical/code snippets verbatim.
+OUTPUT LENGTH: Strictly limit the summary content to within {int(self.valves.max_summary_tokens * 3)} characters. Prioritize key information; if space is insufficient, trim details rather than core conclusions.
+FORMATTING: Output only the summary text. Do not add any extra explanations, execution logs, or generation processes. You must use the following headings (if a section has no content, write "None"):
+Core Theme:
+Key Information:
+... (List 3-6 key points)
+Decisions/Conclusions:
+Action Items (with owner/deadline if any):
+Relevant Roles/Preferences:
+Risks/Dependencies/Assumptions:
+Points to Clarify:
+Compression Ratio: Original ~X words → Summary ~Y words (estimate)
+Conversation Content:
+{conversation_text}
+
+Please directly output the compressed summary that meets the above requirements (summary text only).
+"""
+        # Determine the model to use
+        model = self.valves.summary_model or body.get("model", "")
+
+        if self.valves.debug_mode:
+            print(f"[🤖 LLM Call] Model: {model}")
+
+        # Build payload
+        payload = {
+            "model": model,
+            "messages": [{"role": "user", "content": summary_prompt}],
+            "stream": False,
+            "max_tokens": self.valves.max_summary_tokens,
+            "temperature": self.valves.summary_temperature,
+        }
+
+        try:
+            # Get user object
+            user_id = user_data.get("id") if user_data else None
+            if not user_id:
+                raise ValueError("Could not get user ID")
+
+            # [Optimization] Get user object in a background thread to avoid blocking the event loop.
+            if self.valves.debug_mode:
+                print("[Optimization] Getting user object in a background thread to avoid blocking the event loop.")
+            user = await asyncio.to_thread(Users.get_user_by_id, user_id)
+
+            if not user:
+                raise ValueError(f"Could not find user: {user_id}")
+
+            if self.valves.debug_mode:
+                print(f"[🤖 LLM Call] User: {user.email}")
+                print(f"[🤖 LLM Call] Sending request...")
+
+            # Create Request object
+            request = Request(scope={"type": "http", "app": webui_app})
+
+            # Call generate_chat_completion
+            response = await generate_chat_completion(request, payload, user)
+
+            if not response or "choices" not in response or not response["choices"]:
+                raise ValueError("LLM response is not in the correct format or is empty")
+
+            summary = response["choices"][0]["message"]["content"].strip()
+
+            if self.valves.debug_mode:
+                print(f"[🤖 LLM Call] ✅ Successfully received summary.")
+
+            return summary
+
+        except Exception as e:
+            error_message = f"An error occurred while calling the LLM ({model}) to generate a summary: {str(e)}"
+            if not self.valves.summary_model:
+                error_message += (
+                    "\n[Hint] You did not specify a summary_model, so the filter attempted to use the current conversation's model. "
+                    "If this is a pipeline (Pipe) model or an incompatible model, please specify a compatible summary model (e.g., 'gemini-2.5-flash') in the configuration."
+                )
+
+            if self.valves.debug_mode:
+                print(f"[🤖 LLM Call] ❌ {error_message}")
+
+            raise Exception(error_message)
--- a/plugins/filters/async-context-compression/async_context_compression_cn.md
+++ b/plugins/filters/async-context-compression/async_context_compression_cn.md
@@ -0,0 +1,77 @@
+# 异步上下文压缩过滤器
+
+**作者:** [Fu-Jie](https://github.com/Fu-Jie) | **版本:** 1.0.0 | **许可证:** MIT
+
+> **重要提示**：为了确保所有过滤器的可维护性和易用性，每个过滤器都应附带清晰、完整的文档，以确保其功能、配置和使用方法得到充分说明。
+
+本过滤器通过智能摘要和消息压缩技术，在保持对话连贯性的同时，显著降低长对话的Token消耗。
+
+---
+
+## 核心特性
+
+-   ✅ **自动压缩**: 基于消息数量阈值自动触发上下文压缩。
+-   ✅ **异步摘要**: 在后台生成摘要，不阻塞当前对话的响应。
+-   ✅ **持久化存储**: 支持 PostgreSQL 和 SQLite 数据库，确保摘要在服务重启后不丢失。
+-   ✅ **灵活保留策略**: 可自由配置保留对话头部和尾部的消息数量，确保关键信息和上下文的连贯性。
+-   ✅ **智能注入**: 将生成的历史摘要智能地注入到新的上下文中。
+
+---
+
+## 安装与配置
+
+### 1. 环境变量
+
+本插件的运行依赖于数据库，您**必须**在 Open WebUI 的环境变量中配置 `DATABASE_URL`。
+
+-   **PostgreSQL 示例**:
+    ```
+    DATABASE_URL=postgresql://user:password@host:5432/openwebui
+    ```
+-   **SQLite 示例**:
+    ```
+    DATABASE_URL=sqlite:///path/to/your/data/webui.db
+    ```
+
+### 2. 过滤器顺序
+
+建议将此过滤器的优先级设置得相对较高（数值较小），以确保它在其他可能修改消息内容的过滤器之前运行。一个典型的顺序可能是：
+
+1.  **前置过滤器 (priority < 10)**
+    -   例如：注入系统级提示的过滤器。
+2.  **本压缩过滤器 (priority = 10)**
+3.  **后置过滤器 (priority > 10)**
+    -   例如：对最终输出进行格式化的过滤器。
+
+---
+
+## 配置参数
+
+您可以在过滤器的设置中调整以下参数：
+
+| 参数 | 默认值 | 描述 |
+| :--- | :--- | :--- |
+| `priority` | `10` | 过滤器执行顺序，数值越小越先执行。 |
+| `compression_threshold` | `15` | 当总消息数达到此值时，将在后台触发摘要生成。 |
+| `keep_first` | `1` | 始终保留对话开始的 N 条消息。第一条消息通常包含重要的系统提示。 |
+| `keep_last` | `6` | 始终保留对话末尾的 N 条消息，以确保上下文连贯。 |
+| `summary_model` | `None` | 用于生成摘要的模型。**强烈建议**配置一个快速、经济的兼容模型（如 `gemini-2.5-flash`）。如果留空，将尝试使用当前对话的模型，但这可能因模型不兼容（如 Pipe 模型）而失败。 |
+| `max_summary_tokens` | `4000` | 生成摘要时允许的最大 Token 数。 |
+| `summary_temperature` | `0.3` | 控制摘要生成的随机性，较低的值结果更稳定。 |
+| `debug_mode` | `true` | 是否在日志中打印详细的调试信息。生产环境建议设为 `false`。 |
+
+---
+
+## 故障排除
+
+-   **问题：数据库连接失败**
+    -   **解决**：请确认 `DATABASE_URL` 环境变量已正确设置，并且数据库服务运行正常。
+
+-   **问题：摘要未生成**
+    -   **解决**：检查 `compression_threshold` 是否已达到，并确认 `summary_model` 配置正确。查看日志以获取详细错误。
+
+-   **问题：初始的系统提示丢失**
+    -   **解决**：确保 `keep_first` 的值大于 0，以保留包含重要信息的初始消息。
+
+-   **问题：压缩效果不明显**
+    -   **解决**：尝试适当提高 `compression_threshold`，或减少 `keep_first` / `keep_last` 的值。
--- a/plugins/filters/async-context-compression/工作流程指南.md
+++ b/plugins/filters/async-context-compression/工作流程指南.md
@@ -0,0 +1,662 @@
+# 异步上下文压缩过滤器 - 工作流程指南
+
+## 📋 目录
+1. [概述](#概述)
+2. [系统架构](#系统架构)
+3. [工作流程详解](#工作流程详解)
+4. [Token 计数机制](#token-计数机制)
+5. [递归摘要机制](#递归摘要机制)
+6. [配置指南](#配置指南)
+7. [最佳实践](#最佳实践)
+
+---
+
+## 概述
+
+异步上下文压缩过滤器是一个高性能的消息压缩插件，通过以下方式降低长对话的 Token 消耗：
+
+- **智能摘要**：将历史消息压缩成高保真摘要
+- **递归更新**：新摘要合并旧摘要，保证历史连贯性
+- **异步处理**：后台生成摘要，不阻塞用户响应
+- **灵活配置**：支持全局和模型特定的阈值配置
+
+### 核心指标
+- **压缩率**：可达 65% 以上（取决于对话长度）
+- **响应时间**：inlet 阶段 <10ms（无计算开销）
+- **摘要质量**：高保真递归摘要，保留关键信息
+
+---
+
+## 系统架构
+
+```
+┌─────────────────────────────────────────────────────┐
+│                  用户请求流程                        │
+└────────────────┬────────────────────────────────────┘
+                 │
+    ┌────────────▼──────────────┐
+    │   inlet（请求前处理）       │
+    │  ├─ 加载摘要记录           │
+    │  ├─ 注入摘要到首条消息     │
+    │  └─ 返回压缩消息列表       │ ◄─ 快速返回 (<10ms)
+    └────────────┬──────────────┘
+                 │
+    ┌────────────▼──────────────┐
+    │     LLM 处理消息           │
+    │  ├─ 调用语言模型           │
+    │  └─ 生成回复               │
+    └────────────┬──────────────┘
+                 │
+    ┌────────────▼──────────────┐
+    │   outlet（响应后处理）      │
+    │  ├─ 启动后台异步任务       │
+    │  └─ 立即返回（不阻塞）     │ ◄─ 返回响应给用户
+    └────────────┬──────────────┘
+                 │
+    ┌────────────▼──────────────┐
+    │  后台处理（asyncio 任务）   │
+    │  ├─ 计算 Token 数          │
+    │  ├─ 检查压缩阈值           │
+    │  ├─ 生成递归摘要           │
+    │  └─ 保存到数据库           │
+    └────────────┬──────────────┘
+                 │
+    ┌────────────▼──────────────┐
+    │    数据库持久化存储         │
+    │  ├─ 摘要内容               │
+    │  ├─ 压缩进度               │
+    │  └─ 时间戳                 │
+    └────────────────────────────┘
+```
+
+---
+
+## 工作流程详解
+
+### 1️⃣ inlet 阶段：消息注入与压缩视图构建
+
+**目标**：快速应用已有摘要，构建压缩消息视图
+
+**流程**：
+
+```
+输入：所有消息列表
+  │
+  ├─► 从数据库加载摘要记录
+  │     │
+  │     ├─► 找到 ✓ ─────┐
+  │     └─► 未找到 ───┐ │
+  │                  │ │
+  ├──────────────────┴─┼─► 存在摘要？
+  │                    │
+  │                ┌───▼───┐
+  │                │  是   │  否
+  │                └───┬───┴───┐
+  │                    │       │
+  │        ┌───────────▼─┐   ┌─▼─────────┐
+  │        │ 构建压缩视图  │   │ 使用原始 │
+  │        │ [H] + [T]   │   │ 消息列表 │
+  │        └───────┬─────┘   └─┬────────┘
+  │                │          │
+  │    ┌───────────┴──────────┘
+  │    │
+  │    └─► 组合消息：
+  │           • 头部（keep_first）
+  │           • 摘要注入到首条
+  │           • 尾部（keep_last）
+  │
+  └─────► 返回压缩消息列表
+           ⏱️ 耗时 <10ms
+```
+
+**关键参数**：
+- `keep_first`：保留前 N 条消息（默认 1）
+- `keep_last`：保留后 N 条消息（默认 6）
+- 摘要注入位置：首条消息的内容前
+
+**示例**：
+```python
+# 原始：20 条消息
+消息1: [系统提示]
+消息2-14: [历史对话]
+消息15-20: [最近对话]
+
+# inlet 后（存在摘要）：7 条消息
+消息1: [系统提示 + 【历史摘要】...]  ◄─ 摘要已注入
+消息15-20: [最近对话]  ◄─ 保留后6条
+```
+
+---
+
+### 2️⃣ outlet 阶段：后台异步处理
+
+**目标**：计算 Token 数、检查阈值、生成摘要（不阻塞响应）
+
+**流程**：
+
+```
+LLM 响应完成
+  │
+  └─► outlet 处理
+      │
+      └─► 启动后台异步任务（asyncio.create_task）
+          │
+          ├─► 立即返回给用户 ✓
+          │   （不等待后台任务完成）
+          │
+          └─► 后台执行 _check_and_generate_summary_async
+              │
+              ├─► 在后台线程中计算 Token 数
+              │   (await asyncio.to_thread)
+              │
+              ├─► 获取模型阈值配置
+              │   • 优先使用 model_thresholds 中的配置
+              │   • 回退到全局 compression_threshold_tokens
+              │
+              ├─► 检查是否触发压缩
+              │   if current_tokens >= threshold:
+              │
+              └─► 触发摘要生成流程
+```
+
+**时序图**：
+```
+时间线：
+│
+├─ T0: LLM 响应完成
+│
+├─ T1: outlet 被调用
+│       └─► 启动后台任务
+│       └─► 立即返回 ✓
+│
+├─ T2: 用户收到响应 ✓✓✓
+│
+└─ T3-T10: 后台任务执行
+            ├─ 计算 Token
+            ├─ 检查阈值
+            ├─ 调用 LLM 生成摘要
+            └─ 保存到数据库
+```
+
+**关键特性**：
+- ✅ 用户响应不受影响
+- ✅ Token 计算不阻塞请求
+- ✅ 摘要生成异步进行
+
+---
+
+### 3️⃣ Token 计数与阈值检查
+
+**工作流程**：
+
+```
+后台线程执行 _check_and_generate_summary_async
+│
+├─► Step 1: 计算当前 Token 总数
+│   │
+│   ├─ 遍历所有消息
+│   ├─ 处理多模态内容（提取文本部分）
+│   ├─ 使用 o200k_base 编码计数
+│   └─ 返回 total_tokens
+│
+├─► Step 2: 获取模型特定阈值
+│   │
+│   ├─ 模型 ID: gpt-4
+│   ├─ 查询 model_thresholds
+│   │
+│   ├─ 存在配置？
+│   │   ├─ 是 ✓ 使用该配置
+│   │   └─ 否 ✓ 使用全局参数
+│   │
+│   ├─ compression_threshold_tokens（默认 64000）
+│   └─ max_context_tokens（默认 128000）
+│
+└─► Step 3: 检查是否触发压缩
+    │
+    if current_tokens >= compression_threshold_tokens:
+    │   └─► 触发摘要生成
+    │
+    else:
+        └─► 无需压缩，任务结束
+```
+
+**Token 计数细节**：
+
+```python
+def _count_tokens(text):
+    if tiktoken_available:
+        # 使用 o200k_base（统一编码）
+        encoding = tiktoken.get_encoding("o200k_base")
+        return len(encoding.encode(text))
+    else:
+        # 回退：字符估算
+        return len(text) // 4
+```
+
+**模型阈值优先级**：
+```
+优先级 1: model_thresholds["gpt-4"]
+优先级 2: model_thresholds["gemini-2.5-flash"]
+优先级 3: 全局 compression_threshold_tokens
+```
+
+---
+
+### 4️⃣ 递归摘要生成
+
+**核心机制**：将旧摘要与新消息合并，生成更新的摘要
+
+**工作流程**：
+
+```
+触发 _generate_summary_async
+│
+├─► Step 1: 加载旧摘要
+│   │
+│   ├─ 从数据库查询
+│   ├─ 获取 previous_summary
+│   └─ 获取 compressed_message_count（上次压缩进度）
+│
+├─► Step 2: 确定待压缩消息范围
+│   │
+│   ├─ start_index = max(compressed_count, keep_first)
+│   ├─ end_index = len(messages) - keep_last
+│   │
+│   ├─ 提取 messages[start_index:end_index]
+│   └─ 这是【新增对话】部分
+│
+├─► Step 3: 构建 LLM 提示词
+│   │
+│   ├─ 【已有摘要】= previous_summary
+│   ├─ 【新增对话】= 格式化的新消息
+│   │
+│   └─ 提示词模板：
+│       "将【已有摘要】和【新增对话】合并..."
+│
+├─► Step 4: 调用 LLM 生成摘要
+│   │
+│   ├─ 模型选择：summary_model（若配置）或当前模型
+│   ├─ 参数：
+│   │   • max_tokens = max_summary_tokens（默认 4000）
+│   │   • temperature = summary_temperature（默认 0.3）
+│   │   • stream = False
+│   │
+│   └─ 返回 new_summary
+│
+├─► Step 5: 保存摘要到数据库
+│   │
+│   ├─ 更新 chat_summary 表
+│   ├─ summary = new_summary
+│   ├─ compressed_message_count = end_index
+│   └─ updated_at = now()
+│
+└─► Step 6: 记录日志
+    └─ 摘要长度、压缩进度、耗时等
+```
+
+**递归摘要示例**：
+
+```
+第一轮压缩：
+  旧摘要: 无
+  新消息: 消息2-14（13条）
+  生成: Summary_V1
+  
+  保存: compressed_message_count = 14
+
+第二轮压缩：
+  旧摘要: Summary_V1
+  新消息: 消息15-28（从14开始）
+  生成: Summary_V2 = LLM(Summary_V1 + 新消息14-28)
+  
+  保存: compressed_message_count = 28
+
+结果：
+  ✓ 早期信息得以保留（通过 Summary_V1）
+  ✓ 新信息与旧摘要融合
+  ✓ 历史连贯性维护
+```
+
+---
+
+## Token 计数机制
+
+### 编码方案
+
+```
+┌─────────────────────────────────┐
+│   _count_tokens(text)           │
+├─────────────────────────────────┤
+│ 1. tiktoken 可用？              │
+│    ├─ 是 ✓                      │
+│    │  └─ use o200k_base         │
+│    │     (最新模型适配)          │
+│    │                             │
+│    └─ 否 ✓                      │
+│       └─ 字符估算               │
+│          (1 token ≈ 4 chars)   │
+└─────────────────────────────────┘
+```
+
+### 多模态内容处理
+
+```python
+# 消息结构
+message = {
+    "role": "user",
+    "content": [
+        {"type": "text", "text": "描述图片..."},
+        {"type": "image_url", "image_url": {...}},
+        {"type": "text", "text": "更多描述..."}
+    ]
+}
+
+# Token 计数
+提取所有 text 部分 → 合并 → 计数
+图片部分被忽略（不消耗文本 token）
+```
+
+### 计数流程
+
+```
+_calculate_messages_tokens(messages, model)
+│
+├─► 遍历每条消息
+│   │
+│   ├─ content 是列表？
+│   │   ├─ 是 ✓ 提取所有文本部分
+│   │   └─ 否 ✓ 直接使用
+│   │
+│   └─ _count_tokens(content)
+│
+└─► 累加所有 Token 数
+```
+
+---
+
+## 递归摘要机制
+
+### 保证历史连贯性的核心原理
+
+```
+传统压缩方式（有问题）：
+时间线：
+  消息1-50 ─► 生成摘要1 ─► 保留 [摘要1 + 消息45-50]
+              │
+              消息51-100 ─► 生成摘要2 ─► 保留 [摘要2 + 消息95-100]
+                           └─► ❌ 摘要1 丢失！早期信息无法追溯
+
+递归摘要方式（本实现）：
+时间线：
+  消息1-50 ──► 生成摘要1 ──► 保存
+              │
+              摘要1 + 消息51-100 ──► 生成摘要2 ──► 保存
+                                     └─► ✓ 摘要1 信息融入摘要2
+                                     ✓ 历史信息连贯保存
+```
+
+### 工作机制
+
+```
+inlet 阶段：
+  摘要库查询
+    │
+    ├─ previous_summary（已有摘要）
+    └─ compressed_message_count（压缩进度）
+
+outlet 阶段：
+  如果 current_tokens >= threshold:
+    │
+    ├─ 新消息范围：
+    │  [compressed_message_count : len(messages) - keep_last]
+    │
+    └─ LLM 处理：
+       Input:  previous_summary + 新消息
+       Output: 更新的摘要（含早期信息 + 新信息）
+       
+  保存进度：
+    └─ compressed_message_count = end_index
+       （下次压缩从这里开始）
+```
+
+---
+
+## 配置指南
+
+### 全局配置
+
+```python
+Valves(
+    # Token 阈值
+    compression_threshold_tokens=64000,  # 触发压缩
+    max_context_tokens=128000,           # 硬性上限
+    
+    # 消息保留策略
+    keep_first=1,      # 保留首条（系统提示）
+    keep_last=6,       # 保留末6条（最近对话）
+    
+    # 摘要模型
+    summary_model="gemini-2.5-flash",  # 快速经济
+    
+    # 摘要参数
+    max_summary_tokens=4000,
+    summary_temperature=0.3,
+)
+```
+
+### 模型特定配置
+
+```python
+model_thresholds = {
+    "gpt-4": {
+        "compression_threshold_tokens": 8000,
+        "max_context_tokens": 32000
+    },
+    "gemini-2.5-flash": {
+        "compression_threshold_tokens": 10000,
+        "max_context_tokens": 40000
+    },
+    "llama-70b": {
+        "compression_threshold_tokens": 20000,
+        "max_context_tokens": 80000
+    }
+}
+```
+
+### 配置选择建议
+
+```
+场景1：长对话成本优化
+  compression_threshold_tokens: 32000  ◄─ 更早触发
+  keep_last: 4                         ◄─ 保留少一些
+  
+场景2：质量优先
+  compression_threshold_tokens: 100000 ◄─ 晚触发
+  keep_last: 10                        ◄─ 保留多一些
+  max_summary_tokens: 8000             ◄─ 更详细摘要
+  
+场景3：平衡方案（推荐）
+  compression_threshold_tokens: 64000  ◄─ 默认
+  keep_last: 6                         ◄─ 默认
+  summary_model: "gemini-2.5-flash"   ◄─ 快速经济
+```
+
+---
+
+## 最佳实践
+
+### 1️⃣ 摘要模型选择
+
+```
+推荐模型：
+  ✅ gemini-2.5-flash    快速、经济、质量好
+  ✅ deepseek-v3         成本低、速度快
+  ✅ gpt-4o-mini         通用、质量稳定
+
+避免：
+  ❌ 流水线（Pipe）模型  可能不支持标准 API
+  ❌ 本地模型            容易超时、影响体验
+```
+
+### 2️⃣ 阈值调优
+
+```
+Token 计数验证：
+  1. 启用 debug_mode
+  2. 观察实际 Token 数
+  3. 根据需要调整阈值
+  
+  # 日志示例
+  [🔍 后台计算] Token 数: 45320
+  [🔍 后台计算] 未触发压缩阈值 (Token: 45320 < 64000)
+```
+
+### 3️⃣ 消息保留策略
+
+```
+keep_first 配置：
+  通常值: 1（保留系统提示）
+  某些场景: 0（系统提示在摘要中）
+  
+keep_last 配置：
+  通常值: 6（保留最近对话）
+  长对话: 8-10（更多最近对话）
+  短对话: 3-4（节省 Token）
+```
+
+### 4️⃣ 监控与维护
+
+```
+关键指标：
+  • 摘要生成耗时
+  • Token 节省率
+  • 摘要质量（通过对话体验）
+  
+数据库维护：
+  # 定期清理过期摘要
+  DELETE FROM chat_summary
+  WHERE updated_at < NOW() - INTERVAL '30 days'
+  
+  # 统计压缩效果
+  SELECT 
+    COUNT(*) as total_summaries,
+    AVG(compressed_message_count) as avg_compressed
+  FROM chat_summary
+```
+
+### 5️⃣ 故障排除
+
+```
+问题：摘要未生成
+  检查项：
+    1. Token 数是否达到阈值？
+       → debug_mode 查看日志
+    2. summary_model 是否配置正确？
+       → 确保模型存在且可用
+    3. 数据库连接是否正常？
+       → 检查 DATABASE_URL
+
+问题：inlet 响应变慢
+  检查项：
+    1. keep_first/keep_last 是否过大？
+    2. 摘要数据是否过大？
+    3. 消息数是否过多？
+    
+问题：摘要质量下降
+  调整方案：
+    1. 增加 max_summary_tokens
+    2. 降低 summary_temperature（更确定性）
+    3. 更换摘要模型
+```
+
+---
+
+## 性能参考
+
+### 时间开销
+
+```
+inlet 阶段：
+  ├─ 数据库查询: 1-2ms
+  ├─ 摘要注入: 2-3ms
+  └─ 总计: <10ms ✓ (不影响用户体验)
+
+outlet 阶段：
+  ├─ 启动后台任务: <1ms
+  └─ 立即返回: ✓ (无等待)
+
+后台处理（不阻塞用户）：
+  ├─ Token 计数: 10-50ms
+  ├─ LLM 调用: 1-5 秒
+  ├─ 数据库保存: 1-2ms
+  └─ 总计: 1-6 秒 (后台进行)
+```
+
+### Token 节省示例
+
+```
+场景：20 条消息对话
+
+未压缩：
+  总消息: 20 条
+  预估 Token: 8000 个
+
+压缩后（keep_first=1, keep_last=6）：
+  头部消息: 1 条 (1600 Token)
+  摘要: ~800 Token (嵌入在头部)
+  尾部消息: 6 条 (3200 Token)
+  总计: 7 条有效输入 (~5600 Token)
+  
+节省：8000 - 5600 = 2400 Token (30% 节省)
+
+随对话变长，节省比例可达 65% 以上
+```
+
+---
+
+## 数据流图
+
+```
+用户消息
+  ↓
+[inlet] 摘要注入器
+  ├─ 数据库 ← 查询摘要
+  ├─ 摘要注入到首条消息
+  └─ 返回压缩消息列表
+  ↓
+LLM 处理
+  ├─ 调用语言模型
+  ├─ 生成响应
+  └─ 返回给用户 ✓✓✓
+  ↓
+[outlet] 后台处理（asyncio 任务）
+  ├─ 计算 Token 数
+  ├─ 检查阈值
+  ├─ [if 需要] 调用 LLM 生成摘要
+  │  ├─ 加载旧摘要
+  │  ├─ 提取新消息
+  │  ├─ 构建提示词
+  │  └─ 调用 LLM
+  ├─ 保存新摘要到数据库
+  └─ 记录日志
+  ↓
+数据库持久化
+  └─ chat_summary 表更新
+```
+
+---
+
+## 总结
+
+| 阶段 | 职责 | 耗时 | 特点 |
+|------|------|------|------|
+| **inlet** | 摘要注入 | <10ms | 快速、无计算 |
+| **LLM** | 生成回复 | 变量 | 正常流程 |
+| **outlet** | 启动后台 | <1ms | 不阻塞响应 |
+| **后台处理** | Token 计算、摘要生成、数据保存 | 1-6s | 异步执行 |
+
+**核心优势**：
+- ✅ 用户响应不受影响
+- ✅ Token 消耗显著降低
+- ✅ 历史信息连贯保存
+- ✅ 灵活的配置选项
--- a/plugins/filters/async-context-compression/异步上下文压缩.py
+++ b/plugins/filters/async-context-compression/异步上下文压缩.py
--- a/plugins/filters/async-context-compression/异步上下文压缩优化.md
+++ b/plugins/filters/async-context-compression/异步上下文压缩优化.md
@@ -0,0 +1,45 @@
+需求文档：异步上下文压缩插件优化 (Async Context Compression Optimization)
+1. 核心目标 将现有的基于消息数量的压缩逻辑升级为基于 Token 数量的压缩逻辑，并引入递归摘要机制，以更精准地控制上下文窗口，提高摘要质量，并防止历史信息丢失。
+
+2. 功能需求
+
+Token 计数与阈值控制
+引入 tiktoken: 使用 tiktoken 库进行精确的 Token 计数。如果环境不支持，则回退到字符估算 (1 token ≈ 4 chars)。
+新配置参数 (Valves):
+compression_threshold_tokens (默认: 64000): 当上下文总 Token 数超过此值时，触发压缩（生成摘要）。
+max_context_tokens (默认: 128000): 上下文的硬性上限。如果超过此值，强制移除最早的消息（保留受保护消息除外）。
+model_thresholds (字典): 支持针对不同模型 ID 配置不同的阈值。例如：{'gpt-4': {'compression_threshold_tokens': 8000, ...}}。
+废弃旧参数: compression_threshold (基于消息数) 将被标记为废弃，优先使用 Token 阈值。
+递归摘要 (Recursive Summarization)
+机制: 在生成新摘要时，必须读取并包含上一次的摘要。
+逻辑: 新摘要 = LLM(上一次摘要 + 新产生的对话消息)。
+目的: 防止随着对话进行，最早期的摘要信息被丢弃，确保长期记忆的连续性。
+消息保护与修剪策略
+保护机制: keep_first (保留头部 N 条) 和 keep_last (保留尾部 N 条) 的消息绝对不参与压缩，也不被移除。
+修剪逻辑: 当触发 max_context_tokens 限制时，优先移除 keep_first 之后、keep_last 之前的最早消息。
+优化的提示词 (Prompt Engineering)
+目标: 去除无用信息（寒暄、重复），保留关键信号（事实、代码、决策）。
+指令:
+提炼与净化: 明确要求移除噪音。
+关键保留: 强调代码片段必须逐字保留。
+合并与更新: 明确指示将新信息合并到旧摘要中。
+语言一致性: 输出语言必须与对话语言保持一致。
+3. 实现细节
+
+文件: 
+async_context_compression.py
+类: 
+Filter
+关键方法:
+_count_tokens(text): 实现 Token 计数。
+_calculate_messages_tokens(messages): 计算消息列表总 Token。
+_generate_summary_async(...)
+: 修改为加载旧摘要，并传入 LLM。
+_call_summary_llm(...)
+: 更新 Prompt，接受 previous_summary 和 new_messages。
+inlet(...)
+:
+使用 compression_threshold_tokens 判断是否注入摘要。
+实现 max_context_tokens 的强制修剪逻辑。
+outlet(...)
+: 使用 compression_threshold_tokens 判断是否触发后台摘要任务。
--- a/plugins/filters/context_enhancement_filter/context_enhancement_filter.py
+++ b/plugins/filters/context_enhancement_filter/context_enhancement_filter.py
@@ -0,0 +1,572 @@
+"""
+title: Context & Model Enhancement Filter
+author: Fu-Jie
+author_url: https://github.com/Fu-Jie
+funding_url: https://github.com/Fu-Jie/awesome-openwebui
+version: 0.2
+
+description:
+    一个功能全面的 Filter 插件，用于增强请求上下文和优化模型功能。提供四大核心功能：
+
+    1. 环境变量注入：在每条用户消息前自动注入用户环境变量（用户名、时间、时区、语言等）
+       - 支持纯文本、图片、多模态消息
+       - 幂等性设计，避免重复注入
+       - 注入成功时发送前端状态提示
+
+    2. Web Search 功能改进：为特定模型优化 Web 搜索功能
+       - 为阿里云通义千问系列、DeepSeek、Gemini 等模型添加搜索能力
+       - 自动识别模型并追加 "-search" 后缀
+       - 管理功能开关，防止冲突
+       - 启用时发送搜索能力状态提示
+
+    3. 模型适配与上下文注入：为特定模型注入 chat_id 等上下文信息
+       - 支持 cfchatqwen、webgemini 等模型的特殊处理
+       - 动态模型重定向
+       - 智能化的模型识别和适配
+
+    4. 智能内容规范化：生产级的内容清洗与修复系统
+       - 智能修复损坏的代码块（前缀、后缀、缩进）
+       - 规范化 LaTeX 公式格式（行内/块级）
+       - 优化思维链标签（</thought>）格式
+       - 自动闭合未结束的代码块
+       - 智能列表格式修复
+       - 清理冗余的 XML 标签
+       - 可配置的规则系统
+
+features:
+    - 自动化环境变量管理
+    - 智能模型功能适配
+    - 异步状态反馈
+    - 幂等性保证
+    - 多模型支持
+    - 智能内容清洗与规范化
+"""
+
+from pydantic import BaseModel, Field
+from typing import Optional, List, Callable
+import re
+import logging
+from dataclasses import dataclass, field
+
+
+# 配置日志
+logger = logging.getLogger(__name__)
+
+@dataclass
+class NormalizerConfig:
+    """规范化配置类,用于动态启用/禁用特定规则"""
+    enable_escape_fix: bool = True          # 修复转义字符
+    enable_thought_tag_fix: bool = True     # 修复思考链标签
+    enable_code_block_fix: bool = True      # 修复代码块格式
+    enable_latex_fix: bool = True           # 修复 LaTeX 公式格式
+    enable_list_fix: bool = False            # 修复列表换行
+    enable_unclosed_block_fix: bool = True  # 修复未闭合代码块
+    enable_fullwidth_symbol_fix: bool = False # 修复代码内的全角符号
+    enable_xml_tag_cleanup: bool = True     # 清理 XML 残留标签
+    
+    # 自定义清理函数列表（高级扩展用）
+    custom_cleaners: List[Callable[[str], str]] = field(default_factory=list)
+
+class ContentNormalizer:
+    """LLM 输出内容规范化器 - 生产级实现"""
+    
+    # --- 1. 预编译正则表达式（性能优化） ---
+    _PATTERNS = {
+        # 代码块前缀：如果 ``` 前面不是行首也不是换行符
+        'code_block_prefix': re.compile(r'(?<!^)(?<!\n)(```)', re.MULTILINE),
+        
+        # 代码块后缀：匹配 ```语言名 后面紧跟非空白字符(没有换行)
+        # 匹配 ```python code 这种情况，但不匹配 ```python 或 ```python\n
+        'code_block_suffix': re.compile(r'(```[\w\+\-\.]*)[ \t]+([^\n\r])'),
+        
+        # 代码块缩进：行首的空白字符 + ```
+        'code_block_indent': re.compile(r'^[ \t]+(```)', re.MULTILINE),
+        
+        # 思考链标签：</thought> 后可能跟空格或换行
+        'thought_tag': re.compile(r'</thought>[ \t]*\n*'),
+        
+        # LaTeX 块级公式：\[ ... \]
+        'latex_bracket_block': re.compile(r'\\\[(.+?)\\\]', re.DOTALL),
+        # LaTeX 行内公式：\( ... \)
+        'latex_paren_inline': re.compile(r'\\\((.+?)\\\)'),
+        
+        # 列表项：非换行符 + 数字 + 点 + 空格 (e.g. "Text1. Item")
+        'list_item': re.compile(r'([^\n])(\d+\. )'),
+        
+        # XML 残留标签 (如 Claude 的 artifacts)
+        'xml_artifacts': re.compile(r'</?(?:antArtifact|antThinking|artifact)[^>]*>', re.IGNORECASE),
+    }
+    
+    def __init__(self, config: Optional[NormalizerConfig] = None):
+        self.config = config or NormalizerConfig()
+        self.applied_fixes = []
+    
+    def normalize(self, content: str) -> str:
+        """主入口：按顺序应用所有规范化规则"""
+        self.applied_fixes = []
+        if not content:
+            return content
+        
+        try:
+            # 1. 转义字符修复（必须最先执行，否则影响后续正则）
+            if self.config.enable_escape_fix:
+                original = content
+                content = self._fix_escape_characters(content)
+                if content != original:
+                    self.applied_fixes.append("修复转义字符")
+            
+            # 2. 思考链标签规范化
+            if self.config.enable_thought_tag_fix:
+                original = content
+                content = self._fix_thought_tags(content)
+                if content != original:
+                    self.applied_fixes.append("规范化思考链")
+            
+            # 3. 代码块格式修复
+            if self.config.enable_code_block_fix:
+                original = content
+                content = self._fix_code_blocks(content)
+                if content != original:
+                    self.applied_fixes.append("修复代码块格式")
+            
+            # 4. LaTeX 公式规范化
+            if self.config.enable_latex_fix:
+                original = content
+                content = self._fix_latex_formulas(content)
+                if content != original:
+                    self.applied_fixes.append("规范化 LaTeX 公式")
+            
+            # 5. 列表格式修复
+            if self.config.enable_list_fix:
+                original = content
+                content = self._fix_list_formatting(content)
+                if content != original:
+                    self.applied_fixes.append("修复列表格式")
+            
+            # 6. 未闭合代码块检测与修复
+            if self.config.enable_unclosed_block_fix:
+                original = content
+                content = self._fix_unclosed_code_blocks(content)
+                if content != original:
+                    self.applied_fixes.append("闭合未结束代码块")
+            
+            # 7. 全角符号转半角（仅代码块内）
+            if self.config.enable_fullwidth_symbol_fix:
+                original = content
+                content = self._fix_fullwidth_symbols_in_code(content)
+                if content != original:
+                    self.applied_fixes.append("全角符号转半角")
+            
+            # 8. XML 标签残留清理
+            if self.config.enable_xml_tag_cleanup:
+                original = content
+                content = self._cleanup_xml_tags(content)
+                if content != original:
+                    self.applied_fixes.append("清理 XML 标签")
+            
+            # 9. 执行自定义清理函数
+            for cleaner in self.config.custom_cleaners:
+                original = content
+                content = cleaner(content)
+                if content != original:
+                    self.applied_fixes.append("执行自定义清理")
+            
+            return content
+            
+        except Exception as e:
+            # 生产环境保底机制：如果清洗过程报错，返回原始内容，避免阻断服务
+            logger.error(f"内容规范化失败: {e}", exc_info=True)
+            return content
+    
+    def _fix_escape_characters(self, content: str) -> str:
+        """修复过度转义的字符"""
+        # 注意：先处理具体的转义序列，再处理通用的双反斜杠
+        content = content.replace("\\r\\n", "\n")
+        content = content.replace("\\n", "\n")
+        content = content.replace("\\t", "\t")
+        # 修复过度转义的反斜杠 (例如路径 C:\\Users)
+        content = content.replace("\\\\", "\\")
+        return content
+    
+    def _fix_thought_tags(self, content: str) -> str:
+        """规范化 </thought> 标签，统一为空两行"""
+        return self._PATTERNS['thought_tag'].sub("</thought>\n\n", content)
+    
+    def _fix_code_blocks(self, content: str) -> str:
+        """修复代码块格式（独占行、换行、去缩进）"""
+        # C: 移除代码块前的缩进（必须先执行，否则影响下面的判断）
+        content = self._PATTERNS['code_block_indent'].sub(r"\1", content)
+        # A: 确保 ``` 前有换行
+        content = self._PATTERNS['code_block_prefix'].sub(r"\n\1", content)
+        # B: 确保 ```语言标识 后有换行
+        content = self._PATTERNS['code_block_suffix'].sub(r"\1\n\2", content)
+        return content
+    
+    def _fix_latex_formulas(self, content: str) -> str:
+        """规范化 LaTeX 公式：\[ -> $$ (块级), \( -> $ (行内)"""
+        content = self._PATTERNS['latex_bracket_block'].sub(r"$$\1$$", content)
+        content = self._PATTERNS['latex_paren_inline'].sub(r"$\1$", content)
+        return content
+    
+    def _fix_list_formatting(self, content: str) -> str:
+        """修复列表项缺少换行的问题 (如 'text1. item' -> 'text\\n1. item')"""
+        return self._PATTERNS['list_item'].sub(r"\1\n\2", content)
+    
+    def _fix_unclosed_code_blocks(self, content: str) -> str:
+        """检测并修复未闭合的代码块"""
+        if content.count("```") % 2 != 0:
+            logger.warning("检测到未闭合的代码块，自动补全")
+            content += "\n```"
+        return content
+    
+    def _fix_fullwidth_symbols_in_code(self, content: str) -> str:
+        """在代码块内将全角符号转为半角（精细化操作）"""
+        # 常见误用的全角符号映射
+        FULLWIDTH_MAP = {
+            '，': ',', '。': '.', '（': '(', '）': ')',
+            '【': '[', '】': ']', '；': ';', '：': ':',
+            '？': '?', '！': '!', '"': '"', '"': '"',
+            ''': "'", ''': "'",
+        }
+        
+        parts = content.split("```")
+        # 代码块内容位于索引 1, 3, 5... (奇数位)
+        for i in range(1, len(parts), 2):
+            for full, half in FULLWIDTH_MAP.items():
+                parts[i] = parts[i].replace(full, half)
+        
+        return "```".join(parts)
+    
+    def _cleanup_xml_tags(self, content: str) -> str:
+        """移除无关的 XML 标签"""
+        return self._PATTERNS['xml_artifacts'].sub("", content)
+
+class Filter:
+    class Valves(BaseModel):
+        priority: int = Field(
+            default=0, description="Priority level for the filter operations."
+        )
+
+    def __init__(self):
+        # Indicates custom file handling logic. This flag helps disengage default routines in favor of custom
+        # implementations, informing the WebUI to defer file-related operations to designated methods within this class.
+        # Alternatively, you can remove the files directly from the body in from the inlet hook
+        # self.file_handler = True
+
+        # Initialize 'valves' with specific configurations. Using 'Valves' instance helps encapsulate settings,
+        # which ensures settings are managed cohesively and not confused with operational flags like 'file_handler'.
+        self.valves = self.Valves()
+        pass
+
+    def inlet(
+        self,
+        body: dict,
+        __user__: Optional[dict] = None,
+        __metadata__: Optional[dict] = None,
+        __model__: Optional[dict] = None,
+        __event_emitter__=None,
+    ) -> dict:
+        # Modify the request body or validate it before processing by the chat completion API.
+        # This function is the pre-processor for the API where various checks on the input can be performed.
+        # It can also modify the request before sending it to the API.
+        messages = body.get("messages", [])
+        self.insert_user_env_info(__metadata__, messages, __event_emitter__)
+        # if "测试系统提示词" in str(messages):
+        #     messages.insert(0, {"role": "system", "content": "你是一个大数学家"})
+        #     print("XXXXX" * 100)
+        #     print(body)
+        self.change_web_search(body, __user__, __event_emitter__)
+        body = self.inlet_chat_id(__model__, __metadata__, body)
+
+        return body
+
+    def inlet_chat_id(self, model: dict, metadata: dict, body: dict):
+        if "openai" in model:
+            base_model_id = model["openai"]["id"]
+
+        else:
+            base_model_id = model["info"]["base_model_id"]
+
+        base_model = model["id"] if base_model_id is None else base_model_id
+        if base_model.startswith("cfchatqwen"):
+            # pass
+            body["chat_id"] = metadata["chat_id"]
+
+        if base_model.startswith("webgemini"):
+            body["chat_id"] = metadata["chat_id"]
+            if not model["id"].startswith("webgemini"):
+                body["custom_model_id"] = model["id"]
+
+        # print("我是 body *******************", body)
+        return body
+
+    def change_web_search(self, body, __user__, __event_emitter__=None):
+        """
+        优化特定模型的 Web 搜索功能。
+
+        功能：
+        - 检测是否启用了 Web 搜索
+        - 为支持搜索的模型启用模型本身的搜索能力
+        - 禁用默认的 web_search 开关以避免冲突
+        - 当使用模型本身的搜索能力时发送状态提示
+
+        参数：
+            body: 请求体字典
+            __user__: 用户信息
+            __event_emitter__: 用于发送前端事件的发射器函数
+        """
+        features = body.get("features", {})
+        web_search_enabled = (
+            features.get("web_search", False) if isinstance(features, dict) else False
+        )
+        if isinstance(__user__, (list, tuple)):
+            user_email = __user__[0].get("email", "用户") if __user__[0] else "用户"
+        elif isinstance(__user__, dict):
+            user_email = __user__.get("email", "用户")
+        model_name = body.get("model")
+
+        search_enabled_for_model = False
+        if web_search_enabled:
+            if model_name in ["qwen-max-latest", "qwen-max", "qwen-plus-latest"]:
+                body.setdefault("enable_search", True)
+                features["web_search"] = False
+                search_enabled_for_model = True
+            if "search" in model_name or "搜索" in model_name:
+                features["web_search"] = False
+            if model_name.startswith("cfdeepseek-deepseek") and not model_name.endswith(
+                "search"
+            ):
+                body["model"] = body["model"] + "-search"
+                features["web_search"] = False
+                search_enabled_for_model = True
+            if model_name.startswith("cfchatqwen") and not model_name.endswith(
+                "search"
+            ):
+                body["model"] = body["model"] + "-search"
+                features["web_search"] = False
+                search_enabled_for_model = True
+            if model_name.startswith("gemini-2.5") and "search" not in model_name:
+                body["model"] = body["model"] + "-search"
+                features["web_search"] = False
+                search_enabled_for_model = True
+            if user_email == "yi204o@qq.com":
+                features["web_search"] = False
+
+        # 如果启用了模型本身的搜索能力，发送状态提示
+        if search_enabled_for_model and __event_emitter__:
+            import asyncio
+
+            try:
+                asyncio.create_task(
+                    self._emit_search_status(__event_emitter__, model_name)
+                )
+            except RuntimeError:
+                pass
+
+    def insert_user_env_info(
+        self, __metadata__, messages, __event_emitter__=None, model_match_tags=None
+    ):
+        """
+        在第一条用户消息中注入环境变量信息。
+
+        功能特性：
+        - 始终在用户消息内容前注入环境变量的 Markdown 说明
+        - 支持多种消息类型：纯文本、图片、图文混合消息
+        - 幂等性设计：若环境变量信息已存在则更新为最新数据，不会重复添加
+        - 注入成功后通过事件发射器向前端发送"注入成功"的状态提示
+
+        参数：
+            __metadata__: 包含环境变量的元数据字典
+            messages: 消息列表
+            __event_emitter__: 用于发送前端事件的发射器函数
+            model_match_tags: 模型匹配标签（保留参数，当前未使用）
+        """
+        variables = __metadata__.get("variables", {})
+        if not messages or messages[0]["role"] != "user":
+            return
+
+        env_injected = False
+        if variables:
+            # 构建环境变量的Markdown文本
+            variable_markdown = (
+                "## 用户环境变量\n"
+                "以下信息为用户的环境变量，可用于为用户提供更个性化的服务或满足特定需求时作为参考：\n"
+                f"- **用户姓名**：{variables.get('{{USER_NAME}}', '')}\n"
+                f"- **当前日期时间**：{variables.get('{{CURRENT_DATETIME}}', '')}\n"
+                f"- **当前星期**：{variables.get('{{CURRENT_WEEKDAY}}', '')}\n"
+                f"- **当前时区**：{variables.get('{{CURRENT_TIMEZONE}}', '')}\n"
+                f"- **用户语言**：{variables.get('{{USER_LANGUAGE}}', '')}\n"
+            )
+
+            content = messages[0]["content"]
+            # 环境变量部分的匹配模式
+            env_var_pattern = r"(## 用户环境变量\n以下信息为用户的环境变量，可用于为用户提供更个性化的服务或满足特定需求时作为参考：\n.*?用户语言.*?\n)"
+            # 处理不同内容类型
+            if isinstance(content, list):  # 多模态内容(可能包含图片和文本)
+                # 查找第一个文本类型的内容
+                text_index = -1
+                for i, part in enumerate(content):
+                    if isinstance(part, dict) and part.get("type") == "text":
+                        text_index = i
+                        break
+
+                if text_index >= 0:
+                    # 存在文本内容，检查是否已存在环境变量信息
+                    text_part = content[text_index]
+                    text_content = text_part.get("text", "")
+
+                    if re.search(env_var_pattern, text_content, flags=re.DOTALL):
+                        # 已存在环境变量信息，更新为最新数据
+                        text_part["text"] = re.sub(
+                            env_var_pattern,
+                            variable_markdown,
+                            text_content,
+                            flags=re.DOTALL,
+                        )
+                    else:
+                        # 不存在环境变量信息，添加到开头
+                        text_part["text"] = f"{variable_markdown}\n{text_content}"
+
+                    content[text_index] = text_part
+                else:
+                    # 没有文本内容(例如只有图片)，添加新的文本项
+                    content.insert(
+                        0, {"type": "text", "text": f"{variable_markdown}\n"}
+                    )
+
+                messages[0]["content"] = content
+
+            elif isinstance(content, str):  # 纯文本内容
+                # 检查是否已存在环境变量信息
+                if re.search(env_var_pattern, content, flags=re.DOTALL):
+                    # 已存在，更新为最新数据
+                    messages[0]["content"] = re.sub(
+                        env_var_pattern, variable_markdown, content, flags=re.DOTALL
+                    )
+                else:
+                    # 不存在，添加到开头
+                    messages[0]["content"] = f"{variable_markdown}\n{content}"
+                env_injected = True
+
+            else:  # 其他类型内容
+                # 转换为字符串并处理
+                str_content = str(content)
+                # 检查是否已存在环境变量信息
+                if re.search(env_var_pattern, str_content, flags=re.DOTALL):
+                    # 已存在，更新为最新数据
+                    messages[0]["content"] = re.sub(
+                        env_var_pattern, variable_markdown, str_content, flags=re.DOTALL
+                    )
+                else:
+                    # 不存在，添加到开头
+                    messages[0]["content"] = f"{variable_markdown}\n{str_content}"
+                env_injected = True
+
+            # 环境变量注入成功后，发送状态提示给用户
+            if env_injected and __event_emitter__:
+                import asyncio
+
+                try:
+                    # 如果在异步环境中，使用 await
+                    asyncio.create_task(self._emit_env_status(__event_emitter__))
+                except RuntimeError:
+                    # 如果不在异步环境中，直接调用
+                    pass
+
+    async def _emit_env_status(self, __event_emitter__):
+        """
+        发送环境变量注入成功的状态提示给前端用户
+        """
+        try:
+            await __event_emitter__(
+                {
+                    "type": "status",
+                    "data": {
+                        "description": "✓ 用户环境变量已注入成功",
+                        "done": True,
+                    },
+                }
+            )
+        except Exception as e:
+            print(f"发送状态提示时出错: {e}")
+
+    async def _emit_search_status(self, __event_emitter__, model_name):
+        """
+        发送模型搜索功能启用的状态提示给前端用户
+        """
+        try:
+            await __event_emitter__(
+                {
+                    "type": "status",
+                    "data": {
+                        "description": f"🔍 已为 {model_name} 启用搜索能力",
+                        "done": True,
+                    },
+                }
+            )
+        except Exception as e:
+            print(f"发送搜索状态提示时出错: {e}")
+
+    async def _emit_normalization_status(self, __event_emitter__, applied_fixes: List[str] = None):
+        """
+        发送内容规范化完成的状态提示
+        """
+        description = "✓ 内容已自动规范化"
+        if applied_fixes:
+            description += f"：{', '.join(applied_fixes)}"
+
+        try:
+            await __event_emitter__(
+                {
+                    "type": "status",
+                    "data": {
+                        "description": description,
+                        "done": True,
+                    },
+                }
+            )
+        except Exception as e:
+            print(f"发送规范化状态提示时出错: {e}")
+
+    def _contains_html(self, content: str) -> bool:
+        """
+        检测内容是否包含 HTML 标签
+        """
+        # 匹配常见的 HTML 标签
+        pattern = r"<\s*/?\s*(?:html|head|body|div|span|p|br|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|b|i|strong|em|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
+        return bool(re.search(pattern, content, re.IGNORECASE))
+
+    def outlet(self, body: dict, __user__: Optional[dict] = None, __event_emitter__=None) -> dict:
+        """
+        处理传出响应体，通过修改最后一条助手消息的内容。
+        使用 ContentNormalizer 进行全面的内容规范化。
+        """
+        if "messages" in body and body["messages"]:
+            last = body["messages"][-1]
+            content = last.get("content", "") or ""
+            
+            if last.get("role") == "assistant" and isinstance(content, str):
+                # 如果包含 HTML，跳过规范化，为了防止错误格式化
+                if self._contains_html(content):
+                    return body
+
+                # 初始化规范化器
+                normalizer = ContentNormalizer()
+                
+                # 执行规范化
+                new_content = normalizer.normalize(content)
+                
+                # 更新内容
+                if new_content != content:
+                    last["content"] = new_content
+                    # 如果内容发生了改变，发送状态提示
+                    if __event_emitter__:
+                        import asyncio
+                        try:
+                            # 传入 applied_fixes
+                            asyncio.create_task(self._emit_normalization_status(__event_emitter__, normalizer.applied_fixes))
+                        except RuntimeError:
+                            # 假如不在循环中，则忽略
+                            pass
+        
+        return body
--- a/plugins/filters/gemini_manifold_companion/gemini_manifold_companion.py
+++ b/plugins/filters/gemini_manifold_companion/gemini_manifold_companion.py
--- a/plugins/filters/multi_model_context_merger.py
+++ b/plugins/filters/multi_model_context_merger.py
@@ -0,0 +1,212 @@
+import asyncio
+from typing import List, Optional, Dict
+from pydantic import BaseModel, Field
+from fastapi import Request
+
+from open_webui.models.chats import Chats
+
+
+class Filter:
+    class Valves(BaseModel):
+        # 注入的系统消息的前缀
+        CONTEXT_PREFIX: str = Field(
+            default="下面是多个匿名AI模型给出的回答，使用<response>标签包裹：\n\n",
+            description="Prefix for the injected system message containing the raw merged context.",
+        )
+
+    def __init__(self):
+        self.valves = self.Valves()
+        self.toggle = True
+        self.type = "filter"
+        self.name = "合并回答"
+        self.description = "在用户提问时，自动注入之前多个模型回答的上下文。"
+
+    async def inlet(
+        self,
+        body: Dict,
+        __user__: Dict,
+        __metadata__: Dict,
+        __request__: Request,
+        __event_emitter__,
+    ):
+        """
+        此方法是过滤器的入口点。它会检查上一回合是否为多模型响应，
+        如果是，则将这些响应直接格式化，并将格式化后的上下文作为系统消息注入到当前请求中。
+        """
+        print(f"*********** Filter '{self.name}' triggered ***********")
+        chat_id = __metadata__.get("chat_id")
+        if not chat_id:
+            print(
+                f"DEBUG: Filter '{self.name}' skipped: chat_id not found in metadata."
+            )
+            return body
+
+        print(f"DEBUG: Chat ID found: {chat_id}")
+
+        # 1. 从数据库获取完整的聊天历史
+        try:
+            chat = await asyncio.to_thread(Chats.get_chat_by_id, chat_id)
+
+            if (
+                not chat
+                or not hasattr(chat, "chat")
+                or not chat.chat.get("history")
+                or not chat.chat.get("history").get("messages")
+            ):
+                print(
+                    f"DEBUG: Filter '{self.name}' skipped: Chat history not found or empty for chat_id: {chat_id}"
+                )
+                return body
+
+            messages_map = chat.chat["history"]["messages"]
+            print(
+                f"DEBUG: Successfully loaded {len(messages_map)} messages from history."
+            )
+
+            # Count the number of user messages in the history
+            user_message_count = sum(
+                1 for msg in messages_map.values() if msg.get("role") == "user"
+            )
+
+            # If there are less than 2 user messages, there's no previous turn to merge.
+            if user_message_count < 2:
+                print(
+                    f"DEBUG: Filter '{self.name}' skipped: Not enough user messages in history to have a previous turn (found {user_message_count}, required >= 2)."
+                )
+                return body
+
+        except Exception as e:
+            print(
+                f"ERROR: Filter '{self.name}' failed to get chat history from DB: {e}"
+            )
+            return body
+
+        # This filter rebuilds the entire chat history to consolidate all multi-response turns.
+
+        # 1. Get all messages from history and sort by timestamp
+        all_messages = list(messages_map.values())
+        all_messages.sort(key=lambda x: x.get("timestamp", 0))
+
+        # 2. Pre-group all assistant messages by their parentId for efficient lookup
+        assistant_groups = {}
+        for msg in all_messages:
+            if msg.get("role") == "assistant":
+                parent_id = msg.get("parentId")
+                if parent_id:
+                    if parent_id not in assistant_groups:
+                        assistant_groups[parent_id] = []
+                    assistant_groups[parent_id].append(msg)
+
+        final_messages = []
+        processed_parent_ids = set()
+
+        # 3. Iterate through the sorted historical messages to build the final, clean list
+        for msg in all_messages:
+            msg_id = msg.get("id")
+            role = msg.get("role")
+            parent_id = msg.get("parentId")
+
+            if role == "user":
+                # Add user messages directly
+                final_messages.append(msg)
+
+            elif role == "assistant":
+                # If this assistant's parent group has already been processed, skip it
+                if parent_id in processed_parent_ids:
+                    continue
+
+                # Process the group of siblings for this parent_id
+                if parent_id in assistant_groups:
+                    siblings = assistant_groups[parent_id]
+
+                    # Only perform a merge if there are multiple siblings
+                    if len(siblings) > 1:
+                        print(
+                            f"DEBUG: Found a group of {len(siblings)} siblings for parent_id {parent_id}. Merging..."
+                        )
+
+                        # --- MERGE LOGIC ---
+                        merged_content = None
+                        merged_message_id = None
+                        # Sort siblings by timestamp before processing
+                        siblings.sort(key=lambda s: s.get("timestamp", 0))
+                        merged_message_timestamp = siblings[0].get("timestamp", 0)
+
+                        # Case A: Check for system pre-merged content (merged.status: true and content not empty)
+                        merged_content_msg = next(
+                            (
+                                s
+                                for s in siblings
+                                if s.get("merged", {}).get("status")
+                                and s.get("merged", {}).get("content")
+                            ),
+                            None,
+                        )
+
+                        if merged_content_msg:
+                            merged_content = merged_content_msg["merged"]["content"]
+                            merged_message_id = merged_content_msg["id"]
+                            merged_message_timestamp = merged_content_msg.get(
+                                "timestamp", merged_message_timestamp
+                            )
+                            print(
+                                f"DEBUG: Using pre-merged content from message ID: {merged_message_id}"
+                            )
+                        else:
+                            # Case B: Manually merge content
+                            combined_content = []
+                            first_sibling_id = None
+                            counter = 0
+
+                            for s in siblings:
+                                if not first_sibling_id:
+                                    first_sibling_id = s["id"]
+
+                                content = s.get("content", "")
+                                if (
+                                    content
+                                    and content
+                                    != "The requested model is not supported."
+                                ):
+                                    response_id = chr(ord("a") + counter)
+                                    combined_content.append(
+                                        f'<response id="{response_id}">\n{content}\n</response>'
+                                    )
+                                    counter += 1
+
+                            if combined_content:
+                                merged_content = "\n\n".join(combined_content)
+                                merged_message_id = first_sibling_id or parent_id
+
+                        if merged_content:
+                            merged_message = {
+                                "id": merged_message_id,
+                                "parentId": parent_id,
+                                "role": "assistant",
+                                "content": f"{self.valves.CONTEXT_PREFIX}{merged_content}",
+                                "timestamp": merged_message_timestamp,
+                            }
+                            final_messages.append(merged_message)
+                    else:
+                        # If there's only one sibling, add it directly
+                        final_messages.append(siblings[0])
+
+                    # Mark this group as processed
+                    processed_parent_ids.add(parent_id)
+
+        # 4. The new user message from the current request is not in the historical messages_map,
+        # so we need to append it to our newly constructed message list.
+        if body.get("messages"):
+            new_user_message_from_body = body["messages"][-1]
+            # Ensure we don't add a historical message that might be in the body for context
+            if new_user_message_from_body.get("id") not in messages_map:
+                final_messages.append(new_user_message_from_body)
+
+        # 5. Replace the original message list with the new, cleaned-up list
+        body["messages"] = final_messages
+        print(
+            f"DEBUG: Rebuilt message history with {len(final_messages)} messages, consolidating all multi-response turns."
+        )
+
+        print(f"*********** Filter '{self.name}' finished successfully ***********")
+        return body