feat: 新增插件系统、多种插件类型、开发指南及多语言文档。

This commit is contained in:
fujie
2025-12-20 12:34:49 +08:00
commit eaa6319991
74 changed files with 28409 additions and 0 deletions

45
plugins/filters/README.md Normal file
View File

@@ -0,0 +1,45 @@
# Filters
English | [中文](./README_CN.md)
Filters process and modify user input before it is sent to the LLM. This directory contains various filters that can be used to extend OpenWebUI functionality.
## 📋 Filter List
| Filter Name | Description | Documentation |
| :--- | :--- | :--- |
| **Async Context Compression** | Reduces token consumption in long conversations through intelligent summarization and message compression while maintaining conversational coherence. | [English](./async-context-compression/async_context_compression.md) / [中文](./async-context-compression/async_context_compression_cn.md) |
## 🚀 Quick Start
### Installing a Filter
1. Navigate to the desired filter directory
2. Download the corresponding `.py` file to your local machine
3. Open OpenWebUI Admin Settings and find the "Filters" section
4. Upload the Python file
5. Configure the filter parameters according to its documentation
6. Refresh the page and enable the filter in your chat settings
## 📖 Development Guide
When adding a new filter, please follow these steps:
1. **Create Filter Directory**: Create a new folder in the current directory (e.g., `my_filter/`)
2. **Write Filter Code**: Create a `.py` file with clear documentation of functionality and configuration in comments
3. **Write Documentation**:
- Create `filter_name.md` (English version)
- Create `filter_name_cn.md` (Chinese version)
- Documentation should include: feature description, configuration parameters, usage examples, and troubleshooting
4. **Update This List**: Add your new filter to the table above
## ⚙️ Configuration Best Practices
- **Priority Management**: Set appropriate filter priority to ensure correct execution order
- **Parameter Tuning**: Adjust filter parameters based on your specific needs
- **Debug Logging**: Enable debug mode during development, disable in production
- **Performance Testing**: Test filter performance under high load
---
> **Contributor Note**: To ensure project maintainability and user experience, please provide clear and complete documentation for each new filter, including feature description, parameter configuration, usage examples, and troubleshooting guide.

View File

@@ -0,0 +1,67 @@
# 自动上下文合并过滤器 (Auto Context Merger Filter)
## 概述
`auto_context_merger` 是一个 Open WebUI 过滤器插件,旨在通过自动收集和注入上一回合多模型回答的上下文,来增强后续对话的连贯性和深度。当用户在一次多模型回答之后提出新的后续问题时,此过滤器会自动激活。
它会从对话历史中识别出上一回合所有 AI 模型的回答,将它们按照清晰的格式直接拼接起来,然后作为一个系统消息注入到当前请求中。这样,当前模型在处理用户的新问题时,就能直接参考到之前所有 AI 的观点,从而提供更全面、更连贯的回答。
## 工作原理
1. **触发时机**: 当用户在一次“多模型回答”之后,发送新的后续问题时,此过滤器会自动激活。
2. **获取历史数据**: 过滤器会使用当前对话的 `chat_id`,从数据库中加载完整的对话历史记录。
3. **分析上一回合**: 通过分析对话树结构,它能准确找到用户上一个问题,以及当时所有 AI 模型给出的并行回答。
4. **直接格式化**: 如果检测到上一回合确实有多个 AI 回答,它会收集所有这些 AI 的回答内容。
5. **智能注入**: 将这些格式化后的回答作为一个系统消息,注入到当前请求的 `messages` 列表的开头,紧邻用户的新问题之前。
6. **传递给目标模型**: 修改后的消息体(包含格式化后的上下文)将传递给用户最初选择的目标模型。目标模型在生成响应时,将能够利用这个更丰富的上下文。
7. **状态更新**: 在整个处理过程中,过滤器会通过 `__event_emitter__` 提供实时状态更新,让用户了解处理进度。
## 配置 (Valves)
您可以在 Open WebUI 的管理界面中配置此过滤器的 `Valves`
* **`CONTEXT_PREFIX`** (字符串, 必填):
* **描述**: 注入的系统消息的前缀文本。它会出现在合并后的上下文之前,用于向模型解释这段内容的来源和目的。
* **示例**: `**背景知识**为了更好地回答您的新问题请参考上一轮对话中多个AI模型给出的回答\n\n`
## 如何使用
1. **部署过滤器**: 将 `auto_context_merger.py` 文件放置在 Open WebUI 实例的 `plugins/filters/` 目录下。
2. **启用过滤器**: 登录 Open WebUI 管理界面,导航到 **Workspace -> Functions**。找到 `auto_context_merger` 过滤器并启用它。
3. **配置参数**: 点击 `auto_context_merger` 过滤器旁边的编辑按钮,根据您的需求配置 `CONTEXT_PREFIX`
4. **开始对话**:
* 首先,向一个模型提问,并确保有多个模型(例如通过 `gemini_manifold` 或其他多模型工具)给出回答。
* 然后,针对这个多模型回答,提出您的后续问题。
* 此过滤器将自动激活,将上一回合所有 AI 的回答合并并注入到当前请求中。
## 示例
假设您配置了 `CONTEXT_PREFIX` 为默认值。
1. **用户提问**: “解释一下量子力学”
2. **多个 AI 回答** (例如,模型 A 和模型 B 都给出了回答)
3. **用户再次提问**: “那么,量子纠缠和量子隧穿有什么区别?”
此时,`auto_context_merger` 过滤器将自动激活:
1. 它会获取模型 A 和模型 B 对“解释一下量子力学”的回答。
2. 将它们格式化为:
```
**背景知识**为了更好地回答您的新问题请参考上一轮对话中多个AI模型给出的回答
**来自模型 '模型A名称' 的回答是:**
[模型A对量子力学的解释]
---
**来自模型 '模型B名称' 的回答是:**
[模型B对量子力学的解释]
```
3. 然后,将这段内容作为一个系统消息,注入到当前请求中,紧邻“那么,量子纠缠和量子隧穿有什么区别?”这个用户问题之前。
最终,模型将收到一个包含所有相关上下文的请求,从而能够更准确、更全面地回答您的后续问题。
## 注意事项
* 此过滤器旨在增强多模型对话的连贯性,通过提供更丰富的上下文来帮助模型理解后续问题。
* 确保您的 Open WebUI 实例中已配置并启用了 `gemini_manifold` 或其他能够产生多模型回答的工具,以便此过滤器能够检测到多模型历史。
* 此过滤器不会增加额外的模型调用,因此不会显著增加延迟或成本。它只是对现有历史数据进行格式化和注入。

View File

@@ -0,0 +1,77 @@
# Async Context Compression Filter
**Author:** [Fu-Jie](https://github.com/Fu-Jie) | **Version:** 1.0.0 | **License:** MIT
> **Important Note**: To ensure the maintainability and usability of all filters, each filter should be accompanied by clear and complete documentation to fully explain its functionality, configuration, and usage.
This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
---
## Core Features
-**Automatic Compression**: Triggers context compression automatically based on a message count threshold.
-**Asynchronous Summarization**: Generates summaries in the background without blocking the current chat response.
-**Persistent Storage**: Supports both PostgreSQL and SQLite databases to ensure summaries are not lost after a service restart.
-**Flexible Retention Policy**: Freely configure the number of initial and final messages to keep, ensuring critical information and context continuity.
-**Smart Injection**: Intelligently injects the generated historical summary into the new context.
---
## Installation & Configuration
### 1. Environment Variable
This plugin requires a database connection. You **must** configure the `DATABASE_URL` in your Open WebUI environment variables.
- **PostgreSQL Example**:
```
DATABASE_URL=postgresql://user:password@host:5432/openwebui
```
- **SQLite Example**:
```
DATABASE_URL=sqlite:///path/to/your/data/webui.db
```
### 2. Filter Order
It is recommended to set the priority of this filter relatively high (a smaller number) to ensure it runs before other filters that might modify message content. A typical order might be:
1. **Pre-Filters (priority < 10)**
- e.g., A filter that injects a system-level prompt.
2. **This Compression Filter (priority = 10)**
3. **Post-Filters (priority > 10)**
- e.g., A filter that formats the final output.
---
## Configuration Parameters
You can adjust the following parameters in the filter's settings:
| Parameter | Default | Description |
| :--- | :--- | :--- |
| `priority` | `10` | The execution order of the filter. Lower numbers run first. |
| `compression_threshold` | `15` | When the total message count reaches this value, a background summary generation will be triggered. |
| `keep_first` | `1` | Always keep the first N messages. The first message often contains important system prompts. |
| `keep_last` | `6` | Always keep the last N messages to ensure contextual coherence. |
| `summary_model` | `None` | The model used for generating summaries. **Strongly recommended** to set a fast, economical, and compatible model (e.g., `gemini-2.5-flash`). If left empty, it will try to use the current chat's model, which may fail if it's an incompatible model type (like a Pipe model). |
| `max_summary_tokens` | `4000` | The maximum number of tokens allowed for the generated summary. |
| `summary_temperature` | `0.3` | Controls the randomness of the summary. Lower values are more deterministic. |
| `debug_mode` | `true` | Whether to print detailed debug information to the log. Recommended to set to `false` in production. |
---
## Troubleshooting
- **Problem: Database connection failed.**
- **Solution**: Please ensure the `DATABASE_URL` environment variable is set correctly and that the database service is running.
- **Problem: Summary not generated.**
- **Solution**: Check if the `compression_threshold` has been met and verify that `summary_model` is configured correctly. Check the logs for detailed errors.
- **Problem: Initial system prompt is lost.**
- **Solution**: Ensure `keep_first` is set to a value greater than 0 to preserve the initial messages containing important information.
- **Problem: Compression effect is not significant.**
- **Solution**: Try increasing the `compression_threshold` or decreasing the `keep_first` / `keep_last` values.

View File

@@ -0,0 +1,780 @@
"""
title: Async Context Compression
id: async_context_compression
author: Fu-Jie
author_url: https://github.com/Fu-Jie
funding_url: https://github.com/Fu-Jie/awesome-openwebui
description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
version: 1.0.1
license: MIT
═══════════════════════════════════════════════════════════════════════════════
📌 Overview
═══════════════════════════════════════════════════════════════════════════════
This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
Core Features:
✅ Automatic compression triggered by a message count threshold
✅ Asynchronous summary generation (does not block user response)
✅ Persistent storage with database support (PostgreSQL and SQLite)
✅ Flexible retention policy (configurable to keep first and last N messages)
✅ Smart summary injection to maintain context
═══════════════════════════════════════════════════════════════════════════════
🔄 Workflow
═══════════════════════════════════════════════════════════════════════════════
Phase 1: Inlet (Pre-request processing)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Receives all messages in the current conversation.
2. Checks for a previously saved summary.
3. If a summary exists and the message count exceeds the retention threshold:
├─ Extracts the first N messages to be kept.
├─ Injects the summary into the first message.
├─ Extracts the last N messages to be kept.
└─ Combines them into a new message list: [Kept First Messages + Summary] + [Kept Last Messages].
4. Sends the compressed message list to the LLM.
Phase 2: Outlet (Post-response processing)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. Triggered after the LLM response is complete.
2. Checks if the message count has reached the compression threshold.
3. If the threshold is met, an asynchronous background task is started to generate a summary:
├─ Extracts messages to be summarized (excluding the kept first and last messages).
├─ Calls the LLM to generate a concise summary.
└─ Saves the summary to the database.
═══════════════════════════════════════════════════════════════════════════════
💾 Storage
═══════════════════════════════════════════════════════════════════════════════
This filter uses a database for persistent storage, configured via the `DATABASE_URL` environment variable. It supports both PostgreSQL and SQLite.
Configuration:
- The `DATABASE_URL` environment variable must be set.
- PostgreSQL Example: `postgresql://user:password@host:5432/openwebui`
- SQLite Example: `sqlite:///path/to/your/database.db`
The filter automatically selects the appropriate database driver based on the `DATABASE_URL` prefix (`postgres` or `sqlite`).
Table Structure (`chat_summary`):
- id: Primary Key (auto-increment)
- chat_id: Unique chat identifier (indexed)
- summary: The summary content (TEXT)
- compressed_message_count: The original number of messages
- created_at: Timestamp of creation
- updated_at: Timestamp of last update
═══════════════════════════════════════════════════════════════════════════════
📊 Compression Example
═══════════════════════════════════════════════════════════════════════════════
Scenario: A 20-message conversation (Default settings: keep first 1, keep last 6)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Before Compression:
Message 1: [Initial prompt + First question]
Messages 2-14: [Historical conversation]
Messages 15-20: [Recent conversation]
Total: 20 full messages
After Compression:
Message 1: [Initial prompt + Historical summary + First question]
Messages 15-20: [Last 6 full messages]
Total: 7 messages
Effect:
✓ Saves 13 messages (approx. 65%)
✓ Retains full context
✓ Protects important initial prompts
═══════════════════════════════════════════════════════════════════════════════
⚙️ Configuration
═══════════════════════════════════════════════════════════════════════════════
priority
Default: 10
Description: The execution order of the filter. Lower numbers run first.
compression_threshold
Default: 15
Description: When the message count reaches this value, a background summary generation will be triggered after the conversation ends.
Recommendation: Adjust based on your model's context window and cost.
keep_first
Default: 1
Description: Always keep the first N messages of the conversation. Set to 0 to disable. The first message often contains important system prompts.
keep_last
Default: 6
Description: Always keep the last N full messages of the conversation to ensure context coherence.
summary_model
Default: None
Description: The LLM used to generate the summary.
Recommendation:
- It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`、`gemini-2.5-flash`、`gpt-4.1`。
- If left empty, the filter will attempt to use the model from the current conversation.
Note:
- If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, leaving this field empty may cause summary generation to fail. In this case, you must specify a valid model.
max_summary_tokens
Default: 4000
Description: The maximum number of tokens allowed for the generated summary.
summary_temperature
Default: 0.3
Description: Controls the randomness of the summary generation. Lower values produce more deterministic output.
debug_mode
Default: true
Description: Prints detailed debug information to the log. Recommended to set to `false` in production.
🔧 Deployment
═══════════════════════════════════════════════════════
Docker Compose Example:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
services:
openwebui:
environment:
DATABASE_URL: postgresql://user:password@postgres:5432/openwebui
depends_on:
- postgres
postgres:
image: postgres:15-alpine
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: openwebui
Suggested Filter Installation Order:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
It is recommended to set the priority of this filter relatively high (a smaller number) to ensure it runs before other filters that might modify message content. A typical order might be:
1. Filters that need access to the full, uncompressed history (priority < 10)
(e.g., a filter that injects a system-level prompt)
2. This compression filter (priority = 10)
3. Filters that run after compression (priority > 10)
(e.g., a final output formatting filter)
═══════════════════════════════════════════════════════════════════════════════
📝 Database Query Examples
═══════════════════════════════════════════════════════════════════════════════
View all summaries:
SELECT
chat_id,
LEFT(summary, 100) as summary_preview,
compressed_message_count,
updated_at
FROM chat_summary
ORDER BY updated_at DESC;
Query a specific conversation:
SELECT *
FROM chat_summary
WHERE chat_id = 'your_chat_id';
Delete old summaries:
DELETE FROM chat_summary
WHERE updated_at < NOW() - INTERVAL '30 days';
Statistics:
SELECT
COUNT(*) as total_summaries,
AVG(LENGTH(summary)) as avg_summary_length,
AVG(compressed_message_count) as avg_msg_count
FROM chat_summary;
═══════════════════════════════════════════════════════════════════════════════
⚠️ Important Notes
═══════════════════════════════════════════════════════════════════════════════
1. Database Permissions
⚠ Ensure the user specified in `DATABASE_URL` has permissions to create tables.
⚠ The `chat_summary` table will be created automatically on first run.
2. Retention Policy
⚠ The `keep_first` setting is crucial for preserving initial messages that contain system prompts. Configure it as needed.
3. Performance
⚠ Summary generation is asynchronous and will not block the user response.
⚠ There will be a brief background processing time when the threshold is first met.
4. Cost Optimization
⚠ The summary model is called once each time the threshold is met.
⚠ Set `compression_threshold` reasonably to avoid frequent calls.
⚠ It's recommended to use a fast and economical model to generate summaries.
5. Multimodal Support
✓ This filter supports multimodal messages containing images.
✓ The summary is generated only from the text content.
✓ Non-text parts (like images) are preserved in their original messages during compression.
═══════════════════════════════════════════════════════════════════════════════
🐛 Troubleshooting
═══════════════════════════════════════════════════════════════════════════════
Problem: Database connection failed
Solution:
1. Verify that the `DATABASE_URL` environment variable is set correctly.
2. Confirm that `DATABASE_URL` starts with either `sqlite` or `postgres`.
3. Ensure the database service is running and network connectivity is normal.
4. Validate the username, password, host, and port in the connection URL.
5. Check the Open WebUI container logs for detailed error messages.
Problem: Summary not generated
Solution:
1. Check if the `compression_threshold` has been met.
2. Verify that the `summary_model` is configured correctly.
3. Check the debug logs for any error messages.
Problem: Initial system prompt is lost
Solution:
- Ensure `keep_first` is set to a value greater than 0 to preserve the initial messages containing this information.
Problem: Compression effect is not significant
Solution:
1. Increase the `compression_threshold` appropriately.
2. Decrease the number of `keep_last` or `keep_first`.
3. Check if the conversation is actually long enough.
"""
from pydantic import BaseModel, Field, model_validator
from typing import Optional
import asyncio
import json
import hashlib
import os
# Open WebUI built-in imports
from open_webui.utils.chat import generate_chat_completion
from open_webui.models.users import Users
from fastapi.requests import Request
from open_webui.main import app as webui_app
# Database imports
from sqlalchemy import create_engine, Column, String, Text, DateTime, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime
Base = declarative_base()
class ChatSummary(Base):
"""Chat Summary Storage Table"""
__tablename__ = "chat_summary"
id = Column(Integer, primary_key=True, autoincrement=True)
chat_id = Column(String(255), unique=True, nullable=False, index=True)
summary = Column(Text, nullable=False)
compressed_message_count = Column(Integer, default=0)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
class Filter:
def __init__(self):
self.valves = self.Valves()
self._db_engine = None
self._SessionLocal = None
self._init_database()
def _init_database(self):
"""Initializes the database connection and table."""
try:
database_url = os.getenv("DATABASE_URL")
if not database_url:
print("[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable.")
self._db_engine = None
self._SessionLocal = None
return
db_type = None
engine_args = {}
if database_url.startswith("sqlite"):
db_type = "SQLite"
engine_args = {
"connect_args": {"check_same_thread": False},
"echo": False,
}
elif database_url.startswith("postgres"):
db_type = "PostgreSQL"
if database_url.startswith("postgres://"):
database_url = database_url.replace(
"postgres://", "postgresql://", 1
)
print("[Database] Automatically converted postgres:// to postgresql://")
engine_args = {
"pool_pre_ping": True,
"pool_recycle": 3600,
"echo": False,
}
else:
print(
f"[Database] ❌ Error: Unsupported database type. DATABASE_URL must start with 'sqlite' or 'postgres'. Current value: {database_url}"
)
self._db_engine = None
self._SessionLocal = None
return
# Create database engine
self._db_engine = create_engine(database_url, **engine_args)
# Create session factory
self._SessionLocal = sessionmaker(
autocommit=False, autoflush=False, bind=self._db_engine
)
# Create table if it doesn't exist
Base.metadata.create_all(bind=self._db_engine)
print(f"[Database] ✅ Successfully connected to {db_type} and initialized the chat_summary table.")
except Exception as e:
print(f"[Database] ❌ Initialization failed: {str(e)}")
self._db_engine = None
self._SessionLocal = None
class Valves(BaseModel):
priority: int = Field(
default=10, description="Priority level for the filter operations."
)
compression_threshold: int = Field(
default=15, ge=0, description="The number of messages at which to trigger compression."
)
keep_first: int = Field(
default=1, ge=0, description="Always keep the first N messages. Set to 0 to disable."
)
keep_last: int = Field(default=6, ge=0, description="Always keep the last N messages.")
summary_model: str = Field(
default=None,
description="The model to use for generating the summary. If empty, uses the current conversation's model.",
)
max_summary_tokens: int = Field(
default=4000, ge=1, description="The maximum number of tokens for the summary."
)
summary_temperature: float = Field(
default=0.3, ge=0.0, le=2.0, description="The temperature for summary generation."
)
debug_mode: bool = Field(default=True, description="Enable detailed logging for debugging.")
@model_validator(mode="after")
def check_thresholds(self) -> "Valves":
kept_count = self.keep_first + self.keep_last
if self.compression_threshold <= kept_count:
raise ValueError(
f"compression_threshold ({self.compression_threshold}) must be greater than "
f"the sum of keep_first ({self.keep_first}) and keep_last ({self.keep_last}) ({kept_count})."
)
return self
def _save_summary(self, chat_id: str, summary: str, body: dict):
"""Saves the summary to the database."""
if not self._SessionLocal:
if self.valves.debug_mode:
print("[Storage] Database not initialized, skipping summary save.")
return
try:
session = self._SessionLocal()
try:
# Find existing record
existing = (
session.query(ChatSummary).filter_by(chat_id=chat_id).first()
)
if existing:
# Update existing record
existing.summary = summary
existing.compressed_message_count = len(body.get("messages", []))
existing.updated_at = datetime.utcnow()
else:
# Create new record
new_summary = ChatSummary(
chat_id=chat_id,
summary=summary,
compressed_message_count=len(body.get("messages", [])),
)
session.add(new_summary)
session.commit()
if self.valves.debug_mode:
action = "Updated" if existing else "Created"
print(f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})")
finally:
session.close()
except Exception as e:
print(f"[Storage] ❌ Database save failed: {str(e)}")
def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
"""Loads the summary from the database."""
if not self._SessionLocal:
if self.valves.debug_mode:
print("[Storage] Database not initialized, cannot load summary.")
return None
try:
session = self._SessionLocal()
try:
record = (
session.query(ChatSummary).filter_by(chat_id=chat_id).first()
)
if record:
if self.valves.debug_mode:
print(f"[Storage] Loaded summary from database (Chat ID: {chat_id})")
print(
f"[Storage] Last updated: {record.updated_at}, Original message count: {record.compressed_message_count}"
)
return record.summary
finally:
session.close()
except Exception as e:
print(f"[Storage] ❌ Database read failed: {str(e)}")
return None
def _inject_summary_to_first_message(self, message: dict, summary: str) -> dict:
"""Injects the summary into the first message by prepending it."""
content = message.get("content", "")
summary_block = f"【Historical Conversation Summary】\n{summary}\n\n---\nBelow is the recent conversation:\n\n"
# Handle different content types
if isinstance(content, list): # Multimodal content
# Find the first text part and insert the summary before it
new_content = []
summary_inserted = False
for part in content:
if (
isinstance(part, dict)
and part.get("type") == "text"
and not summary_inserted
):
# Prepend summary to the first text part
new_content.append(
{"type": "text", "text": summary_block + part.get("text", "")}
)
summary_inserted = True
else:
new_content.append(part)
# If no text part, insert at the beginning
if not summary_inserted:
new_content.insert(0, {"type": "text", "text": summary_block})
message["content"] = new_content
elif isinstance(content, str): # Plain text
message["content"] = summary_block + content
return message
async def inlet(
self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
) -> dict:
"""
Executed before sending to the LLM.
Compression Strategy:
1. Keep the first N messages.
2. Inject the summary into the first message (if keep_first > 0).
3. Keep the last N messages.
"""
messages = body.get("messages", [])
chat_id = __metadata__["chat_id"]
if self.valves.debug_mode:
print(f"\n{'='*60}")
print(f"[Inlet] Chat ID: {chat_id}")
print(f"[Inlet] Received {len(messages)} messages")
# [Optimization] Load summary in a background thread to avoid blocking the event loop.
if self.valves.debug_mode:
print("[Optimization] Loading summary in a background thread to avoid blocking the event loop.")
saved_summary = await asyncio.to_thread(self._load_summary, chat_id, body)
total_kept_count = self.valves.keep_first + self.valves.keep_last
if saved_summary and len(messages) > total_kept_count:
if self.valves.debug_mode:
print(f"[Inlet] Found saved summary, applying compression.")
first_messages_to_keep = []
if self.valves.keep_first > 0:
# Copy the initial messages to keep
first_messages_to_keep = [
m.copy() for m in messages[: self.valves.keep_first]
]
# Inject the summary into the very first message
first_messages_to_keep[0] = self._inject_summary_to_first_message(
first_messages_to_keep[0], saved_summary
)
else:
# If not keeping initial messages, create a new system message for the summary
summary_block = (
f"【Historical Conversation Summary】\n{saved_summary}\n\n---\nBelow is the recent conversation:\n\n"
)
first_messages_to_keep.append(
{"role": "system", "content": summary_block}
)
# Keep the last messages
last_messages_to_keep = (
messages[-self.valves.keep_last :] if self.valves.keep_last > 0 else []
)
# Combine: [Kept initial messages (with summary)] + [Kept recent messages]
body["messages"] = first_messages_to_keep + last_messages_to_keep
if self.valves.debug_mode:
print(f"[Inlet] ✂️ Compression complete:")
print(f" - Original messages: {len(messages)}")
print(f" - Compressed to: {len(body['messages'])}")
print(
f" - Structure: [Keep first {self.valves.keep_first} (with summary)] + [Keep last {self.valves.keep_last}]"
)
print(f" - Saved: {len(messages) - len(body['messages'])} messages")
else:
if self.valves.debug_mode:
if not saved_summary:
print(f"[Inlet] No summary found, using full conversation history.")
else:
print(f"[Inlet] Message count does not exceed retention threshold, no compression applied.")
if self.valves.debug_mode:
print(f"{'='*60}\n")
return body
async def outlet(
self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
) -> dict:
"""
Executed after the LLM response is complete.
Triggers summary generation asynchronously.
"""
messages = body.get("messages", [])
chat_id = __metadata__["chat_id"]
if self.valves.debug_mode:
print(f"\n{'='*60}")
print(f"[Outlet] Chat ID: {chat_id}")
print(f"[Outlet] Response complete, current message count: {len(messages)}")
# Check if compression is needed
if len(messages) >= self.valves.compression_threshold:
if self.valves.debug_mode:
print(
f"[Outlet] ⚡ Compression threshold reached ({len(messages)} >= {self.valves.compression_threshold})"
)
print(f"[Outlet] Preparing to generate summary in the background...")
# Generate summary asynchronously in the background
asyncio.create_task(
self._generate_summary_async(messages, chat_id, body, __user__)
)
else:
if self.valves.debug_mode:
print(
f"[Outlet] Compression threshold not reached ({len(messages)} < {self.valves.compression_threshold})"
)
if self.valves.debug_mode:
print(f"{'='*60}\n")
return body
async def _generate_summary_async(
self, messages: list, chat_id: str, body: dict, user_data: Optional[dict]
):
"""
Generates a summary asynchronously in the background.
"""
try:
if self.valves.debug_mode:
print(f"\n[🤖 Async Summary Task] Starting...")
# Messages to summarize: exclude kept initial and final messages
if self.valves.keep_last > 0:
messages_to_summarize = messages[
self.valves.keep_first : -self.valves.keep_last
]
else:
messages_to_summarize = messages[self.valves.keep_first :]
if len(messages_to_summarize) == 0:
if self.valves.debug_mode:
print(f"[🤖 Async Summary Task] No messages to summarize, skipping.")
return
if self.valves.debug_mode:
print(f"[🤖 Async Summary Task] Preparing to summarize {len(messages_to_summarize)} messages.")
print(
f"[🤖 Async Summary Task] Protecting: First {self.valves.keep_first} + Last {self.valves.keep_last} messages."
)
# Build conversation history text
conversation_text = self._format_messages_for_summary(messages_to_summarize)
# Call LLM to generate summary
summary = await self._call_summary_llm(conversation_text, body, user_data)
# [Optimization] Save summary in a background thread to avoid blocking the event loop.
if self.valves.debug_mode:
print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
await asyncio.to_thread(self._save_summary, chat_id, summary, body)
if self.valves.debug_mode:
print(f"[🤖 Async Summary Task] ✅ Complete! Summary length: {len(summary)} characters.")
print(f"[🤖 Async Summary Task] Summary preview: {summary[:150]}...")
except Exception as e:
print(f"[🤖 Async Summary Task] ❌ Error: {str(e)}")
import traceback
traceback.print_exc()
# Save a simple placeholder even on failure
fallback_summary = (
f"[Historical Conversation Summary] Contains content from approximately {len(messages_to_summarize)} messages."
)
# [Optimization] Save summary in a background thread to avoid blocking the event loop.
if self.valves.debug_mode:
print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
await asyncio.to_thread(self._save_summary, chat_id, fallback_summary, body)
def _format_messages_for_summary(self, messages: list) -> str:
"""Formats messages for summarization."""
formatted = []
for i, msg in enumerate(messages, 1):
role = msg.get("role", "unknown")
content = msg.get("content", "")
# Handle multimodal content
if isinstance(content, list):
text_parts = []
for part in content:
if isinstance(part, dict) and part.get("type") == "text":
text_parts.append(part.get("text", ""))
content = " ".join(text_parts)
# Handle role name
role_name = {"user": "User", "assistant": "Assistant"}.get(role, role)
# Limit length of each message to avoid excessive length
if len(content) > 500:
content = content[:500] + "..."
formatted.append(f"[{i}] {role_name}: {content}")
return "\n\n".join(formatted)
async def _call_summary_llm(
self, conversation_text: str, body: dict, user_data: dict
) -> str:
"""
Calls the LLM to generate a summary using Open WebUI's built-in method.
"""
if self.valves.debug_mode:
print(f"[🤖 LLM Call] Using Open WebUI's built-in method.")
# Build summary prompt
summary_prompt = f"""
You are a professional conversation context compression assistant. Your task is to perform a high-fidelity compression of the [Conversation Content] below, producing a concise summary that can be used directly as context for subsequent conversation. Strictly adhere to the following requirements:
MUST RETAIN: Topics/goals, user intent, key facts and data, important parameters and constraints, deadlines, decisions/conclusions, action items and their status, and technical details like code/commands (code must be preserved as is).
REMOVE: Greetings, politeness, repetitive statements, off-topic chatter, and procedural details (unless essential). For information that has been overturned or is outdated, please mark it as "Obsolete: <explanation>" when retaining.
CONFLICT RESOLUTION: If there are contradictions or multiple revisions, retain the latest consistent conclusion and list unresolved or conflicting points under "Points to Clarify".
STRUCTURE AND TONE: Output in structured bullet points. Be logical, objective, and concise. Summarize from a third-person perspective. Use code blocks to preserve technical/code snippets verbatim.
OUTPUT LENGTH: Strictly limit the summary content to within {int(self.valves.max_summary_tokens * 3)} characters. Prioritize key information; if space is insufficient, trim details rather than core conclusions.
FORMATTING: Output only the summary text. Do not add any extra explanations, execution logs, or generation processes. You must use the following headings (if a section has no content, write "None"):
Core Theme:
Key Information:
... (List 3-6 key points)
Decisions/Conclusions:
Action Items (with owner/deadline if any):
Relevant Roles/Preferences:
Risks/Dependencies/Assumptions:
Points to Clarify:
Compression Ratio: Original ~X words → Summary ~Y words (estimate)
Conversation Content:
{conversation_text}
Please directly output the compressed summary that meets the above requirements (summary text only).
"""
# Determine the model to use
model = self.valves.summary_model or body.get("model", "")
if self.valves.debug_mode:
print(f"[🤖 LLM Call] Model: {model}")
# Build payload
payload = {
"model": model,
"messages": [{"role": "user", "content": summary_prompt}],
"stream": False,
"max_tokens": self.valves.max_summary_tokens,
"temperature": self.valves.summary_temperature,
}
try:
# Get user object
user_id = user_data.get("id") if user_data else None
if not user_id:
raise ValueError("Could not get user ID")
# [Optimization] Get user object in a background thread to avoid blocking the event loop.
if self.valves.debug_mode:
print("[Optimization] Getting user object in a background thread to avoid blocking the event loop.")
user = await asyncio.to_thread(Users.get_user_by_id, user_id)
if not user:
raise ValueError(f"Could not find user: {user_id}")
if self.valves.debug_mode:
print(f"[🤖 LLM Call] User: {user.email}")
print(f"[🤖 LLM Call] Sending request...")
# Create Request object
request = Request(scope={"type": "http", "app": webui_app})
# Call generate_chat_completion
response = await generate_chat_completion(request, payload, user)
if not response or "choices" not in response or not response["choices"]:
raise ValueError("LLM response is not in the correct format or is empty")
summary = response["choices"][0]["message"]["content"].strip()
if self.valves.debug_mode:
print(f"[🤖 LLM Call] ✅ Successfully received summary.")
return summary
except Exception as e:
error_message = f"An error occurred while calling the LLM ({model}) to generate a summary: {str(e)}"
if not self.valves.summary_model:
error_message += (
"\n[Hint] You did not specify a summary_model, so the filter attempted to use the current conversation's model. "
"If this is a pipeline (Pipe) model or an incompatible model, please specify a compatible summary model (e.g., 'gemini-2.5-flash') in the configuration."
)
if self.valves.debug_mode:
print(f"[🤖 LLM Call] ❌ {error_message}")
raise Exception(error_message)

View File

@@ -0,0 +1,77 @@
# 异步上下文压缩过滤器
**作者:** [Fu-Jie](https://github.com/Fu-Jie) | **版本:** 1.0.0 | **许可证:** MIT
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明。
本过滤器通过智能摘要和消息压缩技术在保持对话连贯性的同时显著降低长对话的Token消耗。
---
## 核心特性
-**自动压缩**: 基于消息数量阈值自动触发上下文压缩。
-**异步摘要**: 在后台生成摘要,不阻塞当前对话的响应。
-**持久化存储**: 支持 PostgreSQL 和 SQLite 数据库,确保摘要在服务重启后不丢失。
-**灵活保留策略**: 可自由配置保留对话头部和尾部的消息数量,确保关键信息和上下文的连贯性。
-**智能注入**: 将生成的历史摘要智能地注入到新的上下文中。
---
## 安装与配置
### 1. 环境变量
本插件的运行依赖于数据库,您**必须**在 Open WebUI 的环境变量中配置 `DATABASE_URL`
- **PostgreSQL 示例**:
```
DATABASE_URL=postgresql://user:password@host:5432/openwebui
```
- **SQLite 示例**:
```
DATABASE_URL=sqlite:///path/to/your/data/webui.db
```
### 2. 过滤器顺序
建议将此过滤器的优先级设置得相对较高(数值较小),以确保它在其他可能修改消息内容的过滤器之前运行。一个典型的顺序可能是:
1. **前置过滤器 (priority < 10)**
- 例如:注入系统级提示的过滤器。
2. **本压缩过滤器 (priority = 10)**
3. **后置过滤器 (priority > 10)**
- 例如:对最终输出进行格式化的过滤器。
---
## 配置参数
您可以在过滤器的设置中调整以下参数:
| 参数 | 默认值 | 描述 |
| :--- | :--- | :--- |
| `priority` | `10` | 过滤器执行顺序,数值越小越先执行。 |
| `compression_threshold` | `15` | 当总消息数达到此值时,将在后台触发摘要生成。 |
| `keep_first` | `1` | 始终保留对话开始的 N 条消息。第一条消息通常包含重要的系统提示。 |
| `keep_last` | `6` | 始终保留对话末尾的 N 条消息,以确保上下文连贯。 |
| `summary_model` | `None` | 用于生成摘要的模型。**强烈建议**配置一个快速、经济的兼容模型(如 `gemini-2.5-flash`)。如果留空,将尝试使用当前对话的模型,但这可能因模型不兼容(如 Pipe 模型)而失败。 |
| `max_summary_tokens` | `4000` | 生成摘要时允许的最大 Token 数。 |
| `summary_temperature` | `0.3` | 控制摘要生成的随机性,较低的值结果更稳定。 |
| `debug_mode` | `true` | 是否在日志中打印详细的调试信息。生产环境建议设为 `false`。 |
---
## 故障排除
- **问题:数据库连接失败**
- **解决**:请确认 `DATABASE_URL` 环境变量已正确设置,并且数据库服务运行正常。
- **问题:摘要未生成**
- **解决**:检查 `compression_threshold` 是否已达到,并确认 `summary_model` 配置正确。查看日志以获取详细错误。
- **问题:初始的系统提示丢失**
- **解决**:确保 `keep_first` 的值大于 0以保留包含重要信息的初始消息。
- **问题:压缩效果不明显**
- **解决**:尝试适当提高 `compression_threshold`,或减少 `keep_first` / `keep_last` 的值。

View File

@@ -0,0 +1,662 @@
# 异步上下文压缩过滤器 - 工作流程指南
## 📋 目录
1. [概述](#概述)
2. [系统架构](#系统架构)
3. [工作流程详解](#工作流程详解)
4. [Token 计数机制](#token-计数机制)
5. [递归摘要机制](#递归摘要机制)
6. [配置指南](#配置指南)
7. [最佳实践](#最佳实践)
---
## 概述
异步上下文压缩过滤器是一个高性能的消息压缩插件,通过以下方式降低长对话的 Token 消耗:
- **智能摘要**:将历史消息压缩成高保真摘要
- **递归更新**:新摘要合并旧摘要,保证历史连贯性
- **异步处理**:后台生成摘要,不阻塞用户响应
- **灵活配置**:支持全局和模型特定的阈值配置
### 核心指标
- **压缩率**:可达 65% 以上(取决于对话长度)
- **响应时间**inlet 阶段 <10ms无计算开销
- **摘要质量**:高保真递归摘要,保留关键信息
---
## 系统架构
```
┌─────────────────────────────────────────────────────┐
│ 用户请求流程 │
└────────────────┬────────────────────────────────────┘
┌────────────▼──────────────┐
│ inlet请求前处理
│ ├─ 加载摘要记录 │
│ ├─ 注入摘要到首条消息 │
│ └─ 返回压缩消息列表 │ ◄─ 快速返回 (<10ms)
└────────────┬──────────────┘
┌────────────▼──────────────┐
│ LLM 处理消息 │
│ ├─ 调用语言模型 │
│ └─ 生成回复 │
└────────────┬──────────────┘
┌────────────▼──────────────┐
│ outlet响应后处理
│ ├─ 启动后台异步任务 │
│ └─ 立即返回(不阻塞) │ ◄─ 返回响应给用户
└────────────┬──────────────┘
┌────────────▼──────────────┐
│ 后台处理asyncio 任务) │
│ ├─ 计算 Token 数 │
│ ├─ 检查压缩阈值 │
│ ├─ 生成递归摘要 │
│ └─ 保存到数据库 │
└────────────┬──────────────┘
┌────────────▼──────────────┐
│ 数据库持久化存储 │
│ ├─ 摘要内容 │
│ ├─ 压缩进度 │
│ └─ 时间戳 │
└────────────────────────────┘
```
---
## 工作流程详解
### 1⃣ inlet 阶段:消息注入与压缩视图构建
**目标**:快速应用已有摘要,构建压缩消息视图
**流程**
```
输入:所有消息列表
├─► 从数据库加载摘要记录
│ │
│ ├─► 找到 ✓ ─────┐
│ └─► 未找到 ───┐ │
│ │ │
├──────────────────┴─┼─► 存在摘要?
│ │
│ ┌───▼───┐
│ │ 是 │ 否
│ └───┬───┴───┐
│ │ │
│ ┌───────────▼─┐ ┌─▼─────────┐
│ │ 构建压缩视图 │ │ 使用原始 │
│ │ [H] + [T] │ │ 消息列表 │
│ └───────┬─────┘ └─┬────────┘
│ │ │
│ ┌───────────┴──────────┘
│ │
│ └─► 组合消息:
│ • 头部keep_first
│ • 摘要注入到首条
│ • 尾部keep_last
└─────► 返回压缩消息列表
⏱️ 耗时 <10ms
```
**关键参数**
- `keep_first`:保留前 N 条消息(默认 1
- `keep_last`:保留后 N 条消息(默认 6
- 摘要注入位置:首条消息的内容前
**示例**
```python
# 原始20 条消息
消息1: [系统提示]
消息2-14: [历史对话]
消息15-20: [最近对话]
# inlet 后存在摘要7 条消息
消息1: [系统提示 + 历史摘要...] 摘要已注入
消息15-20: [最近对话] 保留后6条
```
---
### 2⃣ outlet 阶段:后台异步处理
**目标**:计算 Token 数、检查阈值、生成摘要(不阻塞响应)
**流程**
```
LLM 响应完成
└─► outlet 处理
└─► 启动后台异步任务asyncio.create_task
├─► 立即返回给用户 ✓
│ (不等待后台任务完成)
└─► 后台执行 _check_and_generate_summary_async
├─► 在后台线程中计算 Token 数
│ (await asyncio.to_thread)
├─► 获取模型阈值配置
│ • 优先使用 model_thresholds 中的配置
│ • 回退到全局 compression_threshold_tokens
├─► 检查是否触发压缩
│ if current_tokens >= threshold:
└─► 触发摘要生成流程
```
**时序图**
```
时间线:
├─ T0: LLM 响应完成
├─ T1: outlet 被调用
│ └─► 启动后台任务
│ └─► 立即返回 ✓
├─ T2: 用户收到响应 ✓✓✓
└─ T3-T10: 后台任务执行
├─ 计算 Token
├─ 检查阈值
├─ 调用 LLM 生成摘要
└─ 保存到数据库
```
**关键特性**
- ✅ 用户响应不受影响
- ✅ Token 计算不阻塞请求
- ✅ 摘要生成异步进行
---
### 3⃣ Token 计数与阈值检查
**工作流程**
```
后台线程执行 _check_and_generate_summary_async
├─► Step 1: 计算当前 Token 总数
│ │
│ ├─ 遍历所有消息
│ ├─ 处理多模态内容(提取文本部分)
│ ├─ 使用 o200k_base 编码计数
│ └─ 返回 total_tokens
├─► Step 2: 获取模型特定阈值
│ │
│ ├─ 模型 ID: gpt-4
│ ├─ 查询 model_thresholds
│ │
│ ├─ 存在配置?
│ │ ├─ 是 ✓ 使用该配置
│ │ └─ 否 ✓ 使用全局参数
│ │
│ ├─ compression_threshold_tokens默认 64000
│ └─ max_context_tokens默认 128000
└─► Step 3: 检查是否触发压缩
if current_tokens >= compression_threshold_tokens:
│ └─► 触发摘要生成
else:
└─► 无需压缩,任务结束
```
**Token 计数细节**
```python
def _count_tokens(text):
if tiktoken_available:
# 使用 o200k_base统一编码
encoding = tiktoken.get_encoding("o200k_base")
return len(encoding.encode(text))
else:
# 回退:字符估算
return len(text) // 4
```
**模型阈值优先级**
```
优先级 1: model_thresholds["gpt-4"]
优先级 2: model_thresholds["gemini-2.5-flash"]
优先级 3: 全局 compression_threshold_tokens
```
---
### 4⃣ 递归摘要生成
**核心机制**:将旧摘要与新消息合并,生成更新的摘要
**工作流程**
```
触发 _generate_summary_async
├─► Step 1: 加载旧摘要
│ │
│ ├─ 从数据库查询
│ ├─ 获取 previous_summary
│ └─ 获取 compressed_message_count上次压缩进度
├─► Step 2: 确定待压缩消息范围
│ │
│ ├─ start_index = max(compressed_count, keep_first)
│ ├─ end_index = len(messages) - keep_last
│ │
│ ├─ 提取 messages[start_index:end_index]
│ └─ 这是【新增对话】部分
├─► Step 3: 构建 LLM 提示词
│ │
│ ├─ 【已有摘要】= previous_summary
│ ├─ 【新增对话】= 格式化的新消息
│ │
│ └─ 提示词模板:
│ "将【已有摘要】和【新增对话】合并..."
├─► Step 4: 调用 LLM 生成摘要
│ │
│ ├─ 模型选择summary_model若配置或当前模型
│ ├─ 参数:
│ │ • max_tokens = max_summary_tokens默认 4000
│ │ • temperature = summary_temperature默认 0.3
│ │ • stream = False
│ │
│ └─ 返回 new_summary
├─► Step 5: 保存摘要到数据库
│ │
│ ├─ 更新 chat_summary 表
│ ├─ summary = new_summary
│ ├─ compressed_message_count = end_index
│ └─ updated_at = now()
└─► Step 6: 记录日志
└─ 摘要长度、压缩进度、耗时等
```
**递归摘要示例**
```
第一轮压缩:
旧摘要: 无
新消息: 消息2-1413条
生成: Summary_V1
保存: compressed_message_count = 14
第二轮压缩:
旧摘要: Summary_V1
新消息: 消息15-28从14开始
生成: Summary_V2 = LLM(Summary_V1 + 新消息14-28)
保存: compressed_message_count = 28
结果:
✓ 早期信息得以保留(通过 Summary_V1
✓ 新信息与旧摘要融合
✓ 历史连贯性维护
```
---
## Token 计数机制
### 编码方案
```
┌─────────────────────────────────┐
│ _count_tokens(text) │
├─────────────────────────────────┤
│ 1. tiktoken 可用? │
│ ├─ 是 ✓ │
│ │ └─ use o200k_base │
│ │ (最新模型适配) │
│ │ │
│ └─ 否 ✓ │
│ └─ 字符估算 │
│ (1 token ≈ 4 chars) │
└─────────────────────────────────┘
```
### 多模态内容处理
```python
# 消息结构
message = {
"role": "user",
"content": [
{"type": "text", "text": "描述图片..."},
{"type": "image_url", "image_url": {...}},
{"type": "text", "text": "更多描述..."}
]
}
# Token 计数
提取所有 text 部分 合并 计数
图片部分被忽略不消耗文本 token
```
### 计数流程
```
_calculate_messages_tokens(messages, model)
├─► 遍历每条消息
│ │
│ ├─ content 是列表?
│ │ ├─ 是 ✓ 提取所有文本部分
│ │ └─ 否 ✓ 直接使用
│ │
│ └─ _count_tokens(content)
└─► 累加所有 Token 数
```
---
## 递归摘要机制
### 保证历史连贯性的核心原理
```
传统压缩方式(有问题):
时间线:
消息1-50 ─► 生成摘要1 ─► 保留 [摘要1 + 消息45-50]
消息51-100 ─► 生成摘要2 ─► 保留 [摘要2 + 消息95-100]
└─► ❌ 摘要1 丢失!早期信息无法追溯
递归摘要方式(本实现):
时间线:
消息1-50 ──► 生成摘要1 ──► 保存
摘要1 + 消息51-100 ──► 生成摘要2 ──► 保存
└─► ✓ 摘要1 信息融入摘要2
✓ 历史信息连贯保存
```
### 工作机制
```
inlet 阶段:
摘要库查询
├─ previous_summary已有摘要
└─ compressed_message_count压缩进度
outlet 阶段:
如果 current_tokens >= threshold:
├─ 新消息范围:
│ [compressed_message_count : len(messages) - keep_last]
└─ LLM 处理:
Input: previous_summary + 新消息
Output: 更新的摘要(含早期信息 + 新信息)
保存进度:
└─ compressed_message_count = end_index
(下次压缩从这里开始)
```
---
## 配置指南
### 全局配置
```python
Valves(
# Token 阈值
compression_threshold_tokens=64000, # 触发压缩
max_context_tokens=128000, # 硬性上限
# 消息保留策略
keep_first=1, # 保留首条(系统提示)
keep_last=6, # 保留末6条最近对话
# 摘要模型
summary_model="gemini-2.5-flash", # 快速经济
# 摘要参数
max_summary_tokens=4000,
summary_temperature=0.3,
)
```
### 模型特定配置
```python
model_thresholds = {
"gpt-4": {
"compression_threshold_tokens": 8000,
"max_context_tokens": 32000
},
"gemini-2.5-flash": {
"compression_threshold_tokens": 10000,
"max_context_tokens": 40000
},
"llama-70b": {
"compression_threshold_tokens": 20000,
"max_context_tokens": 80000
}
}
```
### 配置选择建议
```
场景1长对话成本优化
compression_threshold_tokens: 32000 ◄─ 更早触发
keep_last: 4 ◄─ 保留少一些
场景2质量优先
compression_threshold_tokens: 100000 ◄─ 晚触发
keep_last: 10 ◄─ 保留多一些
max_summary_tokens: 8000 ◄─ 更详细摘要
场景3平衡方案推荐
compression_threshold_tokens: 64000 ◄─ 默认
keep_last: 6 ◄─ 默认
summary_model: "gemini-2.5-flash" ◄─ 快速经济
```
---
## 最佳实践
### 1⃣ 摘要模型选择
```
推荐模型:
✅ gemini-2.5-flash 快速、经济、质量好
✅ deepseek-v3 成本低、速度快
✅ gpt-4o-mini 通用、质量稳定
避免:
❌ 流水线Pipe模型 可能不支持标准 API
❌ 本地模型 容易超时、影响体验
```
### 2⃣ 阈值调优
```
Token 计数验证:
1. 启用 debug_mode
2. 观察实际 Token 数
3. 根据需要调整阈值
# 日志示例
[🔍 后台计算] Token 数: 45320
[🔍 后台计算] 未触发压缩阈值 (Token: 45320 < 64000)
```
### 3⃣ 消息保留策略
```
keep_first 配置:
通常值: 1保留系统提示
某些场景: 0系统提示在摘要中
keep_last 配置:
通常值: 6保留最近对话
长对话: 8-10更多最近对话
短对话: 3-4节省 Token
```
### 4⃣ 监控与维护
```
关键指标:
• 摘要生成耗时
• Token 节省率
• 摘要质量(通过对话体验)
数据库维护:
# 定期清理过期摘要
DELETE FROM chat_summary
WHERE updated_at < NOW() - INTERVAL '30 days'
# 统计压缩效果
SELECT
COUNT(*) as total_summaries,
AVG(compressed_message_count) as avg_compressed
FROM chat_summary
```
### 5⃣ 故障排除
```
问题:摘要未生成
检查项:
1. Token 数是否达到阈值?
→ debug_mode 查看日志
2. summary_model 是否配置正确?
→ 确保模型存在且可用
3. 数据库连接是否正常?
→ 检查 DATABASE_URL
问题inlet 响应变慢
检查项:
1. keep_first/keep_last 是否过大?
2. 摘要数据是否过大?
3. 消息数是否过多?
问题:摘要质量下降
调整方案:
1. 增加 max_summary_tokens
2. 降低 summary_temperature更确定性
3. 更换摘要模型
```
---
## 性能参考
### 时间开销
```
inlet 阶段:
├─ 数据库查询: 1-2ms
├─ 摘要注入: 2-3ms
└─ 总计: <10ms ✓ (不影响用户体验)
outlet 阶段:
├─ 启动后台任务: <1ms
└─ 立即返回: ✓ (无等待)
后台处理(不阻塞用户):
├─ Token 计数: 10-50ms
├─ LLM 调用: 1-5 秒
├─ 数据库保存: 1-2ms
└─ 总计: 1-6 秒 (后台进行)
```
### Token 节省示例
```
场景20 条消息对话
未压缩:
总消息: 20 条
预估 Token: 8000 个
压缩后keep_first=1, keep_last=6
头部消息: 1 条 (1600 Token)
摘要: ~800 Token (嵌入在头部)
尾部消息: 6 条 (3200 Token)
总计: 7 条有效输入 (~5600 Token)
节省8000 - 5600 = 2400 Token (30% 节省)
随对话变长,节省比例可达 65% 以上
```
---
## 数据流图
```
用户消息
[inlet] 摘要注入器
├─ 数据库 ← 查询摘要
├─ 摘要注入到首条消息
└─ 返回压缩消息列表
LLM 处理
├─ 调用语言模型
├─ 生成响应
└─ 返回给用户 ✓✓✓
[outlet] 后台处理asyncio 任务)
├─ 计算 Token 数
├─ 检查阈值
├─ [if 需要] 调用 LLM 生成摘要
│ ├─ 加载旧摘要
│ ├─ 提取新消息
│ ├─ 构建提示词
│ └─ 调用 LLM
├─ 保存新摘要到数据库
└─ 记录日志
数据库持久化
└─ chat_summary 表更新
```
---
## 总结
| 阶段 | 职责 | 耗时 | 特点 |
|------|------|------|------|
| **inlet** | 摘要注入 | <10ms | 快速、无计算 |
| **LLM** | 生成回复 | 变量 | 正常流程 |
| **outlet** | 启动后台 | <1ms | 不阻塞响应 |
| **后台处理** | Token 计算、摘要生成、数据保存 | 1-6s | 异步执行 |
**核心优势**
- ✅ 用户响应不受影响
- ✅ Token 消耗显著降低
- ✅ 历史信息连贯保存
- ✅ 灵活的配置选项

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,45 @@
需求文档:异步上下文压缩插件优化 (Async Context Compression Optimization)
1. 核心目标 将现有的基于消息数量的压缩逻辑升级为基于 Token 数量的压缩逻辑,并引入递归摘要机制,以更精准地控制上下文窗口,提高摘要质量,并防止历史信息丢失。
2. 功能需求
Token 计数与阈值控制
引入 tiktoken: 使用 tiktoken 库进行精确的 Token 计数。如果环境不支持,则回退到字符估算 (1 token ≈ 4 chars)。
新配置参数 (Valves):
compression_threshold_tokens (默认: 64000): 当上下文总 Token 数超过此值时,触发压缩(生成摘要)。
max_context_tokens (默认: 128000): 上下文的硬性上限。如果超过此值,强制移除最早的消息(保留受保护消息除外)。
model_thresholds (字典): 支持针对不同模型 ID 配置不同的阈值。例如:{'gpt-4': {'compression_threshold_tokens': 8000, ...}}。
废弃旧参数: compression_threshold (基于消息数) 将被标记为废弃,优先使用 Token 阈值。
递归摘要 (Recursive Summarization)
机制: 在生成新摘要时,必须读取并包含上一次的摘要。
逻辑: 新摘要 = LLM(上一次摘要 + 新产生的对话消息)。
目的: 防止随着对话进行,最早期的摘要信息被丢弃,确保长期记忆的连续性。
消息保护与修剪策略
保护机制: keep_first (保留头部 N 条) 和 keep_last (保留尾部 N 条) 的消息绝对不参与压缩,也不被移除。
修剪逻辑: 当触发 max_context_tokens 限制时,优先移除 keep_first 之后、keep_last 之前的最早消息。
优化的提示词 (Prompt Engineering)
目标: 去除无用信息(寒暄、重复),保留关键信号(事实、代码、决策)。
指令:
提炼与净化: 明确要求移除噪音。
关键保留: 强调代码片段必须逐字保留。
合并与更新: 明确指示将新信息合并到旧摘要中。
语言一致性: 输出语言必须与对话语言保持一致。
3. 实现细节
文件:
async_context_compression.py
类:
Filter
关键方法:
_count_tokens(text): 实现 Token 计数。
_calculate_messages_tokens(messages): 计算消息列表总 Token。
_generate_summary_async(...)
: 修改为加载旧摘要,并传入 LLM。
_call_summary_llm(...)
: 更新 Prompt接受 previous_summary 和 new_messages。
inlet(...)
:
使用 compression_threshold_tokens 判断是否注入摘要。
实现 max_context_tokens 的强制修剪逻辑。
outlet(...)
: 使用 compression_threshold_tokens 判断是否触发后台摘要任务。

View File

@@ -0,0 +1,572 @@
"""
title: Context & Model Enhancement Filter
author: Fu-Jie
author_url: https://github.com/Fu-Jie
funding_url: https://github.com/Fu-Jie/awesome-openwebui
version: 0.2
description:
一个功能全面的 Filter 插件,用于增强请求上下文和优化模型功能。提供四大核心功能:
1. 环境变量注入:在每条用户消息前自动注入用户环境变量(用户名、时间、时区、语言等)
- 支持纯文本、图片、多模态消息
- 幂等性设计,避免重复注入
- 注入成功时发送前端状态提示
2. Web Search 功能改进:为特定模型优化 Web 搜索功能
- 为阿里云通义千问系列、DeepSeek、Gemini 等模型添加搜索能力
- 自动识别模型并追加 "-search" 后缀
- 管理功能开关,防止冲突
- 启用时发送搜索能力状态提示
3. 模型适配与上下文注入:为特定模型注入 chat_id 等上下文信息
- 支持 cfchatqwen、webgemini 等模型的特殊处理
- 动态模型重定向
- 智能化的模型识别和适配
4. 智能内容规范化:生产级的内容清洗与修复系统
- 智能修复损坏的代码块(前缀、后缀、缩进)
- 规范化 LaTeX 公式格式(行内/块级)
- 优化思维链标签(</thought>)格式
- 自动闭合未结束的代码块
- 智能列表格式修复
- 清理冗余的 XML 标签
- 可配置的规则系统
features:
- 自动化环境变量管理
- 智能模型功能适配
- 异步状态反馈
- 幂等性保证
- 多模型支持
- 智能内容清洗与规范化
"""
from pydantic import BaseModel, Field
from typing import Optional, List, Callable
import re
import logging
from dataclasses import dataclass, field
# 配置日志
logger = logging.getLogger(__name__)
@dataclass
class NormalizerConfig:
"""规范化配置类,用于动态启用/禁用特定规则"""
enable_escape_fix: bool = True # 修复转义字符
enable_thought_tag_fix: bool = True # 修复思考链标签
enable_code_block_fix: bool = True # 修复代码块格式
enable_latex_fix: bool = True # 修复 LaTeX 公式格式
enable_list_fix: bool = False # 修复列表换行
enable_unclosed_block_fix: bool = True # 修复未闭合代码块
enable_fullwidth_symbol_fix: bool = False # 修复代码内的全角符号
enable_xml_tag_cleanup: bool = True # 清理 XML 残留标签
# 自定义清理函数列表(高级扩展用)
custom_cleaners: List[Callable[[str], str]] = field(default_factory=list)
class ContentNormalizer:
"""LLM 输出内容规范化器 - 生产级实现"""
# --- 1. 预编译正则表达式(性能优化) ---
_PATTERNS = {
# 代码块前缀:如果 ``` 前面不是行首也不是换行符
'code_block_prefix': re.compile(r'(?<!^)(?<!\n)(```)', re.MULTILINE),
# 代码块后缀:匹配 ```语言名 后面紧跟非空白字符(没有换行)
# 匹配 ```python code 这种情况,但不匹配 ```python 或 ```python\n
'code_block_suffix': re.compile(r'(```[\w\+\-\.]*)[ \t]+([^\n\r])'),
# 代码块缩进:行首的空白字符 + ```
'code_block_indent': re.compile(r'^[ \t]+(```)', re.MULTILINE),
# 思考链标签:</thought> 后可能跟空格或换行
'thought_tag': re.compile(r'</thought>[ \t]*\n*'),
# LaTeX 块级公式:\[ ... \]
'latex_bracket_block': re.compile(r'\\\[(.+?)\\\]', re.DOTALL),
# LaTeX 行内公式:\( ... \)
'latex_paren_inline': re.compile(r'\\\((.+?)\\\)'),
# 列表项:非换行符 + 数字 + 点 + 空格 (e.g. "Text1. Item")
'list_item': re.compile(r'([^\n])(\d+\. )'),
# XML 残留标签 (如 Claude 的 artifacts)
'xml_artifacts': re.compile(r'</?(?:antArtifact|antThinking|artifact)[^>]*>', re.IGNORECASE),
}
def __init__(self, config: Optional[NormalizerConfig] = None):
self.config = config or NormalizerConfig()
self.applied_fixes = []
def normalize(self, content: str) -> str:
"""主入口:按顺序应用所有规范化规则"""
self.applied_fixes = []
if not content:
return content
try:
# 1. 转义字符修复(必须最先执行,否则影响后续正则)
if self.config.enable_escape_fix:
original = content
content = self._fix_escape_characters(content)
if content != original:
self.applied_fixes.append("修复转义字符")
# 2. 思考链标签规范化
if self.config.enable_thought_tag_fix:
original = content
content = self._fix_thought_tags(content)
if content != original:
self.applied_fixes.append("规范化思考链")
# 3. 代码块格式修复
if self.config.enable_code_block_fix:
original = content
content = self._fix_code_blocks(content)
if content != original:
self.applied_fixes.append("修复代码块格式")
# 4. LaTeX 公式规范化
if self.config.enable_latex_fix:
original = content
content = self._fix_latex_formulas(content)
if content != original:
self.applied_fixes.append("规范化 LaTeX 公式")
# 5. 列表格式修复
if self.config.enable_list_fix:
original = content
content = self._fix_list_formatting(content)
if content != original:
self.applied_fixes.append("修复列表格式")
# 6. 未闭合代码块检测与修复
if self.config.enable_unclosed_block_fix:
original = content
content = self._fix_unclosed_code_blocks(content)
if content != original:
self.applied_fixes.append("闭合未结束代码块")
# 7. 全角符号转半角(仅代码块内)
if self.config.enable_fullwidth_symbol_fix:
original = content
content = self._fix_fullwidth_symbols_in_code(content)
if content != original:
self.applied_fixes.append("全角符号转半角")
# 8. XML 标签残留清理
if self.config.enable_xml_tag_cleanup:
original = content
content = self._cleanup_xml_tags(content)
if content != original:
self.applied_fixes.append("清理 XML 标签")
# 9. 执行自定义清理函数
for cleaner in self.config.custom_cleaners:
original = content
content = cleaner(content)
if content != original:
self.applied_fixes.append("执行自定义清理")
return content
except Exception as e:
# 生产环境保底机制:如果清洗过程报错,返回原始内容,避免阻断服务
logger.error(f"内容规范化失败: {e}", exc_info=True)
return content
def _fix_escape_characters(self, content: str) -> str:
"""修复过度转义的字符"""
# 注意:先处理具体的转义序列,再处理通用的双反斜杠
content = content.replace("\\r\\n", "\n")
content = content.replace("\\n", "\n")
content = content.replace("\\t", "\t")
# 修复过度转义的反斜杠 (例如路径 C:\\Users)
content = content.replace("\\\\", "\\")
return content
def _fix_thought_tags(self, content: str) -> str:
"""规范化 </thought> 标签,统一为空两行"""
return self._PATTERNS['thought_tag'].sub("</thought>\n\n", content)
def _fix_code_blocks(self, content: str) -> str:
"""修复代码块格式(独占行、换行、去缩进)"""
# C: 移除代码块前的缩进(必须先执行,否则影响下面的判断)
content = self._PATTERNS['code_block_indent'].sub(r"\1", content)
# A: 确保 ``` 前有换行
content = self._PATTERNS['code_block_prefix'].sub(r"\n\1", content)
# B: 确保 ```语言标识 后有换行
content = self._PATTERNS['code_block_suffix'].sub(r"\1\n\2", content)
return content
def _fix_latex_formulas(self, content: str) -> str:
"""规范化 LaTeX 公式:\[ -> $$ (块级), \( -> $ (行内)"""
content = self._PATTERNS['latex_bracket_block'].sub(r"$$\1$$", content)
content = self._PATTERNS['latex_paren_inline'].sub(r"$\1$", content)
return content
def _fix_list_formatting(self, content: str) -> str:
"""修复列表项缺少换行的问题 (如 'text1. item' -> 'text\\n1. item')"""
return self._PATTERNS['list_item'].sub(r"\1\n\2", content)
def _fix_unclosed_code_blocks(self, content: str) -> str:
"""检测并修复未闭合的代码块"""
if content.count("```") % 2 != 0:
logger.warning("检测到未闭合的代码块,自动补全")
content += "\n```"
return content
def _fix_fullwidth_symbols_in_code(self, content: str) -> str:
"""在代码块内将全角符号转为半角(精细化操作)"""
# 常见误用的全角符号映射
FULLWIDTH_MAP = {
'': ',', '': '.', '': '(', '': ')',
'': '[', '': ']', '': ';', '': ':',
'': '?', '': '!', '"': '"', '"': '"',
''': "'", ''': "'",
}
parts = content.split("```")
# 代码块内容位于索引 1, 3, 5... (奇数位)
for i in range(1, len(parts), 2):
for full, half in FULLWIDTH_MAP.items():
parts[i] = parts[i].replace(full, half)
return "```".join(parts)
def _cleanup_xml_tags(self, content: str) -> str:
"""移除无关的 XML 标签"""
return self._PATTERNS['xml_artifacts'].sub("", content)
class Filter:
class Valves(BaseModel):
priority: int = Field(
default=0, description="Priority level for the filter operations."
)
def __init__(self):
# Indicates custom file handling logic. This flag helps disengage default routines in favor of custom
# implementations, informing the WebUI to defer file-related operations to designated methods within this class.
# Alternatively, you can remove the files directly from the body in from the inlet hook
# self.file_handler = True
# Initialize 'valves' with specific configurations. Using 'Valves' instance helps encapsulate settings,
# which ensures settings are managed cohesively and not confused with operational flags like 'file_handler'.
self.valves = self.Valves()
pass
def inlet(
self,
body: dict,
__user__: Optional[dict] = None,
__metadata__: Optional[dict] = None,
__model__: Optional[dict] = None,
__event_emitter__=None,
) -> dict:
# Modify the request body or validate it before processing by the chat completion API.
# This function is the pre-processor for the API where various checks on the input can be performed.
# It can also modify the request before sending it to the API.
messages = body.get("messages", [])
self.insert_user_env_info(__metadata__, messages, __event_emitter__)
# if "测试系统提示词" in str(messages):
# messages.insert(0, {"role": "system", "content": "你是一个大数学家"})
# print("XXXXX" * 100)
# print(body)
self.change_web_search(body, __user__, __event_emitter__)
body = self.inlet_chat_id(__model__, __metadata__, body)
return body
def inlet_chat_id(self, model: dict, metadata: dict, body: dict):
if "openai" in model:
base_model_id = model["openai"]["id"]
else:
base_model_id = model["info"]["base_model_id"]
base_model = model["id"] if base_model_id is None else base_model_id
if base_model.startswith("cfchatqwen"):
# pass
body["chat_id"] = metadata["chat_id"]
if base_model.startswith("webgemini"):
body["chat_id"] = metadata["chat_id"]
if not model["id"].startswith("webgemini"):
body["custom_model_id"] = model["id"]
# print("我是 body *******************", body)
return body
def change_web_search(self, body, __user__, __event_emitter__=None):
"""
优化特定模型的 Web 搜索功能。
功能:
- 检测是否启用了 Web 搜索
- 为支持搜索的模型启用模型本身的搜索能力
- 禁用默认的 web_search 开关以避免冲突
- 当使用模型本身的搜索能力时发送状态提示
参数:
body: 请求体字典
__user__: 用户信息
__event_emitter__: 用于发送前端事件的发射器函数
"""
features = body.get("features", {})
web_search_enabled = (
features.get("web_search", False) if isinstance(features, dict) else False
)
if isinstance(__user__, (list, tuple)):
user_email = __user__[0].get("email", "用户") if __user__[0] else "用户"
elif isinstance(__user__, dict):
user_email = __user__.get("email", "用户")
model_name = body.get("model")
search_enabled_for_model = False
if web_search_enabled:
if model_name in ["qwen-max-latest", "qwen-max", "qwen-plus-latest"]:
body.setdefault("enable_search", True)
features["web_search"] = False
search_enabled_for_model = True
if "search" in model_name or "搜索" in model_name:
features["web_search"] = False
if model_name.startswith("cfdeepseek-deepseek") and not model_name.endswith(
"search"
):
body["model"] = body["model"] + "-search"
features["web_search"] = False
search_enabled_for_model = True
if model_name.startswith("cfchatqwen") and not model_name.endswith(
"search"
):
body["model"] = body["model"] + "-search"
features["web_search"] = False
search_enabled_for_model = True
if model_name.startswith("gemini-2.5") and "search" not in model_name:
body["model"] = body["model"] + "-search"
features["web_search"] = False
search_enabled_for_model = True
if user_email == "yi204o@qq.com":
features["web_search"] = False
# 如果启用了模型本身的搜索能力,发送状态提示
if search_enabled_for_model and __event_emitter__:
import asyncio
try:
asyncio.create_task(
self._emit_search_status(__event_emitter__, model_name)
)
except RuntimeError:
pass
def insert_user_env_info(
self, __metadata__, messages, __event_emitter__=None, model_match_tags=None
):
"""
在第一条用户消息中注入环境变量信息。
功能特性:
- 始终在用户消息内容前注入环境变量的 Markdown 说明
- 支持多种消息类型:纯文本、图片、图文混合消息
- 幂等性设计:若环境变量信息已存在则更新为最新数据,不会重复添加
- 注入成功后通过事件发射器向前端发送"注入成功"的状态提示
参数:
__metadata__: 包含环境变量的元数据字典
messages: 消息列表
__event_emitter__: 用于发送前端事件的发射器函数
model_match_tags: 模型匹配标签(保留参数,当前未使用)
"""
variables = __metadata__.get("variables", {})
if not messages or messages[0]["role"] != "user":
return
env_injected = False
if variables:
# 构建环境变量的Markdown文本
variable_markdown = (
"## 用户环境变量\n"
"以下信息为用户的环境变量,可用于为用户提供更个性化的服务或满足特定需求时作为参考:\n"
f"- **用户姓名**{variables.get('{{USER_NAME}}', '')}\n"
f"- **当前日期时间**{variables.get('{{CURRENT_DATETIME}}', '')}\n"
f"- **当前星期**{variables.get('{{CURRENT_WEEKDAY}}', '')}\n"
f"- **当前时区**{variables.get('{{CURRENT_TIMEZONE}}', '')}\n"
f"- **用户语言**{variables.get('{{USER_LANGUAGE}}', '')}\n"
)
content = messages[0]["content"]
# 环境变量部分的匹配模式
env_var_pattern = r"(## 用户环境变量\n以下信息为用户的环境变量可用于为用户提供更个性化的服务或满足特定需求时作为参考\n.*?用户语言.*?\n)"
# 处理不同内容类型
if isinstance(content, list): # 多模态内容(可能包含图片和文本)
# 查找第一个文本类型的内容
text_index = -1
for i, part in enumerate(content):
if isinstance(part, dict) and part.get("type") == "text":
text_index = i
break
if text_index >= 0:
# 存在文本内容,检查是否已存在环境变量信息
text_part = content[text_index]
text_content = text_part.get("text", "")
if re.search(env_var_pattern, text_content, flags=re.DOTALL):
# 已存在环境变量信息,更新为最新数据
text_part["text"] = re.sub(
env_var_pattern,
variable_markdown,
text_content,
flags=re.DOTALL,
)
else:
# 不存在环境变量信息,添加到开头
text_part["text"] = f"{variable_markdown}\n{text_content}"
content[text_index] = text_part
else:
# 没有文本内容(例如只有图片),添加新的文本项
content.insert(
0, {"type": "text", "text": f"{variable_markdown}\n"}
)
messages[0]["content"] = content
elif isinstance(content, str): # 纯文本内容
# 检查是否已存在环境变量信息
if re.search(env_var_pattern, content, flags=re.DOTALL):
# 已存在,更新为最新数据
messages[0]["content"] = re.sub(
env_var_pattern, variable_markdown, content, flags=re.DOTALL
)
else:
# 不存在,添加到开头
messages[0]["content"] = f"{variable_markdown}\n{content}"
env_injected = True
else: # 其他类型内容
# 转换为字符串并处理
str_content = str(content)
# 检查是否已存在环境变量信息
if re.search(env_var_pattern, str_content, flags=re.DOTALL):
# 已存在,更新为最新数据
messages[0]["content"] = re.sub(
env_var_pattern, variable_markdown, str_content, flags=re.DOTALL
)
else:
# 不存在,添加到开头
messages[0]["content"] = f"{variable_markdown}\n{str_content}"
env_injected = True
# 环境变量注入成功后,发送状态提示给用户
if env_injected and __event_emitter__:
import asyncio
try:
# 如果在异步环境中,使用 await
asyncio.create_task(self._emit_env_status(__event_emitter__))
except RuntimeError:
# 如果不在异步环境中,直接调用
pass
async def _emit_env_status(self, __event_emitter__):
"""
发送环境变量注入成功的状态提示给前端用户
"""
try:
await __event_emitter__(
{
"type": "status",
"data": {
"description": "✓ 用户环境变量已注入成功",
"done": True,
},
}
)
except Exception as e:
print(f"发送状态提示时出错: {e}")
async def _emit_search_status(self, __event_emitter__, model_name):
"""
发送模型搜索功能启用的状态提示给前端用户
"""
try:
await __event_emitter__(
{
"type": "status",
"data": {
"description": f"🔍 已为 {model_name} 启用搜索能力",
"done": True,
},
}
)
except Exception as e:
print(f"发送搜索状态提示时出错: {e}")
async def _emit_normalization_status(self, __event_emitter__, applied_fixes: List[str] = None):
"""
发送内容规范化完成的状态提示
"""
description = "✓ 内容已自动规范化"
if applied_fixes:
description += f"{', '.join(applied_fixes)}"
try:
await __event_emitter__(
{
"type": "status",
"data": {
"description": description,
"done": True,
},
}
)
except Exception as e:
print(f"发送规范化状态提示时出错: {e}")
def _contains_html(self, content: str) -> bool:
"""
检测内容是否包含 HTML 标签
"""
# 匹配常见的 HTML 标签
pattern = r"<\s*/?\s*(?:html|head|body|div|span|p|br|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|b|i|strong|em|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
return bool(re.search(pattern, content, re.IGNORECASE))
def outlet(self, body: dict, __user__: Optional[dict] = None, __event_emitter__=None) -> dict:
"""
处理传出响应体,通过修改最后一条助手消息的内容。
使用 ContentNormalizer 进行全面的内容规范化。
"""
if "messages" in body and body["messages"]:
last = body["messages"][-1]
content = last.get("content", "") or ""
if last.get("role") == "assistant" and isinstance(content, str):
# 如果包含 HTML跳过规范化为了防止错误格式化
if self._contains_html(content):
return body
# 初始化规范化器
normalizer = ContentNormalizer()
# 执行规范化
new_content = normalizer.normalize(content)
# 更新内容
if new_content != content:
last["content"] = new_content
# 如果内容发生了改变,发送状态提示
if __event_emitter__:
import asyncio
try:
# 传入 applied_fixes
asyncio.create_task(self._emit_normalization_status(__event_emitter__, normalizer.applied_fixes))
except RuntimeError:
# 假如不在循环中,则忽略
pass
return body

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,212 @@
import asyncio
from typing import List, Optional, Dict
from pydantic import BaseModel, Field
from fastapi import Request
from open_webui.models.chats import Chats
class Filter:
class Valves(BaseModel):
# 注入的系统消息的前缀
CONTEXT_PREFIX: str = Field(
default="下面是多个匿名AI模型给出的回答使用<response>标签包裹:\n\n",
description="Prefix for the injected system message containing the raw merged context.",
)
def __init__(self):
self.valves = self.Valves()
self.toggle = True
self.type = "filter"
self.name = "合并回答"
self.description = "在用户提问时,自动注入之前多个模型回答的上下文。"
async def inlet(
self,
body: Dict,
__user__: Dict,
__metadata__: Dict,
__request__: Request,
__event_emitter__,
):
"""
此方法是过滤器的入口点。它会检查上一回合是否为多模型响应,
如果是,则将这些响应直接格式化,并将格式化后的上下文作为系统消息注入到当前请求中。
"""
print(f"*********** Filter '{self.name}' triggered ***********")
chat_id = __metadata__.get("chat_id")
if not chat_id:
print(
f"DEBUG: Filter '{self.name}' skipped: chat_id not found in metadata."
)
return body
print(f"DEBUG: Chat ID found: {chat_id}")
# 1. 从数据库获取完整的聊天历史
try:
chat = await asyncio.to_thread(Chats.get_chat_by_id, chat_id)
if (
not chat
or not hasattr(chat, "chat")
or not chat.chat.get("history")
or not chat.chat.get("history").get("messages")
):
print(
f"DEBUG: Filter '{self.name}' skipped: Chat history not found or empty for chat_id: {chat_id}"
)
return body
messages_map = chat.chat["history"]["messages"]
print(
f"DEBUG: Successfully loaded {len(messages_map)} messages from history."
)
# Count the number of user messages in the history
user_message_count = sum(
1 for msg in messages_map.values() if msg.get("role") == "user"
)
# If there are less than 2 user messages, there's no previous turn to merge.
if user_message_count < 2:
print(
f"DEBUG: Filter '{self.name}' skipped: Not enough user messages in history to have a previous turn (found {user_message_count}, required >= 2)."
)
return body
except Exception as e:
print(
f"ERROR: Filter '{self.name}' failed to get chat history from DB: {e}"
)
return body
# This filter rebuilds the entire chat history to consolidate all multi-response turns.
# 1. Get all messages from history and sort by timestamp
all_messages = list(messages_map.values())
all_messages.sort(key=lambda x: x.get("timestamp", 0))
# 2. Pre-group all assistant messages by their parentId for efficient lookup
assistant_groups = {}
for msg in all_messages:
if msg.get("role") == "assistant":
parent_id = msg.get("parentId")
if parent_id:
if parent_id not in assistant_groups:
assistant_groups[parent_id] = []
assistant_groups[parent_id].append(msg)
final_messages = []
processed_parent_ids = set()
# 3. Iterate through the sorted historical messages to build the final, clean list
for msg in all_messages:
msg_id = msg.get("id")
role = msg.get("role")
parent_id = msg.get("parentId")
if role == "user":
# Add user messages directly
final_messages.append(msg)
elif role == "assistant":
# If this assistant's parent group has already been processed, skip it
if parent_id in processed_parent_ids:
continue
# Process the group of siblings for this parent_id
if parent_id in assistant_groups:
siblings = assistant_groups[parent_id]
# Only perform a merge if there are multiple siblings
if len(siblings) > 1:
print(
f"DEBUG: Found a group of {len(siblings)} siblings for parent_id {parent_id}. Merging..."
)
# --- MERGE LOGIC ---
merged_content = None
merged_message_id = None
# Sort siblings by timestamp before processing
siblings.sort(key=lambda s: s.get("timestamp", 0))
merged_message_timestamp = siblings[0].get("timestamp", 0)
# Case A: Check for system pre-merged content (merged.status: true and content not empty)
merged_content_msg = next(
(
s
for s in siblings
if s.get("merged", {}).get("status")
and s.get("merged", {}).get("content")
),
None,
)
if merged_content_msg:
merged_content = merged_content_msg["merged"]["content"]
merged_message_id = merged_content_msg["id"]
merged_message_timestamp = merged_content_msg.get(
"timestamp", merged_message_timestamp
)
print(
f"DEBUG: Using pre-merged content from message ID: {merged_message_id}"
)
else:
# Case B: Manually merge content
combined_content = []
first_sibling_id = None
counter = 0
for s in siblings:
if not first_sibling_id:
first_sibling_id = s["id"]
content = s.get("content", "")
if (
content
and content
!= "The requested model is not supported."
):
response_id = chr(ord("a") + counter)
combined_content.append(
f'<response id="{response_id}">\n{content}\n</response>'
)
counter += 1
if combined_content:
merged_content = "\n\n".join(combined_content)
merged_message_id = first_sibling_id or parent_id
if merged_content:
merged_message = {
"id": merged_message_id,
"parentId": parent_id,
"role": "assistant",
"content": f"{self.valves.CONTEXT_PREFIX}{merged_content}",
"timestamp": merged_message_timestamp,
}
final_messages.append(merged_message)
else:
# If there's only one sibling, add it directly
final_messages.append(siblings[0])
# Mark this group as processed
processed_parent_ids.add(parent_id)
# 4. The new user message from the current request is not in the historical messages_map,
# so we need to append it to our newly constructed message list.
if body.get("messages"):
new_user_message_from_body = body["messages"][-1]
# Ensure we don't add a historical message that might be in the body for context
if new_user_message_from_body.get("id") not in messages_map:
final_messages.append(new_user_message_from_body)
# 5. Replace the original message list with the new, cleaned-up list
body["messages"] = final_messages
print(
f"DEBUG: Rebuilt message history with {len(final_messages)} messages, consolidating all multi-response turns."
)
print(f"*********** Filter '{self.name}' finished successfully ***********")
return body