diff --git a/CHANGELOG.md b/CHANGELOG.md deleted file mode 100644 index 01e8a65..0000000 --- a/CHANGELOG.md +++ /dev/null @@ -1,27 +0,0 @@ -# Changelog - -All notable changes to this project will be documented in this file. - -## [Unreleased] - -### πŸš€ New Features -- **Smart Mind Map**: Updated to v0.7.3. Improved description and metadata. -- **Knowledge Card**: Updated to v0.2.1. Improved description and metadata. -- **Documentation**: Added comprehensive `plugin_development_guide_cn.md` consolidating all previous guides. - -### πŸ“¦ Project Structure -- **Renamed**: Project renamed from `awesome-openwebui` to **OpenWebUI Extras**. -- **Reorganized**: - - Moved `run.py` to `scripts/`. - - Moved large documentation files to `docs/`. - - Removed `requirements.txt` to emphasize "resource collection" nature. -- **Added**: `CONTRIBUTING.md` guide. - -### πŸ“ Documentation -- **README**: Updated English and Chinese READMEs with new project name and structure. -- **Plan**: Updated `implementation_plan.md` to reflect the new direction. - ---- - -## [0.1.0] - 2025-12-19 -- Initial release of the reorganized project structure. diff --git a/README.md b/README.md index 7d9e482..f6b8500 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ English | [δΈ­ζ–‡](./README_CN.md) A collection of enhancements, plugins, and prompts for [OpenWebUI](https://github.com/open-webui/open-webui), developed and curated for personal use to extend functionality and improve experience. -[Contributing](./CONTRIBUTING.md) | [Changelog](./CHANGELOG.md) +[Contributing](./CONTRIBUTING.md) ## πŸ“¦ Project Contents diff --git a/plugins/filters/async-context-compression/async_context_compression.py b/plugins/filters/async-context-compression/async_context_compression.py index 0597b61..79d9058 100644 --- a/plugins/filters/async-context-compression/async_context_compression.py +++ b/plugins/filters/async-context-compression/async_context_compression.py @@ -15,7 +15,7 @@ license: MIT This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence. Core Features: - βœ… Automatic compression triggered by a message count threshold + βœ… Automatic compression triggered by Token count threshold βœ… Asynchronous summary generation (does not block user response) βœ… Persistent storage with database support (PostgreSQL and SQLite) βœ… Flexible retention policy (configurable to keep first and last N messages) @@ -39,7 +39,7 @@ Phase 1: Inlet (Pre-request processing) Phase 2: Outlet (Post-response processing) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Triggered after the LLM response is complete. - 2. Checks if the message count has reached the compression threshold. + 2. Checks if the Token count has reached the compression threshold. 3. If the threshold is met, an asynchronous background task is started to generate a summary: β”œβ”€ Extracts messages to be summarized (excluding the kept first and last messages). β”œβ”€ Calls the LLM to generate a concise summary. @@ -96,11 +96,20 @@ priority Default: 10 Description: The execution order of the filter. Lower numbers run first. -compression_threshold - Default: 15 - Description: When the message count reaches this value, a background summary generation will be triggered after the conversation ends. +compression_threshold_tokens + Default: 64000 + Description: When the total context Token count exceeds this value, compression is triggered. Recommendation: Adjust based on your model's context window and cost. +max_context_tokens + Default: 128000 + Description: Hard limit for context. Exceeding this value will force removal of the earliest messages. + +model_thresholds + Default: {} + Description: Threshold override configuration for specific models. + Example: {"gpt-4": {"compression_threshold_tokens": 8000, "max_context_tokens": 32000}} + keep_first Default: 1 Description: Always keep the first N messages of the conversation. Set to 0 to disable. The first message often contains important system prompts. @@ -113,7 +122,7 @@ summary_model Default: None Description: The LLM used to generate the summary. Recommendation: - - It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`、`gemini-2.5-flash`、`gpt-4.1`。 + - It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`, `gemini-2.5-flash`, `gpt-4.1`. - If left empty, the filter will attempt to use the model from the current conversation. Note: - If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, leaving this field empty may cause summary generation to fail. In this case, you must specify a valid model. @@ -205,8 +214,8 @@ Statistics: 4. Cost Optimization ⚠ The summary model is called once each time the threshold is met. - ⚠ Set `compression_threshold` reasonably to avoid frequent calls. - ⚠ It's recommended to use a fast and economical model to generate summaries. + ⚠ Set `compression_threshold_tokens` reasonably to avoid frequent calls. + ⚠ It's recommended to use a fast and economical model (like `gemini-flash`) to generate summaries. 5. Multimodal Support βœ“ This filter supports multimodal messages containing images. @@ -227,7 +236,7 @@ Solution: Problem: Summary not generated Solution: - 1. Check if the `compression_threshold` has been met. + 1. Check if the `compression_threshold_tokens` has been met. 2. Verify that the `summary_model` is configured correctly. 3. Check the debug logs for any error messages. @@ -237,7 +246,7 @@ Solution: Problem: Compression effect is not significant Solution: - 1. Increase the `compression_threshold` appropriately. + 1. Increase the `compression_threshold_tokens` appropriately. 2. Decrease the number of `keep_last` or `keep_first`. 3. Check if the conversation is actually long enough. @@ -245,11 +254,12 @@ Solution: """ from pydantic import BaseModel, Field, model_validator -from typing import Optional +from typing import Optional, Dict, Any, List, Union, Callable, Awaitable import asyncio import json import hashlib import os +import time # Open WebUI built-in imports from open_webui.utils.chat import generate_chat_completion @@ -257,6 +267,12 @@ from open_webui.models.users import Users from fastapi.requests import Request from open_webui.main import app as webui_app +# Try to import tiktoken +try: + import tiktoken +except ImportError: + tiktoken = None + # Database imports from sqlalchemy import create_engine, Column, String, Text, DateTime, Integer from sqlalchemy.ext.declarative import declarative_base @@ -284,6 +300,7 @@ class Filter: self.valves = self.Valves() self._db_engine = None self._SessionLocal = None + self.temp_state = {} # Used to pass temporary data between inlet and outlet self._init_database() def _init_database(self): @@ -292,7 +309,9 @@ class Filter: database_url = os.getenv("DATABASE_URL") if not database_url: - print("[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable.") + print( + "[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable." + ) self._db_engine = None self._SessionLocal = None return @@ -312,7 +331,9 @@ class Filter: database_url = database_url.replace( "postgres://", "postgresql://", 1 ) - print("[Database] ℹ️ Automatically converted postgres:// to postgresql://") + print( + "[Database] ℹ️ Automatically converted postgres:// to postgresql://" + ) engine_args = { "pool_pre_ping": True, "pool_recycle": 3600, @@ -337,7 +358,9 @@ class Filter: # Create table if it doesn't exist Base.metadata.create_all(bind=self._db_engine) - print(f"[Database] βœ… Successfully connected to {db_type} and initialized the chat_summary table.") + print( + f"[Database] βœ… Successfully connected to {db_type} and initialized the chat_summary table." + ) except Exception as e: print(f"[Database] ❌ Initialization failed: {str(e)}") @@ -348,36 +371,403 @@ class Filter: priority: int = Field( default=10, description="Priority level for the filter operations." ) - compression_threshold: int = Field( - default=15, ge=0, description="The number of messages at which to trigger compression." + # Token related parameters + compression_threshold_tokens: int = Field( + default=64000, + ge=0, + description="When total context Token count exceeds this value, trigger compression (Global Default)", ) + max_context_tokens: int = Field( + default=128000, + ge=0, + description="Hard limit for context. Exceeding this value will force removal of earliest messages (Global Default)", + ) + model_thresholds: dict = Field( + default={ + # Groq + "groq-openai/gpt-oss-20b": { + "max_context_tokens": 8000, + "compression_threshold_tokens": 5600, + }, + "groq-openai/gpt-oss-120b": { + "max_context_tokens": 8000, + "compression_threshold_tokens": 5600, + }, + # Qwen (ModelScope / CF) + "modelscope-Qwen/Qwen3-Coder-480B-A35B-Instruct": { + "max_context_tokens": 256000, + "compression_threshold_tokens": 179200, + }, + "cfchatqwen-qwen3-max-search": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "modelscope-Qwen/Qwen3-235B-A22B-Thinking-2507": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-max": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-vl-plus-thinking": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-coder-plus-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "cfchatqwen-qwen3-vl-plus": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-coder-plus": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "cfchatqwen-qwen3-omni-flash-thinking": { + "max_context_tokens": 65536, + "compression_threshold_tokens": 45875, + }, + "cfchatqwen-qwen3-omni-flash": { + "max_context_tokens": 65536, + "compression_threshold_tokens": 45875, + }, + "cfchatqwen-qwen3-next-80b-a3b-thinking": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "modelscope-Qwen/Qwen3-VL-235B-A22B-Instruct": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-next-80b-a3b-thinking-search": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-next-80b-a3b": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-235b-a22b-thinking-search": { + "max_context_tokens": 131072, + "compression_threshold_tokens": 91750, + }, + "cfchatqwen-qwen3-235b-a22b": { + "max_context_tokens": 131072, + "compression_threshold_tokens": 91750, + }, + "cfchatqwen-qwen3-235b-a22b-thinking": { + "max_context_tokens": 131072, + "compression_threshold_tokens": 91750, + }, + "cfchatqwen-qwen3-coder-flash-search": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-coder-flash": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-max-2025-10-30": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-max-2025-10-30-thinking": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-max-2025-10-30-thinking-search": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "modelscope-Qwen/Qwen3-235B-A22B-Instruct-2507": { + "max_context_tokens": 262144, + "compression_threshold_tokens": 183500, + }, + "cfchatqwen-qwen3-vl-30b-a3b": { + "max_context_tokens": 131072, + "compression_threshold_tokens": 91750, + }, + "cfchatqwen-qwen3-vl-30b-a3b-thinking": { + "max_context_tokens": 131072, + "compression_threshold_tokens": 91750, + }, + # Gemini + "gemini-2.5-pro-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.5-flash-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.5-flash": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.5-flash-lite": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.5-flash-lite-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.5-pro": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.0-flash-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.0-flash": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.0-flash-exp": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-2.0-flash-lite": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "copilot-gemini-2.5-pro": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gemini-pro-latest": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-3-pro-preview": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gemini-pro-latest-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-flash-latest": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-flash-latest-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-flash-lite-latest-search": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-flash-lite-latest": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + "gemini-robotics-er-1.5-preview": { + "max_context_tokens": 1048576, + "compression_threshold_tokens": 734000, + }, + # DeepSeek + "modelscope-deepseek-ai/DeepSeek-V3.1": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfdeepseek-deepseek-search": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "openrouter-deepseek/deepseek-r1-0528:free": { + "max_context_tokens": 163840, + "compression_threshold_tokens": 114688, + }, + "modelscope-deepseek-ai/DeepSeek-V3.2-Exp": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfdeepseek-deepseek-r1-search": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfdeepseek-deepseek-r1": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "openrouter-deepseek/deepseek-chat-v3.1:free": { + "max_context_tokens": 163800, + "compression_threshold_tokens": 114660, + }, + "modelscope-deepseek-ai/DeepSeek-R1-0528": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfdeepseek-deepseek": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + # Kimi (Moonshot) + "cfkimi-kimi-k2-search": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfkimi-kimi-k1.5-search": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfkimi-kimi-k1.5-thinking-search": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfkimi-kimi-research": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "openrouter-moonshotai/kimi-k2:free": { + "max_context_tokens": 32768, + "compression_threshold_tokens": 22937, + }, + "cfkimi-kimi-k2": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "cfkimi-kimi-k1.5": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + # GPT / OpenAI + "gpt-4.1": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gpt-4o": { + "max_context_tokens": 64000, + "compression_threshold_tokens": 44800, + }, + "gpt-5": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "github-gpt-4.1": { + "max_context_tokens": 7500, + "compression_threshold_tokens": 5250, + }, + "gpt-5-mini": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gpt-5.1": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gpt-5.1-codex": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gpt-5.1-codex-mini": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "gpt-5-codex": { + "max_context_tokens": 200000, + "compression_threshold_tokens": 140000, + }, + "github-gpt-4.1-mini": { + "max_context_tokens": 7500, + "compression_threshold_tokens": 5250, + }, + "openrouter-openai/gpt-oss-20b:free": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + # Claude / Anthropic + "claude-sonnet-4.5": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "claude-haiku-4.5": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "copilot-claude-opus-41": { + "max_context_tokens": 80000, + "compression_threshold_tokens": 56000, + }, + "copilot-claude-sonnet-4": { + "max_context_tokens": 80000, + "compression_threshold_tokens": 56000, + }, + # Other / OpenRouter / OSWE + "oswe-vscode-insiders": { + "max_context_tokens": 256000, + "compression_threshold_tokens": 179200, + }, + "modelscope-MiniMax/MiniMax-M2": { + "max_context_tokens": 204800, + "compression_threshold_tokens": 143360, + }, + "oswe-vscode-prime": { + "max_context_tokens": 200000, + "compression_threshold_tokens": 140000, + }, + "grok-code-fast-1": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "copilot-auto": { + "max_context_tokens": 128000, + "compression_threshold_tokens": 89600, + }, + "modelscope-ZhipuAI/GLM-4.6": { + "max_context_tokens": 32000, + "compression_threshold_tokens": 22400, + }, + "openrouter-x-ai/grok-4.1-fast:free": { + "max_context_tokens": 2000000, + "compression_threshold_tokens": 1400000, + }, + "openrouter-qwen/qwen3-coder:free": { + "max_context_tokens": 262000, + "compression_threshold_tokens": 183400, + }, + "openrouter-qwen/qwen3-235b-a22b:free": { + "max_context_tokens": 40960, + "compression_threshold_tokens": 28672, + }, + }, + description="Threshold override configuration for specific models. Only includes models requiring special configuration.", + ) + keep_first: int = Field( - default=1, ge=0, description="Always keep the first N messages. Set to 0 to disable." + default=1, + ge=0, + description="Always keep the first N messages. Set to 0 to disable.", + ) + keep_last: int = Field( + default=6, ge=0, description="Always keep the last N full messages." ) - keep_last: int = Field(default=6, ge=0, description="Always keep the last N messages.") summary_model: str = Field( default=None, - description="The model to use for generating the summary. If empty, uses the current conversation's model.", + description="The model ID used to generate the summary. If empty, uses the current conversation's model. Used to match configurations in model_thresholds.", ) max_summary_tokens: int = Field( - default=4000, ge=1, description="The maximum number of tokens for the summary." + default=16384, + ge=1, + description="The maximum number of tokens for the summary.", ) summary_temperature: float = Field( - default=0.3, ge=0.0, le=2.0, description="The temperature for summary generation." + default=0.1, + ge=0.0, + le=2.0, + description="The temperature for summary generation.", + ) + debug_mode: bool = Field( + default=True, description="Enable detailed logging for debugging." ) - debug_mode: bool = Field(default=True, description="Enable detailed logging for debugging.") - @model_validator(mode="after") - def check_thresholds(self) -> "Valves": - kept_count = self.keep_first + self.keep_last - if self.compression_threshold <= kept_count: - raise ValueError( - f"compression_threshold ({self.compression_threshold}) must be greater than " - f"the sum of keep_first ({self.keep_first}) and keep_last ({self.keep_last}) ({kept_count})." - ) - return self - - def _save_summary(self, chat_id: str, summary: str, body: dict): + def _save_summary(self, chat_id: str, summary: str, compressed_count: int): """Saves the summary to the database.""" if not self._SessionLocal: if self.valves.debug_mode: @@ -388,21 +778,27 @@ class Filter: session = self._SessionLocal() try: # Find existing record - existing = ( - session.query(ChatSummary).filter_by(chat_id=chat_id).first() - ) + existing = session.query(ChatSummary).filter_by(chat_id=chat_id).first() if existing: + # [Optimization] Optimistic lock check: update only if progress moves forward + if compressed_count <= existing.compressed_message_count: + if self.valves.debug_mode: + print( + f"[Storage] Skipping update: New progress ({compressed_count}) is not greater than existing progress ({existing.compressed_message_count})" + ) + return + # Update existing record existing.summary = summary - existing.compressed_message_count = len(body.get("messages", [])) + existing.compressed_message_count = compressed_count existing.updated_at = datetime.utcnow() else: # Create new record new_summary = ChatSummary( chat_id=chat_id, summary=summary, - compressed_message_count=len(body.get("messages", [])), + compressed_message_count=compressed_count, ) session.add(new_summary) @@ -410,7 +806,9 @@ class Filter: if self.valves.debug_mode: action = "Updated" if existing else "Created" - print(f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})") + print( + f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})" + ) finally: session.close() @@ -418,38 +816,100 @@ class Filter: except Exception as e: print(f"[Storage] ❌ Database save failed: {str(e)}") - def _load_summary(self, chat_id: str, body: dict) -> Optional[str]: - """Loads the summary from the database.""" + def _load_summary_record(self, chat_id: str) -> Optional[ChatSummary]: + """Loads the summary record object from the database.""" if not self._SessionLocal: - if self.valves.debug_mode: - print("[Storage] Database not initialized, cannot load summary.") return None try: session = self._SessionLocal() try: - record = ( - session.query(ChatSummary).filter_by(chat_id=chat_id).first() - ) - + record = session.query(ChatSummary).filter_by(chat_id=chat_id).first() if record: - if self.valves.debug_mode: - print(f"[Storage] Loaded summary from database (Chat ID: {chat_id})") - print( - f"[Storage] Last updated: {record.updated_at}, Original message count: {record.compressed_message_count}" - ) - return record.summary - + # Detach the object from the session so it can be used after session close + session.expunge(record) + return record finally: session.close() - except Exception as e: - print(f"[Storage] ❌ Database read failed: {str(e)}") - + print(f"[Load] ❌ Database read failed: {str(e)}") return None + def _load_summary(self, chat_id: str, body: dict) -> Optional[str]: + """Loads the summary text from the database (Compatible with old interface).""" + record = self._load_summary_record(chat_id) + if record: + if self.valves.debug_mode: + print(f"[Load] Loaded summary from database (Chat ID: {chat_id})") + print( + f"[Load] Last updated: {record.updated_at}, Compressed message count: {record.compressed_message_count}" + ) + return record.summary + return None + + def _count_tokens(self, text: str, model: str = "gpt-3.5-turbo") -> int: + """Counts the number of tokens in the text.""" + if not text: + return 0 + + if tiktoken: + try: + # Uniformly use o200k_base encoding (adapted for latest models) + encoding = tiktoken.get_encoding("o200k_base") + return len(encoding.encode(text)) + except Exception as e: + if self.valves.debug_mode: + print( + f"[Token Count] tiktoken error: {e}, falling back to character estimation" + ) + + # Fallback strategy: Rough estimation (1 token β‰ˆ 4 chars) + return len(text) // 4 + + def _calculate_messages_tokens( + self, messages: List[Dict], model: str = "gpt-3.5-turbo" + ) -> int: + """Calculates the total tokens for a list of messages.""" + total_tokens = 0 + for msg in messages: + content = msg.get("content", "") + if isinstance(content, list): + # Handle multimodal content + text_content = "" + for part in content: + if isinstance(part, dict) and part.get("type") == "text": + text_content += part.get("text", "") + total_tokens += self._count_tokens(text_content, model) + else: + total_tokens += self._count_tokens(str(content), model) + return total_tokens + + def _get_model_thresholds(self, model_id: str) -> Dict[str, int]: + """Gets threshold configuration for a specific model. + + Priority: + 1. If configuration exists for the model ID in model_thresholds, use it. + 2. Otherwise, use global parameters compression_threshold_tokens and max_context_tokens. + """ + # Try to match from model-specific configuration + if model_id in self.valves.model_thresholds: + if self.valves.debug_mode: + print(f"[Config] Using model-specific configuration: {model_id}") + return self.valves.model_thresholds[model_id] + + # Use global default configuration + if self.valves.debug_mode: + print( + f"[Config] Model {model_id} not in model_thresholds, using global parameters" + ) + + return { + "compression_threshold_tokens": self.valves.compression_threshold_tokens, + "max_context_tokens": self.valves.max_context_tokens, + } + def _inject_summary_to_first_message(self, message: dict, summary: str) -> dict: - """Injects the summary into the first message by prepending it.""" + """Injects the summary into the first message (prepended to content).""" content = message.get("content", "") summary_block = f"【Historical Conversation Summary】\n{summary}\n\n---\nBelow is the recent conversation:\n\n" @@ -485,14 +945,15 @@ class Filter: return message async def inlet( - self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None + self, + body: dict, + __user__: Optional[dict] = None, + __metadata__: dict = None, + __event_emitter__: Callable[[Any], Awaitable[None]] = None, ) -> dict: """ Executed before sending to the LLM. - Compression Strategy: - 1. Keep the first N messages. - 2. Inject the summary into the first message (if keep_first > 0). - 3. Keep the last N messages. + Compression Strategy: Only responsible for injecting existing summaries, no Token calculation. """ messages = body.get("messages", []) chat_id = __metadata__["chat_id"] @@ -502,161 +963,331 @@ class Filter: print(f"[Inlet] Chat ID: {chat_id}") print(f"[Inlet] Received {len(messages)} messages") - # [Optimization] Load summary in a background thread to avoid blocking the event loop. - if self.valves.debug_mode: - print("[Optimization] Loading summary in a background thread to avoid blocking the event loop.") - saved_summary = await asyncio.to_thread(self._load_summary, chat_id, body) + # Record the target compression progress for the original messages, for use in outlet + # Target is to compress up to the (total - keep_last) message + target_compressed_count = max(0, len(messages) - self.valves.keep_last) - total_kept_count = self.valves.keep_first + self.valves.keep_last - - if saved_summary and len(messages) > total_kept_count: + # [Optimization] Simple state cleanup check + if chat_id in self.temp_state: if self.valves.debug_mode: - print(f"[Inlet] Found saved summary, applying compression.") - - first_messages_to_keep = [] - - if self.valves.keep_first > 0: - # Copy the initial messages to keep - first_messages_to_keep = [ - m.copy() for m in messages[: self.valves.keep_first] - ] - # Inject the summary into the very first message - first_messages_to_keep[0] = self._inject_summary_to_first_message( - first_messages_to_keep[0], saved_summary - ) - else: - # If not keeping initial messages, create a new system message for the summary - summary_block = ( - f"【Historical Conversation Summary】\n{saved_summary}\n\n---\nBelow is the recent conversation:\n\n" - ) - first_messages_to_keep.append( - {"role": "system", "content": summary_block} + print( + f"[Inlet] ⚠️ Overwriting unconsumed old state (Chat ID: {chat_id})" ) - # Keep the last messages - last_messages_to_keep = ( - messages[-self.valves.keep_last :] if self.valves.keep_last > 0 else [] + self.temp_state[chat_id] = target_compressed_count + + if self.valves.debug_mode: + print( + f"[Inlet] Recorded target compression progress: {target_compressed_count}" ) - # Combine: [Kept initial messages (with summary)] + [Kept recent messages] - body["messages"] = first_messages_to_keep + last_messages_to_keep + # Load summary record + summary_record = await asyncio.to_thread(self._load_summary_record, chat_id) + + final_messages = [] + + if summary_record: + # Summary exists, build view: [Head] + [Summary Message] + [Tail] + # Tail is all messages after the last compression point + compressed_count = summary_record.compressed_message_count + + # Ensure compressed_count is reasonable + if compressed_count > len(messages): + compressed_count = max(0, len(messages) - self.valves.keep_last) + + # 1. Head messages (Keep First) + head_messages = [] + if self.valves.keep_first > 0: + head_messages = messages[: self.valves.keep_first] + + # 2. Summary message (Inserted as User message) + summary_content = ( + f"【System Prompt: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n" + f"{summary_record.summary}\n\n" + f"---\n" + f"Below is the recent conversation:" + ) + summary_msg = {"role": "user", "content": summary_content} + + # 3. Tail messages (Tail) - All messages starting from the last compression point + # Note: Must ensure head messages are not duplicated + start_index = max(compressed_count, self.valves.keep_first) + tail_messages = messages[start_index:] + + final_messages = head_messages + [summary_msg] + tail_messages + + # Send status notification + if __event_emitter__: + await __event_emitter__( + { + "type": "status", + "data": { + "description": f"Loaded historical summary (Hidden {compressed_count} historical messages)", + "done": True, + }, + } + ) if self.valves.debug_mode: - print(f"[Inlet] βœ‚οΈ Compression complete:") - print(f" - Original messages: {len(messages)}") - print(f" - Compressed to: {len(body['messages'])}") print( - f" - Structure: [Keep first {self.valves.keep_first} (with summary)] + [Keep last {self.valves.keep_last}]" + f"[Inlet] Applied summary: Head({len(head_messages)}) + Summary + Tail({len(tail_messages)})" ) - print(f" - Saved: {len(messages) - len(body['messages'])} messages") else: - if self.valves.debug_mode: - if not saved_summary: - print(f"[Inlet] No summary found, using full conversation history.") - else: - print(f"[Inlet] Message count does not exceed retention threshold, no compression applied.") + # No summary, use original messages + final_messages = messages + + body["messages"] = final_messages if self.valves.debug_mode: + print(f"[Inlet] Final send: {len(body['messages'])} messages") print(f"{'='*60}\n") return body async def outlet( - self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None + self, + body: dict, + __user__: Optional[dict] = None, + __metadata__: dict = None, + __event_emitter__: Callable[[Any], Awaitable[None]] = None, ) -> dict: """ Executed after the LLM response is complete. - Triggers summary generation asynchronously. + Calculates Token count in the background and triggers summary generation (does not block current response, does not affect content output). """ - messages = body.get("messages", []) chat_id = __metadata__["chat_id"] + model = body.get("model", "gpt-3.5-turbo") if self.valves.debug_mode: print(f"\n{'='*60}") print(f"[Outlet] Chat ID: {chat_id}") - print(f"[Outlet] Response complete, current message count: {len(messages)}") + print(f"[Outlet] Response complete") - # Check if compression is needed - if len(messages) >= self.valves.compression_threshold: - if self.valves.debug_mode: - print( - f"[Outlet] ⚑ Compression threshold reached ({len(messages)} >= {self.valves.compression_threshold})" - ) - print(f"[Outlet] Preparing to generate summary in the background...") - - # Generate summary asynchronously in the background - asyncio.create_task( - self._generate_summary_async(messages, chat_id, body, __user__) + # Process Token calculation and summary generation asynchronously in the background (do not wait for completion, do not affect output) + asyncio.create_task( + self._check_and_generate_summary_async( + chat_id, model, body, __user__, __event_emitter__ ) - else: - if self.valves.debug_mode: - print( - f"[Outlet] Compression threshold not reached ({len(messages)} < {self.valves.compression_threshold})" - ) + ) if self.valves.debug_mode: + print(f"[Outlet] Background processing started") print(f"{'='*60}\n") return body - async def _generate_summary_async( - self, messages: list, chat_id: str, body: dict, user_data: Optional[dict] + async def _check_and_generate_summary_async( + self, + chat_id: str, + model: str, + body: dict, + user_data: Optional[dict], + __event_emitter__: Callable[[Any], Awaitable[None]] = None, ): """ - Generates a summary asynchronously in the background. + Background processing: Calculates Token count and generates summary (does not block response). + """ + try: + messages = body.get("messages", []) + + # Get threshold configuration for current model + thresholds = self._get_model_thresholds(model) + compression_threshold_tokens = thresholds.get( + "compression_threshold_tokens", self.valves.compression_threshold_tokens + ) + + if self.valves.debug_mode: + print(f"\n[πŸ” Background Calculation] Starting Token count...") + + # Calculate Token count in a background thread + current_tokens = await asyncio.to_thread( + self._calculate_messages_tokens, messages, model + ) + + if self.valves.debug_mode: + print(f"[πŸ” Background Calculation] Token count: {current_tokens}") + + # Check if compression is needed + if current_tokens >= compression_threshold_tokens: + if self.valves.debug_mode: + print( + f"[πŸ” Background Calculation] ⚑ Compression threshold triggered (Token: {current_tokens} >= {compression_threshold_tokens})" + ) + + # Proceed to generate summary + await self._generate_summary_async( + messages, chat_id, body, user_data, __event_emitter__ + ) + else: + if self.valves.debug_mode: + print( + f"[πŸ” Background Calculation] Compression threshold not reached (Token: {current_tokens} < {compression_threshold_tokens})" + ) + + except Exception as e: + print(f"[πŸ” Background Calculation] ❌ Error: {str(e)}") + + async def _generate_summary_async( + self, + messages: list, + chat_id: str, + body: dict, + user_data: Optional[dict], + __event_emitter__: Callable[[Any], Awaitable[None]] = None, + ): + """ + Generates summary asynchronously (runs in background, does not block response). + Logic: + 1. Extract middle messages (remove keep_first and keep_last). + 2. Check Token limit, if exceeding max_context_tokens, remove from the head of middle messages. + 3. Generate summary for the remaining middle messages. """ try: if self.valves.debug_mode: print(f"\n[πŸ€– Async Summary Task] Starting...") - # Messages to summarize: exclude kept initial and final messages - if self.valves.keep_last > 0: - messages_to_summarize = messages[ - self.valves.keep_first : -self.valves.keep_last - ] - else: - messages_to_summarize = messages[self.valves.keep_first :] - - if len(messages_to_summarize) == 0: + # 1. Get target compression progress + # Prioritize getting from temp_state (calculated by inlet). If unavailable (e.g., after restart), assume current is full history. + target_compressed_count = self.temp_state.pop(chat_id, None) + if target_compressed_count is None: + target_compressed_count = max(0, len(messages) - self.valves.keep_last) if self.valves.debug_mode: - print(f"[πŸ€– Async Summary Task] No messages to summarize, skipping.") + print( + f"[πŸ€– Async Summary Task] ⚠️ Could not get inlet state, estimating progress using current message count: {target_compressed_count}" + ) + + # 2. Determine the range of messages to compress (Middle) + start_index = self.valves.keep_first + end_index = len(messages) - self.valves.keep_last + if self.valves.keep_last == 0: + end_index = len(messages) + + # Ensure indices are valid + if start_index >= end_index: + if self.valves.debug_mode: + print( + f"[πŸ€– Async Summary Task] Middle messages empty (Start: {start_index}, End: {end_index}), skipping" + ) return + middle_messages = messages[start_index:end_index] + if self.valves.debug_mode: - print(f"[πŸ€– Async Summary Task] Preparing to summarize {len(messages_to_summarize)} messages.") print( - f"[πŸ€– Async Summary Task] Protecting: First {self.valves.keep_first} + Last {self.valves.keep_last} messages." + f"[πŸ€– Async Summary Task] Middle messages to process: {len(middle_messages)}" ) - # Build conversation history text - conversation_text = self._format_messages_for_summary(messages_to_summarize) + # 3. Check Token limit and truncate (Max Context Truncation) + # [Optimization] Use the summary model's (if any) threshold to decide how many middle messages can be processed + # This allows using a long-window model (like gemini-flash) to compress history exceeding the current model's window + summary_model_id = self.valves.summary_model or body.get( + "model", "gpt-3.5-turbo" + ) - # Call LLM to generate summary - summary = await self._call_summary_llm(conversation_text, body, user_data) - - # [Optimization] Save summary in a background thread to avoid blocking the event loop. - if self.valves.debug_mode: - print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.") - await asyncio.to_thread(self._save_summary, chat_id, summary, body) + thresholds = self._get_model_thresholds(summary_model_id) + # Note: Using the summary model's max context limit here + max_context_tokens = thresholds.get( + "max_context_tokens", self.valves.max_context_tokens + ) if self.valves.debug_mode: - print(f"[πŸ€– Async Summary Task] βœ… Complete! Summary length: {len(summary)} characters.") - print(f"[πŸ€– Async Summary Task] Summary preview: {summary[:150]}...") + print( + f"[πŸ€– Async Summary Task] Using max limit for model {summary_model_id}: {max_context_tokens} Tokens" + ) + + # Calculate current total Tokens (using summary model for counting) + total_tokens = await asyncio.to_thread( + self._calculate_messages_tokens, messages, summary_model_id + ) + + if total_tokens > max_context_tokens: + excess_tokens = total_tokens - max_context_tokens + if self.valves.debug_mode: + print( + f"[πŸ€– Async Summary Task] ⚠️ Total Tokens ({total_tokens}) exceed summary model limit ({max_context_tokens}), need to remove approx {excess_tokens} Tokens" + ) + + # Remove from the head of middle_messages + removed_tokens = 0 + removed_count = 0 + + while removed_tokens < excess_tokens and middle_messages: + msg_to_remove = middle_messages.pop(0) + msg_tokens = self._count_tokens( + str(msg_to_remove.get("content", "")), summary_model_id + ) + removed_tokens += msg_tokens + removed_count += 1 + + if self.valves.debug_mode: + print( + f"[πŸ€– Async Summary Task] Removed {removed_count} messages, totaling {removed_tokens} Tokens" + ) + + if not middle_messages: + if self.valves.debug_mode: + print( + f"[πŸ€– Async Summary Task] Middle messages empty after truncation, skipping summary generation" + ) + return + + # 4. Build conversation text + conversation_text = self._format_messages_for_summary(middle_messages) + + # 5. Call LLM to generate new summary + # Note: previous_summary is not passed here because old summary (if any) is already included in middle_messages + + # Send status notification for starting summary generation + if __event_emitter__: + await __event_emitter__( + { + "type": "status", + "data": { + "description": "Generating context summary in background...", + "done": False, + }, + } + ) + + new_summary = await self._call_summary_llm( + None, conversation_text, body, user_data + ) + + # 6. Save new summary + if self.valves.debug_mode: + print( + "[Optimization] Saving summary in a background thread to avoid blocking the event loop." + ) + + await asyncio.to_thread( + self._save_summary, chat_id, new_summary, target_compressed_count + ) + + # Send completion status notification + if __event_emitter__: + await __event_emitter__( + { + "type": "status", + "data": { + "description": f"Context summary updated (Saved {len(middle_messages)} messages)", + "done": True, + }, + } + ) + + if self.valves.debug_mode: + print( + f"[πŸ€– Async Summary Task] βœ… Complete! New summary length: {len(new_summary)} characters" + ) + print( + f"[πŸ€– Async Summary Task] Progress update: Compressed up to original message {target_compressed_count}" + ) except Exception as e: print(f"[πŸ€– Async Summary Task] ❌ Error: {str(e)}") import traceback traceback.print_exc() - # Save a simple placeholder even on failure - fallback_summary = ( - f"[Historical Conversation Summary] Contains content from approximately {len(messages_to_summarize)} messages." - ) - - # [Optimization] Save summary in a background thread to avoid blocking the event loop. - if self.valves.debug_mode: - print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.") - await asyncio.to_thread(self._save_summary, chat_id, fallback_summary, body) def _format_messages_for_summary(self, messages: list) -> str: """Formats messages for summarization.""" @@ -685,37 +1316,51 @@ class Filter: return "\n\n".join(formatted) async def _call_summary_llm( - self, conversation_text: str, body: dict, user_data: dict + self, + previous_summary: Optional[str], + new_conversation_text: str, + body: dict, + user_data: dict, ) -> str: """ Calls the LLM to generate a summary using Open WebUI's built-in method. """ if self.valves.debug_mode: - print(f"[πŸ€– LLM Call] Using Open WebUI's built-in method.") + print(f"[πŸ€– LLM Call] Using Open WebUI's built-in method") - # Build summary prompt + # Build summary prompt (Optimized) summary_prompt = f""" -You are a professional conversation context compression assistant. Your task is to perform a high-fidelity compression of the [Conversation Content] below, producing a concise summary that can be used directly as context for subsequent conversation. Strictly adhere to the following requirements: +You are a professional conversation context compression expert. Your task is to create a high-fidelity summary of the following conversation content. +This conversation may contain previous summaries (as system messages or text) and subsequent conversation content. -MUST RETAIN: Topics/goals, user intent, key facts and data, important parameters and constraints, deadlines, decisions/conclusions, action items and their status, and technical details like code/commands (code must be preserved as is). -REMOVE: Greetings, politeness, repetitive statements, off-topic chatter, and procedural details (unless essential). For information that has been overturned or is outdated, please mark it as "Obsolete: " when retaining. -CONFLICT RESOLUTION: If there are contradictions or multiple revisions, retain the latest consistent conclusion and list unresolved or conflicting points under "Points to Clarify". -STRUCTURE AND TONE: Output in structured bullet points. Be logical, objective, and concise. Summarize from a third-person perspective. Use code blocks to preserve technical/code snippets verbatim. -OUTPUT LENGTH: Strictly limit the summary content to within {int(self.valves.max_summary_tokens * 3)} characters. Prioritize key information; if space is insufficient, trim details rather than core conclusions. -FORMATTING: Output only the summary text. Do not add any extra explanations, execution logs, or generation processes. You must use the following headings (if a section has no content, write "None"): -Core Theme: -Key Information: -... (List 3-6 key points) -Decisions/Conclusions: -Action Items (with owner/deadline if any): -Relevant Roles/Preferences: -Risks/Dependencies/Assumptions: -Points to Clarify: -Compression Ratio: Original ~X words β†’ Summary ~Y words (estimate) -Conversation Content: -{conversation_text} +### Core Objectives +1. **Comprehensive Summary**: Concisely summarize key information, user intent, and assistant responses from the conversation. +2. **De-noising**: Remove greetings, repetitions, confirmations, and other non-essential information. +3. **Key Retention**: + * **Code snippets, commands, and technical parameters must be preserved verbatim. Do not modify or generalize them.** + * User intent, core requirements, decisions, and action items must be clearly preserved. +4. **Coherence**: The generated summary should be a cohesive whole that can replace the original conversation as context. +5. **Detailed Record**: Since length is permitted, please preserve details, reasoning processes, and nuances of multi-turn interactions as much as possible, rather than just high-level generalizations. -Please directly output the compressed summary that meets the above requirements (summary text only). +### Output Requirements +* **Format**: Structured text, logically clear. +* **Language**: Consistent with the conversation language (usually English). +* **Length**: Strictly control within {self.valves.max_summary_tokens} Tokens. +* **Strictly Forbidden**: Do not output "According to the conversation...", "The summary is as follows..." or similar filler. Output the summary content directly. + +### Suggested Summary Structure +* **Current Goal/Topic**: A one-sentence summary of the problem currently being solved. +* **Key Information & Context**: + * Confirmed facts/parameters. + * **Code/Technical Details** (Wrap in code blocks). +* **Progress & Conclusions**: Completed steps and reached consensus. +* **Action Items/Next Steps**: Clear follow-up actions. + +--- +{new_conversation_text} +--- + +Based on the content above, generate the summary: """ # Determine the model to use model = self.valves.summary_model or body.get("model", "") @@ -740,7 +1385,9 @@ Please directly output the compressed summary that meets the above requirements # [Optimization] Get user object in a background thread to avoid blocking the event loop. if self.valves.debug_mode: - print("[Optimization] Getting user object in a background thread to avoid blocking the event loop.") + print( + "[Optimization] Getting user object in a background thread to avoid blocking the event loop." + ) user = await asyncio.to_thread(Users.get_user_by_id, user_id) if not user: @@ -757,17 +1404,17 @@ Please directly output the compressed summary that meets the above requirements response = await generate_chat_completion(request, payload, user) if not response or "choices" not in response or not response["choices"]: - raise ValueError("LLM response is not in the correct format or is empty") + raise ValueError("LLM response format incorrect or empty") summary = response["choices"][0]["message"]["content"].strip() if self.valves.debug_mode: - print(f"[πŸ€– LLM Call] βœ… Successfully received summary.") + print(f"[πŸ€– LLM Call] βœ… Successfully received summary") return summary except Exception as e: - error_message = f"An error occurred while calling the LLM ({model}) to generate a summary: {str(e)}" + error_message = f"Error occurred while calling LLM ({model}) to generate summary: {str(e)}" if not self.valves.summary_model: error_message += ( "\n[Hint] You did not specify a summary_model, so the filter attempted to use the current conversation's model. "