feat(async-context-compression): release v1.2.1 with smart config & optimizations

This release introduces significant improvements to configuration flexibility, performance, and stability.

**Key Changes:**

*   **Smart Configuration:**
    *   Added `summary_model_max_context` to allow independent context limits for the summary model (e.g., using `gemini-flash` with 1M context to summarize `gpt-4` history).
    *   Implemented auto-detection of base model settings for custom models, ensuring correct threshold application.
*   **Performance & Refactoring:**
    *   Optimized `model_thresholds` parsing with caching to reduce overhead.
    *   Refactored `inlet` and `outlet` logic to remove redundant code and improve maintainability.
    *   Replaced all `print` statements with proper `logging` calls for better production monitoring.
*   **Bug Fixes & Modernization:**
    *   Fixed `datetime.utcnow()` deprecation warnings by switching to timezone-aware `datetime.now(timezone.utc)`.
    *   Corrected type annotations and improved error handling for `JSONResponse` objects from LLM backends.
    *   Removed hard truncation in summary generation to allow full context usage.

**Files Updated:**
*   Plugin source code (English & Chinese)
*   Documentation and READMEs
*   Version bumped to 1.2.1
This commit is contained in:
fujie
2026-01-20 19:09:25 +08:00
parent 0d853577df
commit 25c9d20f3d
8 changed files with 495 additions and 210 deletions

View File

@@ -1,9 +1,15 @@
# Async Context Compression Filter
**Author:** [Fu-Jie](https://github.com/Fu-Jie/awesome-openwebui) | **Version:** 1.2.0 | **Project:** [Awesome OpenWebUI](https://github.com/Fu-Jie/awesome-openwebui) | **License:** MIT
**Author:** [Fu-Jie](https://github.com/Fu-Jie/awesome-openwebui) | **Version:** 1.2.1 | **Project:** [Awesome OpenWebUI](https://github.com/Fu-Jie/awesome-openwebui) | **License:** MIT
This filter reduces token consumption in long conversations through intelligent summarization and message compression while keeping conversations coherent.
## What's new in 1.2.1
- **Smart Configuration**: Automatically detects base model settings for custom models and adds `summary_model_max_context` for independent summary limits.
- **Performance & Refactoring**: Optimized threshold parsing with caching, removed redundant code, and improved LLM response handling (JSONResponse support).
- **Bug Fixes & Modernization**: Fixed `datetime` deprecation warnings, corrected type annotations, and replaced print statements with proper logging.
## What's new in 1.2.0
- **Preflight Context Check**: Before sending to the model, validates that total tokens fit within the context window. Automatically trims or drops oldest messages if exceeded.
@@ -19,18 +25,6 @@ This filter reduces token consumption in long conversations through intelligent
- **Enhanced Stability**: Fixed a race condition in state management that could cause "inlet state not found" warnings in high-concurrency scenarios.
- **Bug Fixes**: Corrected default model handling to prevent misleading logs when no model is specified.
## What's new in 1.1.2
- **Open WebUI v0.7.x Compatibility**: Resolved a critical database session binding error affecting Open WebUI v0.7.x users. The plugin now dynamically discovers the database engine and session context, ensuring compatibility across versions.
- **Enhanced Error Reporting**: Errors during background summary generation are now reported via both the status bar and browser console.
- **Robust Model Handling**: Improved handling of missing or invalid model IDs to prevent crashes.
## What's new in 1.1.1
- **Frontend Debugging**: Added `show_debug_log` option to print debug info to the browser console (F12).
- **Optimized Compression**: Improved token calculation logic to prevent aggressive truncation of history, ensuring more context is retained.
---
@@ -45,6 +39,8 @@ This filter reduces token consumption in long conversations through intelligent
- ✅ Native tool output trimming for cleaner context when using function calling.
- ✅ Real-time context usage monitoring with warning notifications (>90%).
- ✅ Detailed token logging for precise debugging and optimization.
-**Smart Model Matching**: Automatically inherits configuration from base models for custom presets.
-**Multimodal Support**: Images are preserved but their tokens are **NOT** calculated. Please adjust thresholds accordingly.
---
@@ -75,7 +71,8 @@ It is recommended to keep this filter early in the chain so it runs before filte
| `keep_first` | `1` | Always keep the first N messages (protects system prompts). |
| `keep_last` | `6` | Always keep the last N messages to preserve recent context. |
| `summary_model` | `None` | Model for summaries. Strongly recommended to set a fast, economical model (e.g., `gemini-2.5-flash`, `deepseek-v3`). Falls back to the current chat model when empty. |
| `max_summary_tokens` | `4000` | Maximum tokens for the generated summary. |
| `summary_model_max_context` | `0` | Max context tokens for the summary model. If 0, falls back to `model_thresholds` or global `max_context_tokens`. |
| `max_summary_tokens` | `16384` | Maximum tokens for the generated summary. |
| `summary_temperature` | `0.3` | Randomness for summary generation. Lower is more deterministic. |
| `model_thresholds` | `{}` | Per-model overrides for `compression_threshold_tokens` and `max_context_tokens` (useful for mixed models). |
| `enable_tool_output_trimming` | `false` | When enabled and `function_calling: "native"` is active, trims verbose tool outputs to extract only the final answer. |

View File

@@ -1,11 +1,17 @@
# 异步上下文压缩过滤器
**作者:** [Fu-Jie](https://github.com/Fu-Jie/awesome-openwebui) | **版本:** 1.2.0 | **项目:** [Awesome OpenWebUI](https://github.com/Fu-Jie/awesome-openwebui) | **许可证:** MIT
**作者:** [Fu-Jie](https://github.com/Fu-Jie/awesome-openwebui) | **版本:** 1.2.1 | **项目:** [Awesome OpenWebUI](https://github.com/Fu-Jie/awesome-openwebui) | **许可证:** MIT
> **重要提示**:为了确保所有过滤器的可维护性和易用性,每个过滤器都应附带清晰、完整的文档,以确保其功能、配置和使用方法得到充分说明。
本过滤器通过智能摘要和消息压缩技术,在保持对话连贯性的同时,显著降低长对话的 Token 消耗。
## 1.2.1 版本更新
- **智能配置增强**: 自动检测自定义模型的基础模型配置,并新增 `summary_model_max_context` 参数以独立控制摘要模型的上下文限制。
- **性能优化与重构**: 重构了阈值解析逻辑并增加缓存,移除了冗余的处理代码,并增强了 LLM 响应处理(支持 JSONResponse
- **稳定性改进**: 修复了 `datetime` 弃用警告,修正了类型注解,并将 print 语句替换为标准日志记录。
## 1.2.0 版本更新
- **预检上下文检查 (Preflight Context Check)**: 在发送给模型之前,验证总 Token 是否符合上下文窗口。如果超出,自动裁剪或丢弃最旧的消息。
@@ -21,18 +27,6 @@
- **稳定性增强**: 修复了状态管理中的竞态条件,解决了高并发场景下可能出现的“无法获取 inlet 状态”警告。
- **Bug 修复**: 修正了默认模型处理逻辑,防止在未指定模型时产生误导性日志。
## 1.1.2 版本更新
- **Open WebUI v0.7.x 兼容性**: 修复了影响 Open WebUI v0.7.x 用户的严重数据库会话绑定错误。插件现在动态发现数据库引擎和会话上下文,确保跨版本兼容性。
- **增强错误报告**: 后台摘要生成过程中的错误现在会通过状态栏和浏览器控制台同时报告。
- **健壮的模型处理**: 改进了对缺失或无效模型 ID 的处理,防止程序崩溃。
## 1.1.1 版本更新
- **前端调试**: 新增 `show_debug_log` 选项,支持在浏览器控制台 (F12) 打印调试信息。
- **压缩优化**: 优化 Token 计算逻辑,防止历史记录被过度截断,保留更多上下文。
---
@@ -47,6 +41,8 @@
-**原生工具输出裁剪**: 支持裁剪冗长的工具调用输出。
-**实时监控**: 实时监控上下文使用情况,超过 90% 发出警告。
-**详细日志**: 提供精确的 Token 统计日志,便于调试。
-**智能模型匹配**: 自定义模型自动继承基础模型的阈值配置。
-**多模态支持**: 图片内容会被保留,但其 Token **不参与计算**。请相应调整阈值。
详细的工作原理和流程请参考 [工作流程指南](WORKFLOW_GUIDE_CN.md)。
@@ -88,6 +84,7 @@
| 参数 | 默认值 | 描述 |
| :-------------------- | :------ | :------------------------------------------------------------------------------------------------------------------------------------------ |
| `summary_model` | `None` | 用于生成摘要的模型 ID。**强烈建议**配置快速、经济、上下文窗口大的模型(如 `gemini-2.5-flash``deepseek-v3`)。留空则尝试复用当前对话模型。 |
| `summary_model_max_context` | `0` | 摘要模型的最大上下文 Token 数。如果为 0则回退到 `model_thresholds` 或全局 `max_context_tokens`。 |
| `max_summary_tokens` | `16384` | 生成摘要时允许的最大 Token 数。 |
| `summary_temperature` | `0.1` | 控制摘要生成的随机性,较低的值结果更稳定。 |

View File

@@ -5,19 +5,17 @@ author: Fu-Jie
author_url: https://github.com/Fu-Jie/awesome-openwebui
funding_url: https://github.com/open-webui
description: Reduces token consumption in long conversations while maintaining coherence through intelligent summarization and message compression.
version: 1.2.0
version: 1.2.1
openwebui_id: b1655bc8-6de9-4cad-8cb5-a6f7829a02ce
license: MIT
═══════════════════════════════════════════════════════════════════════════════
📌 What's new in 1.2.0
📌 What's new in 1.2.1
═══════════════════════════════════════════════════════════════════════════════
Preflight Context Check: Validates context fit before sending to model.
Structure-Aware Trimming: Collapses long AI responses while keeping H1-H6, intro, and conclusion.
Native Tool Output Trimming: Cleaner context when using function calling. (Note: Non-native tool outputs are not fully injected into context)
✅ Context Usage Warning: Notification when usage exceeds 90%.
✅ Detailed Token Logging: Granular breakdown of System, Head, Summary, and Tail tokens.
Smart Configuration: Automatically detects base model settings for custom models and adds `summary_model_max_context` for independent summary limits.
Performance & Refactoring: Optimized threshold parsing with caching and removed redundant code for better efficiency.
Bug Fixes & Modernization: Fixed `datetime` deprecation warnings and corrected type annotations.
═══════════════════════════════════════════════════════════════════════════════
📌 Overview
@@ -229,6 +227,8 @@ Statistics:
✓ This filter supports multimodal messages containing images.
✓ The summary is generated only from the text content.
✓ Non-text parts (like images) are preserved in their original messages during compression.
⚠ Image tokens are NOT calculated. Different models have vastly different image token costs
(GPT-4o: 85-1105, Claude: ~1300, Gemini: ~258 per image). Plan your thresholds accordingly.
═══════════════════════════════════════════════════════════════════════════════
🐛 Troubleshooting
@@ -259,7 +259,7 @@ Solution:
"""
from pydantic import BaseModel, Field, model_validator
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any, List, Union, Callable, Awaitable
import re
import asyncio
@@ -267,6 +267,10 @@ import json
import hashlib
import time
import contextlib
import logging
# Setup logger
logger = logging.getLogger(__name__)
# Open WebUI built-in imports
from open_webui.utils.chat import generate_chat_completion
@@ -291,7 +295,7 @@ except ImportError:
from sqlalchemy import Column, String, Text, DateTime, Integer, inspect
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy.engine import Engine
from datetime import datetime
from datetime import datetime, timezone
def _discover_owui_engine(db_module: Any) -> Optional[Engine]:
@@ -312,7 +316,7 @@ def _discover_owui_engine(db_module: Any) -> Optional[Engine]:
session, "engine", None
)
except Exception as exc:
print(f"[DB Discover] get_db_context failed: {exc}")
logger.error(f"[DB Discover] get_db_context failed: {exc}")
for attr in ("engine", "ENGINE", "bind", "BIND"):
candidate = getattr(db_module, attr, None)
@@ -334,7 +338,7 @@ def _discover_owui_schema(db_module: Any) -> Optional[str]:
if isinstance(candidate, str) and candidate.strip():
return candidate.strip()
except Exception as exc:
print(f"[DB Discover] Base metadata schema lookup failed: {exc}")
logger.error(f"[DB Discover] Base metadata schema lookup failed: {exc}")
try:
metadata_obj = getattr(db_module, "metadata_obj", None)
@@ -344,7 +348,7 @@ def _discover_owui_schema(db_module: Any) -> Optional[str]:
if isinstance(candidate, str) and candidate.strip():
return candidate.strip()
except Exception as exc:
print(f"[DB Discover] metadata_obj schema lookup failed: {exc}")
logger.error(f"[DB Discover] metadata_obj schema lookup failed: {exc}")
try:
from open_webui import env as owui_env
@@ -353,7 +357,7 @@ def _discover_owui_schema(db_module: Any) -> Optional[str]:
if isinstance(candidate, str) and candidate.strip():
return candidate.strip()
except Exception as exc:
print(f"[DB Discover] env schema lookup failed: {exc}")
logger.error(f"[DB Discover] env schema lookup failed: {exc}")
return None
@@ -379,8 +383,21 @@ class ChatSummary(owui_Base):
chat_id = Column(String(255), unique=True, nullable=False, index=True)
summary = Column(Text, nullable=False)
compressed_message_count = Column(Integer, default=0)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))
updated_at = Column(
DateTime,
default=lambda: datetime.now(timezone.utc),
onupdate=lambda: datetime.now(timezone.utc),
)
# Global cache for tiktoken encoding
TIKTOKEN_ENCODING = None
if tiktoken:
try:
TIKTOKEN_ENCODING = tiktoken.get_encoding("o200k_base")
except Exception as e:
logger.error(f"[Init] Failed to load tiktoken encoding: {e}")
class Filter:
@@ -391,8 +408,48 @@ class Filter:
self._fallback_session_factory = (
sessionmaker(bind=self._db_engine) if self._db_engine else None
)
self._model_thresholds_cache: Optional[Dict[str, Any]] = None
self._init_database()
def _parse_model_thresholds(self) -> Dict[str, Any]:
"""Parse model_thresholds string into a dictionary.
Format: model_id:compression_threshold:max_context, model_id2:threshold2:max2
Example: gpt-4:8000:32000, claude-3:100000:200000
Returns cached result if already parsed.
"""
if self._model_thresholds_cache is not None:
return self._model_thresholds_cache
self._model_thresholds_cache = {}
raw_config = self.valves.model_thresholds
if not raw_config:
return self._model_thresholds_cache
for entry in raw_config.split(","):
entry = entry.strip()
if not entry:
continue
parts = entry.split(":")
if len(parts) != 3:
continue
try:
model_id = parts[0].strip()
compression_threshold = int(parts[1].strip())
max_context = int(parts[2].strip())
self._model_thresholds_cache[model_id] = {
"compression_threshold_tokens": compression_threshold,
"max_context_tokens": max_context,
}
except ValueError:
continue
return self._model_thresholds_cache
@contextlib.contextmanager
def _db_session(self):
"""Yield a database session using Open WebUI helpers with graceful fallbacks."""
@@ -435,7 +492,7 @@ class Filter:
try:
session.close()
except Exception as exc: # pragma: no cover - best-effort cleanup
print(f"[Database] ⚠️ Failed to close fallback session: {exc}")
logger.warning(f"[Database] ⚠️ Failed to close fallback session: {exc}")
def _init_database(self):
"""Initializes the database table using Open WebUI's shared connection."""
@@ -447,19 +504,26 @@ class Filter:
# Check if table exists using SQLAlchemy inspect
inspector = inspect(self._db_engine)
if not inspector.has_table("chat_summary"):
# Support schema if configured
has_table = (
inspector.has_table("chat_summary", schema=owui_schema)
if owui_schema
else inspector.has_table("chat_summary")
)
if not has_table:
# Create the chat_summary table if it doesn't exist
ChatSummary.__table__.create(bind=self._db_engine, checkfirst=True)
print(
logger.info(
"[Database] ✅ Successfully created chat_summary table using Open WebUI's shared database connection."
)
else:
print(
logger.info(
"[Database] ✅ Using Open WebUI's shared database connection. chat_summary table already exists."
)
except Exception as e:
print(f"[Database] ❌ Initialization failed: {str(e)}")
logger.error(f"[Database] ❌ Initialization failed: {str(e)}")
class Valves(BaseModel):
priority: int = Field(
@@ -476,9 +540,9 @@ class Filter:
ge=0,
description="Hard limit for context. Exceeding this value will force removal of earliest messages (Global Default)",
)
model_thresholds: dict = Field(
default={},
description="Threshold override configuration for specific models. Only includes models requiring special configuration.",
model_thresholds: str = Field(
default="",
description="Per-model threshold overrides. Format: model_id:compression_threshold:max_context (comma-separated). Example: gpt-4:8000:32000, claude-3:100000:200000",
)
keep_first: int = Field(
@@ -489,10 +553,15 @@ class Filter:
keep_last: int = Field(
default=6, ge=0, description="Always keep the last N full messages."
)
summary_model: str = Field(
summary_model: Optional[str] = Field(
default=None,
description="The model ID used to generate the summary. If empty, uses the current conversation's model. Used to match configurations in model_thresholds.",
)
summary_model_max_context: int = Field(
default=0,
ge=0,
description="Max context tokens for the summary model. If 0, falls back to model_thresholds or global max_context_tokens. Example: gemini-flash=1000000, gpt-4o-mini=128000.",
)
max_summary_tokens: int = Field(
default=16384,
ge=1,
@@ -529,7 +598,7 @@ class Filter:
# [Optimization] Optimistic lock check: update only if progress moves forward
if compressed_count <= existing.compressed_message_count:
if self.valves.debug_mode:
print(
logger.info(
f"[Storage] Skipping update: New progress ({compressed_count}) is not greater than existing progress ({existing.compressed_message_count})"
)
return
@@ -537,7 +606,7 @@ class Filter:
# Update existing record
existing.summary = summary
existing.compressed_message_count = compressed_count
existing.updated_at = datetime.utcnow()
existing.updated_at = datetime.now(timezone.utc)
else:
# Create new record
new_summary = ChatSummary(
@@ -551,12 +620,12 @@ class Filter:
if self.valves.debug_mode:
action = "Updated" if existing else "Created"
print(
logger.info(
f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})"
)
except Exception as e:
print(f"[Storage] ❌ Database save failed: {str(e)}")
logger.error(f"[Storage] ❌ Database save failed: {str(e)}")
def _load_summary_record(self, chat_id: str) -> Optional[ChatSummary]:
"""Loads the summary record object from the database."""
@@ -568,7 +637,7 @@ class Filter:
session.expunge(record)
return record
except Exception as e:
print(f"[Load] ❌ Database read failed: {str(e)}")
logger.error(f"[Load] ❌ Database read failed: {str(e)}")
return None
def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
@@ -576,8 +645,8 @@ class Filter:
record = self._load_summary_record(chat_id)
if record:
if self.valves.debug_mode:
print(f"[Load] Loaded summary from database (Chat ID: {chat_id})")
print(
logger.info(f"[Load] Loaded summary from database (Chat ID: {chat_id})")
logger.info(
f"[Load] Last updated: {record.updated_at}, Compressed message count: {record.compressed_message_count}"
)
return record.summary
@@ -588,14 +657,12 @@ class Filter:
if not text:
return 0
if tiktoken:
if TIKTOKEN_ENCODING:
try:
# Uniformly use o200k_base encoding (adapted for latest models)
encoding = tiktoken.get_encoding("o200k_base")
return len(encoding.encode(text))
return len(TIKTOKEN_ENCODING.encode(text))
except Exception as e:
if self.valves.debug_mode:
print(
logger.warning(
f"[Token Count] tiktoken error: {e}, falling back to character estimation"
)
@@ -604,6 +671,7 @@ class Filter:
def _calculate_messages_tokens(self, messages: List[Dict]) -> int:
"""Calculates the total tokens for a list of messages."""
start_time = time.time()
total_tokens = 0
for msg in messages:
content = msg.get("content", "")
@@ -616,6 +684,13 @@ class Filter:
total_tokens += self._count_tokens(text_content)
else:
total_tokens += self._count_tokens(str(content))
duration = (time.time() - start_time) * 1000
if self.valves.debug_mode:
logger.info(
f"[Token Calc] Calculated {total_tokens} tokens for {len(messages)} messages in {duration:.2f}ms"
)
return total_tokens
def _get_model_thresholds(self, model_id: str) -> Dict[str, int]:
@@ -623,17 +698,48 @@ class Filter:
Priority:
1. If configuration exists for the model ID in model_thresholds, use it.
2. Otherwise, use global parameters compression_threshold_tokens and max_context_tokens.
2. If model is a custom model, try to match its base_model_id.
3. Otherwise, use global parameters compression_threshold_tokens and max_context_tokens.
"""
# Try to match from model-specific configuration
if model_id in self.valves.model_thresholds:
if self.valves.debug_mode:
print(f"[Config] Using model-specific configuration: {model_id}")
return self.valves.model_thresholds[model_id]
parsed = self._parse_model_thresholds()
# Use global default configuration
# 1. Direct match with model_id
if model_id in parsed:
if self.valves.debug_mode:
logger.info(f"[Config] Using model-specific configuration: {model_id}")
return parsed[model_id]
# 2. Try to find base_model_id for custom models
try:
model_obj = Models.get_model_by_id(model_id)
if model_obj:
# Check for base_model_id (custom model)
base_model_id = getattr(model_obj, "base_model_id", None)
if not base_model_id:
# Try base_model_ids (array) - take first one
base_model_ids = getattr(model_obj, "base_model_ids", None)
if (
base_model_ids
and isinstance(base_model_ids, list)
and len(base_model_ids) > 0
):
base_model_id = base_model_ids[0]
if base_model_id and base_model_id in parsed:
if self.valves.debug_mode:
logger.info(
f"[Config] Custom model '{model_id}' -> base_model '{base_model_id}': using base model configuration"
)
return parsed[base_model_id]
except Exception as e:
if self.valves.debug_mode:
logger.warning(
f"[Config] Failed to lookup base_model for '{model_id}': {e}"
)
# 3. Use global default configuration
if self.valves.debug_mode:
print(
logger.info(
f"[Config] Model {model_id} not in model_thresholds, using global parameters"
)
@@ -731,13 +837,13 @@ class Filter:
}
)
except Exception as e:
print(f"Error emitting debug log: {e}")
logger.error(f"Error emitting debug log: {e}")
async def _log(self, message: str, type: str = "info", event_call=None):
"""Unified logging to both backend (print) and frontend (console.log)"""
# Backend logging
if self.valves.debug_mode:
print(message)
logger.info(message)
# Frontend logging
if self.valves.show_debug_log and event_call:
@@ -770,9 +876,17 @@ class Filter:
js_code = f"""
console.log("%c[Compression] {safe_message}", "{css}");
"""
await event_call({"type": "execute", "data": {"code": js_code}})
# Add timeout to prevent blocking if frontend connection is broken
await asyncio.wait_for(
event_call({"type": "execute", "data": {"code": js_code}}),
timeout=2.0,
)
except asyncio.TimeoutError:
logger.warning(
f"Failed to emit log to frontend: Timeout (connection may be broken)"
)
except Exception as e:
print(f"Failed to emit log to frontend: {e}")
logger.error(f"Failed to emit log to frontend: {type(e).__name__}: {e}")
async def inlet(
self,
@@ -819,42 +933,57 @@ class Filter:
event_call=__event_call__,
)
# Extract the final answer (after last tool call metadata)
# Pattern: Matches escaped JSON strings like ""&quot;...&quot;"" followed by newlines
# We look for the last occurrence of such a pattern and take everything after it
# 1. Try matching the specific OpenWebUI tool output format: ""&quot;...&quot;""
# This regex finds the last end-quote of a tool output block
tool_output_pattern = r'""&quot;.*?&quot;""\s*'
# Find all matches
matches = list(
re.finditer(tool_output_pattern, content, re.DOTALL)
# Strategy 1: Tool Output / Code Block Trimming
# Detect if message contains large tool outputs or code blocks
# Improved regex to be less brittle
is_tool_output = (
"&quot;" in content
or "Arguments:" in content
or "```" in content
or "<tool_code>" in content
)
if matches:
# Get the end position of the last match
last_match_end = matches[-1].end()
if is_tool_output:
# Regex to find the last occurrence of a tool output block or code block
# This pattern looks for:
# 1. OpenWebUI's escaped JSON format: ""&quot;...&quot;""
# 2. "Arguments: {...}" pattern
# 3. Generic code blocks: ```...```
# 4. <tool_code>...</tool_code>
# It captures the content *after* the last such block.
tool_output_pattern = r'(?:""&quot;.*?&quot;""|Arguments:\s*\{[^}]+\}|```.*?```|<tool_code>.*?</tool_code>)\s*'
# Everything after the last tool output is the final answer
final_answer = content[last_match_end:].strip()
# Find all matches
matches = list(
re.finditer(tool_output_pattern, content, re.DOTALL)
)
if matches:
# Get the end position of the last match
last_match_end = matches[-1].end()
# Everything after the last tool output is the final answer
final_answer = content[last_match_end:].strip()
if final_answer:
msg["content"] = (
f"... [Tool outputs trimmed]\n{final_answer}"
)
trimmed_count += 1
else:
# Fallback: Try splitting on "Arguments:" if the new format isn't found
# (Preserving backward compatibility or different model behaviors)
parts = re.split(r"(?:Arguments:\s*\{[^}]+\})\n+", content)
if len(parts) > 1:
final_answer = parts[-1].strip()
if final_answer:
msg["content"] = (
f"... [Tool outputs trimmed]\n{final_answer}"
)
trimmed_count += 1
else:
# Fallback: If no specific pattern matched, but it was identified as tool output,
# try a simpler split or just mark as trimmed if no final answer can be extracted.
# (Preserving backward compatibility or different model behaviors)
parts = re.split(
r"(?:Arguments:\s*\{[^}]+\})\n+", content
)
if len(parts) > 1:
final_answer = parts[-1].strip()
if final_answer:
msg["content"] = (
f"... [Tool outputs trimmed]\n{final_answer}"
)
trimmed_count += 1
if trimmed_count > 0 and self.valves.show_debug_log and __event_call__:
await self._log(
@@ -881,7 +1010,8 @@ class Filter:
)
# Clean model ID if needed (though get_model_by_id usually expects the full ID)
model_obj = Models.get_model_by_id(model_id)
# Run in thread to avoid blocking event loop on slow DB queries
model_obj = await asyncio.to_thread(Models.get_model_by_id, model_id)
if model_obj:
if self.valves.show_debug_log and __event_call__:
@@ -933,8 +1063,7 @@ class Filter:
else:
if self.valves.show_debug_log and __event_call__:
await self._log(
f"[Inlet] ❌ Model NOT found in DB",
type="warning",
f"[Inlet] Not a custom model, skipping custom system prompt check",
event_call=__event_call__,
)
@@ -946,7 +1075,7 @@ class Filter:
event_call=__event_call__,
)
if self.valves.debug_mode:
print(f"[Inlet] Error fetching system prompt from DB: {e}")
logger.error(f"[Inlet] Error fetching system prompt from DB: {e}")
# Fall back to checking messages (base model or already included)
if not system_prompt_content:
@@ -960,7 +1089,7 @@ class Filter:
if system_prompt_content:
system_prompt_msg = {"role": "system", "content": system_prompt_content}
if self.valves.debug_mode:
print(
logger.info(
f"[Inlet] Found system prompt ({len(system_prompt_content)} chars). Including in budget."
)
@@ -991,7 +1120,7 @@ class Filter:
f"[Inlet] Message Stats: {stats_str}", event_call=__event_call__
)
except Exception as e:
print(f"[Inlet] Error logging message stats: {e}")
logger.error(f"[Inlet] Error logging message stats: {e}")
if not chat_id:
await self._log(
@@ -1007,6 +1136,33 @@ class Filter:
event_call=__event_call__,
)
# Log custom model configurations
raw_config = self.valves.model_thresholds
parsed_configs = self._parse_model_thresholds()
if raw_config:
config_list = [
f"{model}: {cfg['compression_threshold_tokens']}t/{cfg['max_context_tokens']}t"
for model, cfg in parsed_configs.items()
]
if config_list:
await self._log(
f"[Inlet] 📋 Model Configs (Raw: '{raw_config}'): {', '.join(config_list)}",
event_call=__event_call__,
)
else:
await self._log(
f"[Inlet] ⚠️ Invalid Model Configs (Raw: '{raw_config}'): No valid configs parsed. Expected format: 'model_id:threshold:max_context'",
type="warning",
event_call=__event_call__,
)
else:
await self._log(
f"[Inlet] 📋 Model Configs: No custom configuration (Global defaults only)",
event_call=__event_call__,
)
# Record the target compression progress for the original messages, for use in outlet
# Target is to compress up to the (total - keep_last) message
target_compressed_count = max(0, len(messages) - self.valves.keep_last)
@@ -1043,9 +1199,9 @@ class Filter:
if effective_keep_first > 0:
head_messages = messages[:effective_keep_first]
# 2. Summary message (Inserted as User message)
# 2. Summary message (Inserted as Assistant message)
summary_content = (
f"System Prompt: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n"
f"Previous Summary: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n"
f"{summary_record.summary}\n\n"
f"---\n"
f"Below is the recent conversation:"
@@ -1287,7 +1443,7 @@ class Filter:
# Get max context limit
model = self._clean_model_id(body.get("model"))
thresholds = self._get_model_thresholds(model)
thresholds = self._get_model_thresholds(model) or {}
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
@@ -1314,7 +1470,8 @@ class Filter:
> start_trim_index + 1 # Keep at least 1 message after keep_first
):
dropped = final_messages.pop(start_trim_index)
total_tokens -= self._count_tokens(str(dropped.get("content", "")))
dropped_tokens = self._count_tokens(str(dropped.get("content", "")))
total_tokens -= dropped_tokens
await self._log(
f"[Inlet] ✂️ Messages reduced. New total: {total_tokens} Tokens",
@@ -1371,18 +1528,11 @@ class Filter:
)
return body
model = body.get("model") or ""
messages = body.get("messages", [])
# Calculate target compression progress directly
# Assuming body['messages'] in outlet contains the full history (including new response)
messages = body.get("messages", [])
target_compressed_count = max(0, len(messages) - self.valves.keep_last)
if self.valves.debug_mode or self.valves.show_debug_log:
await self._log(
f"\n{'='*60}\n[Outlet] Chat ID: {chat_id}\n[Outlet] Response complete\n[Outlet] Calculated target compression progress: {target_compressed_count} (Messages: {len(messages)})",
event_call=__event_call__,
)
# Process Token calculation and summary generation asynchronously in the background (do not wait for completion, do not affect output)
asyncio.create_task(
self._check_and_generate_summary_async(
@@ -1396,11 +1546,6 @@ class Filter:
)
)
await self._log(
f"[Outlet] Background processing started\n{'='*60}\n",
event_call=__event_call__,
)
return body
async def _check_and_generate_summary_async(
@@ -1416,11 +1561,25 @@ class Filter:
"""
Background processing: Calculates Token count and generates summary (does not block response).
"""
try:
messages = body.get("messages", [])
# Clean model ID
model = self._clean_model_id(model)
if self.valves.debug_mode or self.valves.show_debug_log:
await self._log(
f"\n{'='*60}\n[Outlet] Chat ID: {chat_id}\n[Outlet] Response complete\n[Outlet] Calculated target compression progress: {target_compressed_count} (Messages: {len(messages)})",
event_call=__event_call__,
)
await self._log(
f"[Outlet] Background processing started\n{'='*60}\n",
event_call=__event_call__,
)
# Get threshold configuration for current model
thresholds = self._get_model_thresholds(model)
thresholds = self._get_model_thresholds(model) or {}
compression_threshold_tokens = thresholds.get(
"compression_threshold_tokens", self.valves.compression_threshold_tokens
)
@@ -1440,6 +1599,28 @@ class Filter:
event_call=__event_call__,
)
# Send status notification (Context Usage format)
if __event_emitter__ and self.valves.show_token_usage_status:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
status_msg = f"Context Usage (Estimated): {current_tokens} / {max_context_tokens} Tokens"
if max_context_tokens > 0:
usage_ratio = current_tokens / max_context_tokens
status_msg += f" ({usage_ratio*100:.1f}%)"
if usage_ratio > 0.9:
status_msg += " | ⚠️ High Usage"
await __event_emitter__(
{
"type": "status",
"data": {
"description": status_msg,
"done": True,
},
}
)
# Check if compression is needed
if current_tokens >= compression_threshold_tokens:
await self._log(
@@ -1559,10 +1740,13 @@ class Filter:
return
thresholds = self._get_model_thresholds(summary_model_id)
# Note: Using the summary model's max context limit here
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
# Priority: 1. summary_model_max_context (if > 0) -> 2. model_thresholds -> 3. global max_context_tokens
if self.valves.summary_model_max_context > 0:
max_context_tokens = self.valves.summary_model_max_context
else:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
await self._log(
f"[🤖 Async Summary Task] Using max limit for model {summary_model_id}: {max_context_tokens} Tokens",
@@ -1753,7 +1937,6 @@ class Filter:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
# 6. Emit Status
status_msg = f"Context Summary Updated: {token_count} / {max_context_tokens} Tokens"
if max_context_tokens > 0:
@@ -1798,7 +1981,7 @@ class Filter:
import traceback
traceback.print_exc()
logger.exception("[🤖 Async Summary Task] Unhandled exception")
def _format_messages_for_summary(self, messages: list) -> str:
"""Formats messages for summarization."""
@@ -1818,9 +2001,8 @@ class Filter:
# Handle role name
role_name = {"user": "User", "assistant": "Assistant"}.get(role, role)
# Limit length of each message to avoid excessive length
if len(content) > 500:
content = content[:500] + "..."
# User requested to remove truncation to allow full context for summary
# unless it exceeds model limits (which is handled by the LLM call itself or max_tokens)
formatted.append(f"[{i}] {role_name}: {content}")
@@ -1927,8 +2109,25 @@ Based on the content above, generate the summary:
# Call generate_chat_completion
response = await generate_chat_completion(request, payload, user)
if not response or "choices" not in response or not response["choices"]:
raise ValueError("LLM response format incorrect or empty")
# Handle JSONResponse (some backends return JSONResponse instead of dict)
if hasattr(response, "body"):
# It's a Response object, extract the body
import json as json_module
try:
response = json_module.loads(response.body.decode("utf-8"))
except Exception:
raise ValueError(f"Failed to parse JSONResponse body: {response}")
if (
not response
or not isinstance(response, dict)
or "choices" not in response
or not response["choices"]
):
raise ValueError(
f"LLM response format incorrect or empty: {type(response).__name__}"
)
summary = response["choices"][0]["message"]["content"].strip()

View File

@@ -5,19 +5,17 @@ author: Fu-Jie
author_url: https://github.com/Fu-Jie/awesome-openwebui
funding_url: https://github.com/open-webui
description: 通过智能摘要和消息压缩,降低长对话的 token 消耗,同时保持对话连贯性。
version: 1.2.0
version: 1.2.1
openwebui_id: 5c0617cb-a9e4-4bd6-a440-d276534ebd18
license: MIT
═══════════════════════════════════════════════════════════════════════════════
📌 1.2.0 版本更新
📌 1.2.1 版本更新
═══════════════════════════════════════════════════════════════════════════════
预检上下文检查:发送给模型前验证上下文是否适配
结构感知裁剪:折叠过长的 AI 响应,同时保留标题 (H1-H6)、开头和结尾
原生工具输出裁剪:使用函数调用时清理上下文,去除冗余输出。(注意:非原生工具调用输出不会完整注入上下文)
✅ 上下文使用警告:当使用量超过 90% 时发出通知。
✅ 详细 Token 日志:细粒度记录 System、Head、Summary 和 Tail 的 Token 消耗。
智能配置增强:自动检测自定义模型的基础模型配置,并新增 `summary_model_max_context` 参数以独立控制摘要模型的上下文限制
性能优化与重构:重构了阈值解析逻辑并增加缓存,移除了冗余的处理代码,并增强了 LLM 响应处理(支持 JSONResponse
稳定性改进:修复了 `datetime` 弃用警告,修正了类型注解,并将 print 语句替换为标准日志记录。
═══════════════════════════════════════════════════════════════════════════════
📌 功能概述
@@ -258,8 +256,19 @@ import re
import asyncio
import json
import hashlib
import time
import contextlib
import logging
# 配置日志记录
logger = logging.getLogger(__name__)
if not logger.handlers:
handler = logging.StreamHandler()
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Open WebUI 内置导入
from open_webui.utils.chat import generate_chat_completion
@@ -284,7 +293,7 @@ except ImportError:
from sqlalchemy import Column, String, Text, DateTime, Integer, inspect
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy.engine import Engine
from datetime import datetime
from datetime import datetime, timezone
def _discover_owui_engine(db_module: Any) -> Optional[Engine]:
@@ -372,8 +381,12 @@ class ChatSummary(owui_Base):
chat_id = Column(String(255), unique=True, nullable=False, index=True)
summary = Column(Text, nullable=False)
compressed_message_count = Column(Integer, default=0)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))
updated_at = Column(
DateTime,
default=lambda: datetime.now(timezone.utc),
onupdate=lambda: datetime.now(timezone.utc),
)
class Filter:
@@ -384,6 +397,7 @@ class Filter:
self._fallback_session_factory = (
sessionmaker(bind=self._db_engine) if self._db_engine else None
)
self._threshold_cache = {}
self._init_database()
@contextlib.contextmanager
@@ -469,21 +483,38 @@ class Filter:
ge=0,
description="上下文的硬性上限。超过此值将强制移除最早的消息 (全局默认值)",
)
model_thresholds: dict = Field(
model_thresholds: Union[str, dict] = Field(
default={},
description="针对特定模型的阈值覆盖配置。仅包含需要特殊配置的模型",
description="针对特定模型的阈值覆盖配置。可以是 JSON 字符串或字典",
)
@model_validator(mode="before")
@classmethod
def parse_model_thresholds(cls, data: Any) -> Any:
if isinstance(data, dict):
thresholds = data.get("model_thresholds")
if isinstance(thresholds, str) and thresholds.strip():
try:
data["model_thresholds"] = json.loads(thresholds)
except Exception as e:
logger.error(f"Failed to parse model_thresholds JSON: {e}")
return data
keep_first: int = Field(
default=1, ge=0, description="始终保留最初的 N 条消息。设置为 0 则不保留。"
)
keep_last: int = Field(
default=6, ge=0, description="始终保留最近的 N 条完整消息。"
)
summary_model: str = Field(
summary_model: Optional[str] = Field(
default=None,
description="用于生成摘要的模型 ID。留空则使用当前对话的模型。用于匹配 model_thresholds 中的配置。",
)
summary_model_max_context: int = Field(
default=0,
ge=0,
description="摘要模型的最大上下文 Token 数。如果为 0则回退到 model_thresholds 或全局 max_context_tokens。",
)
max_summary_tokens: int = Field(
default=16384, ge=1, description="摘要的最大 token 数"
)
@@ -513,7 +544,7 @@ class Filter:
# [优化] 乐观锁检查:只有进度向前推进时才更新
if compressed_count <= existing.compressed_message_count:
if self.valves.debug_mode:
print(
logger.debug(
f"[存储] 跳过更新:新进度 ({compressed_count}) 不大于现有进度 ({existing.compressed_message_count})"
)
return
@@ -521,7 +552,7 @@ class Filter:
# 更新现有记录
existing.summary = summary
existing.compressed_message_count = compressed_count
existing.updated_at = datetime.utcnow()
existing.updated_at = datetime.now(timezone.utc)
else:
# 创建新记录
new_summary = ChatSummary(
@@ -535,10 +566,10 @@ class Filter:
if self.valves.debug_mode:
action = "更新" if existing else "创建"
print(f"[存储] 摘要已{action}到数据库 (Chat ID: {chat_id})")
logger.info(f"[存储] 摘要已{action}到数据库 (Chat ID: {chat_id})")
except Exception as e:
print(f"[存储] ❌ 数据库保存失败: {str(e)}")
logger.error(f"[存储] ❌ 数据库保存失败: {str(e)}")
def _load_summary_record(self, chat_id: str) -> Optional[ChatSummary]:
"""从数据库加载摘要记录对象"""
@@ -550,7 +581,7 @@ class Filter:
session.expunge(record)
return record
except Exception as e:
print(f"[加载] ❌ 数据库读取失败: {str(e)}")
logger.error(f"[加载] ❌ 数据库读取失败: {str(e)}")
return None
def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
@@ -558,8 +589,8 @@ class Filter:
record = self._load_summary_record(chat_id)
if record:
if self.valves.debug_mode:
print(f"[加载] 从数据库加载摘要 (Chat ID: {chat_id})")
print(
logger.debug(f"[加载] 从数据库加载摘要 (Chat ID: {chat_id})")
logger.debug(
f"[加载] 更新时间: {record.updated_at}, 已压缩消息数: {record.compressed_message_count}"
)
return record.summary
@@ -602,23 +633,68 @@ class Filter:
"""获取特定模型的阈值配置
优先级:
1. 如果 model_thresholds 中存在该模型ID的配置使用该配置
2. 否则使用全局参数 compression_threshold_tokens 和 max_context_tokens
1. 缓存匹配
2. model_thresholds 直接匹配
3. 基础模型 (base_model_id) 匹配
4. 全局默认配置
"""
# 尝试从模型特定配置中匹配
if model_id in self.valves.model_thresholds:
if not model_id:
return {
"compression_threshold_tokens": self.valves.compression_threshold_tokens,
"max_context_tokens": self.valves.max_context_tokens,
}
# 1. 检查缓存
if model_id in self._threshold_cache:
return self._threshold_cache[model_id]
# 获取解析后的阈值配置
parsed = self.valves.model_thresholds
if isinstance(parsed, str):
try:
parsed = json.loads(parsed)
except Exception:
parsed = {}
# 2. 尝试直接匹配
if model_id in parsed:
res = parsed[model_id]
self._threshold_cache[model_id] = res
if self.valves.debug_mode:
print(f"[配置] 使用模型特定配置: {model_id}")
return self.valves.model_thresholds[model_id]
logger.debug(f"[配置] 模型 {model_id} 命中直接配置")
return res
# 使用全局默认配置
if self.valves.debug_mode:
print(f"[配置] 模型 {model_id} 未在 model_thresholds 中,使用全局参数")
# 3. 尝试匹配基础模型 (base_model_id)
try:
model_obj = Models.get_model_by_id(model_id)
if model_obj:
# 某些模型可能有多个基础模型 ID
base_ids = []
if hasattr(model_obj, "base_model_id") and model_obj.base_model_id:
base_ids.append(model_obj.base_model_id)
if hasattr(model_obj, "base_model_ids") and model_obj.base_model_ids:
if isinstance(model_obj.base_model_ids, list):
base_ids.extend(model_obj.base_model_ids)
return {
for b_id in base_ids:
if b_id in parsed:
res = parsed[b_id]
self._threshold_cache[model_id] = res
if self.valves.debug_mode:
logger.info(
f"[配置] 模型 {model_id} 匹配到基础模型 {b_id} 的配置"
)
return res
except Exception as e:
logger.error(f"[配置] 查找基础模型失败: {e}")
# 4. 使用全局默认配置
res = {
"compression_threshold_tokens": self.valves.compression_threshold_tokens,
"max_context_tokens": self.valves.max_context_tokens,
}
self._threshold_cache[model_id] = res
return res
def _get_chat_context(
self, body: dict, __metadata__: Optional[dict] = None
@@ -750,7 +826,7 @@ class Filter:
"""
await event_call({"type": "execute", "data": {"code": js_code}})
except Exception as e:
print(f"发送前端日志失败: {e}")
logger.error(f"发送前端日志失败: {e}")
async def inlet(
self,
@@ -922,7 +998,7 @@ class Filter:
event_call=__event_call__,
)
if self.valves.debug_mode:
print(f"[Inlet] 从数据库获取系统提示词错误: {e}")
logger.error(f"[Inlet] 从数据库获取系统提示词错误: {e}")
# 回退:检查消息列表 (基础模型或已包含)
if not system_prompt_content:
@@ -936,7 +1012,7 @@ class Filter:
if system_prompt_content:
system_prompt_msg = {"role": "system", "content": system_prompt_content}
if self.valves.debug_mode:
print(
logger.debug(
f"[Inlet] 找到系统提示词 ({len(system_prompt_content)} 字符)。计入预算。"
)
@@ -967,7 +1043,7 @@ class Filter:
f"[Inlet] 消息统计: {stats_str}", event_call=__event_call__
)
except Exception as e:
print(f"[Inlet] 记录消息统计错误: {e}")
logger.error(f"[Inlet] 记录消息统计错误: {e}")
if not chat_id:
await self._log(
@@ -1058,7 +1134,7 @@ class Filter:
# 获取最大上下文限制
model = self._clean_model_id(body.get("model"))
thresholds = self._get_model_thresholds(model)
thresholds = self._get_model_thresholds(model) or {}
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
@@ -1262,7 +1338,7 @@ class Filter:
# 获取最大上下文限制
model = self._clean_model_id(body.get("model"))
thresholds = self._get_model_thresholds(model)
thresholds = self._get_model_thresholds(model) or {}
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
@@ -1289,7 +1365,8 @@ class Filter:
> start_trim_index + 1 # 保留 keep_first 之后至少 1 条消息
):
dropped = final_messages.pop(start_trim_index)
total_tokens -= self._count_tokens(str(dropped.get("content", "")))
dropped_tokens = self._count_tokens(str(dropped.get("content", "")))
total_tokens -= dropped_tokens
await self._log(
f"[Inlet] ✂️ 消息已缩减。新总数: {total_tokens} Tokens",
@@ -1348,18 +1425,11 @@ class Filter:
)
return body
model = body.get("model") or ""
messages = body.get("messages", [])
# 直接计算目标压缩进度
# 假设 outlet 中的 body['messages'] 包含完整历史(包括新响应)
messages = body.get("messages", [])
target_compressed_count = max(0, len(messages) - self.valves.keep_last)
if self.valves.debug_mode or self.valves.show_debug_log:
await self._log(
f"\n{'='*60}\n[Outlet] Chat ID: {chat_id}\n[Outlet] 响应完成\n[Outlet] 计算目标压缩进度: {target_compressed_count} (消息数: {len(messages)})",
event_call=__event_call__,
)
# 在后台异步处理 Token 计算和摘要生成(不等待完成,不影响输出)
asyncio.create_task(
self._check_and_generate_summary_async(
@@ -1373,11 +1443,6 @@ class Filter:
)
)
await self._log(
f"[Outlet] 后台处理已启动\n{'='*60}\n",
event_call=__event_call__,
)
return body
async def _check_and_generate_summary_async(
@@ -1397,7 +1462,7 @@ class Filter:
messages = body.get("messages", [])
# 获取当前模型的阈值配置
thresholds = self._get_model_thresholds(model)
thresholds = self._get_model_thresholds(model) or {}
compression_threshold_tokens = thresholds.get(
"compression_threshold_tokens", self.valves.compression_threshold_tokens
)
@@ -1533,11 +1598,14 @@ class Filter:
)
return
thresholds = self._get_model_thresholds(summary_model_id)
# 注意:这里使用的是摘要模型的最大上下文限制
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
thresholds = self._get_model_thresholds(summary_model_id) or {}
# Priority: 1. summary_model_max_context (if > 0) -> 2. model_thresholds -> 3. global max_context_tokens
if self.valves.summary_model_max_context > 0:
max_context_tokens = self.valves.summary_model_max_context
else:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
await self._log(
f"[🤖 异步摘要任务] 使用模型 {summary_model_id} 的上限: {max_context_tokens} Tokens",
@@ -1722,10 +1790,14 @@ class Filter:
# 5. 获取阈值并计算比例
model = self._clean_model_id(body.get("model"))
thresholds = self._get_model_thresholds(model)
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
thresholds = self._get_model_thresholds(model) or {}
# Priority: 1. summary_model_max_context (if > 0) -> 2. model_thresholds -> 3. global max_context_tokens
if self.valves.summary_model_max_context > 0:
max_context_tokens = self.valves.summary_model_max_context
else:
max_context_tokens = thresholds.get(
"max_context_tokens", self.valves.max_context_tokens
)
# 6. 发送状态
status_msg = (
@@ -1771,9 +1843,7 @@ class Filter:
}
)
import traceback
traceback.print_exc()
logger.exception("[🤖 异步摘要任务] ❌ 发生异常")
def _format_messages_for_summary(self, messages: list) -> str:
"""Formats messages for summarization."""
@@ -1793,9 +1863,8 @@ class Filter:
# Handle role name
role_name = {"user": "User", "assistant": "Assistant"}.get(role, role)
# Limit length of each message to avoid excessive length
if len(content) > 500:
content = content[:500] + "..."
# User requested to remove truncation to allow full context for summary
# unless it exceeds model limits (which is handled by the LLM call itself or max_tokens)
formatted.append(f"[{i}] {role_name}: {content}")
@@ -1902,8 +1971,25 @@ class Filter:
# 调用 generate_chat_completion
response = await generate_chat_completion(request, payload, user)
if not response or "choices" not in response or not response["choices"]:
raise ValueError("LLM 响应格式不正确或为空")
# Handle JSONResponse (some backends return JSONResponse instead of dict)
if hasattr(response, "body"):
# It's a Response object, extract the body
import json as json_module
try:
response = json_module.loads(response.body.decode("utf-8"))
except Exception:
raise ValueError(f"Failed to parse JSONResponse body: {response}")
if (
not response
or not isinstance(response, dict)
or "choices" not in response
or not response["choices"]
):
raise ValueError(
f"LLM response format incorrect or empty: {type(response).__name__}"
)
summary = response["choices"][0]["message"]["content"].strip()