diff --git a/CHANGELOG.md b/CHANGELOG.md
deleted file mode 100644
index 01e8a65..0000000
--- a/CHANGELOG.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Changelog
-
-All notable changes to this project will be documented in this file.
-
-## [Unreleased]
-
-### 🚀 New Features
-- **Smart Mind Map**: Updated to v0.7.3. Improved description and metadata.
-- **Knowledge Card**: Updated to v0.2.1. Improved description and metadata.
-- **Documentation**: Added comprehensive `plugin_development_guide_cn.md` consolidating all previous guides.
-
-### 📦 Project Structure
-- **Renamed**: Project renamed from `awesome-openwebui` to **OpenWebUI Extras**.
-- **Reorganized**:
-    - Moved `run.py` to `scripts/`.
-    - Moved large documentation files to `docs/`.
-    - Removed `requirements.txt` to emphasize "resource collection" nature.
-- **Added**: `CONTRIBUTING.md` guide.
-
-### 📝 Documentation
-- **README**: Updated English and Chinese READMEs with new project name and structure.
-- **Plan**: Updated `implementation_plan.md` to reflect the new direction.
-
----
-
-## [0.1.0] - 2025-12-19
-- Initial release of the reorganized project structure.
diff --git a/README.md b/README.md
index 7d9e482..f6b8500 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ English | [中文](./README_CN.md)
 
 A collection of enhancements, plugins, and prompts for [OpenWebUI](https://github.com/open-webui/open-webui), developed and curated for personal use to extend functionality and improve experience.
 
-[Contributing](./CONTRIBUTING.md) | [Changelog](./CHANGELOG.md)
+[Contributing](./CONTRIBUTING.md)
 
 ## 📦 Project Contents
 
diff --git a/plugins/filters/async-context-compression/async_context_compression.py b/plugins/filters/async-context-compression/async_context_compression.py
index 0597b61..79d9058 100644
--- a/plugins/filters/async-context-compression/async_context_compression.py
+++ b/plugins/filters/async-context-compression/async_context_compression.py
@@ -15,7 +15,7 @@ license: MIT
 This filter significantly reduces token consumption in long conversations by using intelligent summarization and message compression, while maintaining conversational coherence.
 
 Core Features:
-  ✅ Automatic compression triggered by a message count threshold
+  ✅ Automatic compression triggered by Token count threshold
   ✅ Asynchronous summary generation (does not block user response)
   ✅ Persistent storage with database support (PostgreSQL and SQLite)
   ✅ Flexible retention policy (configurable to keep first and last N messages)
@@ -39,7 +39,7 @@ Phase 1: Inlet (Pre-request processing)
 Phase 2: Outlet (Post-response processing)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   1. Triggered after the LLM response is complete.
-  2. Checks if the message count has reached the compression threshold.
+  2. Checks if the Token count has reached the compression threshold.
   3. If the threshold is met, an asynchronous background task is started to generate a summary:
      ├─ Extracts messages to be summarized (excluding the kept first and last messages).
      ├─ Calls the LLM to generate a concise summary.
@@ -96,11 +96,20 @@ priority
   Default: 10
   Description: The execution order of the filter. Lower numbers run first.
 
-compression_threshold
-  Default: 15
-  Description: When the message count reaches this value, a background summary generation will be triggered after the conversation ends.
+compression_threshold_tokens
+  Default: 64000
+  Description: When the total context Token count exceeds this value, compression is triggered.
   Recommendation: Adjust based on your model's context window and cost.
 
+max_context_tokens
+  Default: 128000
+  Description: Hard limit for context. Exceeding this value will force removal of the earliest messages.
+
+model_thresholds
+  Default: {}
+  Description: Threshold override configuration for specific models.
+  Example: {"gpt-4": {"compression_threshold_tokens": 8000, "max_context_tokens": 32000}}
+
 keep_first
   Default: 1
   Description: Always keep the first N messages of the conversation. Set to 0 to disable. The first message often contains important system prompts.
@@ -113,7 +122,7 @@ summary_model
   Default: None
   Description: The LLM used to generate the summary.
   Recommendation:
-    - It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`、`gemini-2.5-flash`、`gpt-4.1`。
+    - It is strongly recommended to configure a fast, economical, and compatible model, such as `deepseek-v3`, `gemini-2.5-flash`, `gpt-4.1`.
     - If left empty, the filter will attempt to use the model from the current conversation.
   Note:
     - If the current conversation uses a pipeline (Pipe) model or a model that does not support standard generation APIs, leaving this field empty may cause summary generation to fail. In this case, you must specify a valid model.
@@ -205,8 +214,8 @@ Statistics:
 
 4. Cost Optimization
    ⚠ The summary model is called once each time the threshold is met.
-   ⚠ Set `compression_threshold` reasonably to avoid frequent calls.
-   ⚠ It's recommended to use a fast and economical model to generate summaries.
+   ⚠ Set `compression_threshold_tokens` reasonably to avoid frequent calls.
+   ⚠ It's recommended to use a fast and economical model (like `gemini-flash`) to generate summaries.
 
 5. Multimodal Support
    ✓ This filter supports multimodal messages containing images.
@@ -227,7 +236,7 @@ Solution:
 
 Problem: Summary not generated
 Solution:
-  1. Check if the `compression_threshold` has been met.
+  1. Check if the `compression_threshold_tokens` has been met.
   2. Verify that the `summary_model` is configured correctly.
   3. Check the debug logs for any error messages.
 
@@ -237,7 +246,7 @@ Solution:
 
 Problem: Compression effect is not significant
 Solution:
-  1. Increase the `compression_threshold` appropriately.
+  1. Increase the `compression_threshold_tokens` appropriately.
   2. Decrease the number of `keep_last` or `keep_first`.
   3. Check if the conversation is actually long enough.
 
@@ -245,11 +254,12 @@ Solution:
 """
 
 from pydantic import BaseModel, Field, model_validator
-from typing import Optional
+from typing import Optional, Dict, Any, List, Union, Callable, Awaitable
 import asyncio
 import json
 import hashlib
 import os
+import time
 
 # Open WebUI built-in imports
 from open_webui.utils.chat import generate_chat_completion
@@ -257,6 +267,12 @@ from open_webui.models.users import Users
 from fastapi.requests import Request
 from open_webui.main import app as webui_app
 
+# Try to import tiktoken
+try:
+    import tiktoken
+except ImportError:
+    tiktoken = None
+
 # Database imports
 from sqlalchemy import create_engine, Column, String, Text, DateTime, Integer
 from sqlalchemy.ext.declarative import declarative_base
@@ -284,6 +300,7 @@ class Filter:
         self.valves = self.Valves()
         self._db_engine = None
         self._SessionLocal = None
+        self.temp_state = {}  # Used to pass temporary data between inlet and outlet
         self._init_database()
 
     def _init_database(self):
@@ -292,7 +309,9 @@ class Filter:
             database_url = os.getenv("DATABASE_URL")
 
             if not database_url:
-                print("[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable.")
+                print(
+                    "[Database] ❌ Error: DATABASE_URL environment variable is not set. Please set this variable."
+                )
                 self._db_engine = None
                 self._SessionLocal = None
                 return
@@ -312,7 +331,9 @@ class Filter:
                     database_url = database_url.replace(
                         "postgres://", "postgresql://", 1
                     )
-                    print("[Database] ℹ️ Automatically converted postgres:// to postgresql://")
+                    print(
+                        "[Database] ℹ️ Automatically converted postgres:// to postgresql://"
+                    )
                 engine_args = {
                     "pool_pre_ping": True,
                     "pool_recycle": 3600,
@@ -337,7 +358,9 @@ class Filter:
             # Create table if it doesn't exist
             Base.metadata.create_all(bind=self._db_engine)
 
-            print(f"[Database] ✅ Successfully connected to {db_type} and initialized the chat_summary table.")
+            print(
+                f"[Database] ✅ Successfully connected to {db_type} and initialized the chat_summary table."
+            )
 
         except Exception as e:
             print(f"[Database] ❌ Initialization failed: {str(e)}")
@@ -348,36 +371,403 @@ class Filter:
         priority: int = Field(
             default=10, description="Priority level for the filter operations."
         )
-        compression_threshold: int = Field(
-            default=15, ge=0, description="The number of messages at which to trigger compression."
+        # Token related parameters
+        compression_threshold_tokens: int = Field(
+            default=64000,
+            ge=0,
+            description="When total context Token count exceeds this value, trigger compression (Global Default)",
         )
+        max_context_tokens: int = Field(
+            default=128000,
+            ge=0,
+            description="Hard limit for context. Exceeding this value will force removal of earliest messages (Global Default)",
+        )
+        model_thresholds: dict = Field(
+            default={
+                # Groq
+                "groq-openai/gpt-oss-20b": {
+                    "max_context_tokens": 8000,
+                    "compression_threshold_tokens": 5600,
+                },
+                "groq-openai/gpt-oss-120b": {
+                    "max_context_tokens": 8000,
+                    "compression_threshold_tokens": 5600,
+                },
+                # Qwen (ModelScope / CF)
+                "modelscope-Qwen/Qwen3-Coder-480B-A35B-Instruct": {
+                    "max_context_tokens": 256000,
+                    "compression_threshold_tokens": 179200,
+                },
+                "cfchatqwen-qwen3-max-search": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "modelscope-Qwen/Qwen3-235B-A22B-Thinking-2507": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-max": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-vl-plus-thinking": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-coder-plus-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "cfchatqwen-qwen3-vl-plus": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-coder-plus": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "cfchatqwen-qwen3-omni-flash-thinking": {
+                    "max_context_tokens": 65536,
+                    "compression_threshold_tokens": 45875,
+                },
+                "cfchatqwen-qwen3-omni-flash": {
+                    "max_context_tokens": 65536,
+                    "compression_threshold_tokens": 45875,
+                },
+                "cfchatqwen-qwen3-next-80b-a3b-thinking": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "modelscope-Qwen/Qwen3-VL-235B-A22B-Instruct": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-next-80b-a3b-thinking-search": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-next-80b-a3b": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-235b-a22b-thinking-search": {
+                    "max_context_tokens": 131072,
+                    "compression_threshold_tokens": 91750,
+                },
+                "cfchatqwen-qwen3-235b-a22b": {
+                    "max_context_tokens": 131072,
+                    "compression_threshold_tokens": 91750,
+                },
+                "cfchatqwen-qwen3-235b-a22b-thinking": {
+                    "max_context_tokens": 131072,
+                    "compression_threshold_tokens": 91750,
+                },
+                "cfchatqwen-qwen3-coder-flash-search": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-coder-flash": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-max-2025-10-30": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-max-2025-10-30-thinking": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-max-2025-10-30-thinking-search": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "modelscope-Qwen/Qwen3-235B-A22B-Instruct-2507": {
+                    "max_context_tokens": 262144,
+                    "compression_threshold_tokens": 183500,
+                },
+                "cfchatqwen-qwen3-vl-30b-a3b": {
+                    "max_context_tokens": 131072,
+                    "compression_threshold_tokens": 91750,
+                },
+                "cfchatqwen-qwen3-vl-30b-a3b-thinking": {
+                    "max_context_tokens": 131072,
+                    "compression_threshold_tokens": 91750,
+                },
+                # Gemini
+                "gemini-2.5-pro-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.5-flash-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.5-flash": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.5-flash-lite": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.5-flash-lite-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.5-pro": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.0-flash-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.0-flash": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.0-flash-exp": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-2.0-flash-lite": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "copilot-gemini-2.5-pro": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gemini-pro-latest": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-3-pro-preview": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gemini-pro-latest-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-flash-latest": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-flash-latest-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-flash-lite-latest-search": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-flash-lite-latest": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                "gemini-robotics-er-1.5-preview": {
+                    "max_context_tokens": 1048576,
+                    "compression_threshold_tokens": 734000,
+                },
+                # DeepSeek
+                "modelscope-deepseek-ai/DeepSeek-V3.1": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfdeepseek-deepseek-search": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "openrouter-deepseek/deepseek-r1-0528:free": {
+                    "max_context_tokens": 163840,
+                    "compression_threshold_tokens": 114688,
+                },
+                "modelscope-deepseek-ai/DeepSeek-V3.2-Exp": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfdeepseek-deepseek-r1-search": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfdeepseek-deepseek-r1": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "openrouter-deepseek/deepseek-chat-v3.1:free": {
+                    "max_context_tokens": 163800,
+                    "compression_threshold_tokens": 114660,
+                },
+                "modelscope-deepseek-ai/DeepSeek-R1-0528": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfdeepseek-deepseek": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                # Kimi (Moonshot)
+                "cfkimi-kimi-k2-search": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfkimi-kimi-k1.5-search": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfkimi-kimi-k1.5-thinking-search": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfkimi-kimi-research": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "openrouter-moonshotai/kimi-k2:free": {
+                    "max_context_tokens": 32768,
+                    "compression_threshold_tokens": 22937,
+                },
+                "cfkimi-kimi-k2": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "cfkimi-kimi-k1.5": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                # GPT / OpenAI
+                "gpt-4.1": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gpt-4o": {
+                    "max_context_tokens": 64000,
+                    "compression_threshold_tokens": 44800,
+                },
+                "gpt-5": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "github-gpt-4.1": {
+                    "max_context_tokens": 7500,
+                    "compression_threshold_tokens": 5250,
+                },
+                "gpt-5-mini": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gpt-5.1": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gpt-5.1-codex": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gpt-5.1-codex-mini": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "gpt-5-codex": {
+                    "max_context_tokens": 200000,
+                    "compression_threshold_tokens": 140000,
+                },
+                "github-gpt-4.1-mini": {
+                    "max_context_tokens": 7500,
+                    "compression_threshold_tokens": 5250,
+                },
+                "openrouter-openai/gpt-oss-20b:free": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                # Claude / Anthropic
+                "claude-sonnet-4.5": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "claude-haiku-4.5": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "copilot-claude-opus-41": {
+                    "max_context_tokens": 80000,
+                    "compression_threshold_tokens": 56000,
+                },
+                "copilot-claude-sonnet-4": {
+                    "max_context_tokens": 80000,
+                    "compression_threshold_tokens": 56000,
+                },
+                # Other / OpenRouter / OSWE
+                "oswe-vscode-insiders": {
+                    "max_context_tokens": 256000,
+                    "compression_threshold_tokens": 179200,
+                },
+                "modelscope-MiniMax/MiniMax-M2": {
+                    "max_context_tokens": 204800,
+                    "compression_threshold_tokens": 143360,
+                },
+                "oswe-vscode-prime": {
+                    "max_context_tokens": 200000,
+                    "compression_threshold_tokens": 140000,
+                },
+                "grok-code-fast-1": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "copilot-auto": {
+                    "max_context_tokens": 128000,
+                    "compression_threshold_tokens": 89600,
+                },
+                "modelscope-ZhipuAI/GLM-4.6": {
+                    "max_context_tokens": 32000,
+                    "compression_threshold_tokens": 22400,
+                },
+                "openrouter-x-ai/grok-4.1-fast:free": {
+                    "max_context_tokens": 2000000,
+                    "compression_threshold_tokens": 1400000,
+                },
+                "openrouter-qwen/qwen3-coder:free": {
+                    "max_context_tokens": 262000,
+                    "compression_threshold_tokens": 183400,
+                },
+                "openrouter-qwen/qwen3-235b-a22b:free": {
+                    "max_context_tokens": 40960,
+                    "compression_threshold_tokens": 28672,
+                },
+            },
+            description="Threshold override configuration for specific models. Only includes models requiring special configuration.",
+        )
+
         keep_first: int = Field(
-            default=1, ge=0, description="Always keep the first N messages. Set to 0 to disable."
+            default=1,
+            ge=0,
+            description="Always keep the first N messages. Set to 0 to disable.",
+        )
+        keep_last: int = Field(
+            default=6, ge=0, description="Always keep the last N full messages."
         )
-        keep_last: int = Field(default=6, ge=0, description="Always keep the last N messages.")
         summary_model: str = Field(
             default=None,
-            description="The model to use for generating the summary. If empty, uses the current conversation's model.",
+            description="The model ID used to generate the summary. If empty, uses the current conversation's model. Used to match configurations in model_thresholds.",
         )
         max_summary_tokens: int = Field(
-            default=4000, ge=1, description="The maximum number of tokens for the summary."
+            default=16384,
+            ge=1,
+            description="The maximum number of tokens for the summary.",
         )
         summary_temperature: float = Field(
-            default=0.3, ge=0.0, le=2.0, description="The temperature for summary generation."
+            default=0.1,
+            ge=0.0,
+            le=2.0,
+            description="The temperature for summary generation.",
+        )
+        debug_mode: bool = Field(
+            default=True, description="Enable detailed logging for debugging."
         )
-        debug_mode: bool = Field(default=True, description="Enable detailed logging for debugging.")
 
-        @model_validator(mode="after")
-        def check_thresholds(self) -> "Valves":
-            kept_count = self.keep_first + self.keep_last
-            if self.compression_threshold <= kept_count:
-                raise ValueError(
-                    f"compression_threshold ({self.compression_threshold}) must be greater than "
-                    f"the sum of keep_first ({self.keep_first}) and keep_last ({self.keep_last}) ({kept_count})."
-                )
-            return self
-
-    def _save_summary(self, chat_id: str, summary: str, body: dict):
+    def _save_summary(self, chat_id: str, summary: str, compressed_count: int):
         """Saves the summary to the database."""
         if not self._SessionLocal:
             if self.valves.debug_mode:
@@ -388,21 +778,27 @@ class Filter:
             session = self._SessionLocal()
             try:
                 # Find existing record
-                existing = (
-                    session.query(ChatSummary).filter_by(chat_id=chat_id).first()
-                )
+                existing = session.query(ChatSummary).filter_by(chat_id=chat_id).first()
 
                 if existing:
+                    # [Optimization] Optimistic lock check: update only if progress moves forward
+                    if compressed_count <= existing.compressed_message_count:
+                        if self.valves.debug_mode:
+                            print(
+                                f"[Storage] Skipping update: New progress ({compressed_count}) is not greater than existing progress ({existing.compressed_message_count})"
+                            )
+                        return
+
                     # Update existing record
                     existing.summary = summary
-                    existing.compressed_message_count = len(body.get("messages", []))
+                    existing.compressed_message_count = compressed_count
                     existing.updated_at = datetime.utcnow()
                 else:
                     # Create new record
                     new_summary = ChatSummary(
                         chat_id=chat_id,
                         summary=summary,
-                        compressed_message_count=len(body.get("messages", [])),
+                        compressed_message_count=compressed_count,
                     )
                     session.add(new_summary)
 
@@ -410,7 +806,9 @@ class Filter:
 
                 if self.valves.debug_mode:
                     action = "Updated" if existing else "Created"
-                    print(f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})")
+                    print(
+                        f"[Storage] Summary has been {action.lower()} in the database (Chat ID: {chat_id})"
+                    )
 
             finally:
                 session.close()
@@ -418,38 +816,100 @@ class Filter:
         except Exception as e:
             print(f"[Storage] ❌ Database save failed: {str(e)}")
 
-    def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
-        """Loads the summary from the database."""
+    def _load_summary_record(self, chat_id: str) -> Optional[ChatSummary]:
+        """Loads the summary record object from the database."""
         if not self._SessionLocal:
-            if self.valves.debug_mode:
-                print("[Storage] Database not initialized, cannot load summary.")
             return None
 
         try:
             session = self._SessionLocal()
             try:
-                record = (
-                    session.query(ChatSummary).filter_by(chat_id=chat_id).first()
-                )
-
+                record = session.query(ChatSummary).filter_by(chat_id=chat_id).first()
                 if record:
-                    if self.valves.debug_mode:
-                        print(f"[Storage] Loaded summary from database (Chat ID: {chat_id})")
-                        print(
-                            f"[Storage] Last updated: {record.updated_at}, Original message count: {record.compressed_message_count}"
-                        )
-                    return record.summary
-
+                    # Detach the object from the session so it can be used after session close
+                    session.expunge(record)
+                    return record
             finally:
                 session.close()
-
         except Exception as e:
-            print(f"[Storage] ❌ Database read failed: {str(e)}")
-
+            print(f"[Load] ❌ Database read failed: {str(e)}")
         return None
 
+    def _load_summary(self, chat_id: str, body: dict) -> Optional[str]:
+        """Loads the summary text from the database (Compatible with old interface)."""
+        record = self._load_summary_record(chat_id)
+        if record:
+            if self.valves.debug_mode:
+                print(f"[Load] Loaded summary from database (Chat ID: {chat_id})")
+                print(
+                    f"[Load] Last updated: {record.updated_at}, Compressed message count: {record.compressed_message_count}"
+                )
+            return record.summary
+        return None
+
+    def _count_tokens(self, text: str, model: str = "gpt-3.5-turbo") -> int:
+        """Counts the number of tokens in the text."""
+        if not text:
+            return 0
+
+        if tiktoken:
+            try:
+                # Uniformly use o200k_base encoding (adapted for latest models)
+                encoding = tiktoken.get_encoding("o200k_base")
+                return len(encoding.encode(text))
+            except Exception as e:
+                if self.valves.debug_mode:
+                    print(
+                        f"[Token Count] tiktoken error: {e}, falling back to character estimation"
+                    )
+
+        # Fallback strategy: Rough estimation (1 token ≈ 4 chars)
+        return len(text) // 4
+
+    def _calculate_messages_tokens(
+        self, messages: List[Dict], model: str = "gpt-3.5-turbo"
+    ) -> int:
+        """Calculates the total tokens for a list of messages."""
+        total_tokens = 0
+        for msg in messages:
+            content = msg.get("content", "")
+            if isinstance(content, list):
+                # Handle multimodal content
+                text_content = ""
+                for part in content:
+                    if isinstance(part, dict) and part.get("type") == "text":
+                        text_content += part.get("text", "")
+                total_tokens += self._count_tokens(text_content, model)
+            else:
+                total_tokens += self._count_tokens(str(content), model)
+        return total_tokens
+
+    def _get_model_thresholds(self, model_id: str) -> Dict[str, int]:
+        """Gets threshold configuration for a specific model.
+
+        Priority:
+        1. If configuration exists for the model ID in model_thresholds, use it.
+        2. Otherwise, use global parameters compression_threshold_tokens and max_context_tokens.
+        """
+        # Try to match from model-specific configuration
+        if model_id in self.valves.model_thresholds:
+            if self.valves.debug_mode:
+                print(f"[Config] Using model-specific configuration: {model_id}")
+            return self.valves.model_thresholds[model_id]
+
+        # Use global default configuration
+        if self.valves.debug_mode:
+            print(
+                f"[Config] Model {model_id} not in model_thresholds, using global parameters"
+            )
+
+        return {
+            "compression_threshold_tokens": self.valves.compression_threshold_tokens,
+            "max_context_tokens": self.valves.max_context_tokens,
+        }
+
     def _inject_summary_to_first_message(self, message: dict, summary: str) -> dict:
-        """Injects the summary into the first message by prepending it."""
+        """Injects the summary into the first message (prepended to content)."""
         content = message.get("content", "")
         summary_block = f"【Historical Conversation Summary】\n{summary}\n\n---\nBelow is the recent conversation:\n\n"
 
@@ -485,14 +945,15 @@ class Filter:
         return message
 
     async def inlet(
-        self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
+        self,
+        body: dict,
+        __user__: Optional[dict] = None,
+        __metadata__: dict = None,
+        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
     ) -> dict:
         """
         Executed before sending to the LLM.
-        Compression Strategy:
-        1. Keep the first N messages.
-        2. Inject the summary into the first message (if keep_first > 0).
-        3. Keep the last N messages.
+        Compression Strategy: Only responsible for injecting existing summaries, no Token calculation.
         """
         messages = body.get("messages", [])
         chat_id = __metadata__["chat_id"]
@@ -502,161 +963,331 @@ class Filter:
             print(f"[Inlet] Chat ID: {chat_id}")
             print(f"[Inlet] Received {len(messages)} messages")
 
-        # [Optimization] Load summary in a background thread to avoid blocking the event loop.
-        if self.valves.debug_mode:
-            print("[Optimization] Loading summary in a background thread to avoid blocking the event loop.")
-        saved_summary = await asyncio.to_thread(self._load_summary, chat_id, body)
+        # Record the target compression progress for the original messages, for use in outlet
+        # Target is to compress up to the (total - keep_last) message
+        target_compressed_count = max(0, len(messages) - self.valves.keep_last)
 
-        total_kept_count = self.valves.keep_first + self.valves.keep_last
-
-        if saved_summary and len(messages) > total_kept_count:
+        # [Optimization] Simple state cleanup check
+        if chat_id in self.temp_state:
             if self.valves.debug_mode:
-                print(f"[Inlet] Found saved summary, applying compression.")
-
-            first_messages_to_keep = []
-
-            if self.valves.keep_first > 0:
-                # Copy the initial messages to keep
-                first_messages_to_keep = [
-                    m.copy() for m in messages[: self.valves.keep_first]
-                ]
-                # Inject the summary into the very first message
-                first_messages_to_keep[0] = self._inject_summary_to_first_message(
-                    first_messages_to_keep[0], saved_summary
-                )
-            else:
-                # If not keeping initial messages, create a new system message for the summary
-                summary_block = (
-                    f"【Historical Conversation Summary】\n{saved_summary}\n\n---\nBelow is the recent conversation:\n\n"
-                )
-                first_messages_to_keep.append(
-                    {"role": "system", "content": summary_block}
+                print(
+                    f"[Inlet] ⚠️ Overwriting unconsumed old state (Chat ID: {chat_id})"
                 )
 
-            # Keep the last messages
-            last_messages_to_keep = (
-                messages[-self.valves.keep_last :] if self.valves.keep_last > 0 else []
+        self.temp_state[chat_id] = target_compressed_count
+
+        if self.valves.debug_mode:
+            print(
+                f"[Inlet] Recorded target compression progress: {target_compressed_count}"
             )
 
-            # Combine: [Kept initial messages (with summary)] + [Kept recent messages]
-            body["messages"] = first_messages_to_keep + last_messages_to_keep
+        # Load summary record
+        summary_record = await asyncio.to_thread(self._load_summary_record, chat_id)
+
+        final_messages = []
+
+        if summary_record:
+            # Summary exists, build view: [Head] + [Summary Message] + [Tail]
+            # Tail is all messages after the last compression point
+            compressed_count = summary_record.compressed_message_count
+
+            # Ensure compressed_count is reasonable
+            if compressed_count > len(messages):
+                compressed_count = max(0, len(messages) - self.valves.keep_last)
+
+            # 1. Head messages (Keep First)
+            head_messages = []
+            if self.valves.keep_first > 0:
+                head_messages = messages[: self.valves.keep_first]
+
+            # 2. Summary message (Inserted as User message)
+            summary_content = (
+                f"【System Prompt: The following is a summary of the historical conversation, provided for context only. Do not reply to the summary content itself; answer the subsequent latest questions directly.】\n\n"
+                f"{summary_record.summary}\n\n"
+                f"---\n"
+                f"Below is the recent conversation:"
+            )
+            summary_msg = {"role": "user", "content": summary_content}
+
+            # 3. Tail messages (Tail) - All messages starting from the last compression point
+            # Note: Must ensure head messages are not duplicated
+            start_index = max(compressed_count, self.valves.keep_first)
+            tail_messages = messages[start_index:]
+
+            final_messages = head_messages + [summary_msg] + tail_messages
+
+            # Send status notification
+            if __event_emitter__:
+                await __event_emitter__(
+                    {
+                        "type": "status",
+                        "data": {
+                            "description": f"Loaded historical summary (Hidden {compressed_count} historical messages)",
+                            "done": True,
+                        },
+                    }
+                )
 
             if self.valves.debug_mode:
-                print(f"[Inlet] ✂️ Compression complete:")
-                print(f"  - Original messages: {len(messages)}")
-                print(f"  - Compressed to: {len(body['messages'])}")
                 print(
-                    f"  - Structure: [Keep first {self.valves.keep_first} (with summary)] + [Keep last {self.valves.keep_last}]"
+                    f"[Inlet] Applied summary: Head({len(head_messages)}) + Summary + Tail({len(tail_messages)})"
                 )
-                print(f"  - Saved: {len(messages) - len(body['messages'])} messages")
         else:
-            if self.valves.debug_mode:
-                if not saved_summary:
-                    print(f"[Inlet] No summary found, using full conversation history.")
-                else:
-                    print(f"[Inlet] Message count does not exceed retention threshold, no compression applied.")
+            # No summary, use original messages
+            final_messages = messages
+
+        body["messages"] = final_messages
 
         if self.valves.debug_mode:
+            print(f"[Inlet] Final send: {len(body['messages'])} messages")
             print(f"{'='*60}\n")
 
         return body
 
     async def outlet(
-        self, body: dict, __user__: Optional[dict] = None, __metadata__: dict = None
+        self,
+        body: dict,
+        __user__: Optional[dict] = None,
+        __metadata__: dict = None,
+        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
     ) -> dict:
         """
         Executed after the LLM response is complete.
-        Triggers summary generation asynchronously.
+        Calculates Token count in the background and triggers summary generation (does not block current response, does not affect content output).
         """
-        messages = body.get("messages", [])
         chat_id = __metadata__["chat_id"]
+        model = body.get("model", "gpt-3.5-turbo")
 
         if self.valves.debug_mode:
             print(f"\n{'='*60}")
             print(f"[Outlet] Chat ID: {chat_id}")
-            print(f"[Outlet] Response complete, current message count: {len(messages)}")
+            print(f"[Outlet] Response complete")
 
-        # Check if compression is needed
-        if len(messages) >= self.valves.compression_threshold:
-            if self.valves.debug_mode:
-                print(
-                    f"[Outlet] ⚡ Compression threshold reached ({len(messages)} >= {self.valves.compression_threshold})"
-                )
-                print(f"[Outlet] Preparing to generate summary in the background...")
-
-            # Generate summary asynchronously in the background
-            asyncio.create_task(
-                self._generate_summary_async(messages, chat_id, body, __user__)
+        # Process Token calculation and summary generation asynchronously in the background (do not wait for completion, do not affect output)
+        asyncio.create_task(
+            self._check_and_generate_summary_async(
+                chat_id, model, body, __user__, __event_emitter__
             )
-        else:
-            if self.valves.debug_mode:
-                print(
-                    f"[Outlet] Compression threshold not reached ({len(messages)} < {self.valves.compression_threshold})"
-                )
+        )
 
         if self.valves.debug_mode:
+            print(f"[Outlet] Background processing started")
             print(f"{'='*60}\n")
 
         return body
 
-    async def _generate_summary_async(
-        self, messages: list, chat_id: str, body: dict, user_data: Optional[dict]
+    async def _check_and_generate_summary_async(
+        self,
+        chat_id: str,
+        model: str,
+        body: dict,
+        user_data: Optional[dict],
+        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
     ):
         """
-        Generates a summary asynchronously in the background.
+        Background processing: Calculates Token count and generates summary (does not block response).
+        """
+        try:
+            messages = body.get("messages", [])
+
+            # Get threshold configuration for current model
+            thresholds = self._get_model_thresholds(model)
+            compression_threshold_tokens = thresholds.get(
+                "compression_threshold_tokens", self.valves.compression_threshold_tokens
+            )
+
+            if self.valves.debug_mode:
+                print(f"\n[🔍 Background Calculation] Starting Token count...")
+
+            # Calculate Token count in a background thread
+            current_tokens = await asyncio.to_thread(
+                self._calculate_messages_tokens, messages, model
+            )
+
+            if self.valves.debug_mode:
+                print(f"[🔍 Background Calculation] Token count: {current_tokens}")
+
+            # Check if compression is needed
+            if current_tokens >= compression_threshold_tokens:
+                if self.valves.debug_mode:
+                    print(
+                        f"[🔍 Background Calculation] ⚡ Compression threshold triggered (Token: {current_tokens} >= {compression_threshold_tokens})"
+                    )
+
+                # Proceed to generate summary
+                await self._generate_summary_async(
+                    messages, chat_id, body, user_data, __event_emitter__
+                )
+            else:
+                if self.valves.debug_mode:
+                    print(
+                        f"[🔍 Background Calculation] Compression threshold not reached (Token: {current_tokens} < {compression_threshold_tokens})"
+                    )
+
+        except Exception as e:
+            print(f"[🔍 Background Calculation] ❌ Error: {str(e)}")
+
+    async def _generate_summary_async(
+        self,
+        messages: list,
+        chat_id: str,
+        body: dict,
+        user_data: Optional[dict],
+        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
+    ):
+        """
+        Generates summary asynchronously (runs in background, does not block response).
+        Logic:
+        1. Extract middle messages (remove keep_first and keep_last).
+        2. Check Token limit, if exceeding max_context_tokens, remove from the head of middle messages.
+        3. Generate summary for the remaining middle messages.
         """
         try:
             if self.valves.debug_mode:
                 print(f"\n[🤖 Async Summary Task] Starting...")
 
-            # Messages to summarize: exclude kept initial and final messages
-            if self.valves.keep_last > 0:
-                messages_to_summarize = messages[
-                    self.valves.keep_first : -self.valves.keep_last
-                ]
-            else:
-                messages_to_summarize = messages[self.valves.keep_first :]
-
-            if len(messages_to_summarize) == 0:
+            # 1. Get target compression progress
+            # Prioritize getting from temp_state (calculated by inlet). If unavailable (e.g., after restart), assume current is full history.
+            target_compressed_count = self.temp_state.pop(chat_id, None)
+            if target_compressed_count is None:
+                target_compressed_count = max(0, len(messages) - self.valves.keep_last)
                 if self.valves.debug_mode:
-                    print(f"[🤖 Async Summary Task] No messages to summarize, skipping.")
+                    print(
+                        f"[🤖 Async Summary Task] ⚠️ Could not get inlet state, estimating progress using current message count: {target_compressed_count}"
+                    )
+
+            # 2. Determine the range of messages to compress (Middle)
+            start_index = self.valves.keep_first
+            end_index = len(messages) - self.valves.keep_last
+            if self.valves.keep_last == 0:
+                end_index = len(messages)
+
+            # Ensure indices are valid
+            if start_index >= end_index:
+                if self.valves.debug_mode:
+                    print(
+                        f"[🤖 Async Summary Task] Middle messages empty (Start: {start_index}, End: {end_index}), skipping"
+                    )
                 return
 
+            middle_messages = messages[start_index:end_index]
+
             if self.valves.debug_mode:
-                print(f"[🤖 Async Summary Task] Preparing to summarize {len(messages_to_summarize)} messages.")
                 print(
-                    f"[🤖 Async Summary Task] Protecting: First {self.valves.keep_first} + Last {self.valves.keep_last} messages."
+                    f"[🤖 Async Summary Task] Middle messages to process: {len(middle_messages)}"
                 )
 
-            # Build conversation history text
-            conversation_text = self._format_messages_for_summary(messages_to_summarize)
+            # 3. Check Token limit and truncate (Max Context Truncation)
+            # [Optimization] Use the summary model's (if any) threshold to decide how many middle messages can be processed
+            # This allows using a long-window model (like gemini-flash) to compress history exceeding the current model's window
+            summary_model_id = self.valves.summary_model or body.get(
+                "model", "gpt-3.5-turbo"
+            )
 
-            # Call LLM to generate summary
-            summary = await self._call_summary_llm(conversation_text, body, user_data)
-
-            # [Optimization] Save summary in a background thread to avoid blocking the event loop.
-            if self.valves.debug_mode:
-                print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
-            await asyncio.to_thread(self._save_summary, chat_id, summary, body)
+            thresholds = self._get_model_thresholds(summary_model_id)
+            # Note: Using the summary model's max context limit here
+            max_context_tokens = thresholds.get(
+                "max_context_tokens", self.valves.max_context_tokens
+            )
 
             if self.valves.debug_mode:
-                print(f"[🤖 Async Summary Task] ✅ Complete! Summary length: {len(summary)} characters.")
-                print(f"[🤖 Async Summary Task] Summary preview: {summary[:150]}...")
+                print(
+                    f"[🤖 Async Summary Task] Using max limit for model {summary_model_id}: {max_context_tokens} Tokens"
+                )
+
+            # Calculate current total Tokens (using summary model for counting)
+            total_tokens = await asyncio.to_thread(
+                self._calculate_messages_tokens, messages, summary_model_id
+            )
+
+            if total_tokens > max_context_tokens:
+                excess_tokens = total_tokens - max_context_tokens
+                if self.valves.debug_mode:
+                    print(
+                        f"[🤖 Async Summary Task] ⚠️ Total Tokens ({total_tokens}) exceed summary model limit ({max_context_tokens}), need to remove approx {excess_tokens} Tokens"
+                    )
+
+                # Remove from the head of middle_messages
+                removed_tokens = 0
+                removed_count = 0
+
+                while removed_tokens < excess_tokens and middle_messages:
+                    msg_to_remove = middle_messages.pop(0)
+                    msg_tokens = self._count_tokens(
+                        str(msg_to_remove.get("content", "")), summary_model_id
+                    )
+                    removed_tokens += msg_tokens
+                    removed_count += 1
+
+                if self.valves.debug_mode:
+                    print(
+                        f"[🤖 Async Summary Task] Removed {removed_count} messages, totaling {removed_tokens} Tokens"
+                    )
+
+            if not middle_messages:
+                if self.valves.debug_mode:
+                    print(
+                        f"[🤖 Async Summary Task] Middle messages empty after truncation, skipping summary generation"
+                    )
+                return
+
+            # 4. Build conversation text
+            conversation_text = self._format_messages_for_summary(middle_messages)
+
+            # 5. Call LLM to generate new summary
+            # Note: previous_summary is not passed here because old summary (if any) is already included in middle_messages
+
+            # Send status notification for starting summary generation
+            if __event_emitter__:
+                await __event_emitter__(
+                    {
+                        "type": "status",
+                        "data": {
+                            "description": "Generating context summary in background...",
+                            "done": False,
+                        },
+                    }
+                )
+
+            new_summary = await self._call_summary_llm(
+                None, conversation_text, body, user_data
+            )
+
+            # 6. Save new summary
+            if self.valves.debug_mode:
+                print(
+                    "[Optimization] Saving summary in a background thread to avoid blocking the event loop."
+                )
+
+            await asyncio.to_thread(
+                self._save_summary, chat_id, new_summary, target_compressed_count
+            )
+
+            # Send completion status notification
+            if __event_emitter__:
+                await __event_emitter__(
+                    {
+                        "type": "status",
+                        "data": {
+                            "description": f"Context summary updated (Saved {len(middle_messages)} messages)",
+                            "done": True,
+                        },
+                    }
+                )
+
+            if self.valves.debug_mode:
+                print(
+                    f"[🤖 Async Summary Task] ✅ Complete! New summary length: {len(new_summary)} characters"
+                )
+                print(
+                    f"[🤖 Async Summary Task] Progress update: Compressed up to original message {target_compressed_count}"
+                )
 
         except Exception as e:
             print(f"[🤖 Async Summary Task] ❌ Error: {str(e)}")
             import traceback
 
             traceback.print_exc()
-            # Save a simple placeholder even on failure
-            fallback_summary = (
-                f"[Historical Conversation Summary] Contains content from approximately {len(messages_to_summarize)} messages."
-            )
-            
-            # [Optimization] Save summary in a background thread to avoid blocking the event loop.
-            if self.valves.debug_mode:
-                print("[Optimization] Saving summary in a background thread to avoid blocking the event loop.")
-            await asyncio.to_thread(self._save_summary, chat_id, fallback_summary, body)
 
     def _format_messages_for_summary(self, messages: list) -> str:
         """Formats messages for summarization."""
@@ -685,37 +1316,51 @@ class Filter:
         return "\n\n".join(formatted)
 
     async def _call_summary_llm(
-        self, conversation_text: str, body: dict, user_data: dict
+        self,
+        previous_summary: Optional[str],
+        new_conversation_text: str,
+        body: dict,
+        user_data: dict,
     ) -> str:
         """
         Calls the LLM to generate a summary using Open WebUI's built-in method.
         """
         if self.valves.debug_mode:
-            print(f"[🤖 LLM Call] Using Open WebUI's built-in method.")
+            print(f"[🤖 LLM Call] Using Open WebUI's built-in method")
 
-        # Build summary prompt
+        # Build summary prompt (Optimized)
         summary_prompt = f"""
-You are a professional conversation context compression assistant. Your task is to perform a high-fidelity compression of the [Conversation Content] below, producing a concise summary that can be used directly as context for subsequent conversation. Strictly adhere to the following requirements:
+You are a professional conversation context compression expert. Your task is to create a high-fidelity summary of the following conversation content.
+This conversation may contain previous summaries (as system messages or text) and subsequent conversation content.
 
-MUST RETAIN: Topics/goals, user intent, key facts and data, important parameters and constraints, deadlines, decisions/conclusions, action items and their status, and technical details like code/commands (code must be preserved as is).
-REMOVE: Greetings, politeness, repetitive statements, off-topic chatter, and procedural details (unless essential). For information that has been overturned or is outdated, please mark it as "Obsolete: <explanation>" when retaining.
-CONFLICT RESOLUTION: If there are contradictions or multiple revisions, retain the latest consistent conclusion and list unresolved or conflicting points under "Points to Clarify".
-STRUCTURE AND TONE: Output in structured bullet points. Be logical, objective, and concise. Summarize from a third-person perspective. Use code blocks to preserve technical/code snippets verbatim.
-OUTPUT LENGTH: Strictly limit the summary content to within {int(self.valves.max_summary_tokens * 3)} characters. Prioritize key information; if space is insufficient, trim details rather than core conclusions.
-FORMATTING: Output only the summary text. Do not add any extra explanations, execution logs, or generation processes. You must use the following headings (if a section has no content, write "None"):
-Core Theme:
-Key Information:
-... (List 3-6 key points)
-Decisions/Conclusions:
-Action Items (with owner/deadline if any):
-Relevant Roles/Preferences:
-Risks/Dependencies/Assumptions:
-Points to Clarify:
-Compression Ratio: Original ~X words → Summary ~Y words (estimate)
-Conversation Content:
-{conversation_text}
+### Core Objectives
+1.  **Comprehensive Summary**: Concisely summarize key information, user intent, and assistant responses from the conversation.
+2.  **De-noising**: Remove greetings, repetitions, confirmations, and other non-essential information.
+3.  **Key Retention**:
+    *   **Code snippets, commands, and technical parameters must be preserved verbatim. Do not modify or generalize them.**
+    *   User intent, core requirements, decisions, and action items must be clearly preserved.
+4.  **Coherence**: The generated summary should be a cohesive whole that can replace the original conversation as context.
+5.  **Detailed Record**: Since length is permitted, please preserve details, reasoning processes, and nuances of multi-turn interactions as much as possible, rather than just high-level generalizations.
 
-Please directly output the compressed summary that meets the above requirements (summary text only).
+### Output Requirements
+*   **Format**: Structured text, logically clear.
+*   **Language**: Consistent with the conversation language (usually English).
+*   **Length**: Strictly control within {self.valves.max_summary_tokens} Tokens.
+*   **Strictly Forbidden**: Do not output "According to the conversation...", "The summary is as follows..." or similar filler. Output the summary content directly.
+
+### Suggested Summary Structure
+*   **Current Goal/Topic**: A one-sentence summary of the problem currently being solved.
+*   **Key Information & Context**:
+    *   Confirmed facts/parameters.
+    *   **Code/Technical Details** (Wrap in code blocks).
+*   **Progress & Conclusions**: Completed steps and reached consensus.
+*   **Action Items/Next Steps**: Clear follow-up actions.
+
+---
+{new_conversation_text}
+---
+
+Based on the content above, generate the summary:
 """
         # Determine the model to use
         model = self.valves.summary_model or body.get("model", "")
@@ -740,7 +1385,9 @@ Please directly output the compressed summary that meets the above requirements
 
             # [Optimization] Get user object in a background thread to avoid blocking the event loop.
             if self.valves.debug_mode:
-                print("[Optimization] Getting user object in a background thread to avoid blocking the event loop.")
+                print(
+                    "[Optimization] Getting user object in a background thread to avoid blocking the event loop."
+                )
             user = await asyncio.to_thread(Users.get_user_by_id, user_id)
 
             if not user:
@@ -757,17 +1404,17 @@ Please directly output the compressed summary that meets the above requirements
             response = await generate_chat_completion(request, payload, user)
 
             if not response or "choices" not in response or not response["choices"]:
-                raise ValueError("LLM response is not in the correct format or is empty")
+                raise ValueError("LLM response format incorrect or empty")
 
             summary = response["choices"][0]["message"]["content"].strip()
 
             if self.valves.debug_mode:
-                print(f"[🤖 LLM Call] ✅ Successfully received summary.")
+                print(f"[🤖 LLM Call] ✅ Successfully received summary")
 
             return summary
 
         except Exception as e:
-            error_message = f"An error occurred while calling the LLM ({model}) to generate a summary: {str(e)}"
+            error_message = f"Error occurred while calling LLM ({model}) to generate summary: {str(e)}"
             if not self.valves.summary_model:
                 error_message += (
                     "\n[Hint] You did not specify a summary_model, so the filter attempted to use the current conversation's model. "