fix(async-context-compression): strengthen summary path robustness

- Add comprehensive error logging for LLM response validation failures - Thread __request__ context through entire summary generation pipeline - Load and merge previous_summary from DB when not in outlet payload - Use real request object instead of minimal synthetic context
2026-03-13 14:15:42 +08:00
parent f11cf27404
commit 8c998ecc73
3 changed files with 83 additions and 27 deletions
--- a/plugins/filters/async-context-compression/README.md
+++ b/plugins/filters/async-context-compression/README.md
@@ -8,6 +8,9 @@ This filter reduces token consumption in long conversations through intelligent

 - **Reverse-Unfolding Mechanism**: Accurately reconstructs the expanded native tool-calling sequence during the outlet phase to permanently fix coordinate drift and missing summaries for long tool-based conversations.
 - **Safer Tool Trimming**: Refactored `enable_tool_output_trimming` to strictly use atomic block groups for safe trimming, completely preventing JSON payload corruption.
+- **Strengthened Summary Path**: Thread real `__request__` context through the entire summary pipeline instead of using minimal synthetic requests, improving compatibility with various LLM providers.
+- **Smarter Previous Summary Loading**: When outlet payload lacks an injected summary, the filter now explicitly loads the previous summary from the database and merges it into the LLM prompt for incremental context preservation.
+- **Better Error Diagnostics**: LLM response validation errors now print the complete response body for immediate troubleshooting instead of just the type name.

 ---

--- a/plugins/filters/async-context-compression/README_CN.md
+++ b/plugins/filters/async-context-compression/README_CN.md
@@ -10,6 +10,9 @@

 - **逆向展开机制**: 引入 `_unfold_messages` 机制以在 `outlet` 阶段精确对齐坐标系，彻底解决了由于前端视图折叠导致长轮次工具调用对话出现进度漂移或跳过生成摘要的问题。
 - **更安全的工具内容裁剪**: 重构了 `enable_tool_output_trimming`，现在严格使用原子级分组进行安全的原生工具内容裁剪，替代了激进的正则表达式匹配，防止 JSON 载荷损坏。
+- **增强 Summary 路径稳定性**: 在整个摘要管道中透传真实 `__request__` 上下文，而不是使用最小化的 synthetic request，提升了与各类 LLM 提供商的兼容性。
+- **智能加载前摘要**: 当 outlet payload 中不存在已注入的摘要时，过滤器现在会显式从数据库加载前摘要并将其合并到 LLM prompt 中，实现增量式上下文保留。
+- **更好的错误诊断**: LLM 响应验证失败时，现在打印完整的响应体而非仅打印类型名称，便于快速排查问题。

 ---

--- a/plugins/filters/async-context-compression/async_context_compression.py
+++ b/plugins/filters/async-context-compression/async_context_compression.py
@@ -2502,6 +2502,7 @@ class Filter:
        __model__: dict = None,
        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
        __event_call__: Callable[[Any], Awaitable[None]] = None,
+        __request__: Request = None,
    ) -> dict:
        """
        Executed after the LLM response is complete.
@@ -2614,6 +2615,7 @@ class Filter:
                lang,
                __event_emitter__,
                __event_call__,
+                __request__,
            )
        )

@@ -2630,6 +2632,7 @@ class Filter:
        lang: str,
        __event_emitter__: Callable,
        __event_call__: Callable,
+        __request__: Request = None,
    ):
        """Wrapper to run summary generation with an async lock."""
        async with lock:
@@ -2642,6 +2645,7 @@ class Filter:
                lang,
                __event_emitter__,
                __event_call__,
+                __request__,
            )

    async def _check_and_generate_summary_async(
@@ -2654,6 +2658,7 @@ class Filter:
        lang: str = "en-US",
        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
        __event_call__: Callable[[Any], Awaitable[None]] = None,
+        __request__: Request = None,
    ):
        """
        Background processing: Calculates Token count and generates summary (does not block response).
@@ -2757,6 +2762,7 @@ class Filter:
                    lang,
                    __event_emitter__,
                    __event_call__,
+                    __request__,
                )
            else:
                await self._log(
@@ -2788,6 +2794,7 @@ class Filter:
        lang: str = "en-US",
        __event_emitter__: Callable[[Any], Awaitable[None]] = None,
        __event_call__: Callable[[Any], Awaitable[None]] = None,
+        __request__: Request = None,
    ):
        """
        Generates summary asynchronously (runs in background, does not block response).
@@ -2941,8 +2948,25 @@ class Filter:
            # 4. Build conversation text
            conversation_text = self._format_messages_for_summary(middle_messages)

-            # 5. Call LLM to generate new summary
-            # Note: previous_summary is not passed here because old summary (if any) is already included in middle_messages
+            # 5. Determine previous_summary to pass to LLM.
+            # When summary_index is not None, the old summary message is already the first
+            # entry of middle_messages (protected_prefix=1), so it appears verbatim in
+            # conversation_text — no need to inject separately.
+            # When summary_index is None the outlet messages come from raw DB history that
+            # has never had the summary injected, so we must load it from DB explicitly.
+            if summary_index is None:
+                previous_summary = await asyncio.to_thread(
+                    self._load_summary, chat_id, body
+                )
+                if previous_summary:
+                    await self._log(
+                        "[🤖 Async Summary Task] Loaded previous summary from DB to pass as context (summary not in messages)",
+                        event_call=__event_call__,
+                    )
+            else:
+                previous_summary = None  # already embedded in middle_messages[0]
+
+            # 6. Call LLM to generate new summary

            # Send status notification for starting summary generation
            if __event_emitter__:
@@ -2959,11 +2983,12 @@ class Filter:
                )

            new_summary = await self._call_summary_llm(
-                None,
                conversation_text,
                {**body, "model": summary_model_id},
                user_data,
                __event_call__,
+                __request__,
+                previous_summary=previous_summary,
            )

            if not new_summary:
@@ -3186,11 +3211,12 @@ class Filter:

    async def _call_summary_llm(
        self,
-        previous_summary: Optional[str],
        new_conversation_text: str,
        body: dict,
        user_data: dict,
        __event_call__: Callable[[Any], Awaitable[None]] = None,
+        __request__: Request = None,
+        previous_summary: Optional[str] = None,
    ) -> str:
        """
        Calls the LLM to generate a summary using Open WebUI's built-in method.
@@ -3201,33 +3227,52 @@ class Filter:
        )

        # Build summary prompt (Optimized for State/Working Memory and Tool Calling)
-        summary_prompt = f"""
-You are an expert Context Compression Engine. Your goal is to create a high-fidelity, highly dense "Working Memory" from the provided conversation.
-This conversation may contain previous Working Memories and raw native tool-calling sequences (JSON arguments and results).
+        previous_summary_block = (
+            f"<previous_working_memory>\n{previous_summary}\n</previous_working_memory>\n\n"
+            if previous_summary
+            else ""
+        )
+        summary_prompt = f"""You are an expert Context Compression Engine. Produce a high-fidelity, maximally dense "Working Memory" snapshot from the inputs below.

-### Rules of Engagement
-1.  **Incremental Integration**: If the conversation begins with an existing Working Memory/Summary, you must PRESERVE its core facts and MERGE the new conversation events into it. Do not discard older facts.
-2.  **Tool-Call Decompression**: Raw JSON/Text outputs from tools are noisy. Extract ONLY the definitive facts, actionable data, or root causes of errors. Ignore the structural payload.
-3.  **Ruthless Denoising**: Completely eliminate greetings, apologies ("I'm sorry for the error"), acknowledgments ("Sure, I can do that"), and redundant confirmations.
-4.  **Verbatim Retention**: ANY code snippets, shell commands, file paths, specific parameters, and Message IDs (e.g., [ID: ...]) MUST be kept exactly as they appear to maintain traceability.
-5.  **Logic Preservation**: Clearly link "what the user asked" -> "what the tool found" -> "how the system reacted".
+### Processing Rules
+1.  **State-Aware Merging**: If `<previous_working_memory>` is provided, you MUST merge it with the new conversation. Preserve facts that are still true; UPDATE or SUPERSEDE facts whose state has changed (e.g., "bug X exists" → "bug X fixed in commit abc"); REMOVE facts fully resolved with no future relevance.
+2.  **Goal Tracking**: Reflect the LATEST user intent as "Current Goal". If the goal has shifted, move the old goal to Working Memory as "Prior Goal (completed/abandoned)".
+3.  **Tool-Call Decompression**: From raw JSON tool arguments/results, extract ONLY: definitive facts, concrete return values, error codes, root causes. Discard structural boilerplate.
+4.  **Error & Exception Verbatim**: Stack traces, error messages, exception types, and exit codes MUST be quoted exactly — they are primary debugging artifacts.
+5.  **Ruthless Denoising**: Delete greetings, apologies, acknowledgments, and any phrase that carries zero information.
+6.  **Verbatim Retention**: Code snippets, shell commands, file paths, config values, and Message IDs (e.g., [ID: ...]) MUST appear character-for-character.
+7.  **Causal Chain**: For each tool call or action, record: trigger → operation → outcome (one line per event).

 ### Output Constraints
-*   **Format**: Strictly follow the Markdown structure below.
-*   **Length**: Maximum {self.valves.max_summary_tokens} Tokens.
-*   **Tone**: Robotic, objective, dense.
-*   **Language**: Consistent with the conversation language.
-*   **Forbidden**: NO conversational openings/closings (e.g., "Here is the summary", "Hope this helps"). Output the data directly.
+*   **Format**: Follow the Required Structure below — omit a section only if it has zero content.
+*   **Token Budget**: Stay under {self.valves.max_summary_tokens} tokens. Prioritize recency and actionability when trimming.
+*   **Tone**: Terse, robotic, third-person where applicable.
+*   **Language**: Match the dominant language of the conversation.
+*   **Forbidden**: No preamble, no closing remarks, no meta-commentary. Start directly with the first section header.

-### Suggested Summary Structure
-*   **Current Goal**: What is the user ultimately trying to achieve?
-*   **Working Memory & Facts**: (Bullet points of established facts, parsed tool results, and constraints. Cite Message IDs if critical).
-*   **Code & Artifacts**: (Only if applicable. Include exact code blocks).
-*   **Recent Actions**: (e.g., "Attempted to run script, failed with SyntaxError, applied fix").
-*   **Pending/Next Steps**: What is waiting to be done.
+### Required Output Structure
+## Current Goal
+(Single sentence: what the user is trying to achieve RIGHT NOW)
+
+## Working Memory & Facts
+(Bullet list — each item: one established fact, constraint, or parsed tool result. Mark superseded items as ~~old~~ → new. Cite [ID: ...] when critical.)
+
+## Code & Artifacts
+(Only if present. Exact code blocks with language tags. File paths as inline code.)
+
+## Causal Log
+(Chronological. Format: `[MSG_ID?] action → result`. One line per event. Keep only the last N events that remain causally relevant.)
+
+## Errors & Exceptions
+(Only if unresolved. Exact quoted text. Include error type, message, and last known stack frame.)
+
+## Pending / Next Steps
+(Ordered list. First item = most immediate action.)

 ---
+{previous_summary_block}<new_conversation>
 {new_conversation_text}
+</new_conversation>
 ---

 Generate the Working Memory:
@@ -3277,8 +3322,8 @@ Generate the Working Memory:
                event_call=__event_call__,
            )

-            # Create Request object
-            request = Request(scope={"type": "http", "app": webui_app})
+            # Use the injected request if available, otherwise fall back to a minimal synthetic one
+            request = __request__ or Request(scope={"type": "http", "app": webui_app})

            # Call generate_chat_completion
            response = await generate_chat_completion(request, payload, user)
@@ -3299,8 +3344,13 @@ Generate the Working Memory:
                or "choices" not in response
                or not response["choices"]
            ):
+                try:
+                    response_repr = json_module.dumps(response, ensure_ascii=False, indent=2)
+                except Exception:
+                    response_repr = repr(response)
                raise ValueError(
-                    f"LLM response format incorrect or empty: {type(response).__name__}"
+                    f"LLM response format incorrect or empty: {type(response).__name__}\n"
+                    f"Full response:\n{response_repr}"
                )

            summary = response["choices"][0]["message"]["content"].strip()