fix(async-context-compression): reverse-unfolding to prevent progress drift

- Reconstruct native tool-calling sequences using reverse-unfolding mechanism - Strictly use atomic grouping for safe native tool output trimming - Add comprehensive test coverage for unfolding logic and issue drafts - READMEs and docs synced (v1.4.1)
2026-03-11 03:54:40 +08:00
parent 3210262296
commit cd95b5ff69
16 changed files with 1540 additions and 152 deletions
--- a/.agent/learnings/async-context-compression-progress-mapping.md
+++ b/.agent/learnings/async-context-compression-progress-mapping.md
@@ -0,0 +1,27 @@
+# Async Context Compression Progress Mapping
+
+> Discovered: 2026-03-10
+
+## Context
+Applies to `plugins/filters/async-context-compression/async_context_compression.py` once the inlet has already replaced early history with a synthetic summary message.
+
+## Finding
+`compressed_message_count` cannot be recalculated from the visible message list length after compression. Once a summary marker is present, the visible list mixes:
+- preserved head messages that are still before the saved boundary
+- one synthetic summary message
+- tail messages that map to original history starting at the saved boundary
+
+## Solution / Pattern
+Store the original-history boundary on the injected summary message metadata, then recover future progress using:
+- `original_count = covered_until + len(messages_after_summary_marker)`
+- `target_progress = max(covered_until, original_count - keep_last)`
+
+When the summary-model window is too small, trim newest atomic groups from the summary input so the saved boundary still matches what the summary actually covers.
+
+## Gotchas
+- If you trim from the head of the summary input, the saved progress can overstate coverage and hide messages that were never summarized.
+- Status previews for the next context must convert the saved original-history boundary back into the current visible view before rebuilding head/summary/tail.
+- `inlet(body["messages"])` and `outlet(body["messages"])` can both represent the full conversation while using different serializations:
+	- inlet may receive expanded native tool-call chains (`assistant(tool_calls) -> tool -> assistant`)
+	- outlet may receive a compact top-level transcript where tool calls are folded into assistant `<details type="tool_calls">` blocks
+- These two views do not share a safe `compressed_message_count` coordinate system. If outlet is in the compact assistant/details view, do not persist summary progress derived from its top-level message count.
--- a/.agent/learnings/openwebui-tool-call-context-inflation.md
+++ b/.agent/learnings/openwebui-tool-call-context-inflation.md
@@ -0,0 +1,26 @@
+# OpenWebUI Tool Call Context Inflation
+
+> Discovered: 2026-03-11
+
+## Context
+When analyzing why the `async_context_compression` plugin sees different array lengths of `messages` between the `inlet` (e.g. 27 items) and `outlet` (e.g. 8 items) phases, especially when native tool calling (Function Calling) is involved in OpenWebUI.
+
+## Finding
+There is a fundamental disparity in how OpenWebUI serializes conversational history at different stages of the request lifecycle:
+
+1. **Outlet (UI Rendering View)**:
+   After the LLM completes generation and tools have been executed, OpenWebUI's `middleware.py` (and streaming builders) bundles intermediate tool calls and their raw results. It hides them inside an HTML `<details type="tool_calls">...</details>` block within a single `role: assistant` message's `content`. 
+   Concurrently, the actual native API tool-calling data is saved in a hidden `output` dict field attached to that message. At this stage, the `messages` array looks short (e.g., 8 items) because tool interactions are visually folded.
+
+2. **Inlet (LLM Native View)**:
+   When the user sends the *next* message, the request enters `main.py` -> `process_chat_payload` -> `middleware.py:process_messages_with_output()`.
+   Here, OpenWebUI scans historical `assistant` messages for that hidden `output` field. If found, it completely **inflates (unfolds)** the raw data back into an exact sequence of OpenAI-compliant `tool_call` and `tool_result` messages (using `utils/misc.py:convert_output_to_messages`).
+   The HTML `<details>` string is entirely discarded before being sent to the LLM.
+
+**Conclusion on Token Consumption**:
+In the next turn, tool context is **NOT** compressed at all. It is fully re-expanded to its original verbose state (e.g., back to 27 items) and consumes the maximum amount of tokens required by the raw JSON arguments and results.
+
+## Gotchas
+- Any logic operating in the `outlet` phase (like background tasks) that relies on the `messages` array index will be completely misaligned with the array seen in the `inlet` phase.
+- Attempting to slice or trim history based on `outlet` array lengths will cause index out-of-bounds errors or destructive cropping of recent messages.
+- The only safe way to bridge these two views is either to translate the folded view back into the expanded view using `convert_output_to_messages`, or to rely on unique `id` fields (if available) rather than array indices.