feat: update markdown-normalizer to v1.1.0 (fix mermaid syntax & html safeguard)

2026-01-12 23:44:27 +08:00
parent 6000c880de
commit f650c64ffe
5 changed files with 178 additions and 52 deletions
--- a/plugins/filters/markdown_normalizer/FEATURES_CN.md
+++ b/plugins/filters/markdown_normalizer/FEATURES_CN.md
@@ -125,19 +125,45 @@ if x == 1:
 ```

 ## 7. Mermaid 语法修复 (Mermaid Syntax Fix)
-**功能**: 修复 Mermaid 图表中常见的语法错误，特别是未加引号的标签包含特殊字符的情况。
+**功能**: 修复 Mermaid 图表中常见的语法错误，特别是未加引号的标签包含特殊字符、嵌套括号或 HTML 标签的情况。
 **默认**: 开启 (`enable_mermaid_fix = True`)
-**示例**:
-*   **Before**:
-    ```mermaid
-    graph TD
-    A[Label with (parens)] --> B(Label with [brackets])
-    ```
-*   **After**:
-    ```mermaid
-    graph TD
-    A["Label with (parens)"] --> B("Label with [brackets]")
-    ```
+
+### 7.1 基础特殊字符
+**Before**:
+```mermaid
+graph TD
+A[Label with (parens)] --> B(Label with [brackets])
+```
+**After**:
+```mermaid
+graph TD
+A["Label with (parens)"] --> B("Label with [brackets]")
+```
+
+### 7.2 嵌套括号修复 (v1.1.0+)
+**Before**:
+```mermaid
+graph TD
+A((开始: 发现可疑快照)) --> B[物理损坏(Allocation Errors)]
+```
+**After**:
+```mermaid
+graph TD
+A(("开始: 发现可疑快照")) --> B["物理损坏(Allocation Errors)"]
+```
+
+### 7.3 包含 HTML 标签 (v1.1.0+)
+**Before**:
+```mermaid
+graph TD
+A[第一步<br/>环境隔离] --> B{状态?}
+```
+**After**:
+```mermaid
+graph TD
+A["第一步<br/>环境隔离"] --> B{"状态?"}
+```
+*注：插件已优化 HTML 保护机制，允许包含 `<br/>` 等标签的 Mermaid 图表正常触发修复。*

 ## 8. XML 标签清理 (XML Cleanup)

--- a/plugins/filters/markdown_normalizer/README.md
+++ b/plugins/filters/markdown_normalizer/README.md
@@ -1,6 +1,9 @@
 # Markdown Normalizer Filter

-A production-grade content normalizer filter for Open WebUI that fixes common Markdown formatting issues in LLM outputs. It ensures that code blocks, LaTeX formulas, Mermaid diagrams, and other Markdown elements are rendered correctly.
+**Author:** [Fu-Jie](https://github.com/Fu-Jie)
+**Version:** 1.1.0
+
+A content normalizer filter for Open WebUI that fixes common Markdown formatting issues in LLM outputs. It ensures that code blocks, LaTeX formulas, Mermaid diagrams, and other Markdown elements are rendered correctly.

 ## Features

@@ -41,6 +44,14 @@ A production-grade content normalizer filter for Open WebUI that fixes common Ma
 *   `show_status`: Show status notification when fixes are applied.
 *   `show_debug_log`: Print debug logs to browser console.

+## Changelog
+
+### v1.1.0
+*   **Mermaid Fix Refinement**: Improved regex to handle nested parentheses in node labels (e.g., `ID("Label (text)")`) and avoided matching connection labels.
+*   **HTML Safeguard Optimization**: Refined `_contains_html` to allow common tags like `<br/>`, `<b>`, `<i>`, etc., ensuring Mermaid diagrams with these tags are still normalized.
+*   **Full-width Symbol Cleanup**: Fixed duplicate keys and incorrect quote mapping in `FULLWIDTH_MAP`.
+*   **Bug Fixes**: Fixed missing `Dict` import in Python files.
+
 ## License

 MIT
--- a/plugins/filters/markdown_normalizer/README_CN.md
+++ b/plugins/filters/markdown_normalizer/README_CN.md
@@ -1,6 +1,9 @@
 # Markdown 格式化过滤器 (Markdown Normalizer)

-这是一个用于 Open WebUI 的生产级内容格式化过滤器，旨在修复 LLM 输出中常见的 Markdown 格式问题。它能确保代码块、LaTeX 公式、Mermaid 图表和其他 Markdown 元素被正确渲染。
+**作者:** [Fu-Jie](https://github.com/Fu-Jie)
+**版本:** 1.1.0
+
+这是一个用于 Open WebUI 的内容格式化过滤器，旨在修复 LLM 输出中常见的 Markdown 格式问题。它能确保代码块、LaTeX 公式、Mermaid 图表和其他 Markdown 元素被正确渲染。

 ## 功能特性

@@ -41,6 +44,14 @@
 *   `show_status`: 应用修复时显示状态通知。
 *   `show_debug_log`: 在浏览器控制台打印调试日志。

+## 更新日志
+
+### v1.1.0
+*   **Mermaid 修复优化**: 改进了正则表达式以处理节点标签中的嵌套括号（如 `ID("标签 (文本)")`），并避免误匹配连接线上的文字。
+*   **HTML 保护机制优化**: 优化了 `_contains_html` 检测，允许 `<br/>`, `<b>`, `<i>` 等常见标签，确保包含这些标签的 Mermaid 图表能被正常规范化。
+*   **全角符号清理**: 修复了 `FULLWIDTH_MAP` 中的重复键名和错误的引号映射。
+*   **Bug 修复**: 修复了 Python 文件中缺失的 `Dict` 类型导入。
+
 ## 许可证

 MIT
--- a/plugins/filters/markdown_normalizer/markdown_normalizer.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer.py
@@ -1,14 +1,14 @@
 """
 title: Markdown Normalizer
 author: Fu-Jie
-author_url: https://github.com/Fu-Jie
-funding_url: https://github.com/Fu-Jie/awesome-openwebui
-version: 1.0.1
-description: A production-grade content normalizer filter that fixes common Markdown formatting issues in LLM outputs, such as broken code blocks, LaTeX formulas, and list formatting.
+author_url: https://github.com/Fu-Jie/awesome-openwebui
+funding_url: https://github.com/open-webui
+version: 1.1.0
+description: A content normalizer filter that fixes common Markdown formatting issues in LLM outputs, such as broken code blocks, LaTeX formulas, and list formatting.
 """

 from pydantic import BaseModel, Field
-from typing import Optional, List, Callable
+from typing import Optional, List, Callable, Dict
 import re
 import logging
 import logging
@@ -75,7 +75,7 @@ class ContentNormalizer:
        # Priority: Longer delimiters match first
        "mermaid_node": re.compile(
            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
-            r"(\w+)\s*(?:"
+            r"(\w+)(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
            r"(\(\[)(?![\"])(.*?)(?<![\"])(\]\))|"  # ([...]) Stadium
@@ -86,7 +86,7 @@ class ContentNormalizer:
            r"(\[\\)(?![\"])(.*?)(?<![\"])(\\\])|"  # [\...\] Parallelogram Alt
            r"(\[/)(?![\"])(.*?)(?<![\"])(\\\])|"  # [/...\] Trapezoid
            r"(\[\\)(?![\"])(.*?)(?<![\"])(/\])|"  # [\.../] Trapezoid Alt
-            r"(\()(?![\"])(.*?)(?<![\"])(\))|"  # (...) Round
+            r"(\()(?![\"])([^)]*?)(?<![\"])(\))|"  # (...) Round - Modified to be safer
            r"(\[)(?![\"])(.*?)(?<![\"])(\])|"  # [...] Square
            r"(\{)(?![\"])(.*?)(?<![\"])(\})|"  # {...} Rhombus
            r"(>)(?![\"])(.*?)(?<![\"])(\])"  # >...] Asymmetric
@@ -267,9 +267,10 @@ class ContentNormalizer:
            "：": ":",
            "？": "?",
            "！": "!",
-            '"': '"',
-            '"': '"',
-            """: "'", """: "'",
+            "“": '"',
+            "”": '"',
+            "‘": "'",
+            "’": "'",
        }

        parts = content.split("```")
@@ -410,9 +411,46 @@ class Filter:
    def __init__(self):
        self.valves = self.Valves()

+    def _get_chat_context(
+        self, body: dict, __metadata__: Optional[dict] = None
+    ) -> Dict[str, str]:
+        """
+        Unified extraction of chat context information (chat_id, message_id).
+        Prioritizes extraction from body, then metadata.
+        """
+        chat_id = ""
+        message_id = ""
+
+        # 1. Try to get from body
+        if isinstance(body, dict):
+            chat_id = body.get("chat_id", "")
+            message_id = body.get("id", "")  # message_id is usually 'id' in body
+
+            # Check body.metadata as fallback
+            if not chat_id or not message_id:
+                body_metadata = body.get("metadata", {})
+                if isinstance(body_metadata, dict):
+                    if not chat_id:
+                        chat_id = body_metadata.get("chat_id", "")
+                    if not message_id:
+                        message_id = body_metadata.get("message_id", "")
+
+        # 2. Try to get from __metadata__ (as supplement)
+        if __metadata__ and isinstance(__metadata__, dict):
+            if not chat_id:
+                chat_id = __metadata__.get("chat_id", "")
+            if not message_id:
+                message_id = __metadata__.get("message_id", "")
+
+        return {
+            "chat_id": str(chat_id).strip(),
+            "message_id": str(message_id).strip(),
+        }
+
    def _contains_html(self, content: str) -> bool:
        """Check if content contains HTML tags (to avoid breaking HTML output)"""
-        pattern = r"<\s*/?\s*(?:html|head|body|div|span|p|br|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|b|i|strong|em|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
+        # Removed common Mermaid-compatible tags like br, b, i, strong, em, span
+        pattern = r"<\s*/?\s*(?:html|head|body|div|p|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
        return bool(re.search(pattern, content, re.IGNORECASE))

    async def _emit_status(self, __event_emitter__, applied_fixes: List[str]):
@@ -438,24 +476,23 @@ class Filter:
            print(f"Error emitting status: {e}")

    async def _emit_debug_log(
-        self, __event_call__, applied_fixes: List[str], original: str, normalized: str
+        self,
+        __event_call__,
+        applied_fixes: List[str],
+        original: str,
+        normalized: str,
+        chat_id: str = "",
    ):
        """Emit debug log to browser console via JS execution"""
        if not self.valves.show_debug_log or not __event_call__:
            return

        try:
-            # Prepare data for JS
-            log_data = {
-                "fixes": applied_fixes,
-                "original": original,
-                "normalized": normalized,
-            }
-
            # Construct JS code
            js_code = f"""
                (async function() {{
                    console.group("🛠️ Markdown Normalizer Debug");
+                    console.log("Chat ID:", {json.dumps(chat_id)});
                    console.log("Applied Fixes:", {json.dumps(applied_fixes, ensure_ascii=False)});
                    console.log("Original Content:", {json.dumps(original, ensure_ascii=False)});
                    console.log("Normalized Content:", {json.dumps(normalized, ensure_ascii=False)});
@@ -521,11 +558,13 @@ class Filter:
                        await self._emit_status(
                            __event_emitter__, normalizer.applied_fixes
                        )
+                        chat_ctx = self._get_chat_context(body, __metadata__)
                        await self._emit_debug_log(
                            __event_call__,
                            normalizer.applied_fixes,
                            content,
                            new_content,
+                            chat_id=chat_ctx["chat_id"],
                        )

        return body
--- a/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
@@ -1,14 +1,14 @@
 """
 title: Markdown 格式修复器 (Markdown Normalizer)
 author: Fu-Jie
-author_url: https://github.com/Fu-Jie
-funding_url: https://github.com/Fu-Jie/awesome-openwebui
-version: 1.0.1
-description: 生产级内容规范化过滤器，修复 LLM 输出中常见的 Markdown 格式问题，如损坏的代码块、LaTeX 公式、Mermaid 图表和列表格式。
+author_url: https://github.com/Fu-Jie/awesome-openwebui
+funding_url: https://github.com/open-webui
+version: 1.1.0
+description: 内容规范化过滤器，修复 LLM 输出中常见的 Markdown 格式问题，如损坏的代码块、LaTeX 公式、Mermaid 图表和列表格式。
 """

 from pydantic import BaseModel, Field
-from typing import Optional, List, Callable
+from typing import Optional, List, Callable, Dict
 import re
 import logging
 import asyncio
@@ -70,7 +70,7 @@ class ContentNormalizer:
        # 优先级：长定界符优先匹配
        "mermaid_node": re.compile(
            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
-            r"(\w+)\s*(?:"
+            r"(\w+)(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
            r"(\(\[)(?![\"])(.*?)(?<![\"])(\]\))|"  # ([...]) Stadium
@@ -81,7 +81,7 @@ class ContentNormalizer:
            r"(\[\\)(?![\"])(.*?)(?<![\"])(\\\])|"  # [\...\] Parallelogram Alt
            r"(\[/)(?![\"])(.*?)(?<![\"])(\\\])|"  # [/...\] Trapezoid
            r"(\[\\)(?![\"])(.*?)(?<![\"])(/\])|"  # [\.../] Trapezoid Alt
-            r"(\()(?![\"])(.*?)(?<![\"])(\))|"  # (...) Round
+            r"(\()(?![\"])([^)]*?)(?<![\"])(\))|"  # (...) Round - Modified to be safer
            r"(\[)(?![\"])(.*?)(?<![\"])(\])|"  # [...] Square
            r"(\{)(?![\"])(.*?)(?<![\"])(\})|"  # {...} Rhombus
            r"(>)(?![\"])(.*?)(?<![\"])(\])"  # >...] Asymmetric
@@ -262,9 +262,10 @@ class ContentNormalizer:
            "：": ":",
            "？": "?",
            "！": "!",
-            '"': '"',
-            '"': '"',
-            """: "'", """: "'",
+            "“": '"',
+            "”": '"',
+            "‘": "'",
+            "’": "'",
        }

        parts = content.split("```")
@@ -410,9 +411,46 @@ class Filter:
    def __init__(self):
        self.valves = self.Valves()

+    def _get_chat_context(
+        self, body: dict, __metadata__: Optional[dict] = None
+    ) -> Dict[str, str]:
+        """
+        统一提取聊天上下文信息 (chat_id, message_id)。
+        优先从 body 中提取，其次从 metadata 中提取。
+        """
+        chat_id = ""
+        message_id = ""
+
+        # 1. 尝试从 body 获取
+        if isinstance(body, dict):
+            chat_id = body.get("chat_id", "")
+            message_id = body.get("id", "")  # message_id 在 body 中通常是 id
+
+            # 再次检查 body.metadata
+            if not chat_id or not message_id:
+                body_metadata = body.get("metadata", {})
+                if isinstance(body_metadata, dict):
+                    if not chat_id:
+                        chat_id = body_metadata.get("chat_id", "")
+                    if not message_id:
+                        message_id = body_metadata.get("message_id", "")
+
+        # 2. 尝试从 __metadata__ 获取 (作为补充)
+        if __metadata__ and isinstance(__metadata__, dict):
+            if not chat_id:
+                chat_id = __metadata__.get("chat_id", "")
+            if not message_id:
+                message_id = __metadata__.get("message_id", "")
+
+        return {
+            "chat_id": str(chat_id).strip(),
+            "message_id": str(message_id).strip(),
+        }
+
    def _contains_html(self, content: str) -> bool:
        """Check if content contains HTML tags (to avoid breaking HTML output)"""
-        pattern = r"<\s*/?\s*(?:html|head|body|div|span|p|br|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|b|i|strong|em|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
+        # Removed common Mermaid-compatible tags like br, b, i, strong, em, span
+        pattern = r"<\s*/?\s*(?:html|head|body|div|p|hr|ul|ol|li|table|thead|tbody|tfoot|tr|td|th|img|a|code|pre|blockquote|h[1-6]|script|style|form|input|button|label|select|option|iframe|link|meta|title)\b"
        return bool(re.search(pattern, content, re.IGNORECASE))

    async def _emit_status(self, __event_emitter__, applied_fixes: List[str]):
@@ -463,24 +501,23 @@ class Filter:
        """Emit debug log to browser console via JS execution"""

    async def _emit_debug_log(
-        self, __event_call__, applied_fixes: List[str], original: str, normalized: str
+        self,
+        __event_call__,
+        applied_fixes: List[str],
+        original: str,
+        normalized: str,
+        chat_id: str = "",
    ):
        """Emit debug log to browser console via JS execution"""
        if not self.valves.show_debug_log or not __event_call__:
            return

        try:
-            # Prepare data for JS
-            log_data = {
-                "fixes": applied_fixes,
-                "original": original,
-                "normalized": normalized,
-            }
-
            # Construct JS code
            js_code = f"""
                (async function() {{
                    console.group("🛠️ Markdown Normalizer Debug");
+                    console.log("Chat ID:", {json.dumps(chat_id)});
                    console.log("Applied Fixes:", {json.dumps(applied_fixes, ensure_ascii=False)});
                    console.log("Original Content:", {json.dumps(original, ensure_ascii=False)});
                    console.log("Normalized Content:", {json.dumps(normalized, ensure_ascii=False)});
@@ -546,11 +583,13 @@ class Filter:
                        await self._emit_status(
                            __event_emitter__, normalizer.applied_fixes
                        )
+                        chat_ctx = self._get_chat_context(body, __metadata__)
                        await self._emit_debug_log(
                            __event_call__,
                            normalizer.applied_fixes,
                            content,
                            new_content,
+                            chat_id=chat_ctx["chat_id"],
                        )

        return body