Fix Mermaid syntax normalization: preserve quoted strings and prevent false positives

2026-01-10 16:07:19 +08:00
parent aabb24c9cd
commit f78e703a99
3 changed files with 183 additions and 11 deletions
--- a/plugins/filters/markdown_normalizer/FEATURES_CN.md
+++ b/plugins/filters/markdown_normalizer/FEATURES_CN.md
@@ -0,0 +1,162 @@
 # Markdown Normalizer 功能详解
 本插件旨在修复 LLM 输出中常见的 Markdown 格式问题，确保在 Open WebUI 中完美渲染。以下是支持的修复功能列表及示例。
 ## 1. 代码块修复 (Code Block Fixes)
 ### 1.1 去除代码块缩进
 LLM 有时会在代码块前添加空格缩进，导致渲染失效。本插件会自动移除这些缩进。
 **Before:**
   ```python
   print("hello")
   ```
 **After:**
 ```python
 print("hello")
 ```
 ### 1.2 补全代码块前后换行
 代码块标记 ` ``` ` 必须独占一行。如果 LLM 将其与文本混在一行，插件会自动修复。
 **Before:**
 Here is code:```python
 print("hello")```
 **After:**
 Here is code:
 ```python
 print("hello")
 ```
 ### 1.3 修复语言标识符后的换行
 有时 LLM 会忘记在语言标识符（如 `python`）后换行。
 **Before:**
 ```python print("hello")
 ```
 **After:**
 ```python
 print("hello")
 ```
 ### 1.4 自动闭合代码块
 如果输出被截断或 LLM 忘记闭合代码块，插件会自动添加结尾的 ` ``` `。
 **Before:**
 ```python
 print("unfinished code...")
 **After:**
 ```python
 print("unfinished code...")
 ```
 ## 2. LaTeX 公式规范化 (LaTeX Normalization)
 Open WebUI 使用 MathJax/KaTeX 渲染公式，通常需要 `$$` 或 `$` 包裹。本插件会将常见的 LaTeX 括号语法转换为标准格式。
 **Before:**
 块级公式：\[ E = mc^2 \]
 行内公式：\( a^2 + b^2 = c^2 \)
 **After:**
 块级公式：$$ E = mc^2 $$
 行内公式：$ a^2 + b^2 = c^2 $
 ## 3. 转义字符清理 (Escape Character Fix)
 修复过度转义的字符，这常见于某些 API 返回的原始字符串中。
 **Before:**
 Line 1\\nLine 2\\tTabbed
 **After:**
 Line 1
 Line 2	Tabbed
 ## 4. 思维链标签规范化 (Thought Tag Fix)
 **功能**: 
 1.  确保 `</thought>` 标签后有足够的空行，防止思维链内容与正文粘连。
 2.  **标准化标签**: 将 `<think>` (DeepSeek 等模型常用) 或 `<thinking>` 统一转换为 Open WebUI 标准的 `<thought>` 标签，以便正确触发 UI 的折叠功能。
 **默认**: 开启 (`enable_thought_tag_fix = True`)
 **示例**:
 *   **Before**: `<think>Thinking...</think>Response starts here.`
 *   **After**: 
    ```xml
    <thought>Thinking...</thought>
    Response starts here.
    ```
 ## 5. 列表格式修复 (List Formatting Fix)
 *默认关闭，需在设置中开启*
 修复列表项缺少换行的问题。
 **Before:**
 Header1. Item 1
 **After:**
 Header
 1. Item 1
 ## 6. 全角符号转半角 (Full-width Symbol Fix)
 *默认关闭，需在设置中开启*
 仅在**代码块内部**将全角符号转换为半角符号，防止代码因符号问题无法运行。
 **Before:**
 ```python
 if x == 1：
    print（"hello"）
 ```
 **After:**
 ```python
 if x == 1:
    print("hello")
 ```
 ## 7. Mermaid 语法修复 (Mermaid Syntax Fix)
 **功能**: 修复 Mermaid 图表中常见的语法错误，特别是未加引号的标签包含特殊字符的情况。
 **默认**: 开启 (`enable_mermaid_fix = True`)
 **示例**:
 *   **Before**:
    ```mermaid
    graph TD
    A[Label with (parens)] --> B(Label with [brackets])
    ```
 *   **After**:
    ```mermaid
    graph TD
    A["Label with (parens)"] --> B("Label with [brackets]")
    ```
 ## 8. XML 标签清理 (XML Cleanup)
 移除 LLM 输出中残留的无用 XML 标签（如 Claude 的 artifact 标签）。
 **Before:**
 Here is the result <antArtifact>hidden metadata</antArtifact>.
 **After:**
 ## 9. 标题格式修复 (Heading Format Fix)
 **功能**: 修复标题标记 `#` 后缺少空格的问题。
 **默认**: 开启 (`enable_heading_fix = True`)
 **示例**:
 *   **Before**: `#Heading 1`
 *   **After**: `# Heading 1`
 ## 10. 表格格式修复 (Table Format Fix)
 **功能**: 修复表格行末尾缺少管道符 `|` 的问题。
 **默认**: 开启 (`enable_table_fix = True`)
 **示例**:
 *   **Before**: `| Col 1 | Col 2`
 *   **After**: `| Col 1 | Col 2 |`
--- a/plugins/filters/markdown_normalizer/markdown_normalizer.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer.py
@@ -74,6 +74,7 @@ class ContentNormalizer:
        # Fix "reverse optimization": Must precisely match shape delimiters to avoid breaking structure
        # Priority: Longer delimiters match first
        "mermaid_node": re.compile(
            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
            r"(\w+)\s*(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
@@ -281,14 +282,18 @@ class ContentNormalizer:
        """Fix common Mermaid syntax errors while preserving node shapes"""
        def replacer(match):
-            # Group 1 is ID
+            # Group 1 is Quoted String (if matched)
-            id_str = match.group(1)
+            if match.group(1):
                return match.group(1)
            # Group 2 is ID
            id_str = match.group(2)
            # Find matching shape group
-            # Groups start at index 2, each shape has 3 groups (Open, Content, Close)
+            # Groups start at index 3 (in match.group terms) or index 2 (in match.groups() tuple)
-            # We iterate to find the non-None one
+            # Tuple: (String, ID, Open1, Content1, Close1, ...)
            groups = match.groups()
-            for i in range(1, len(groups), 3):
+            for i in range(2, len(groups), 3):
                if groups[i] is not None:
                    open_char = groups[i]
                    content = groups[i + 1]
--- a/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
@@ -69,6 +69,7 @@ class ContentNormalizer:
        # 修复"反向优化"问题：必须精确匹配各种形状的定界符，避免破坏形状结构
        # 优先级：长定界符优先匹配
        "mermaid_node": re.compile(
            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
            r"(\w+)\s*(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
@@ -276,14 +277,18 @@ class ContentNormalizer:
        """修复常见的 Mermaid 语法错误，同时保留节点形状"""
        def replacer(match):
-            # Group 1 是 ID
+            # Group 1 is Quoted String (if matched)
-            id_str = match.group(1)
+            if match.group(1):
                return match.group(1)
-            # 查找匹配的形状组
+            # Group 2 is ID
-            # 组从索引 2 开始，每个形状有 3 个组 (Open, Content, Close)
+            id_str = match.group(2)
-            # 我们遍历找到非 None 的那一组
+
            # Find matching shape group
            # Groups start at index 3 (in match.group terms) or index 2 (in match.groups() tuple)
            # Tuple: (String, ID, Open1, Content1, Close1, ...)
            groups = match.groups()
-            for i in range(1, len(groups), 3):
+            for i in range(2, len(groups), 3):
                if groups[i] is not None:
                    open_char = groups[i]
                    content = groups[i + 1]