Fix Mermaid syntax normalization: preserve quoted strings and prevent false positives

2026-01-10 16:07:19 +08:00
parent aabb24c9cd
commit f78e703a99
3 changed files with 183 additions and 11 deletions
--- a/plugins/filters/markdown_normalizer/FEATURES_CN.md
+++ b/plugins/filters/markdown_normalizer/FEATURES_CN.md
@@ -0,0 +1,162 @@
+# Markdown Normalizer 功能详解
+
+本插件旨在修复 LLM 输出中常见的 Markdown 格式问题，确保在 Open WebUI 中完美渲染。以下是支持的修复功能列表及示例。
+
+## 1. 代码块修复 (Code Block Fixes)
+
+### 1.1 去除代码块缩进
+LLM 有时会在代码块前添加空格缩进，导致渲染失效。本插件会自动移除这些缩进。
+
+**Before:**
+   ```python
+   print("hello")
+   ```
+
+**After:**
+```python
+print("hello")
+```
+
+### 1.2 补全代码块前后换行
+代码块标记 ` ``` ` 必须独占一行。如果 LLM 将其与文本混在一行，插件会自动修复。
+
+**Before:**
+Here is code:```python
+print("hello")```
+
+**After:**
+Here is code:
+```python
+print("hello")
+```
+
+### 1.3 修复语言标识符后的换行
+有时 LLM 会忘记在语言标识符（如 `python`）后换行。
+
+**Before:**
+```python print("hello")
+```
+
+**After:**
+```python
+print("hello")
+```
+
+### 1.4 自动闭合代码块
+如果输出被截断或 LLM 忘记闭合代码块，插件会自动添加结尾的 ` ``` `。
+
+**Before:**
+```python
+print("unfinished code...")
+
+**After:**
+```python
+print("unfinished code...")
+```
+
+## 2. LaTeX 公式规范化 (LaTeX Normalization)
+
+Open WebUI 使用 MathJax/KaTeX 渲染公式，通常需要 `$$` 或 `$` 包裹。本插件会将常见的 LaTeX 括号语法转换为标准格式。
+
+**Before:**
+块级公式：\[ E = mc^2 \]
+行内公式：\( a^2 + b^2 = c^2 \)
+
+**After:**
+块级公式：$$ E = mc^2 $$
+行内公式：$ a^2 + b^2 = c^2 $
+
+## 3. 转义字符清理 (Escape Character Fix)
+
+修复过度转义的字符，这常见于某些 API 返回的原始字符串中。
+
+**Before:**
+Line 1\\nLine 2\\tTabbed
+
+**After:**
+Line 1
+Line 2	Tabbed
+
+## 4. 思维链标签规范化 (Thought Tag Fix)
+**功能**: 
+1.  确保 `</thought>` 标签后有足够的空行，防止思维链内容与正文粘连。
+2.  **标准化标签**: 将 `<think>` (DeepSeek 等模型常用) 或 `<thinking>` 统一转换为 Open WebUI 标准的 `<thought>` 标签，以便正确触发 UI 的折叠功能。
+
+**默认**: 开启 (`enable_thought_tag_fix = True`)
+
+**示例**:
+*   **Before**: `<think>Thinking...</think>Response starts here.`
+*   **After**: 
+    ```xml
+    <thought>Thinking...</thought>
+
+    Response starts here.
+    ```
+
+## 5. 列表格式修复 (List Formatting Fix)
+
+*默认关闭，需在设置中开启*
+
+修复列表项缺少换行的问题。
+
+**Before:**
+Header1. Item 1
+
+**After:**
+Header
+1. Item 1
+
+## 6. 全角符号转半角 (Full-width Symbol Fix)
+
+*默认关闭，需在设置中开启*
+
+仅在**代码块内部**将全角符号转换为半角符号，防止代码因符号问题无法运行。
+
+**Before:**
+```python
+if x == 1：
+    print（"hello"）
+```
+
+**After:**
+```python
+if x == 1:
+    print("hello")
+```
+
+## 7. Mermaid 语法修复 (Mermaid Syntax Fix)
+**功能**: 修复 Mermaid 图表中常见的语法错误，特别是未加引号的标签包含特殊字符的情况。
+**默认**: 开启 (`enable_mermaid_fix = True`)
+**示例**:
+*   **Before**:
+    ```mermaid
+    graph TD
+    A[Label with (parens)] --> B(Label with [brackets])
+    ```
+*   **After**:
+    ```mermaid
+    graph TD
+    A["Label with (parens)"] --> B("Label with [brackets]")
+    ```
+
+## 8. XML 标签清理 (XML Cleanup)
+
+移除 LLM 输出中残留的无用 XML 标签（如 Claude 的 artifact 标签）。
+
+**Before:**
+Here is the result <antArtifact>hidden metadata</antArtifact>.
+
+**After:**
+## 9. 标题格式修复 (Heading Format Fix)
+**功能**: 修复标题标记 `#` 后缺少空格的问题。
+**默认**: 开启 (`enable_heading_fix = True`)
+**示例**:
+*   **Before**: `#Heading 1`
+*   **After**: `# Heading 1`
+
+## 10. 表格格式修复 (Table Format Fix)
+**功能**: 修复表格行末尾缺少管道符 `|` 的问题。
+**默认**: 开启 (`enable_table_fix = True`)
+**示例**:
+*   **Before**: `| Col 1 | Col 2`
+*   **After**: `| Col 1 | Col 2 |`
--- a/plugins/filters/markdown_normalizer/markdown_normalizer.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer.py
@@ -74,6 +74,7 @@ class ContentNormalizer:
        # Fix "reverse optimization": Must precisely match shape delimiters to avoid breaking structure
        # Priority: Longer delimiters match first
        "mermaid_node": re.compile(
+            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
            r"(\w+)\s*(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
@@ -281,14 +282,18 @@ class ContentNormalizer:
        """Fix common Mermaid syntax errors while preserving node shapes"""

        def replacer(match):
-            # Group 1 is ID
-            id_str = match.group(1)
+            # Group 1 is Quoted String (if matched)
+            if match.group(1):
+                return match.group(1)
+
+            # Group 2 is ID
+            id_str = match.group(2)

            # Find matching shape group
-            # Groups start at index 2, each shape has 3 groups (Open, Content, Close)
-            # We iterate to find the non-None one
+            # Groups start at index 3 (in match.group terms) or index 2 (in match.groups() tuple)
+            # Tuple: (String, ID, Open1, Content1, Close1, ...)
            groups = match.groups()
-            for i in range(1, len(groups), 3):
+            for i in range(2, len(groups), 3):
                if groups[i] is not None:
                    open_char = groups[i]
                    content = groups[i + 1]
--- a/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
+++ b/plugins/filters/markdown_normalizer/markdown_normalizer_cn.py
@@ -69,6 +69,7 @@ class ContentNormalizer:
        # 修复"反向优化"问题：必须精确匹配各种形状的定界符，避免破坏形状结构
        # 优先级：长定界符优先匹配
        "mermaid_node": re.compile(
+            r'("[^"\\]*(?:\\.[^"\\]*)*")|'  # Match quoted strings first (Group 1)
            r"(\w+)\s*(?:"
            r"(\(\(\()(?![\"])(.*?)(?<![\"])(\)\)\))|"  # (((...))) Double Circle
            r"(\(\()(?![\"])(.*?)(?<![\"])(\)\))|"  # ((...)) Circle
@@ -276,14 +277,18 @@ class ContentNormalizer:
        """修复常见的 Mermaid 语法错误，同时保留节点形状"""

        def replacer(match):
-            # Group 1 是 ID
-            id_str = match.group(1)
+            # Group 1 is Quoted String (if matched)
+            if match.group(1):
+                return match.group(1)

-            # 查找匹配的形状组
-            # 组从索引 2 开始，每个形状有 3 个组 (Open, Content, Close)
-            # 我们遍历找到非 None 的那一组
+            # Group 2 is ID
+            id_str = match.group(2)
+
+            # Find matching shape group
+            # Groups start at index 3 (in match.group terms) or index 2 (in match.groups() tuple)
+            # Tuple: (String, ID, Open1, Content1, Close1, ...)
            groups = match.groups()
-            for i in range(1, len(groups), 3):
+            for i in range(2, len(groups), 3):
                if groups[i] is not None:
                    open_char = groups[i]
                    content = groups[i + 1]