危险正则表达式模式
展示安全漏洞、性能问题和常见反模式的正则表达式集合
📝 无效字符类转义 regex
🟢 simple
⭐⭐
包含无效范围或未转义特殊字符的字符类
⏱️ 5 min
🏷️ syntax, correctness, low
# Invalid Character Class Samples
# Character classes with syntax errors or ambiguous patterns
# Risk Level: LOW - Syntax errors or unexpected behavior
# --- Invalid Ranges ---
# Pattern: [a-Z]
# Problem: Z comes before a in ASCII, invalid range
# May match: a, Z, and characters between them in ASCII
[a-Z]
# Fix: [a-zA-Z]
# Pattern: [Z-a]
# Problem: Reverse range, invalid
[Z-a]
# Fix: [a-z]
# Pattern: [0-z]
# Problem: Ambiguous range including special chars
[0-z]
# Fix: [0-9a-zA-Z] or [0-9a-zA-Z]
# --- Unescaped Special Characters ---
# Pattern: [^]]
# Problem: Unclosed character class
# Problematic in: [^]]+
# Fix: [^\]]+ or [^]]+] (depending on engine)
# Pattern: [--]
# Problem: Dash placement confusion
[--]
# Fix: [-\-] or [\--]
# Pattern: [^^]
# Problem: Caret placement confusion
[^^]
# Fix: [\^] or [^\^]
# --- Ambiguous Escapes ---
# Pattern: [\w]
# Problem: Redundant escape in character class
# \w works, but may not mean what you think in some engines
[\w]
# Fix: \w outside class or [a-zA-Z0-9_]
# Pattern: [\b]
# Problem: In character class, \b is backspace, not word boundary
[\b]
# Clarify: Inside class = backspace, Outside = word boundary
# Pattern: [\d]
# Problem: May not work in all engines
[\d]
# Fix: [0-9] or \d outside class
📝 冗余转义 regex
🟢 simple
⭐
降低可读性的不必要转义序列
⏱️ 4 min
🏷️ style, readability, low
# Redundant Escape Samples
# Unnecessary escape sequences that clutter patterns
# Risk Level: LOW - Style and readability issues
# --- Unnecessary Character Escapes ---
# Pattern: \-
# Problem: Hyphen doesn't need escaping outside character class
\-
# Fix: - (just use hyphen)
# Pattern: \:
# Problem: Colon is not a special character
\:
# Fix: : (just use colon)
# Pattern: \.
# Problem: Escaping period when you want literal
# If you want literal: \.
# If you want any char: .
\. # literal period
. # any character
# Pattern: \
# Problem: Single backslash (escaped backslash)
# Often confused with: \ (backslash escape sequence)
\\ # literal backslash
# --- Unnecessary Character Class Escapes ---
# Pattern: [a-z]
# Problem: Escape of hyphen not needed when at edges
[a-z\-] # unnecessary
[a-z-] # better (hyphen at end)
[-a-z] # better (hyphen at start)
# Pattern: [\^]
# Problem: Caret doesn't need escape when not first
[^\^] # caret not first, no escape needed
[\^] # caret first (or anywhere) in negated class
# Pattern: [\]]
# Problem: Escape only needed in some positions
[a-z\]] # necessary here
[\]a-z] # necessary here
# --- Letter Escapes ---
# Pattern: \c\a\t
# Problem: Escaping letters when not needed
# Unless they are special: b, d, s, w, etc.
\c\a\t # unnecessary
cat # just letters
# Pattern: [\Q\E]
# Problem: \Q and \E don't work in character classes
[\Q\E] # just matches Q, E, or backslash
# --- Numeric Escapes ---
# Pattern: \1 vs \1
# Problem: Ambiguous - backreference or octal?
# In modern regex: Usually backreference
# In some contexts: Octal
# Clarify: Use \k<name> for named backreferences to avoid ambiguity
📝 过度捕获组 regex
🟢 simple
⭐⭐
影响性能的不必要捕获组
⏱️ 5 min
🏷️ performance, style, low
# Excessive Capturing Group Samples
# Unnecessary capturing groups that hurt performance and readability
# Risk Level: LOW - Performance and maintainability issues
# --- Unneeded Captures ---
# Pattern: (\d+)\s+(\w+)\s+(\d+)
# Problem: Capturing when you only need to match
# If you don't need the groups, use non-capturing
(\d+)\s+(\w+)\s+(\d+)
# Fix: \d+\s+\w+\s+\d+ (no groups)
# Or: (?:\d+)\s+(?:\w+)\s+(?:\d+) (non-capturing)
# Pattern: (https?)://([^\s]+)
# Problem: Capturing protocol when you just want validation
(https?)://([^\s]+)
# Fix: (?:https?)://[^\s]+ or https?://[^\s]+
# --- Nested Captures ---
# Pattern: ((\d+)\s+(\w+))
# Problem: Nested capturing groups
# Creates: Group 1: entire match, Group 2: digits, Group 3: word
((\d+)\s+(\w+))
# Fix: Use non-capturing where possible:
# (?: (\d+)\s+(\w+) ) or just flatten
# Pattern: (a(b(c)d)e)
# Problem: Deeply nested captures
# Creates: 4 capturing groups
(a(b(c)d)e)
# Fix: (?:a(?:b(?:c)d)e) or a(?:b(?:c)d)e
# --- Performance Impact ---
# Pattern with captures: ~30-50% slower than non-capturing
# Benchmark: Matching 1000 strings
# Slow: (\d{3})-(\d{3})-(\d{4})
# Fast: \d{3}-\d{3}-\d{4}
# Slow: (\w+)@(\w+)\.(\w+)
# Fast: \w+@\w+\.\w+
# --- When to Use Captures ---
# Use capturing groups when:
# - You need to extract specific parts
# - You need backreferences
# Example: (\w+)\s+\1 # repeated word
# Use non-capturing (?:...) when:
# - Grouping for quantifiers: (?:abc){3}
# - Grouping for alternation: (?:a|b|c)
# - Grouping for precedence: ^(?:abc|def)
# --- Named Groups for Clarity ---
# Instead of: (\d{4})-(\d{2})-(\d{2})
# Use: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
# Improves: Readability and maintenance
# Performance: Similar to capturing groups, but clearer
📝 无锚点验证模式 regex
🟡 intermediate
⭐⭐⭐
缺少锚点、可在输入任意位置匹配从而导致验证问题的正则表达式模式
⏱️ 10 min
🏷️ validation, security, medium
# Unanchored Pattern Samples
# Patterns that can match anywhere in the string, causing validation issues
# Risk Level: MEDIUM - Can bypass validation
# --- Number Validation Without Anchors ---
# Pattern: \d+
# Problem: Matches digits anywhere, not the whole string
# Valid: "123"
# Also Matches: "abc123def", "123abc", "a1b2c3"
\d+
# Fix: ^\d+$
# --- Email Validation Without Anchors ---
# Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
# Problem: Matches email anywhere, allows bypass
# Valid: "[email protected]"
# Also Matches: "[email protected]<script>alert('xss')</script>", "[email protected] malicious content"
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
# Fix: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
# --- URL Validation Without Anchors ---
# Pattern: https?://[^\s]+
# Problem: Matches URL anywhere, allows injection
# Valid: "https://example.com"
# Also Matches: "javascript:https://evil.com", "https://example.com" onclick="steal()""
https?://[^\s]+
# Fix: ^https?://[^\s]+$
# --- Phone Number Without Anchors ---
# Pattern: \d{3}-?\d{3}-?\d{4}
# Problem: Matches pattern anywhere
# Valid: "123-456-7890"
# Also Matches: "Call 123-456-7890 now!", "my number is 123-456-7890 thanks"
\d{3}-?\d{3}-?\d{4}
# Fix: ^\d{3}-?\d{3}-?\d{4}$
# --- HTML Tag Without Anchors ---
# Pattern: <div>.*</div>
# Problem: Matches across multiple unintended divs
# Valid: "<div>content</div>"
# Also Matches: "<div>content1</div><script>evil()</script><div>content2</div>"
<div>.*</div>
# Fix: <div>[^<]*</div> or use proper HTML parser
📝 过度贪婪匹配 regex
🟡 intermediate
⭐⭐⭐
消耗超出预期内容的贪婪量词,导致错误匹配
⏱️ 10 min
🏷️ performance, correctness, medium
# Excessive Greedy Matching Samples
# Greedy quantifiers (.*) consuming more than intended
# Risk Level: MEDIUM - Incorrect behavior and performance issues
# --- HTML/XML Greedy Matching ---
# Pattern: <div>.*</div>
# Problem: .* is greedy and matches across multiple tags
# Input: <div>First</div><div>Second</div>
# Matches: Entire string instead of individual divs
<div>.*</div>
# Fix: <div>.*?</div> (lazy) or <div>[^<]*</div> (negated character class)
# Pattern: <.*>
# Problem: Matches from first < to last >
# Input: <div> <span>text</span> </div>
# Matches: Entire string as one match
<.*>
# Fix: <[^>]+>
# --- URL Greedy Capture ---
# Pattern: https?://.*
# Problem: .* captures to end of line, including other URLs
# Input: "https://example.com https://another.com"
# Matches: "https://example.com https://another.com"
https?://.*
# Fix: https?://[^\s]+
# --- Quote Matching ---
# Pattern: ".*"
# Problem: Greedy match across multiple quoted strings
# Input: He said "hello" and "goodbye"
# Matches: "hello" and "goodbye" as one match: "hello" and "goodbye"
".*"
# Fix: "[^"]*"
# --- Code Comment Extraction ---
# Pattern: //.*
# Problem: Matches across multiple lines if no newline
# Input: code // comment1; more code // comment2
//.*
# Fix: //[^\n]*
# --- Between Delimiters ---
# Pattern: \|.*\|
# Problem: Greedy match from first | to last |
# Input: a|b|c|d
# Matches: "b|c" instead of "b"
\|.*\|
# Fix: \|[^|]*\|
📝 低效惰性量词 regex
🟡 intermediate
⭐⭐⭐
仍然低效或不适合使用场景的惰性量词
⏱️ 8 min
🏷️ performance, correctness, medium
# Inefficient Lazy Quantifier Samples
# .*? patterns that are inefficient or better served by character classes
# Risk Level: LOW-MEDIUM - Performance issues
# --- Lazy vs Character Class ---
# Pattern: <a>.*?</a>
# Problem: Lazy quantifier still backtracks
# Better: <a>[^<]*</a>
<a>.*?</a>
# Pattern: ".*?"
# Problem: Lazy quantifier for quotes is slow
# Better: "[^"]*"
".*?"
# Pattern: \(.*?\)
# Problem: Lazy for parentheses
# Better: \([^)]*\)
\(.*?\)
# --- Lazy in Complex Patterns ---
# Pattern: ^\w+: .*?$\s+^\w+: .*?$ (with multiline)
# Problem: Multiple lazy quantifiers slow on large text
# Better: Use specific patterns for each field
^\w+: .*?$\s+^\w+: .*?$
# --- Nested Lazy ---
# Pattern: (.*?){3}
# Problem: Nested lazy quantifier
# Better: Use specific pattern or split logic
(.*?){3,5}
# --- Lazy with Alternation ---
# Pattern: (a.*?|b.*?|c.*?)
# Problem: Lazy with multiple alternatives
# Better: (?:a[^b]*|b[^c]*|c[^a]*)
(a.*?|b.*?|c.*?)
# --- Performance Comparison ---
# Inefficient: .*?@.*?
# For email: [^@]+@[^@]+
# The character class version is 2-3x faster
# Inefficient: <div>\s*.*?\s*</div>
# Better: <div>\s*[^<]*\s*</div>
<div>\s*.*?\s*</div>
📝 歧义选择顺序 regex
🟡 intermediate
⭐⭐⭐
顺序会导致意外匹配的选择模式
⏱️ 8 min
🏷️ correctness, logic, medium
# Ambiguous Alternation Order Samples
# Alternation patterns where order critically affects matching
# Risk Level: MEDIUM - Incorrect behavior
# --- Prefix Ambiguity ---
# Pattern: cat|category
# Problem: "cat" matches first, "category" never fully matched
# Input: "category"
# Matches: "cat" instead of "category"
cat|category
# Fix: category|cat
# Pattern: a|ab|abc
# Problem: Shortest matches first
# Input: "abc"
# Matches: "a" instead of "abc"
a|ab|abc
# Fix: abc|ab|a
# --- Partial Match Issues ---
# Pattern: Mon|Tues|Wed|Thurs|Fri|Sat|Sun
# Problem: Order matters for prefixes
# Input: "Thursday"
# Matches: "Th" from "Thurs"? No, but ambiguous
Mon|Tues|Wed|Thurs|Fri|Sat|Sun
# Fix: Order by length descending: Thurs|Tues|Wed|...
# Pattern: http|https
# Problem: "http" matches first
# Input: "https://example.com"
# Matches: "http" instead of "https"
http|https
# Fix: https|http
# --- Overlapping Patterns ---
# Pattern: \d+|\d{2}
# Problem: First always wins
# Input: "12"
# Matches: "12" via \d+ not \d{2}
\d+|\d{2}
# Fix: \d{2}|\d+ or use specific patterns
# --- Logical Conflicts ---
# Pattern: foo.*|foo.*bar
# Problem: First pattern swallows second
# Input: "foobazbar"
# Matches: via foo.* (greedy), foo.*bar never tried
foo.*|foo.*bar
# Fix: foo.*bar|foo.*
# --- Word Boundaries with Alternation ---
# Pattern: \b(cat|category)\b
# Problem: Both can't have word boundaries correctly
# Input: "category"
# Issue: "cat" part matches, "egory" breaks word boundary
\b(cat|category)\b
# Fix: \b(category|cat)\b but still problematic for "cat" in "category"
📝 混淆的双重否定 regex
🟡 intermediate
⭐⭐
难以阅读且容易出错的双重否定模式
⏱️ 6 min
🏷️ readability, correctness, low
# Double Negation Samples
# Confusing negative patterns that are hard to read and maintain
# Risk Level: LOW - Readability and maintenance issues
# --- Negated Negated Character Class ---
# Pattern: [^[^]]
# Problem: Double negation - confusing
# Means: Not (not closing bracket)
[^[^]]
# Fix: [\]] or just ] if not in class
# Pattern: [^\D]
# Problem: Double negation for digit
# Means: Not (not digit) = digit
[^\D]
# Fix: \d or [0-9]
# Pattern: [^\W]
# Problem: Double negation for word char
# Means: Not (not word) = word char
[^\W]
# Fix: \w or [a-zA-Z0-9_]
# Pattern: [^\S]
# Problem: Double negation for whitespace
# Means: Not (not space) = space
[^\S]
# Fix: \s or [ \t\n\r]
# --- Nested Negative Lookaheads ---
# Pattern: ^(?!.*(?!pattern)).*
# Problem: Confusing double negative lookahead
^(?!.*(?!pattern)).*
# Fix: Simplify logic or use positive assertions
# Pattern: (?![^a])
# Problem: Double negative lookahead
# Means: Not followed by not 'a' = followed by 'a'
(?![^a])
# Fix: (?=a)
# --- Negated Everything Except ---
# Pattern: [^abc]
# Problem: Negative thinking
# Consider: What if you want to match most things?
[^abc]
# Fix: Consider if positive class is clearer for your case
# Pattern: (?!abc).*
# Problem: Negative lookahead to exclude
# Input: "def" matches, "abc" doesn't
(?!abc).*
# Fix: Consider if positive pattern is clearer
📝 八进制和反向引用歧义 regex
🟡 intermediate
⭐⭐⭐
\1可能表示八进制或反向引用的模式
⏱️ 8 min
🏷️ ambiguity, compatibility, medium
# Octal Ambiguity Samples
# Patterns where \1, \2 etc. have ambiguous meanings
# Risk Level: MEDIUM - Cross-compiler compatibility issues
# --- \0 Ambiguity ---
# Pattern: \0
# Problem: \0 can mean null character or octal 000
# In JavaScript: Octal escape (deprecated)
# In Python: Octal 000 (null char)
\0
# Fix: Use \x00 for null character (more explicit)
# Pattern: \01
# Problem: Could be octal 001 or backreference to group 1
# Modern engines: Usually backreference if group exists
# Old engines: Octal 001
\01
# Fix: Use \x01 for octal or \g<1> for backreference
# --- Backreference vs Octal ---
# Pattern: (.)\1
# Problem: Is \1 backreference to group 1 or octal?
# With group: Backreference
# Without group: Octal (in some engines)
(.)\1
# Fix: Use \g<1> or \k<1> for named groups
# Pattern: \10
# Problem: Could be backreference to group 10 or octal 010
# Depends on number of capturing groups
\10
# Fix: Use \g{10} for clarity
# --- Leading Zeros ---
# Pattern: \01 in (a)\01
# Problem: Ambiguous with only one group
(a)\01 # Is this backreference to 1 or octal?
# Fix: \g<1> or avoid leading zeros
# Pattern: \001
# Problem: Definitely octal, but unclear which
\001 # Octal 001 = decimal 1
# Fix: \x01 for hex escape
# --- Octal in Character Classes ---
# Pattern: [\01]
# Problem: Octal in character class
# Matches: character with octal value 001 (null char)
[\01]
# Fix: [\x01] for clarity
# Pattern: [\0-\7]
# Problem: Octal range
# Matches: characters from null to bell (0-7 decimal)
[\0-\7]
# Fix: Use \x00-\x07 or explicit characters
# --- Cross-Engine Differences ---
# In JavaScript:
\1 # Octal 001 (deprecated in strict mode)
# In Python:
\1 # Backreference to group 1
\01 # Octal if group 1 doesn't exist
# In PCGRE:
\1 # Backreference
\01 # Octal if fewer than 1 group
\g1 # Unambiguous backreference
📝 灾难性回溯模式 regex
🔴 complex
⭐⭐⭐⭐⭐
可能导致指数级回溯并引发拒绝服务攻击的正则表达式模式
⏱️ 15 min
🏷️ security, critical, redo, performance
# Catastrophic Backtracking (ReDoS) Samples
# These patterns can cause exponential time complexity on non-matching input
# Risk Level: CRITICAL - Can cause DoS attacks
# --- Nested Quantifiers ---
# Pattern: (a+)+
# Problem: Nested quantifiers create exponential backtracking
# Dangerous Input: aaaaaaaaaaaaaaaaX (20+ 'a's followed by non-matching char)
(a+)+
# Pattern: (.*?)+
# Problem: Lazy quantifier nested in greedy quantifier
# Dangerous Input: Any input that doesn't fully match
(.*?)+
# Pattern: (.*)+
# Problem: Nested greedy quantifiers
# Dangerous Input: Input with partial match at end
(.*)+
# Pattern: ^(a+)+$
# Problem: Anchored nested quantifiers
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^(a+)+$
# Pattern: ^((a+)+)+$
# Problem: Triple nested quantifiers - extremely dangerous
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^((a+)+)+$
# --- Overlapping Alternatives ---
# Pattern: (a|a)+
# Problem: Identical alternatives in quantifier
# Dangerous Input: aaaaaaaaaaaaaaaaX
(a|a)+
# Pattern: (ab|abc)+
# Problem: Overlapping alternatives
# Dangerous Input: ababababababababX
(ab|abc)+
# Pattern: (\d|\d\d)+
# Problem: Prefix relationship between alternatives
# Dangerous Input: 12345678901234567890X
(\d|\d\d)+
# --- Exponential Patterns ---
# Pattern: (a|b|c)*x
# Problem: Wildcard before specific match
# Dangerous Input: aaaaaaaaaaaaaaaaaaaa (without x)
(a|b|c)*x
# Pattern: ^(a*)*$
# Problem: Nested star quantifier
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^(a.*)*$
# Pattern: .*=(.*).*=(.*).*
# Problem: Multiple backtracking points
# Dangerous Input: a=bbbbbbbbbbbbbbbbbbbb=c=dddddddddddddddddddd
.*=(.*).*=(.*).*