📝 无效字符类转义 regex

🟢 simple ⭐⭐

包含无效范围或未转义特殊字符的字符类

⏱️ 5 min 🏷️ syntax, correctness, low

# Invalid Character Class Samples
# Character classes with syntax errors or ambiguous patterns
# Risk Level: LOW - Syntax errors or unexpected behavior

# --- Invalid Ranges ---

# Pattern: [a-Z]
# Problem: Z comes before a in ASCII, invalid range
# May match: a, Z, and characters between them in ASCII
[a-Z]

# Fix: [a-zA-Z]

# Pattern: [Z-a]
# Problem: Reverse range, invalid
[Z-a]

# Fix: [a-z]

# Pattern: [0-z]
# Problem: Ambiguous range including special chars
[0-z]

# Fix: [0-9a-zA-Z] or [0-9a-zA-Z]

# --- Unescaped Special Characters ---

# Pattern: [^]]
# Problem: Unclosed character class
# Problematic in: [^]]+

# Fix: [^\]]+ or [^]]+] (depending on engine)

# Pattern: [--]
# Problem: Dash placement confusion
[--]

# Fix: [-\-] or [\--]

# Pattern: [^^]
# Problem: Caret placement confusion
[^^]

# Fix: [\^] or [^\^]

# --- Ambiguous Escapes ---

# Pattern: [\w]
# Problem: Redundant escape in character class
# \w works, but may not mean what you think in some engines
[\w]

# Fix: \w outside class or [a-zA-Z0-9_]

# Pattern: [\b]
# Problem: In character class, \b is backspace, not word boundary
[\b]

# Clarify: Inside class = backspace, Outside = word boundary

# Pattern: [\d]
# Problem: May not work in all engines
[\d]

# Fix: [0-9] or \d outside class

📝 冗余转义 regex

🟢 simple ⭐

降低可读性的不必要转义序列

⏱️ 4 min 🏷️ style, readability, low

# Redundant Escape Samples
# Unnecessary escape sequences that clutter patterns
# Risk Level: LOW - Style and readability issues

# --- Unnecessary Character Escapes ---

# Pattern: \-
# Problem: Hyphen doesn't need escaping outside character class
\-

# Fix: - (just use hyphen)

# Pattern: \:
# Problem: Colon is not a special character
\:

# Fix: : (just use colon)

# Pattern: \.
# Problem: Escaping period when you want literal
# If you want literal: \.
# If you want any char: .
\.  # literal period
.   # any character

# Pattern: \
# Problem: Single backslash (escaped backslash)
# Often confused with: \ (backslash escape sequence)
\\  # literal backslash

# --- Unnecessary Character Class Escapes ---

# Pattern: [a-z]
# Problem: Escape of hyphen not needed when at edges
[a-z\-]  # unnecessary
[a-z-]    # better (hyphen at end)
[-a-z]    # better (hyphen at start)

# Pattern: [\^]
# Problem: Caret doesn't need escape when not first
[^\^]   # caret not first, no escape needed
[\^]    # caret first (or anywhere) in negated class

# Pattern: [\]]
# Problem: Escape only needed in some positions
[a-z\]]  # necessary here
[\]a-z]  # necessary here

# --- Letter Escapes ---

# Pattern: \c\a\t
# Problem: Escaping letters when not needed
# Unless they are special: b, d, s, w, etc.
\c\a\t  # unnecessary
cat       # just letters

# Pattern: [\Q\E]
# Problem: \Q and \E don't work in character classes
[\Q\E]  # just matches Q, E, or backslash

# --- Numeric Escapes ---

# Pattern: \1 vs \1
# Problem: Ambiguous - backreference or octal?
# In modern regex: Usually backreference
# In some contexts: Octal

# Clarify: Use \k<name> for named backreferences to avoid ambiguity

📝 过度捕获组 regex

🟢 simple ⭐⭐

影响性能的不必要捕获组

⏱️ 5 min 🏷️ performance, style, low

# Excessive Capturing Group Samples
# Unnecessary capturing groups that hurt performance and readability
# Risk Level: LOW - Performance and maintainability issues

# --- Unneeded Captures ---

# Pattern: (\d+)\s+(\w+)\s+(\d+)
# Problem: Capturing when you only need to match
# If you don't need the groups, use non-capturing
(\d+)\s+(\w+)\s+(\d+)

# Fix: \d+\s+\w+\s+\d+ (no groups)
# Or: (?:\d+)\s+(?:\w+)\s+(?:\d+) (non-capturing)

# Pattern: (https?)://([^\s]+)
# Problem: Capturing protocol when you just want validation
(https?)://([^\s]+)

# Fix: (?:https?)://[^\s]+ or https?://[^\s]+

# --- Nested Captures ---

# Pattern: ((\d+)\s+(\w+))
# Problem: Nested capturing groups
# Creates: Group 1: entire match, Group 2: digits, Group 3: word
((\d+)\s+(\w+))

# Fix: Use non-capturing where possible:
# (?: (\d+)\s+(\w+) ) or just flatten

# Pattern: (a(b(c)d)e)
# Problem: Deeply nested captures
# Creates: 4 capturing groups
(a(b(c)d)e)

# Fix: (?:a(?:b(?:c)d)e) or a(?:b(?:c)d)e

# --- Performance Impact ---

# Pattern with captures: ~30-50% slower than non-capturing
# Benchmark: Matching 1000 strings

# Slow: (\d{3})-(\d{3})-(\d{4})
# Fast: \d{3}-\d{3}-\d{4}

# Slow: (\w+)@(\w+)\.(\w+)
# Fast: \w+@\w+\.\w+

# --- When to Use Captures ---

# Use capturing groups when:
# - You need to extract specific parts
# - You need backreferences
# Example: (\w+)\s+\1  # repeated word

# Use non-capturing (?:...) when:
# - Grouping for quantifiers: (?:abc){3}
# - Grouping for alternation: (?:a|b|c)
# - Grouping for precedence: ^(?:abc|def)

# --- Named Groups for Clarity ---

# Instead of: (\d{4})-(\d{2})-(\d{2})
# Use: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

# Improves: Readability and maintenance
# Performance: Similar to capturing groups, but clearer

📝 无锚点验证模式 regex

🟡 intermediate ⭐⭐⭐

缺少锚点、可在输入任意位置匹配从而导致验证问题的正则表达式模式

⏱️ 10 min 🏷️ validation, security, medium

# Unanchored Pattern Samples
# Patterns that can match anywhere in the string, causing validation issues
# Risk Level: MEDIUM - Can bypass validation

# --- Number Validation Without Anchors ---

# Pattern: \d+
# Problem: Matches digits anywhere, not the whole string
# Valid: "123"
# Also Matches: "abc123def", "123abc", "a1b2c3"
\d+

# Fix: ^\d+$

# --- Email Validation Without Anchors ---

# Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
# Problem: Matches email anywhere, allows bypass
# Valid: "[email protected]"
# Also Matches: "[email protected]<script>alert('xss')</script>", "[email protected] malicious content"
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

# Fix: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

# --- URL Validation Without Anchors ---

# Pattern: https?://[^\s]+
# Problem: Matches URL anywhere, allows injection
# Valid: "https://example.com"
# Also Matches: "javascript:https://evil.com", "https://example.com" onclick="steal()""
https?://[^\s]+

# Fix: ^https?://[^\s]+$

# --- Phone Number Without Anchors ---

# Pattern: \d{3}-?\d{3}-?\d{4}
# Problem: Matches pattern anywhere
# Valid: "123-456-7890"
# Also Matches: "Call 123-456-7890 now!", "my number is 123-456-7890 thanks"
\d{3}-?\d{3}-?\d{4}

# Fix: ^\d{3}-?\d{3}-?\d{4}$

# --- HTML Tag Without Anchors ---

# Pattern: <div>.*</div>
# Problem: Matches across multiple unintended divs
# Valid: "<div>content</div>"
# Also Matches: "<div>content1</div><script>evil()</script><div>content2</div>"
<div>.*</div>

# Fix: <div>[^<]*</div> or use proper HTML parser

📝 过度贪婪匹配 regex

🟡 intermediate ⭐⭐⭐

消耗超出预期内容的贪婪量词，导致错误匹配

⏱️ 10 min 🏷️ performance, correctness, medium

# Excessive Greedy Matching Samples
# Greedy quantifiers (.*) consuming more than intended
# Risk Level: MEDIUM - Incorrect behavior and performance issues

# --- HTML/XML Greedy Matching ---

# Pattern: <div>.*</div>
# Problem: .* is greedy and matches across multiple tags
# Input: <div>First</div><div>Second</div>
# Matches: Entire string instead of individual divs
<div>.*</div>

# Fix: <div>.*?</div> (lazy) or <div>[^<]*</div> (negated character class)

# Pattern: <.*>
# Problem: Matches from first < to last >
# Input: <div> <span>text</span> </div>
# Matches: Entire string as one match
<.*>

# Fix: <[^>]+>

# --- URL Greedy Capture ---

# Pattern: https?://.*
# Problem: .* captures to end of line, including other URLs
# Input: "https://example.com https://another.com"
# Matches: "https://example.com https://another.com"
https?://.*

# Fix: https?://[^\s]+

# --- Quote Matching ---

# Pattern: ".*"
# Problem: Greedy match across multiple quoted strings
# Input: He said "hello" and "goodbye"
# Matches: "hello" and "goodbye" as one match: "hello" and "goodbye"
".*"

# Fix: "[^"]*"

# --- Code Comment Extraction ---

# Pattern: //.*
# Problem: Matches across multiple lines if no newline
# Input: code // comment1; more code // comment2
//.*

# Fix: //[^\n]*

# --- Between Delimiters ---

# Pattern: \|.*\|
# Problem: Greedy match from first | to last |
# Input: a|b|c|d
# Matches: "b|c" instead of "b"
\|.*\|

# Fix: \|[^|]*\|

📝 低效惰性量词 regex

🟡 intermediate ⭐⭐⭐

仍然低效或不适合使用场景的惰性量词

⏱️ 8 min 🏷️ performance, correctness, medium

# Inefficient Lazy Quantifier Samples
# .*? patterns that are inefficient or better served by character classes
# Risk Level: LOW-MEDIUM - Performance issues

# --- Lazy vs Character Class ---

# Pattern: <a>.*?</a>
# Problem: Lazy quantifier still backtracks
# Better: <a>[^<]*</a>
<a>.*?</a>

# Pattern: ".*?"
# Problem: Lazy quantifier for quotes is slow
# Better: "[^"]*"
".*?"

# Pattern: \(.*?\)
# Problem: Lazy for parentheses
# Better: \([^)]*\)
\(.*?\)

# --- Lazy in Complex Patterns ---

# Pattern: ^\w+: .*?$\s+^\w+: .*?$ (with multiline)
# Problem: Multiple lazy quantifiers slow on large text
# Better: Use specific patterns for each field
^\w+: .*?$\s+^\w+: .*?$

# --- Nested Lazy ---

# Pattern: (.*?){3}
# Problem: Nested lazy quantifier
# Better: Use specific pattern or split logic
(.*?){3,5}

# --- Lazy with Alternation ---

# Pattern: (a.*?|b.*?|c.*?)
# Problem: Lazy with multiple alternatives
# Better: (?:a[^b]*|b[^c]*|c[^a]*)
(a.*?|b.*?|c.*?)

# --- Performance Comparison ---

# Inefficient: .*?@.*?
# For email: [^@]+@[^@]+
# The character class version is 2-3x faster

# Inefficient: <div>\s*.*?\s*</div>
# Better: <div>\s*[^<]*\s*</div>
<div>\s*.*?\s*</div>

📝 歧义选择顺序 regex

🟡 intermediate ⭐⭐⭐

顺序会导致意外匹配的选择模式

⏱️ 8 min 🏷️ correctness, logic, medium

# Ambiguous Alternation Order Samples
# Alternation patterns where order critically affects matching
# Risk Level: MEDIUM - Incorrect behavior

# --- Prefix Ambiguity ---

# Pattern: cat|category
# Problem: "cat" matches first, "category" never fully matched
# Input: "category"
# Matches: "cat" instead of "category"
cat|category

# Fix: category|cat

# Pattern: a|ab|abc
# Problem: Shortest matches first
# Input: "abc"
# Matches: "a" instead of "abc"
a|ab|abc

# Fix: abc|ab|a

# --- Partial Match Issues ---

# Pattern: Mon|Tues|Wed|Thurs|Fri|Sat|Sun
# Problem: Order matters for prefixes
# Input: "Thursday"
# Matches: "Th" from "Thurs"? No, but ambiguous
Mon|Tues|Wed|Thurs|Fri|Sat|Sun

# Fix: Order by length descending: Thurs|Tues|Wed|...

# Pattern: http|https
# Problem: "http" matches first
# Input: "https://example.com"
# Matches: "http" instead of "https"
http|https

# Fix: https|http

# --- Overlapping Patterns ---

# Pattern: \d+|\d{2}
# Problem: First always wins
# Input: "12"
# Matches: "12" via \d+ not \d{2}
\d+|\d{2}

# Fix: \d{2}|\d+ or use specific patterns

# --- Logical Conflicts ---

# Pattern: foo.*|foo.*bar
# Problem: First pattern swallows second
# Input: "foobazbar"
# Matches: via foo.* (greedy), foo.*bar never tried
foo.*|foo.*bar

# Fix: foo.*bar|foo.*

# --- Word Boundaries with Alternation ---

# Pattern: \b(cat|category)\b
# Problem: Both can't have word boundaries correctly
# Input: "category"
# Issue: "cat" part matches, "egory" breaks word boundary
\b(cat|category)\b

# Fix: \b(category|cat)\b but still problematic for "cat" in "category"

📝 混淆的双重否定 regex

🟡 intermediate ⭐⭐

难以阅读且容易出错的双重否定模式

⏱️ 6 min 🏷️ readability, correctness, low

# Double Negation Samples
# Confusing negative patterns that are hard to read and maintain
# Risk Level: LOW - Readability and maintenance issues

# --- Negated Negated Character Class ---

# Pattern: [^[^]]
# Problem: Double negation - confusing
# Means: Not (not closing bracket)
[^[^]]

# Fix: [\]] or just ] if not in class

# Pattern: [^\D]
# Problem: Double negation for digit
# Means: Not (not digit) = digit
[^\D]

# Fix: \d or [0-9]

# Pattern: [^\W]
# Problem: Double negation for word char
# Means: Not (not word) = word char
[^\W]

# Fix: \w or [a-zA-Z0-9_]

# Pattern: [^\S]
# Problem: Double negation for whitespace
# Means: Not (not space) = space
[^\S]

# Fix: \s or [ \t\n\r]

# --- Nested Negative Lookaheads ---

# Pattern: ^(?!.*(?!pattern)).*
# Problem: Confusing double negative lookahead
^(?!.*(?!pattern)).*

# Fix: Simplify logic or use positive assertions

# Pattern: (?![^a])
# Problem: Double negative lookahead
# Means: Not followed by not 'a' = followed by 'a'
(?![^a])

# Fix: (?=a)

# --- Negated Everything Except ---

# Pattern: [^abc]
# Problem: Negative thinking
# Consider: What if you want to match most things?
[^abc]

# Fix: Consider if positive class is clearer for your case

# Pattern: (?!abc).*
# Problem: Negative lookahead to exclude
# Input: "def" matches, "abc" doesn't
(?!abc).*

# Fix: Consider if positive pattern is clearer

📝 八进制和反向引用歧义 regex

🟡 intermediate ⭐⭐⭐

\1可能表示八进制或反向引用的模式

⏱️ 8 min 🏷️ ambiguity, compatibility, medium

# Octal Ambiguity Samples
# Patterns where \1, \2 etc. have ambiguous meanings
# Risk Level: MEDIUM - Cross-compiler compatibility issues

# --- \0 Ambiguity ---

# Pattern: \0
# Problem: \0 can mean null character or octal 000
# In JavaScript: Octal escape (deprecated)
# In Python: Octal 000 (null char)
\0

# Fix: Use \x00 for null character (more explicit)

# Pattern: \01
# Problem: Could be octal 001 or backreference to group 1
# Modern engines: Usually backreference if group exists
# Old engines: Octal 001
\01

# Fix: Use \x01 for octal or \g<1> for backreference

# --- Backreference vs Octal ---

# Pattern: (.)\1
# Problem: Is \1 backreference to group 1 or octal?
# With group: Backreference
# Without group: Octal (in some engines)
(.)\1

# Fix: Use \g<1> or \k<1> for named groups

# Pattern: \10
# Problem: Could be backreference to group 10 or octal 010
# Depends on number of capturing groups
\10

# Fix: Use \g{10} for clarity

# --- Leading Zeros ---

# Pattern: \01 in (a)\01
# Problem: Ambiguous with only one group
(a)\01  # Is this backreference to 1 or octal?

# Fix: \g<1> or avoid leading zeros

# Pattern: \001
# Problem: Definitely octal, but unclear which
\001  # Octal 001 = decimal 1

# Fix: \x01 for hex escape

# --- Octal in Character Classes ---

# Pattern: [\01]
# Problem: Octal in character class
# Matches: character with octal value 001 (null char)
[\01]

# Fix: [\x01] for clarity

# Pattern: [\0-\7]
# Problem: Octal range
# Matches: characters from null to bell (0-7 decimal)
[\0-\7]

# Fix: Use \x00-\x07 or explicit characters

# --- Cross-Engine Differences ---

# In JavaScript:
\1  # Octal 001 (deprecated in strict mode)

# In Python:
\1  # Backreference to group 1
\01  # Octal if group 1 doesn't exist

# In PCGRE:
\1  # Backreference
\01  # Octal if fewer than 1 group
\g1  # Unambiguous backreference

📝 灾难性回溯模式 regex

🔴 complex ⭐⭐⭐⭐⭐

可能导致指数级回溯并引发拒绝服务攻击的正则表达式模式

⏱️ 15 min 🏷️ security, critical, redo, performance

# Catastrophic Backtracking (ReDoS) Samples
# These patterns can cause exponential time complexity on non-matching input
# Risk Level: CRITICAL - Can cause DoS attacks

# --- Nested Quantifiers ---

# Pattern: (a+)+
# Problem: Nested quantifiers create exponential backtracking
# Dangerous Input: aaaaaaaaaaaaaaaaX (20+ 'a's followed by non-matching char)
(a+)+

# Pattern: (.*?)+
# Problem: Lazy quantifier nested in greedy quantifier
# Dangerous Input: Any input that doesn't fully match
(.*?)+

# Pattern: (.*)+
# Problem: Nested greedy quantifiers
# Dangerous Input: Input with partial match at end
(.*)+

# Pattern: ^(a+)+$
# Problem: Anchored nested quantifiers
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^(a+)+$

# Pattern: ^((a+)+)+$
# Problem: Triple nested quantifiers - extremely dangerous
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^((a+)+)+$

# --- Overlapping Alternatives ---

# Pattern: (a|a)+
# Problem: Identical alternatives in quantifier
# Dangerous Input: aaaaaaaaaaaaaaaaX
(a|a)+

# Pattern: (ab|abc)+
# Problem: Overlapping alternatives
# Dangerous Input: ababababababababX
(ab|abc)+

# Pattern: (\d|\d\d)+
# Problem: Prefix relationship between alternatives
# Dangerous Input: 12345678901234567890X
(\d|\d\d)+

# --- Exponential Patterns ---

# Pattern: (a|b|c)*x
# Problem: Wildcard before specific match
# Dangerous Input: aaaaaaaaaaaaaaaaaaaa (without x)
(a|b|c)*x

# Pattern: ^(a*)*$
# Problem: Nested star quantifier
# Dangerous Input: aaaaaaaaaaaaaaaaaaaaX
^(a.*)*$

# Pattern: .*=(.*).*=(.*).*
# Problem: Multiple backtracking points
# Dangerous Input: a=bbbbbbbbbbbbbbbbbbbb=c=dddddddddddddddddddd
.*=(.*).*=(.*).*

🎯 推荐示例

危险正则表达式模式

📝 无效字符类转义 regex

📝 冗余转义 regex

📝 过度捕获组 regex

📝 无锚点验证模式 regex

📝 过度贪婪匹配 regex

📝 低效惰性量词 regex

📝 歧义选择顺序 regex

📝 混淆的双重否定 regex

📝 八进制和反向引用歧义 regex

📝 灾难性回溯模式 regex