Regex

Basic regex patterns and rules for pattern matching and text processing

0. Regular Expression

Match, Find or Manage Text

Rules

On default, it will find any matching text
. -> matches to 'Any Character'
[...] = Character Sets
- [abc] -> matches to 'One of the Character between []', in this case, a,b,or c
- [^abc] -> opposite to above, it will not match to a text if text contains one of the chracter inside [^...]
- [a-c] -> selects range, characters between a-c will match (inclusive)
Repetitions -> + * ?
- e* -> matches to '', 'e', 'ee', 'eee', ... -> # of 'e' => 0~
- e+ -> matches to 'e', 'ee', 'eee', ... -> # of 'e' => 1~
- e? -> indicates that 'e' is optional
{ n } => n is integer, charactor in front of {n} needs to be repeated n times
- e{ n, } -> 'e' needs to be repeated 'at least n times'
- e{ n, n+m } -> inclusive range (n~n+m)
Capture Group -> ( ... )
- \n -> references nth group
- (?:) -> non-capturing groups, can't reference this group with \n
- (c|r) -> captures c or r
- | -> doesn't have to be inside ()
Line starts, ends -> ^ (start), $ (end)
Word Character -> \w
- letter, number, and underscore
- no dots, colons, special marks (something like ?!)
- \W -> opposite (excludes any word character)
Number Character -> \d
Space Character -> \s
Lookahead -> looks next word or character
- positive: \d+(?=ab) -> looks for ab after some digits
- negative: \d+(?!ab) -> looks if ab doesn't exist after some digits
Lookbehind -> looks previous word or character
- positive: (?<=$)\d+ -> looks for digits that comes after '$'
- negative: (?<!$)
Flags
- global flag: select all matches -> //g
- multiline flag: handle each line seperately -> //m
- case-insensitive flag -> //i
- all of above can be combined -> //gm, //gmi
greedy & lazy matching (I don't get this)
- .*r -> will find any text that ends with r
- .*?r -> will find the first word that ends with r

# 3.
b[aei]r # this will match  with 'bar', 'ber', 'bir'
b[^aei]r # this won't match with 'bar', 'ber', 'bir'

# 4.
be*r # 'br', 'ber', 'beer'
be+r # 'ber', 'beer'
colou?r # 'color', 'colour'

# 5.
[0-9]{4} # any number with 4 digits, 1111, 1234, 5021, ...

# 6.
(ha)-\1, (haa)-\2 # 'ha-ha, haa-haa'
(?:ha)-ha, (haa)-\1 # 'ha-ha, haa-haa', (?:ha) group will not be captured
(c|r)at|dog # 'cat', 'rat', 'dog'

# 7.
^[0-9] # any line that starts with a number(0-9)
html$ # any lined that ends with 'html'

# 11. 12.
\d+(?=PM) # '3 PM', '4 PM'
(?<=\$)\d+ # '$ 1', '$ 100'

Regex

Basic regex patterns and rules for pattern matching and text processing

0. Regular Expression

Match, Find or Manage Text

Rules

On default, it will find any matching text

. -> matches to 'Any Character'

[...] = Character Sets

[abc] -> matches to 'One of the Character between []', in this case, a,b,or c
[^abc] -> opposite to above, it will not match to a text if text contains one of the chracter inside [^...]
[a-c] -> selects range, characters between a-c will match (inclusive)

Repetitions -> + * ?

e* -> matches to '', 'e', 'ee', 'eee', ... -> # of 'e' => 0~
e+ -> matches to 'e', 'ee', 'eee', ... -> # of 'e' => 1~
e? -> indicates that 'e' is optional

{ n } => n is integer, charactor in front of {n} needs to be repeated n times

e{ n, } -> 'e' needs to be repeated 'at least n times'
e{ n, n+m } -> inclusive range (n~n+m)

Capture Group -> ( ... )

\n -> references nth group
(?:) -> non-capturing groups, can't reference this group with \n
(c|r) -> captures c or r
| -> doesn't have to be inside ()

Line starts, ends -> ^ (start), $ (end)

Word Character -> \w

letter, number, and underscore
no dots, colons, special marks (something like ?!)
\W -> opposite (excludes any word character)

Number Character -> \d

Space Character -> \s

Lookahead -> looks next word or character

positive: \d+(?=ab) -> looks for ab after some digits
negative: \d+(?!ab) -> looks if ab doesn't exist after some digits

Lookbehind -> looks previous word or character

positive: (?<=$)\d+ -> looks for digits that comes after '$'
negative: (?<!$)

Flags

global flag: select all matches -> //g
multiline flag: handle each line seperately -> //m
case-insensitive flag -> //i
all of above can be combined -> //gm, //gmi

greedy & lazy matching (I don't get this)

.*r -> will find any text that ends with r
.*?r -> will find the first word that ends with r

# 3. b[aei]r # this will match with 'bar', 'ber', 'bir' b[^aei]r # this won't match with 'bar', 'ber', 'bir' # 4. be*r # 'br', 'ber', 'beer' be+r # 'ber', 'beer' colou?r # 'color', 'colour' # 5. [0-9]{4} # any number with 4 digits, 1111, 1234, 5021, ... # 6. (ha)-\1, (haa)-\2 # 'ha-ha, haa-haa' (?:ha)-ha, (haa)-\1 # 'ha-ha, haa-haa', (?:ha) group will not be captured (c|r)at|dog # 'cat', 'rat', 'dog' # 7. ^[0-9] # any line that starts with a number(0-9) html$ # any lined that ends with 'html' # 11. 12. \d+(?=PM) # '3 PM', '4 PM' (?<=\$)\d+ # '$ 1', '$ 100'