不可不知的 regex

由 datahunter 在四, 16/06/2011 - 23:50 發表

最後更新: 2019-05-11

介紹

regex 全名是 Regular expression, 它一共有兩種格式, 分別是

Basic Regular Expressions
Extended Regular Expressions

很明顯, Extended 比 Basic 有更多功能

Operator: =~

Pattern Matching In Bash

Default 開啟的 Match

* Match zero or more characters
? Match any single character
[...] Match any of the characters in a set. 支援 "-", "!" 及 "^"

i.e.

touch .txt a.txt b.txt c.txt d.txt aa.txt bb.txt aaa.txt aba.txt abc.txt

ls *.txt # 不包括 .txt

ls ??.txt # aa.txt bb.txt

ls [ac].txt # a.txt c.txt

ls [a-c].txt # a.txt b.txt c.txt

ls [!a-c].txt # d.txt

ls [^a].txt # b.txt c.txt d.txt

Notes

files that start with a dot "." are considered as hidden files.

*.txt to all the filenames in the current directory that end with .txt

But this expansion does not include hidden files.

ls {,*}.txt

cat {,*}.txt

extglob

開啟後 Bash 可以額外識別 5 個匹配運算

shopt -s extglob

# 查看

shopt extglob

extglob         on

# Disable

shopt -u extglob

# 以下4個功能分別不大

*(extglob) Match zero or more occurrences of the patterns
?(extglob) Match zero or one occurrences of the patterns
+(extglob) Match one or more occurrences of the patterns
@(extglob) Match one occurrence of the patterns
# 不出現
!(extglob) Match anything that doesn't match one of the patterns

i.e.

touch a.jpg b.txt c.txt d.zip

ls *(*.jpg|*.txt)

a.jpg b.txt c.txt

ls !(*.jpg|*.txt)

d.zip

=====

dotglob

If set, Bash includes filenames beginning with a "." in the results of filename expansion.

The filenames '.' and '..' must always be matched explicitly, even if dotglob is set.

When it is disabled, the set does not include any filenames beginning with "." unless the pattern or sub-pattern begins with a '.'

=====

Basic Regular Expressions

"?", "+", "{", "|", "(", and ")" lose their special meaning

Extended Regular Expressions

把原來沒特別意思的字符(i.e. "?")前加 "\" 就有 Ext. regex 效果

.   *   (    \   ?
        寫成
\.  \*  \(   \\  \?

Quantifiers:

? zero or one except when followed by "=" # {0,1}

+ one or more # {1,}

* Zero or more # {0,}

{n} Exactly n

{n,} n or more

{m,n} Between m and n

i.e.

\d{3}
w{1,4}

Character:

. # Any character except newline (\r or \n)

[a-z] # "a", "b" through "z"

[a,b,c] # Character a, b, or c

\ # Escape character

\d # shorthand for character class matching digits # [0-9]

\w # shorthand for “word character” # [A-Za-z0-9_]

\s, \S # SPACE & TAB | \S 係 \s 相反

\t # \t being a subset of \s

\r, \n # newline

Ignore case sensitivity

(?i)STRING # modifier

e.g.

((?i)LOGIN|PLAIN|(?:CRAM|DIGEST)-MD5)

[a-bA-B]* #

位置:

$ 尾

^ 頭

logic:

[^abc] not a, b or c # single character

| OR

分組:

(regex) # Capture expr for use with \1, \2, \3 ...

(?:regex) # Non-capturing group.

Non-capturing group

The parser uses it to match the text, but ignores it later

The first group has not been captured.

e.g. nginx setting

location ~* /(?:uploads|files)/.*\.php$ {
    deny all;
}

在用到"OR"的情況要使用它, 比如"中 uploads 或 files"

Performance of capturing and non-capturing group

* require less allocation of memory

* The difference is typically very small for simple, short expressions with short matches.

time grep -ciP '(get|post).*' sample.log

time grep -ciP '(?:get|post).*' sample.log

lookahead(展望)

lookahead(?=S) 與 lookbehind(?<S) 係 lookaround 系列來.

(?=regex) - positive lookahead
(?!regex) - negative lookahead
(?<=string) - positive lookbehind
(?<!string) - negative lookbehind

* 雖然它被 "()" 包起, 但它不是 capturing group 來

* 在 lookahead 可以用 regex, 但 lookbehind 只可用 string

* The lookahead itself is not a capturing group.

(?=(regex)) # 正確 capturing
((?=regex)) # 錯誤 capturing

e.g.

# Match "A" followed by "B"

A(?=B)

# Match "A" not followed by "B"

A(?!B)

# Deny any "." folder except .well-known

<DirectoryMatch "\/\.(?!well-known)">
    Require all denied
</DirectoryMatch>

Notes

lookbehind 多數只支援 fixed-length strings (不像 lookahead 支援 regex)

不支援的原因係因為 engine 不知 step back 幾多去 check

Word boundary

\b Matches the empty string at the edge of a word. (word boundary)

e.g.

(?i)\bwww\.example\.com\b

"<" & ">"

"<" denote the beginning of a word boundary

">" denote the end of a word boundary

i.e.

"<word" will match the word "word" only if it appears at the beginning of a word.

"word>" will match the word "word" only if it appears at the end of a word.

POSIX Character Classes

誤解

[:space:]

itself is just a character class consisting of ":, s, p, a, c, e"

The \s is part of a PCRE (Perl Compatible Regular Expressions).

It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.

"[["

Inside double bracket test [[ use ERE

string='xyz 123'

if [[ $string =~ [[:space:]] ]];
then
        echo "whitespace"
fi

[:space:] matches whitespace characters (space and horizontal tab) <-- [ \t\n\r\f\v]

[:alnum:] matches alphabetic or numeric characters. This is equivalent to A-Za-z0-9

[:alpha:] matches alphabetic characters. This is equivalent to A-Za-z

[:blank:] matches a space or a tab

(?m)

The (?<option_flag>) construct allows you to set various matching properties like case-insensitivity, multiline, greedy, etc.

It declares the regex to read multiline data, i.e., don't stop the regex on a line break.

(?i)param # case-insensitivity

Look-ahead and look-behind operations

(?=.*word1)(?=.*word2)(?=.*word3)

The syntax is: X(?=Y), it means "look for X, but match only if followed by Y"

<DirectoryMatch "^\.|\/\.(?!well-known)">
    Require all denied
</DirectoryMatch>

Negative lookahead ： X(?!Y)

negative lookahead/lookbehind

a(?!b)

Tips

tips1

^http(?:s)?://(?:.+\.)?(?:a.net|b.net)$

其他

很好的 tester

很好的 tutorial

增加新的回應
瀏覽次數： 3694

夢想家

不可不知的 regex

目錄

介紹

Pattern Matching In Bash

Basic Regular Expressions

Extended Regular Expressions

Look-ahead and look-behind operations

Tips

其他