最後更新: 2019-05-11
目錄
- Pattern Matching In Bash
- Basic Regular Expressions
- Extended Regular Expressions
- Tips
- Other
介紹
regex 全名是 Regular expression, 它一共有兩種格式, 分別是
- Basic Regular Expressions
- Extended Regular Expressions
很明顯, Extended 比 Basic 有更多功能
Operator: =~
Pattern Matching In Bash
Default 開啟的 Match
- * Match zero or more characters
- ? Match any single character
- [...] Match any of the characters in a set. 支援 "-", "!" 及 "^"
i.e.
touch .txt a.txt b.txt c.txt d.txt aa.txt bb.txt aaa.txt aba.txt abc.txt
ls *.txt # 不包括 .txt
ls ??.txt # aa.txt bb.txt
ls [ac].txt # a.txt c.txt
ls [a-c].txt # a.txt b.txt c.txt
ls [!a-c].txt # d.txt
ls [^a].txt # b.txt c.txt d.txt
Notes
files that start with a dot "." are considered as hidden files.
*.txt to all the filenames in the current directory that end with .txt
But this expansion does not include hidden files.
ls {,*}.txt
cat {,*}.txt
extglob
開啟後 Bash 可以額外識別 5 個匹配運算
shopt -s extglob
# 查看
shopt extglob
extglob on
# Disable
shopt -u extglob
# 以下4個功能分別不大
- *(extglob) Match zero or more occurrences of the patterns
- ?(extglob) Match zero or one occurrences of the patterns
- +(extglob) Match one or more occurrences of the patterns
- @(extglob) Match one occurrence of the patterns
- # 不出現
- !(extglob) Match anything that doesn't match one of the patterns
i.e.
touch a.jpg b.txt c.txt d.zip
ls *(*.jpg|*.txt)
a.jpg b.txt c.txt
ls !(*.jpg|*.txt)
d.zip
=====
dotglob
If set, Bash includes filenames beginning with a "." in the results of filename expansion.
The filenames '.' and '..' must always be matched explicitly, even if dotglob is set.
When it is disabled, the set does not include any filenames beginning with "." unless the pattern or sub-pattern begins with a '.'
=====
Basic Regular Expressions
"?", "+", "{", "|", "(", and ")" lose their special meaning
Extended Regular Expressions
把原來沒特別意思的字符(i.e. "?")前加 "\" 就有 Ext. regex 效果
. * ( \ ? 寫成 \. \* \( \\ \?
Quantifiers:
? zero or one except when followed by "=" # {0,1}
+ one or more # {1,}
* Zero or more # {0,}
{n} Exactly n
{n,} n or more
{m,n} Between m and n
i.e.
- \d{3}
- w{1,4}
Character:
. # Any character except newline (\r or \n)
[a-z] # "a", "b" through "z"
[a,b,c] # Character a, b, or c
\ # Escape character
\d # shorthand for character class matching digits # [0-9]
\w # shorthand for “word character” # [A-Za-z0-9_]
\s, \S # SPACE & TAB | \S 係 \s 相反
\t # \t being a subset of \s
\r, \n # newline
Ignore case sensitivity
(?i)STRING # modifier
e.g.
((?i)LOGIN|PLAIN|(?:CRAM|DIGEST)-MD5)
[a-bA-B]* #
位置:
$ 尾
^ 頭
logic:
[^abc] not a, b or c # single character
| OR
分組:
(regex) # Capture expr for use with \1, \2, \3 ...
(?:regex) # Non-capturing group.
Non-capturing group
The parser uses it to match the text, but ignores it later
The first group has not been captured.
e.g. nginx setting
location ~* /(?:uploads|files)/.*\.php$ { deny all; }
在用到"OR"的情況要使用它, 比如"中 uploads 或 files"
Performance of capturing and non-capturing group
* require less allocation of memory
* The difference is typically very small for simple, short expressions with short matches.
time grep -ciP '(get|post).*' sample.log
time grep -ciP '(?:get|post).*' sample.log
lookahead(展望)
lookahead(?=S) 與 lookbehind(?<S) 係 lookaround 系列來.
- (?=regex) - positive lookahead
- (?!regex) - negative lookahead
- (?<=string) - positive lookbehind
- (?<!string) - negative lookbehind
* 雖然它被 "()" 包起, 但它不是 capturing group 來
* 在 lookahead 可以用 regex, 但 lookbehind 只可用 string
* The lookahead itself is not a capturing group.
- (?=(regex)) # 正確 capturing
- ((?=regex)) # 錯誤 capturing
e.g.
# Match "A" followed by "B"
A(?=B)
# Match "A" not followed by "B"
A(?!B)
# Deny any "." folder except .well-known
<DirectoryMatch "\/\.(?!well-known)"> Require all denied </DirectoryMatch>
Notes
lookbehind 多數只支援 fixed-length strings (不像 lookahead 支援 regex)
不支援的原因係因為 engine 不知 step back 幾多去 check
Word boundary
\b Matches the empty string at the edge of a word. (word boundary)
e.g.
(?i)\bwww\.example\.com\b
"<" & ">"
"<" denote the beginning of a word boundary
">" denote the end of a word boundary
i.e.
"<word" will match the word "word" only if it appears at the beginning of a word.
"word>" will match the word "word" only if it appears at the end of a word.
POSIX Character Classes
誤解
[:space:]
itself is just a character class consisting of ":, s, p, a, c, e"
\s
The \s is part of a PCRE (Perl Compatible Regular Expressions).
It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.
"[["
Inside double bracket test [[ use ERE
string='xyz 123' if [[ $string =~ [[:space:]] ]]; then echo "whitespace" fi
[:space:] matches whitespace characters (space and horizontal tab) <-- [ \t\n\r\f\v]
[:alnum:] matches alphabetic or numeric characters. This is equivalent to A-Za-z0-9
[:alpha:] matches alphabetic characters. This is equivalent to A-Za-z
[:blank:] matches a space or a tab
(?m)
The (?<option_flag>) construct allows you to set various matching properties like case-insensitivity, multiline, greedy, etc.
It declares the regex to read multiline data, i.e., don't stop the regex on a line break.
(?i)param # case-insensitivity
Look-ahead and look-behind operations
(?=.*word1)(?=.*word2)(?=.*word3)
The syntax is: X(?=Y), it means "look for X, but match only if followed by Y"
<DirectoryMatch "^\.|\/\.(?!well-known)"> Require all denied </DirectoryMatch>
Negative lookahead : X(?!Y)
negative lookahead/lookbehind
a(?!b)
Tips
tips1
^http(?:s)?://(?:.+\.)?(?:a.net|b.net)$
其他