最後更新: 2015-06-15
介紹
re — Regular expression operations
Sample Example
Code
import re
str = 'An example word: cat!!'
result = re.search(r'word: \w\w\w', str)
Remark
(1) 當 re.search 有野中時, type(result) 會是
<type '_sre.SRE_Match'>
無野中時
<type 'NoneType'>
(2) Raw String Notation (r"text")
backslash ('\') 有其他意思
"\" prefixed with another one to escape it (\\)
(3) \w 相當於 [a-zA-Z0-9_]
# If-statement after search() tests if it succeeded
if result:
print 'found', result.group() # Output: 'word:cat'
else:
print 'did not find'
re.search
Usage
re.search(pattern, string, flags=0)
Pattern
opts:
- \d [0-9]
- \w [a-zA-Z0-9_]
- \s whitespace characters [ \t\n\r\f\v]
- \S non-space character (equivalent of [^ \t\n\r\f\v])
- \n \r newline, return character
- \t tab character
- \v vertical tab character
- \f feed character
符號:
- . # any single character except newline '\n'
- * # 0 or more
- + # 1 or more
- ? # 0 or 1 char
- '|' # either A or B
- ^ # start
- $ # end
- \ # escapes special characters
- {m,n} # match m to n repetitions (a{3,5} => match from 3 to 5 'a' characters)
- [abc] # matches 'a' or 'b' or 'c'
- [^ab] # any char except 'a' or 'b'.
- [a-z] # to the set of chars ([abc-] use a dash without indicating a range)
flags
re.I (ignore case)
re.L (locale dependent)
re.M (multi-line)
re.S (dot matches all)
re.U (Unicode dependent)
re.X (verbose)
Group
(...)
the contents of a group can be retrieved after a match has been performed
(?...)
This is an extension notation.
The first character after the '?' determines what the meaning and further syntax of the construct
(?:...)
A non-capturing version of regular parentheses.
(?iLmsux)
相當於
- re.I (ignore case),
- re.L (locale dependent),
- re.M (multi-line),
- re.S (dot matches all),
- re.U (Unicode dependent),
- re.X (verbose)
Group(N)
group(0) returns the entire match.
group(1) returns the first parenthesized subgroup and so on.
Name the groups
(?P<NAME>...)
ie.
r'School = (?P<school>.*)\n'
Code1:
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123" match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"
Code2:
str = 'purple [email protected] monkey dishwasher' match = re.search('([\w.-]+)@([\w.-]+)', str) if match: print match.group() ## '[email protected]' (the whole match) print match.group(1) ## 'alice' (the username, group 1) print match.group(2) ## 'google.com' (the host, group 2) str = 'purple [email protected], blah monkey [email protected] blah dishwasher' emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]']
Code3:
f = open('test.txt', 'r') strings = re.findall(r'some pattern', f.read())
Code4: more opts
match = re.search(pat, str, re.IGNORECASE)
- IGNORECASE ignore upper/lowercase differences for matching
- DOTALL allow dot (.) to match newline
- MULTILINE allow ^ and $ to match the start and end of each line.
re.compile
# Compile a regular expression pattern into a regular expression object
prog = re.compile(pattern) result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
search() vs. match()
* re.match() checks for a match only at the beginning of the string
* re.search() checks for a match anywhere in the string
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
re.split
Usage:
re.split(pattern, string, maxsplit=0, flags=0)
i.e.
re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']
re.sub
功能: Replacement
re.sub(pat, replacement, str)
# \1 is group(1), \2 group(2) in the replacement
i.e.
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\[email protected]', str)
Greedy
Greedy & Lazy
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
Greedy & Lazy quantifier
* *? + +? ? ?? {n} {n}? ...
i.e
<b>foo</b>
Pattern:
'(<.*>)' # Get Greedy result: "<b>foo</b>"
'(<.*?>)' # Get Lazy result: "<b>"
'([^>]*)' # all of these chars except stopping at X
Doc
https://docs.python.org/2/library/re.html