07. Python Regular Expressions (re)

由 datahunter 在日, 09/09/2012 - 23:24 發表

最後更新: 2015-06-15

介紹

re — Regular expression operations

Sample Example

Code

import re

str = 'An example word: cat!!'

result = re.search(r'word: \w\w\w', str)

Remark

(1) 當 re.search 有野中時, type(result) 會是

<type '_sre.SRE_Match'>

無野中時

<type 'NoneType'>

(2) Raw String Notation (r"text")

backslash ('\') 有其他意思

"\" prefixed with another one to escape it (\\)

(3) \w 相當於 [a-zA-Z0-9_]

# If-statement after search() tests if it succeeded

  if result:                      
    print 'found', result.group() # Output: 'word:cat'
  else:
    print 'did not find'

re.search

Usage

re.search(pattern, string, flags=0)

Pattern

opts:

\d [0-9]
\w [a-zA-Z0-9_]
\s whitespace characters [ \t\n\r\f\v]
\S non-space character (equivalent of [^ \t\n\r\f\v])
\n \r newline, return character
\t tab character
\v vertical tab character
\f feed character

符號:

. # any single character except newline '\n'
* # 0 or more
+ # 1 or more
? # 0 or 1 char
'|' # either A or B
^ # start
$ # end
\ # escapes special characters
{m,n} # match m to n repetitions (a{3,5} => match from 3 to 5 'a' characters)
[abc] # matches 'a' or 'b' or 'c'
[^ab] # any char except 'a' or 'b'.
[a-z] # to the set of chars ([abc-] use a dash without indicating a range)

flags

re.I (ignore case)
re.L (locale dependent)
re.M (multi-line)
re.S (dot matches all)
re.U (Unicode dependent)
re.X (verbose)

Group

(...)

the contents of a group can be retrieved after a match has been performed

(?...)

This is an extension notation.

The first character after the '?' determines what the meaning and further syntax of the construct

(?:...)

A non-capturing version of regular parentheses.

(?iLmsux)

相當於

re.I (ignore case),
re.L (locale dependent),
re.M (multi-line),
re.S (dot matches all),
re.U (Unicode dependent),
re.X (verbose)

Group(N)

group(0) returns the entire match.

group(1) returns the first parenthesized subgroup and so on.

Name the groups

(?P<NAME>...)

ie.

r'School = (?P<school>.*)\n'

Code1:

match = re.search(r'\d\d\d', 'p123g')      #  found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')   #  found, match.group() == "abc"

Code2:

str = 'purple [email protected] monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
   print match.group()   ## '[email protected]' (the whole match)
   print match.group(1)  ## 'alice' (the username, group 1)
   print match.group(2)  ## 'google.com' (the host, group 2)

str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]']

Code3:

f = open('test.txt', 'r')
strings = re.findall(r'some pattern', f.read())

Code4: more opts

match = re.search(pat, str, re.IGNORECASE)

IGNORECASE ignore upper/lowercase differences for matching
DOTALL allow dot (.) to match newline
MULTILINE allow ^ and $ to match the start and end of each line.

re.compile

# Compile a regular expression pattern into a regular expression object

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

search() vs. match()

* re.match() checks for a match only at the beginning of the string

* re.search() checks for a match anywhere in the string

>>> re.match("c", "abcdef") # No match

>>> re.search("c", "abcdef") # Match

re.split

Usage:

re.split(pattern, string, maxsplit=0, flags=0)

i.e.

re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

re.split('\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

re.sub

功能: Replacement

re.sub(pat, replacement, str)

# \1 is group(1), \2 group(2) in the replacement

i.e.

print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\[email protected]', str)

Greedy

Greedy & Lazy

'Greedy' means match longest possible string.

'Lazy' means match shortest possible string.

Greedy & Lazy quantifier

*         *?
+         +?
?         ??
{n}       {n}?
...

i.e

<b>foo</b>

Pattern:

'(<.*>)' # Get Greedy result: "<b>foo</b>"

'(<.*?>)' # Get Lazy result: "<b>"

'([^>]*)' # all of these chars except stopping at X

Doc

https://docs.python.org/2/library/re.html

瀏覽次數： 111523

夢想家