07. Python Regular Expressions (re)

最後更新: 2015-06-15

介紹

re — Regular expression operations

 


Sample Example

 

Code

import re

str = 'An example word: cat!!'

result = re.search(r'word: \w\w\w', str)

Remark

(1) 當 re.search 有野中時, type(result) 會是

<type '_sre.SRE_Match'>

無野中時

<type 'NoneType'>

(2) Raw String Notation (r"text")

backslash ('\') 有其他意思

"\" prefixed with another one to escape it (\\)

(3) \w 相當於 [a-zA-Z0-9_]

# If-statement after search() tests if it succeeded

  if result:                      
    print 'found', result.group() # Output: 'word:cat'
  else:
    print 'did not find'

 


re.search

 

Usage

re.search(pattern, string, flags=0)

Pattern

opts:

  • \d                 [0-9]
  • \w                [a-zA-Z0-9_]
  • \s                 whitespace characters [ \t\n\r\f\v]
  • \S                 non-space character (equivalent of [^ \t\n\r\f\v])
  • \n \r              newline, return character
  • \t                  tab character
  • \v                 vertical tab character
  • \f                  feed character

符號:

  • .           # any single character except newline '\n'
  • *          # 0 or more
  • +          # 1 or more
  • ?           # 0 or 1 char
  • '|'          # either A or B
  • ^           # start
  • $           # end
  • \           # escapes special characters
  • {m,n}   # match m to n repetitions (a{3,5} => match from 3 to 5 'a' characters)
  • [abc]     # matches 'a' or 'b' or 'c'
  • [^ab]    # any char except 'a' or 'b'.
  • [a-z]     # to the set of chars ([abc-]  use a dash without indicating a range)

flags

re.I (ignore case)
re.L (locale dependent)
re.M (multi-line)
re.S (dot matches all)
re.U (Unicode dependent)
re.X (verbose)

 


Group

 

(...)

the contents of a group can be retrieved after a match has been performed

(?...)

This is an extension notation.

The first character after the '?' determines what the meaning and further syntax of the construct

(?:...)

A non-capturing version of regular parentheses.

(?iLmsux)

相當於

  • re.I (ignore case),
  • re.L (locale dependent),
  • re.M (multi-line),
  • re.S (dot matches all),
  • re.U (Unicode dependent),
  • re.X (verbose)

Group(N)

group(0) returns the entire match.

group(1) returns the first parenthesized subgroup and so on.

Name the groups

(?P<NAME>...)

ie.

r'School = (?P<school>.*)\n'

Code1:

match = re.search(r'\d\d\d', 'p123g')      #  found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')   #  found, match.group() == "abc"

Code2:

str = 'purple [email protected] monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
   print match.group()   ## '[email protected]' (the whole match)
   print match.group(1)  ## 'alice' (the username, group 1)
   print match.group(2)  ## 'google.com' (the host, group 2)

str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]']

Code3:

f = open('test.txt', 'r')
strings = re.findall(r'some pattern', f.read())

Code4: more opts

match = re.search(pat, str, re.IGNORECASE)

  • IGNORECASE      ignore upper/lowercase differences for matching
  • DOTALL              allow dot (.) to match newline
  • MULTILINE          allow ^ and $ to match the start and end of each line.

 


re.compile

 

# Compile a regular expression pattern into a regular expression object

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

 


search() vs. match()

 

* re.match() checks for a match only at the beginning of the string

* re.search() checks for a match anywhere in the string

>>> re.match("c", "abcdef")  # No match

>>> re.search("c", "abcdef") # Match

 



re.split

 

Usage:

re.split(pattern, string, maxsplit=0, flags=0)

i.e.

re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

re.split('(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

re.split('\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

 


re.sub

 

功能: Replacement

re.sub(pat, replacement, str)

\1 is group(1), \2 group(2) in the replacement

i.e.

print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\[email protected]', str)

 


Greedy

 

Greedy & Lazy

'Greedy' means match longest possible string.

'Lazy' means match shortest possible string.

Greedy & Lazy quantifier

*         *?
+         +?
?         ??
{n}       {n}?
...

i.e

<b>foo</b>

Pattern:

'(<.*>)'                 # Get Greedy result: "<b>foo</b>"

'(<.*?>)'                # Get Lazy result: "<b>"

'([^>]*)'                # all of these chars except stopping at X

 


Doc

 

https://docs.python.org/2/library/re.html

 

 

 

Creative Commons license icon Creative Commons license icon