Regular Expressions in Python
A Regular Expression (also known as RegEx) is a sequence of characters which make up a search pattern, which attempts to match text in a longer string. Different symbols match different characters, and different languages have different interpretations for these symbols. Python's creators included support for Regular Expressions in version 1.5, and have derived its Regular Expression Engine from Secret Labs' Regular Expression Engine i.e. SRE, and is designed to work, more or less, the same way as they work in Perl. Regular Expressions in Perl are extremely flexible and powerful, and hence, Python developers strived to be as close to the Perl syntax as possible.
In this article, I will use basic examples to explain various special characters used in Regular Expressions, as well as various functions of the builtin re module. At the end of the article, I have listed a few exercises for you, to put your learnings into practice.
Table of Contents
- re.match(), re.search() & Match Objects
- Extracting the match from a match object
- Control Characters & Literals
- Greedy & Non-Greedy Matching with *, + & ?
- re.compile()
- Matching Any Character Except Newline with .
- Matching Beginning & Ending of a String with ^ and $
- Matching either this or that with |
- Character Classes: [] and ^
- Control Sequences (\d, \w, \s, \D, |W, |S etc.) & Raw Strings
- Matching exactly m Repetitions of an Expression with {}
- Matching at least m Repetitions and at most n Repetitions of an Expression with {,}
- matchObject.group(), matchObject.groups() & matchObject.groupdict() with (?P…)
- Backreferencing named and unnamed groups with (?P=name) & \number
- Not Capturing a Group with Parentheses (?:…)
- Searching for matches case-insensitively: flags
- re.fullmatch(), re.sub(), re.subn()
- re.split(), re.findall(), re.finditer()
- re.escape(), re.purge()
- Summary
- Handy Tips
- Exercises
In order to use Regular Expressions, you will need to import the builtin re module. Let's look at the objects it contains.
>>> import re >>> print( "\t\t".join( [obj for obj in dir(re) if not obj.startswith('_')] ) ) A ASCII DEBUG DOTALL I IGNORECASE L LOCALE M MULTILINE S Scanner T TEMPLATE U UNICODE VERBOSE X compile copyreg error escape findall finditer fullmatch match purge search split sre_compile sre_parse sub subn sys template
So, the re module contains many attributes and functions. We'll cover most of these step-by-step in this article.
re.match(), re.search() & Match Objects§
Let's perform a basic search using the match() function of the builtin re module. match(pattern, stringSource, flags=0) takes two positional arguments, the pattern to search for, and the string to search in. It applies the pattern at the beginning of the string and returns a match object if a match is found, and if no match is found, returns None. The fact that it looks for the pattern in the beginning of the string, is why it is said to perform an anchored search. The optional flags argument is discussed later in the article.
>>> re.match("x", "xy") <_sre.SRE_Match object; span=(0, 1), match='x'> >>> re.match("x", "y") # No match, therefore, returns None >>> >>> >>> re.match("x", "yx") >>> # No match since 'x' is not in the beginning of the string 'xy'.
The search(pattern, stringSource, flags=0) performs the unanchored search by looking for the search pattern in the entire target string, not just the beginning of it. Like match(), it returns a match object if a match is found, and if no match is found, returns None. The optional flags argument is discussed later in the article.
>>> re.search("x", "xy") <_sre.SRE_Match object; span=(0, 1), match='x'> >>> re.search("x", "y") # No match, therefore, returns None >>> re.search("x", "yx") <_sre.SRE_Match object; span=(1, 2), match='x'>
Both the match() & search() functions return a match object. It's time we see what a match object has to offer to us.
>>> matchObject = re.search("x", "yx") >>> matchObject <_sre.SRE_Match object; span=(1, 2), match='x'> >>> print( "\n".join([obj for obj in dir(matchObject) if not obj.startswith("_")]) ) end endpos expand group groupdict groups lastgroup lastindex pos re regs span start string # Going over a few chosen objects of the match object. Feel free to explore the rest of them. # end(): returns index of the original string where the match ended. 2 here. # groups(): returns a tuple of groups of the matched string. Discussed later in the article. # groupdict(): returns a dictionary of named groups of the matched string. Discussed later in the article. # group(): returns a specific group of the matches string. Discussed later in the article. # re: attribute storing the regular expression object whose search()/match() produced this match object. re.compile('x') here. We will go over re.compile() in subsequent sections. # span(): returns a 2-element tuple denoting the indexes of the original string where the match starts and ends. (1, 2) here. # start(): returns index of the original string where the match started. 1 here. # string: attribute which stores the original string. 'yx' here.
Extracting the match from a match object§
We can obtain the match using the start() & end() functions of the match object to form a slice of the original string.
>>> matchObject = re.search("x", "yx") >>> matchObject.string[matchObject.start():matchObject.end()] # 'yx'[1:2] 'x'
Control Characters & Literals§
There are characters which have special meaning in Regular Expressions, known as control characters. They are + ? . * ^ $ ( ) [ ] { } | \. Characters other than these control characters, such as alphnumeric characters match themselves and hence, are known as literals. In order to match literal * or ^ or other control characters, you need to prefix them with a backslash. i.e. \* or \^ etc.
Let's look at control characters one by one.
Greedy & Non-Greedy Matching with *, + & ?§
The control character * matches 0 or more repetitions of the preceding expression. It is a greedy matcher i.e. it will match as many repetitions as possible.
>>> re.search('x*', 'yyyxx') <_sre.SRE_Match object; span=(0, 0), match=''> # A match object is returned but the match component is empty. This is because as soon as the regular expression engine started to look for 'a*', it found 0 occurrence right at the beginning of 'yyyxx' and returned a match object. >>> re.search('x*', 'xxyyy') <_sre.SRE_Match object; span=(0, 2), match='xx'>
The control character + matches 1 or more repetitions of the preceding expression. It is also a greedy matcher.
>>> re.search('x+', 'yyyxx') <_sre.SRE_Match object; span=(3, 5), match='xx'>
The control character ? matches 0 or 1 repetition of the preceding expression. It is not a greedy matcher. The expressions *? & +? are the lazy versions of * & +. These versions match as little as possible, whatever is the minimum to satisfy the expression.
>>> re.search('x+', 'yyyxx') <_sre.SRE_Match object; span=(3, 5), match='xx'> >>> re.search('x+?', 'yyyxx') <_sre.SRE_Match object; span=(3, 4), match='x'> >>> re.search('<p>.*</p>', '<p>Paragraph 1</p><p>Paragraph 2</p>') <_sre.SRE_Match object; span=(0, 36), match='<p>Paragraph 1</p><p>Paragraph 2</p>'> >>> re.search('<p>.*?</p>', '<p>Paragraph 1</p><p>Paragraph 2</p>') <_sre.SRE_Match object; span=(0, 18), match='<p>Paragraph 1</p>'>
re.compile()§
The compile(pattern, flags=0) function makes it convenient for us in case we need to apply the same pattern to different strings. The optional flags argument is discussed later in the article.
>>> regExPattern = re.compile('ab') >>> regExPattern.search('cab') <_sre.SRE_Match object; span=(1, 3), match='ab'> >>> regExPattern.search('ddab') <_sre.SRE_Match object; span=(2, 4), match='ab'> # The following steps are equivalent. >>> regExPattern = re.compile('pattern') >>> matchObject = regExPattern.search('stringToSearchIn') # IS AS GOOD AS >>> matchObject = re.search('pattern', 'stringToSearchIn')
Matching Any Character Except Newline with .§
The . control character matches any character in the original string except a newline character.
>>> re.search('.', 'abc') <_sre.SRE_Match object; span=(0, 1), match='a'> >>> re.search('..', 'abc') <_sre.SRE_Match object; span=(0, 2), match='ab'> >>> re.search('a.c', 'abc') <_sre.SRE_Match object; span=(0, 3), match='abc'> >>> re.search('.*', 'abc') # 0 or more repetition of any character except newline <_sre.SRE_Match object; span=(0, 3), match='abc'> >>> re.search('.+', 'abc') # 1 or more repetition of any character except newline <_sre.SRE_Match object; span=(0, 3), match='abc'> >>> re.search('.?', 'abc') # 0 or 1 repetition of any character except newline <_sre.SRE_Match object; span=(0, 1), match='a'> >>> re.search('.*', 'abc\nd') # 0 or more repetition of any character except newline <_sre.SRE_Match object; span=(0, 3), match='abc'>
Matching Beginning & Ending of a String with ^ and $§
The control characters ^ and $ are used to match beginning and ending of a string respectively.
>>> re.search('^a', 'abc') <_sre.SRE_Match object; span=(0, 1), match='a'> >>> re.search('^a.', 'abc') <_sre.SRE_Match object; span=(0, 2), match='ab'> >>> re.search('^a', 'bac') # No match, therefore, returns None >>> >>> re.search('c$', 'abc') <_sre.SRE_Match object; span=(2, 3), match='c'> >>> re.search('.c$', 'abc') <_sre.SRE_Match object; span=(1, 3), match='bc'> >>> re.search('.b$', 'abc') # No match, therefore, returns None >>>
Matching either this or that with |§
Using a vertical pipe i.e. |, you can ask the Regular Expression Engine to match either of the expressions around the | symbol.
>>> re.search('a|b', 'abc') <_sre.SRE_Match object; span=(0, 1), match='a'> >>> re.search('(a|b)*', 'abc') <_sre.SRE_Match object; span=(0, 2), match='ab'> >>> re.search('(a|b)*', 'abbbbc') <_sre.SRE_Match object; span=(0, 5), match='abbbb'> # The parentheses help to make a complex expression, by indicating precedence, so that we can qualify it with *, + or ?. They also create groups, which we will discuss shortly.
Character Classes: [] and ^§
You can specify multiple characters inside square brackets to ask the searching engine to match any of the characters listed. Also, you can even use a hyphen (-) to denote alphanumeric ranges. In order to direct the searching engine to search for characters other than the ones listed inside square brackets, use the caret symbol i.e. ^ right after the opening square bracket.
>>> re.search('[ab]', 'abc') # match 1 repetition of either a or b <_sre.SRE_Match object; span=(0, 1), match='a'> >>> re.search('[ab]+', 'abc') # match 1 or more repetitions of either a or b <_sre.SRE_Match object; span=(0, 2), match='ab'> >>> re.search('[a-x]+', 'abc') # match 1 or more repetitions of any character lying between a and x <_sre.SRE_Match object; span=(0, 3), match='abc'> >>> re.search('[1-5a-z]+', '123abc') # match 1 or more repetitions of any digit between 1 & 5 and any character between a and z <_sre.SRE_Match object; span=(0, 6), match='123abc'> >>> re.search('[1-5A-Z]+', '123abc') # match 1 ore more repetitions of any digit between 1 & 5 and any character between A and Z <_sre.SRE_Match object; span=(0, 3), match='123'> >>> re.search('[^1-5A-Z]+', '123abc') # match 1 or more repetitions of any character which does not lie between 1 to 5 and also is not a uppercase character <_sre.SRE_Match object; span=(3, 6), match='abc'>
Control Sequences (\d, \w, \s, \D, |W, |S etc.) & Raw Strings§
There are 11 control sequences of characters, which have special meanings in search patterns. They are listed below, have a glance and then we'll look at a few examples.
\d matches numerals or digits; is equivalent to the character class [0-9] \D matches non-numeral or non-digit characters; equivalent to [^0-9] \s matches whitespace characters such as \t (tab), \n (linefeed), \r (carriage return), \v (vertical tab), \f (formfeed) etc. \S matches non-whitespace character; equivalent to [^\t\n\r\v\f] \w matches alphanumeric characters; equivalent to [a-zA-Z0-9_] \W matches non-alphanumeric characters; equivalent to [^a-zA-Z0-9_] \number matches the contents captured by the group denoted by the same number; we will discuss groups shortly. \A matches only at the start of the original string. \Z matches only at the end of the original string. \b matches an empty string at word boundaries; \b is defined as the boundary between a \w & \W character or between \w and the two ends of a string; note that you will need to write your regex containing \b sequences with raw string notation i.e. r'patternContaining\bSequences', since \b in regular string notation denotes the backspace character. Alternatively, you can use two backslashes to mean the same i.e. 'patternContaining\\bSequences'. \B matches an empty string not at word boundaries i.e. re.search('\Bam\B', 'name') will get return a match whereas '\Bn' will not, since there is a word boundary right before the letter n; # In order to match a literal backslash, use the raw string notation, i.e. re.search(r'\\', 'x\\y'). Alternatively, you can escape both the backslashes by prefixing each of them with another backslash i.e. re.search('\\\\', 'x\\y') ## EXAMPLES ## >>> re.search('\d', '123') # matches a single numeral/digit <_sre.SRE_Match object; span=(0, 1), match='1'> >>> re.search('\d+', '123') # matches 1 or more repetitions of numerals/digits <_sre.SRE_Match object; span=(0, 3), match='123'> >>> re.search('\D+', '123abc') # matches 1 or more repetitions of non-numeral or non-digit characters <_sre.SRE_Match object; span=(3, 6), match='abc'> >>> re.search('\s+', '123\t\rabc') # matches 1 or more whitespace characters such as \t, \n, \r, \v, \f <_sre.SRE_Match object; span=(3, 5), match='\t\r'> >>> re.search('\S+', '123abcd%^&*()\tabcd123') # matches 1 or more non-whitespace characters <_sre.SRE_Match object; span=(0, 13), match='123abcd%^&*()'> >>> re.search('[\S\s]+', '123abcd%^&*()\tabcd123') # matches 1 or more whitespace or non-whitespace characters. <_sre.SRE_Match object; span=(0, 21), match='123abcd%^&*()\tabcd123'> >>> re.search('\A123', '123abc') # matches '123' placed at the beginning of a string <_sre.SRE_Match object; span=(0, 3), match='123'> >>> re.search('bc\Z', '123abc') # matches 'bc' placed at the end of a string <_sre.SRE_Match object; span=(4, 6), match='bc'> >>> re.search('\w+', 'My name is Ethan.') # matches 1 or more alphanumeric characters <_sre.SRE_Match object; span=(0, 2), match='My'> >>> re.search('\W+', 'My name is Ethan.') # matches 1 or more non-alphanumeric characters <_sre.SRE_Match object; span=(2, 3), match=' '> >>> re.search('\\bMy\\b', 'My name is Ethan.') # matches 'My' placed either at the beginning or end of a string OR placed with word boundaries on either end. <_sre.SRE_Match object; span=(0, 2), match='My'> >>> re.search(r'\bMy\b', 'My name is Ethan.') # matches 'My' placed either at the beginning or end of a string OR placed with word boundaries on either end. <_sre.SRE_Match object; span=(0, 2), match='My'> >>> re.search(r'\bEthan\b', 'My name is Ethan.') # matches 'Ethan' placed either at the beginning or end of a string OR placed with word boundaries on either end. Keep in mind that a period counts as a \W character, and hence it returns a match. A \b character is defined as the boundary between a \w & \W character or between \w and the two ends of a string. <_sre.SRE_Match object; span=(11, 16), match='Ethan'> >>> re.search('name\\b', 'My name is Ethan.') # matches 'name' placed either at the beginning or end of a string OR placed with word boundary on its right side. <_sre.SRE_Match object; span=(3, 7), match='name'> >>> re.search('\Bam\B', 'My name is Ethan.') # matches 'am' surrounded by at least 1 alphanumeric character. <_sre.SRE_Match object; span=(4, 6), match='am'> >>> re.search(r'\\', 'x\\y') # matches a single backslash; note that the span component of the match object below suggests a match of only a single backslash <_sre.SRE_Match object; span=(1, 2), match='\\'> >>> re.search('\\\\', 'x\\y') # matches a single backslash; note that the span component of the match object below suggests a match of only a single backslash. <_sre.SRE_Match object; span=(1, 2), match='\\'>
Matching exactly m Repetitions of an Expression with {}§
Using curly bracket {}, you can quantify the number of repetitions of the preceding expression.
>>> re.search('\d{4}', 'ABC20170205CBA') # matches 4 repetitions of digits <_sre.SRE_Match object; span=(3, 7), match='2017'>
Matching at least m Repetitions and at most n Repetitions of an Expression with {,}§
The {m,n} notation can be used to match m to n repetitions of the preceding expression. Note that a space after the comma in the aforementioned notation (i.e. {m, n}) does not do well with Python, it does not perform the same way you expect it to. This nuance can often put off Pythonistas who are in the habit of leaving a space after a comma while writing code, such as myself.
>>> re.search('\d{4,8}', 'ABC20170205000000CBA') # match at least 4 repetitions and at most 8 repetitions of numerals/digits. <_sre.SRE_Match object; span=(3, 11), match='20170205'> # You can leave value of n empty to match at least m repetitions and at most any number of repetitions of the preceding expression. >>> re.search('\d{4,}', 'ABC20170205000000CBA') <_sre.SRE_Match object; span=(3, 17), match='20170205000000'>
Note that:
- {0,} has same meaning as *
- {1,} has same meaning as +
- {,1} has same meaning as ?
matchObject.group(), matchObject.groups() & matchObject.groupdict() with (?P…)§
Using round brackets () around various parts of your regular expression allows you to capture these parts of the matched string as groups. These captured parts can then be obtained using the group(), groups() and groupdict() functions of the matchedObject./p>
>>> matchedObject = re.search('(\w{3})\s(\d{1,2})', 'Feb 5') # matches 3 letter month name, followed by a space, followed by a 1 to 2 digit number >>> matchedObject.groups() # returns a tuple of all captured groups ('Feb', '5') >>> matchedObject.group() # returns the entire match itself 'Feb 5' >>> matchedObject.group(1) # returns match of first group 'Feb' >>> matchedObject.group(2) # returns match of second group '5'
In order to see the groupdict() function in action, we must tweak our original expression to include a strange looking sequence. Within the round brackets of a group, right at the beginning of it, include
# matches 3 letter month name (with group 'month'), followed by a space, followed by 1 to 2 digit number (with group 'day') >>> matchedObject = re.search('(?P<month>\w{3})\s(?P<day>\d{1,2})', 'Feb 5') >>> matchedObject.groupdict() {'month': 'Feb', 'day': '5'} >>> matchedObject.group('month') 'Feb' >>> matchedObject.group('day') '5'
Backreferencing named and unnamed groups with (?P=name) & \number§
In order to match an expression what a named group has already matched, you can use the syntax (?P=name), where name refers to the name of the group previously matched. This is known as backreferencing to a named group.
# matches literal <, followed by at-least 1 character word (with group 'tag'), literal >, followed by contents of the tag (with group 'contents'), followed by literals </, followed by value matched by group 'tag', followed by literal > >>> matchedObject = re.search('<(?P<tag>\w{1,})>(?P<contents>\w+)</(?P=tag)>', '<strong>Hello</strong>') >>> matchedObject.groups() ('strong', 'Hello') # Note that (?P=name) merely matches the value of an earlier group, it does not capture the value itself. In order to capture its value, you need to wrap (?P=name) inside another pair of round brackets i.e. ((?P=name)) >>> matchedObject = re.search('<(?P<tag>\w{1,})>(?P<contents>\w+)</((?P=tag))>', '<strong>Hello</strong>') >>> matchedObject.groups() ('strong', 'Hello', 'strong')
In order to backreference an unnamed group (it works for named groups as well), you can use the notation \number, where number denotes the group number. Note that if you wish to re-match what the first group caught, you will use \1, and not \0. Another thing to keep in mind is to use the raw string notation for this to work, as \number have special meaning in Python. For example, >>> "\1" yields '\x01', which is hexadecimal equivalent of 1.
# matches literal <, followed by at-least 1 character word (with group 'tag'), literal >, followed by contents of the tag (with group 'contents'), followed by literals </, followed by value matched by group # 1, followed by literal > # using \1 to refer to value of first group matched by (?P<tag>\w{1,}) >>> matchedObject = re.search(r'<(?P<tag>\w{1,})>(?P<contents>\w+)</\1>', '<strong>Hello</strong>') >>> matchedObject.groups() ('strong', 'Hello') # using \1 to refer to value of first group matched by (\w{1,}) >>> matchedObject = re.search(r'<(\w{1,})>(\w+)</\1>', '<strong>Hello</strong>') >>> matchedObject.groups() ('strong', 'Hello')
Not Capturing a Group with Parentheses (?:…)§
If you are simply using round brackets to denote precedence or to enhance readability, and you don't want Python to capture it as a group, you can prefix the sub-expression with ?:.
>>> matchedObject = re.search('([A-Za-z]{3})\s(\d{1,2})', 'Feb 5') # without ?: syntax; matches 3 letter month, followed by a space, followed by a 1 to 2 digit number. >>> matchedObject.groups() ('Feb', '5') >>> matchedObject = re.search('(?:[A-Za-z]{3})\s(\d{1,2})', 'Feb 5') # with ?: syntax; not capturing the month, but using () for precedence and readability. >>> matchedObject.groups() ('5',)
Searching for matches case-insensitively: flags§
There are 6 six flags that Regular Expressions in Python offer, namely ?iLmsux:
- i: tells the searching engine to ignore case
- L: makes \w, \b, and \s locale dependent
- m: enables multiline expression
- s: the dotall flag; it makes the dot match all characters, including newline character
- u: makes \w, \b, \d, and \s unicode dependent
- x: makes the expression verbose i.e. ignores unescaped whitespace as well as text after # sign i.e. it treats text after # as comments.
# i: tells the searching engine to ignore case >>> re.search('ETHAN', 'Ethan') # does not match >>> re.search('(?i)ETHAN', 'Ethan') # matches <_sre.SRE_Match object; span=(0, 5), match='Ethan'> # s: the dotall flag; it makes the dot match all characters, including newline character >>> re.search('...', 'Hi\n') # does not match >>> re.search('(?s)...', 'Hi\n') # matches <_sre.SRE_Match object; span=(0, 3), match='Hi\n'>
re.fullmatch(), re.sub(), re.subn()§
The fullmatch(pattern, string, flags=0) performs an anchored search like match(), but returns a match object only if the expression matches the entire string OR the expression matches a portion of string denoted by span indexes specified in optional 2nd and 3rd arguments. Else, it returns None.
>>> pattern = re.compile('ab[cd]') >>> pattern.fullmatch("zabc") # doesn't match as 'a' is not at the beginning of "zabc" >>> pattern.fullmatch("abcde") # doesn't match as the pattern "ab[cd]" does not match the full string >>> pattern.fullmatch("abcde", 0, 3) # matches within specified positions <_sre.SRE_Match object; span=(0, 3), match='abc'>
The sub(pattern, replacement, stringSource) function replaces the occurrences of pattern in stringSource by replacement, and returns the modified string (if no substitutions are made, it returns the original string). The pattern may be either a simple string or a regular expression. The replacement may be either a string or a function.
>>> re.sub('foo', 'bar', 'foo foo foo') 'bar bar bar'
If replacement is a string, then all the familiar escape sequences are converted to their actual representations (\n to new line, \t to a tab space), and the unfamiliar ones (such as \%) are left as they are. Backreferences (\number), are replaced with matches by group # denoted by 'number'. Let's look at an example.
>>> re.sub('(f)(o)(o)', r'barWith\1\2\3', 'foo foo foo') 'barWithfoo barWithfoo barWithfoo'
If replacement is a function, it is called for every occurrence of pattern. The function must takes a match object and returns the replacement string.
>>> re.sub('(f)(o)(o)', r'barWith\1\2\3', 'foo foo foo') 'barWithfoo barWithfoo barWithfoo' >>> def replacefooWithbar(matchObject): return 'barWithFoo' >>> re.sub('foo', replacefooWithbar, 'foo foo foo foo') 'barWithFoo barWithFoo barWithFoo barWithFoo' >>> def replacefooWithbar(matchObject): if matchObject.group(0) == 'foo': return 'barWithFoo' else: pass >>> re.sub('foo', replacefooWithbar, 'foo somethingElse foo foo') 'barWithFoo somethingElse barWithFoo barWithFoo'
The sub() takes two optional arguments: count & flags.count, if provided, refers to the maximum number of occurrences to be replaced. If not provided, or set to 0, all occurrences are replaced. The flags provides you a way to provide optional parameters to affect your expression. There are 7 flags, as described below:
>>> re.sub('(f)(o)(o)', r'barWith\1\2\3', 'foo foo foo', count = 2) 'barWithfoo barWithfoo foo' # valid values for flag argument A also re.ASCII Makes \w, \W, \b, \B, \d, \D match the corresponding ASCII character categories (rather than the whole Unicode categories, which is the default) for string patterns. I also re.IGNORECASE Enables case-insensitive matching. L also re.LOCALE Makes \w, \W, \b, \B dependent on the current locale. M also re.MULTILINE Makes "^" match the beginning of lines (after a newline) as well as the string. "$" matches the end of lines (before a newline) as well as the end of the string. S also re.DOTALL Makes "." match any character at all, including the newline. X also re.VERBOSE Asks Regex Engine to ignore whitespace and comments. U also re.UNICODE Asks the Regex Engine to use the Unicode character set for substitution, which is default for string patterns. >>> re.sub('(F)(o)(o)', r'barWith\1\2\3', 'Foo foo foo') 'barWithFoo foo foo' >>> re.sub('(F)(o)(o)', r'barWith\1\2\3', 'Foo foo foo', flags = re.IGNORECASE) 'barWithFoo barWithfoo barWithfoo' >>> re.sub('foo.', r'barNewLine ', 'foo\nfoo\nfoo\n') 'foo\nfoo\nfoo\n' >>> re.sub('foo.', r'barNewLine ', 'foo\nfoo\nfoo\n', flags = re.DOTALL) 'barNewLine barNewLine barNewLine ' # Specifying multiple flags using | >>> re.sub('Foo.', r'barNewLine ', 'foo\nfoo\nfoo\n', flags = re.DOTALL | re.IGNORECASE) 'barNewLine barNewLine barNewLine '
The subn() performs the same function as sub(), except for the fact that it returns a 2-element tuple, consisting of the new string after the replacement(s) and number of substitutions made.
>>> re.sub('foo', r'barWithFoo', 'foo foo foo') 'barWithFoo barWithFoo barWithFoo' >>> re.subn('foo', r'barWithFoo', 'foo foo foo') ('barWithFoo barWithFoo barWithFoo', 3)
re.split(), re.findall(), re.finditer()§
The split(pattern, stringSource, maxsplit = 0, flags = 0) splits the stringSource on the basis of occurrences of the pattern, returning a list of resulting substrings.
If capturing groups are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
The maxsplit optional argument, when nonzero, specifies the maximum number of splits which take place. The rest of the stringSource is returned as the last element of the list.
>>> re.split('foo', '1 foo 2 foo 3 foo') ['1 ', ' 2 ', ' 3 ', ''] >>> re.split('(f)(o)(o)', '1 foo 2 foo 3 foo') ['1 ', 'f', 'o', 'o', ' 2 ', 'f', 'o', 'o', ' 3 ', 'f', 'o', 'o', ''] >>> re.split('foo', '1 foo 2 foo 3 foo', maxsplit = 2) ['1 ', ' 2 ', ' 3 foo']
The findall(pattern, stringSource, flags = 0) returns a list of all matches (including empty ones) in the stringSource. If capturing groups exist in the pattern, the list of tuples of matches of these groups is returned.
>>> re.findall('spam', 'spam spam spam') ['spam', 'spam', 'spam'] >>> re.findall('(s)(p)a(m)', 'spam spam spam') [('s', 'p', 'm'), ('s', 'p', 'm'), ('s', 'p', 'm')]
The finditer(pattern, stringSource, flags = 0) performs the same way as findall(), except for the fact that it returns an iterator of match objects in case there are no capturing groups, instead of a list. If there are groups in the expression, then an iterator of tuples of matches of these groups is returned.
>>> for element in re.finditer('spam', 'spam spam spam'): print(element) <_sre.SRE_Match object; span=(0, 4), match='spam'> <_sre.SRE_Match object; span=(5, 9), match='spam'> <_sre.SRE_Match object; span=(10, 14), match='spam'> >>> >>> >>> for element in re.findall('(s)(p)a(m)', 'spam spam spam'): print(element) ('s', 'p', 'm') ('s', 'p', 'm') ('s', 'p', 'm')
re.escape(), re.purge()§
The escape(pattern) asks the regular expression engine to escape all the characters in pattern except letters, numbers & underscore(_). This is particularly useful when you are trying to reverse-engineer an expression. For example, say you wanted to match an email of the form first.last@domain.com:
>>> re.escape('first.last@domain.com') 'first\\.last\\@domain\\.com'
Once you have gotten the control characters out of the way, you can bring in the special sequences.
>>> pattern = re.compile('(\w+)\\.(\w+)\\@(\w+)\\.(\w+)') >>> pattern.search('ethan.hunt@mi.com') <_sre.SRE_Match object; span=(0, 17), match='ethan.hunt@mi.com'>
The escape(pattern) is useful in another situation. Say you wanted to the expression to come from user input. You can use this function to create the expression, as it will automatically escape control characters in the input.
The purge() function clears the regular expression caches.
Summary§
Following is a summary of different sets of symbols, flags, special characters etc. we went over in this artcile.
############# CONTROL CHARACTERS ############ # All characters except the following 14 characters match themselves. The following 14 characters are control characters, with special meaning below. If you want to match any literal version of control characters, prefix it with a backslash. * : matches 0 or more repetitions of the preceding expression greedily + : matches 1 or more repetitions of the preceding expression greedily ? : matches 0 or 1 occurrence of the preceding expression non-greedily . : matches any character except for \n ^ : matches beginning of a string respectively. $ : matches ending of a string respectively. | : matches either of the expressions around the | symbol [ ] : matches any of the characters listed inside []; a hyphen (-) denotes alphanumeric ranges; ^ right after opening square bracket makes it match any character except for the listed ones. ( ) : used to capture portions of an expression; also used to denote precedence especially in complex expressions; (?:...) makes represents the non-grouping version of (). { } : {m,n} matches m to n repetitions of the preceding expression \ : used to escape control characters or signal a control sequence (e.g. \w, \W etc.) or special characters such as \n, \t etc. ############## PYTHON SPECIFIC EXTENSIONS ?P DISCUSSED IN THIS ARTICLE ################ (?P<name>...) The substring matched by the group is accessible by name. (?P=name) Matches the text matched earlier by the group named name. ############## FLAGS IN PYTHON SPECIFIC EXTENSIONS ?P ################ i: tells the searching engine to ignore case L: makes \w, \b, and \s locale dependent m: enables multiline expression s: the dotall flag; it makes the dot match all characters, including newline character u: makes \w, \b, \d, and \s unicode dependent x: makes the expression verbose i.e. ignores unescaped whitespace as well as text after # sign i.e. it treats text after # as comments. ############# VALID VALUES FOR OPTIONAL ARGUMENT flag IN FUNCTIONS OF re MODULE ############### A also re.ASCII Makes \w, \W, \b, \B, \d, \D match the corresponding ASCII character categories (rather than the whole Unicode categories, which is the default) for string patterns. I also re.IGNORECASE Enables case-insensitive matching. L also re.LOCALE Makes \w, \W, \b, \B dependent on the current locale. M also re.MULTILINE Makes "^" match the beginning of lines (after a newline) as well as the string. "$" matches the end of lines (before a newline) as well as the end of the string. S also re.DOTALL Makes "." match any character at all, including the newline. X also re.VERBOSE Asks Regex Engine to ignore whitespace and comments. U also re.UNICODE Asks the Regex Engine to use the Unicode character set for substitution, which is default for string patterns. ############### CONTROL SEQUENCES ############### \d matches numerals or digits; is equivalent to the character class [0-9] \D matches non-numeral or non-digit characters; equivalent to [^0-9] \s matches whitespace characters such as \t (tab), \n (linefeed), \r (carriage return), \v (vertical tab), \f (formfeed) etc. \S matches non-whitespace character; equivalent to [^\t\n\r\v\f] \w matches alphanumeric characters; equivalent to [a-zA-Z0-9_] \W matches non-alphanumeric characters; equivalent to [^a-zA-Z0-9_] \number matches the contents captured by the group denoted by the same number; we will discuss groups shortly. \A matches only at the start of the original string. \Z matches only at the end of the original string. \b matches an empty string at word boundaries; \b is defined as the boundary between a \w & \W character or between \w and the two ends of a string; note that you will need to write your regex containing \b sequences with raw string notation i.e. r'patternContaining\bSequences', since \b in regular string notation denotes the backspace character. Alternatively, you can use two backslashes to mean the same i.e. 'patternContaining\\bSequences'. \B matches an empty string not at word boundaries i.e. re.search('\Bam\B', 'name') will get return a match whereas '\Bn' will not, since there is a word boundary right before the letter n; # In order to match a literal backslash, use the raw string notation, i.e. re.search(r'\\', 'x\\y'). Alternatively, you can escape both the backslashes by prefixing each of them with another backslash i.e. re.search('\\\\', 'x\\y') ######### FUNCTIONS OF re MODULE DISCUSSED IN THIS ARTICLE ############ * match(pattern, stringSource, flags=0) applies the pattern at the beginning of the string and returns a match object if a match is found, and if no match is found, returns None * search(pattern, stringSource, flags=0) applies the pattern in the entire of the string and returns a match object if a match is found, and if no match is found, returns None * compile(pattern, flags=0) compiles a pattern to return a pattern object, making it convenient for us in case we need to apply the same pattern to different strings. * fullmatch(pattern, string, flags=0) performs an anchored search like match(), but returns a match object only if the expression matches the entire string OR the expression matches a portion of string denoted by span indexes specified in optional 2nd and 3rd arguments. * sub(pattern, replacement, stringSource) function replaces the occurrences of pattern in stringSource by replacement, and returns the modified string (if no substitutions are made, it returns the original string). The pattern may be either a simple string or a regular expression. The replacement may be either a string or a function. * subn(pattern, replacement, stringSource) performs the same function as sub(), except for the fact that it returns a 2-element tuple, consisting of the new string after the replacement(s) and number of substitutions made. * split(pattern, stringSource, maxsplit = 0, flags = 0) splits the stringSource on the basis of occurrences of the pattern, returning a list of resulting substrings. * findall(pattern, stringSource, flags = 0) returns a list of all matches (including empty ones) in the stringSource. If capturing groups exist in the pattern, the list of tuples of matches of these groups is returned. * finditer(pattern, stringSource, flags = 0) performs the same way as findall(), except for the fact that it returns an iterator of match objects in case there are no capturing groups, instead of a list. If there are groups in the expression, then an iterator of tuples of matches of these groups is returned. * escape(pattern) asks the regular expression engine to escape all the characters in pattern except letters, numbers & underscore(_). * purge()function clears the regular expression caches. ############ FUNCTIONS OF MATCH OBJECT DISCUSSED IN THIS ARTICLE ############ * end(): returns index of the original string where the match ended. * groups(): returns a tuple of groups of the matched string. * groupdict(): returns a dictionary of named groups of the matched string. * group(): returns a specific group of the matches string. * re: attribute storing the regular expression object whose search()/match() produced this match object. * span(): returns a 2-element tuple denoting the indexes of the original string where the match starts and ends. * start(): returns index of the original string where the match started. * string: attribute which stores the original string.
Handy Tips§
-
One way of breaking down a complex pattern is to use the string continuation character i.e. \.
>>> re.search('(\w+)\\@(\w+)\\.(\w+)', 'ethan@mi.com') <_sre.SRE_Match object; span=(0, 12), match='ethan@mi.com'> >>> re.search('(\w+)\ \\@\ (\w+)\ \\.\ (\w+)', 'ethan@mi.com') <_sre.SRE_Match object; span=(0, 12), match='ethan@mi.com'>
-
Another way of breaking down a complex pattern is by using the syntax (?#...) to write comments.
>>> re.search('(\w+)\\@(\w+)\\.(\w+)', 'ethan@mi.com') <_sre.SRE_Match object; span=(0, 12), match='ethan@mi.com'> >>> re.search('((?#catches username)\w+)\\@((?#catches domain)\w+)\\.((?#catches type of domain)\w+)', 'ethan@mi.com') <_sre.SRE_Match object; span=(0, 12), match='ethan@mi.com'> >>> re.search('\ ((?#catches username)\w+)\ \\@\ ((?#catches domain)\w+)\ \\.\ ((?#catches type of domain)\w+)', 'ethan@mi.com') <_sre.SRE_Match object; span=(0, 12), match='ethan@mi.com'>
-
Raw string notation: Throughout the article, there have been a few instances when I have mentioned this term. In order to match empty string at word boundaries using \b, in order to match a literal backslash, in order to bacreference an unnamed group, in all these cases, you can put the raw string notation to good use. If you look at examples of Regular Expressions in Python all over the internet, you will note that it is the preferred string notation. It prevenets overcrowding of your pattern as well as searchString with backslashes, which helps in readability.
-
Pythex & Regex101:
Pythex is a real time Python Regular Expression Matching Engine. You can use this awesome online tool to test your regular expressions. It provides real-time matching of your expression in the target string provided with optional flags, along with capability of capturing groups, a hyper-link to the expression and a handy cheatsheet of symbols.
Regex101 is a similar tool, you can use it too.
-
Documentation: The documentation of re module provides a comprehensive view on all objects that it contains, along with suitable examples.
-
re.py: You can view the contents of the re module in the Lib directory of your Python installation in order to gain better understanding of Regular Expressions in Python. You will find a file re.py in this directory. You may open it with your favorite text-editor.
To understand why raw string notation is better, try >>> print('\\') and >>> print(r'\\'). The former yields \ whereas the latter yields \\. Regular string notation would require you to have four backslashes \\\\ to mean \\, whereas while using raw string notation, two backslashes \\ would mean \\.
Exercises§
- Make a regular expression that matches an 'a' followed by 4 'b's in a string. Solution: ab{4}
- Make a regular expression which matches sequences of a single uppercase letter followed by one or more lowercase letters. Solution: [A-Z][a-z]*
- Compose a regular expression that matches a word at the end of a string, ending with an optional punctuation mark. Solution: \w+\S*$
- Write a Python script to remove leading zeroes from parts of an IP address. Solution: re.sub('\.[0]*', '.', '192.168.01.01')
- Write a Python program to convert camelCase words to underscore_separated_lowercase words. Solution: re.sub('([a-z]+)([A-Z])', r'\1_\2', 'randomCamelCaseVariableOne randomCamelCaseVariableTwo').lower()
- Write a Python script to convert a date of dd-mm-yyyy format to yyyy-mm-dd format. Solution: re.sub('(\d{1,2})-(\d{1,2})-(\d{4})', r'\3-\2-\1', '13-02-2017')
- Make a regular expression to extract all words from a string starting with a, d or s. Solution: re.findall(r'\b[ads]\w+\b', 'apple ball cat dog eagle sea sort')
- Write a Python script to replace any occurrence of underscore, comma, or period with an at-the-rate symbol. Solution: re.sub('[_,\.]', '@', '_,._,.')
- Write a Python script to find all 4 characters long words in a string. Solution: re.findall(r'\b\w{4}\b', 'two three four five six seven')
- Write a Python script to find all words in a string which are either 2-letter long, 3-letter long or 4-letter long. Solution: re.findall(r'\b\w{2,4}\b', 'on two three four five six seven')
- Write a Python script to obtain a list of numbers from a string containing alphanumeric characters. Solution: re.split('[a-zA-Z]+', '1a2b3c4de5f6')
- Write a Python script to extract the contents lying between a paragraph tag with an id 'para'. Solution: matchObject = re.search('<p id\s*=\s*\'para\'>(.*?)</p>', '<p id = \'para\'>Contents of paragraph</p>'); print(matchObject.group(1)) # 'Contents of paragraph'
- Write a regular expression that matches two of a kind characters from a string containing characters other than a newline e.g. 'abc123abc'. Solution: matchObject = re.search(r'.*?(.).*?\1', 'abc123abc'); matchObject.groups(). For clarity, try it out on https://regex101.com/.
- Write a regular expression that matches three of a kind characters from a string containing characters other than a newline e.g. 'abc123abc'. Solution: matchObject = re.search(r'.*?(.).*?\1.*?\1', 'abc123abc123abc'); matchObject.groups(). For clarity, try it out on https://regex101.com/.
- Write a Python script to remove excessive spacing from a string. Solution: re.sub('\s+', ' ', 'this is a string with excessive spacing.')
P.S.: You will receive a 'SyntaxError: invalid character in identifier' error if you try to run these examples as they are. To get around this, replace all occurrences of inverted commas (') by typing them manually.
The Python documentation has listed many good examples for you to understand regular expressions better. I'd advise you to go through the following three: Making a Phonebook, Text Munging & Finding all Adverbs.