python 正则表达式的使用-猿码集

1. 正则表达式简介

正则表达式是用来描述文本模式的一种工具，用一些方法来匹配、查找、替换文本中的字符。在Python中，可以使用re模块来支持正则表达式操作。Python中的正则表达式主要有两个重要的概念：

1.1 元字符

元字符是组成正则表达式的基本元素，用来匹配字符串中的某些字符或者位置，常用的元字符包括：

. 匹配任意字符 ^ 匹配字符串的开头 $ 匹配字符串的末尾 * 匹配前面的字符0次或多次 + 匹配前面的字符1次或多次 ? 匹配前面的字符0次或1次 \d 匹配任意数字，等价于[0-9] \w 匹配任意字母、数字或下划线，等价于[a-zA-Z0-9_]

\s 匹配任意空白字符，包括空格、制表符、换行符等

1.2 转义字符

由于一些元字符在正则表达式中具有特殊的功能，如果想要匹配这些字符本身，就需要使用转义字符\。常用的转义字符包括：

\ 转义字符 \. 匹配实际的点字符 \^ 匹配实际的^字符 \$ 匹配实际的$字符 \* 匹配实际的*字符 \+ 匹配实际的+字符

\? 匹配实际的?字符

2. 正则表达式基本函数

2.1 re.match(pattern, string)

re.match函数是从字符串的起始位置开始匹配，如果字符串的起始位置匹配成功，则返回一个匹配的对象；如果匹配失败，则返回None。

import re
pattern = r'hello'
string = 'hello world'
result = re.match(pattern, string)
if result is not None:
    print('匹配成功')
else:
    print('匹配失败')

以上代码会输出“匹配成功”，因为字符串的起始位置匹配成功。

2.2 re.search(pattern, string)

re.search函数是在整个字符串中查找匹配，如果找到匹配的位置，则返回一个匹配的对象；如果匹配失败，则返回None。

import re
pattern = r'world'
string = 'hello world'
result = re.search(pattern, string)
if result is not None:
    print('匹配成功')
else:
    print('匹配失败')

以上代码会输出“匹配成功”，因为字符串中包含了匹配的字符。

2.3 re.findall(pattern, string)

re.findall函数会在整个字符串中查找符合正则表达式的所有子串，并返回一个列表。

import re
pattern = r'[\d.]+'
string = '1.23 plus 4 is 5.23'
result = re.findall(pattern, string)
print(result)

以上代码会输出["1.23", "4", "5.23"]，因为常数1.23和5.23以及整数4都符合匹配规则。

3. 正则表达式进阶

3.1 分组匹配

使用小括号()来进行分组匹配，可以将匹配的结果按照组来进行提取。

import re
pattern = r'(\w+),(\w+)'
string = 'hello,world'
result = re.match(pattern, string)
if result is not None:
    print(result.group(0))  # 整个匹配结果，即"hello,world"
    print(result.group(1))  # 第一个分组，即"hello"
    print(result.group(2))  # 第二个分组，即"world"

以上代码会输出“hello,world”、“hello”、“world”，因为正则表达式中使用了分组，在匹配成功后可以通过group方法来获取分组的内容。

3.2 替换匹配

使用re.sub函数可以将字符串中符合正则表达式的部分替换成指定的字符串。

import re
pattern = r'world'
string = 'hello world'
new_string = re.sub(pattern, "python", string)
print(new_string)

以上代码会输出“hello python”，因为字符串中的“world”被替换成了“python”。

3.3 贪婪匹配和非贪婪匹配

正则表达式匹配通常是贪婪匹配，即尽可能匹配多的字符，所以会先匹配到最长的字符串。如果需要非贪婪匹配，则可以在正则表达式中添加一个问号?。

import re
pattern = r'<.*>'  # 贪婪匹配
string = 'world'
result = re.findall(pattern, string)
print(result)  # 输出["world"]，因为贪婪匹配会匹配到最长的字符串
pattern = r'<.*?>'  # 非贪婪匹配
result = re.findall(pattern, string)
print(result)  # 输出["", ""]，因为非贪婪匹配会匹配到最短的字符串

4. 结论

本文介绍了Python正则表达式的基本概念、基本函数和进阶用法，包括元字符、转义字符、match函数、search函数、findall函数、分组匹配、替换匹配、贪婪匹配和非贪婪匹配。正则表达式是一种非常强大的文本处理工具，在数据清洗、信息提取等方面都有广泛的应用。

python 正则表达式的使用