BeautifulSoup4 搜索标签常用方法 | 幸福的猪窝

BeautifulSoup4 搜索标签常用方法

最近在做爬虫用到BeautifulSoup 库。简单记录一下html的标签搜索常用方法。

安装

1
pip install beautifulsoup4

使用

1
2
3
4
5
6
from bs4 import BeautifulSoup

# 这里只是已将简单的例子,具体场景再引用实际的片段
htmlString="<html><body>test content</body></html>"

soup = BeautifulSoup(htmlString,"html.parser")
  1. 按照标签解析
1
2
3
4
5
6
7
8
9
10
# 只是用标签进行搜索
htmlString="<div> <span>main content</span> <div><span>sub div content</span></div></div>"
divlist = soup.select("div")

print(divlist)

'''
输出结果
[<div> <span>main content</span> <div><span>sub div content</span></div></div>, <div><span>sub div content</span></div>]
'''
  1. 按父子关系查询
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    # 只是用标签进行搜索
    htmlString="<div> <span>main content</span> <div><span>sub div content</span></div></div>"
    divlist = soup.select("div span")

    print(divlist)

    '''
    输出结果
    [<span>main content</span>, <span>sub div content</span>]
    '''

    # 再加一层
    divlist = soup.select("div div span")
    print(divlist)

    '''
    输出结果
    [<span>sub div content</span>]
    '''