导航：首页 > 互联网科技 >

BeautifulSoup的介绍及作用有哪些

发表于：2025-02-05 作者：千家信息网编辑

千家信息网最后更新 2025年02月05日，本篇内容介绍了"BeautifulSoup的介绍及作用有哪些"的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有

千家信息网最后更新 2025年02月05日BeautifulSoup的介绍及作用有哪些

本篇内容介绍了"BeautifulSoup的介绍及作用有哪些"的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有所成！

一、BeautifulSoup构建

1.1 通过字符串构建

from bs4 import BeautifulSouphtml = """The Dormouse's storyOnce upon a time there were three little sisters; and their names were
"""soup = BeautifulSoup(html, 'html.parser')print(soup.prettify())

1.2 从文件加载

from bs4 import BeautifulSoupwith open(r"F:\tmp\etree.html") as fp:    soup = BeautifulSoup(fp,"lxml")print(soup.prettify())

二、Tag对象

2.1 string、strings、stripped_strings

如果一个节点只包含文本节点，可以通过string直接访问文本节点

如果不止包含文本节点，那么string为None

如果不止包含文本节点，可以通过strings、stripped_strings获取文本节点内容，strings、stripped_strings获取的都是生成器。

2.2 get_text()

只获取文本节点

soup.get_text()#可以指定不同节点之间的文本使用|分割。soup.get_text("|")# 可以指定去除空格soup.get_text("|", strip=True)

2.3 属性

tag.attrs是一个字典类型，可以通过tag['id']这样的方式获取值。下标访问的方式可能会抛出异常KeyError，所以可以使用tag.get('id')方式，如果id属性不存在，返回None。

三、contents、children与descendants

都是节点的子节点，不过： contents是列表 children是生成器

contents、children只包含直接子节点，descendants也是一个生成器，不过包含节点的子孙节点

3.1 parent、parents

parent：父节点 parents：递归父节点

3.2 next_sibling、previous_sibling

next_sibling：后一个兄弟节点 previous_sibling：前一个兄弟节点

3.3 next_element、previous_element

next_element：后一个节点 previous_element：前一个节点

next_element与next_sibling的区别是：

next_sibling从当前tag的结束标签开始解析
next_element从当前tag的开始标签开始解析

四、find、find_all

4.1 方法

find_parent:查找父节点 find_parents:递归查找父节点 find_next_siblings:查找后面的兄弟节点 find_next_sibling:查找后面满足条件的第一个兄弟节点 find_all_next:查找后面所有节点 find_next:查找后面第一个满足条件的节点 find_all_previous:查找前面所有满足条件的节点 find_previous:查找前面第一个满足条件的节点

4.2 tag名称

# 查找所有p节点soup.find_all('p')# 查找title节点，不递归soup.find_all("title", recursive=False)# 查找p节点和span节点soup.find_all(["p", "span"])# 查找第一个a节点，和下面一个find等价soup.find_all("a", limit=1)soup.find('a')

4.3 属性

# 查找id为id1的节点soup.find_all(id='id1')# 查找name属性为tim的节点soup.find_all(name="tim")soup.find_all(attrs={"name": "tim"})#查找class为clazz的p节点soup.find_all("p", "clazz")soup.find_all("p", class_="clazz")soup.find_all("p", class_="body strikeout")

4.4 正则表达式

import re# 查找与p开头的节点soup.find_all(class_=re.compile("^p"))

4.5 函数

# 查找有class属性并且没有id属性的节点soup.find_all(hasClassNoId)def hasClassNoId(tag):    return tag.has_attr('class') and not tag.has_attr('id')

4.6 文本

soup.find_all(string="tim")soup.find_all(string=["alice", "tim", "allen"])soup.find_all(string=re.compile("tim"))def onlyTextTag(s):    return (s == s.parent.string)# 查找只有文本节点的节点soup.find_all(string=onlyTextTag)# 查找文本节点为tim的a节点soup.find_all("a", string="tim")

五、select

5.1 方法

相比于find，select方法就少了很多，就2个，一个是select，另一个是select_one，区别是select_one只选择满足条件的第一个元素。

select的重点在于选择器上，所以接下来我们重点通过介绍示例介绍一些常用的选择器。如果对应css选择器不熟悉的朋友，可以先看一下后面CSS选择器的介绍。

5.2 通过tag选择

# 选择title节点soup.select("title")# 选择body节点下的所有a节点soup.select("body a")# 选择html节点下的head节点下的title节点soup.select("html head title")

通过tag选择非常简单，就是按层级，通过tag的名称使用空格分割就可以了。

5.3 id与类选择器

# 选择类名为article的节点soup.select(".article")# 选择id为id1的a节点soup.select("a#id1")# 选择id为id1的节点soup.select("#id1")# 选择id为id1、id2的节点soup.select("#id1,#id2")

id和类选择器也比较简单，类选择器使用.开头，id选择器使用#开头。

5.4 属性选择器

# 选择有href属性的a节点soup.select('a[href]')# 选择href属性为http://mycollege.vip/tim的a节点soup.select('a[href="http://mycollege.vip/tim"]')# 选择href以http://mycollege.vip/开头的a节点soup.select('a[href^="http://mycollege.vip/"]')# 选择href以png结尾的a节点soup.select('a[href$="png"]')# 选择href属性包含china的a节点soup.select('a[href*="china"]')# 选择href属性包含china的a节点soup.select("a[href~=china]")

5.5 其他选择器

# 父节点为div节点的p节点soup.select("div > p")# 节点之前有div节点的p节点soup.select("div + p")# p节点之后的ul节点(p和ul有共同父节点)soup.select("p~ul")# 父节点中的第3个p节点soup.select("p:nth-of-type(3)")

六、实例

最后我们还是通过一个小例子，来看一下BeautifulSoup的使用。

from bs4 import BeautifulSouptext = '''                            
          解忧杂货店
      [日] 东野圭吾 / 李盈春 / 南海出版公司 / 2014-5 / 39.50元
                      8.5                    (537322人评价)              
      现代人内心流失的东西，这家杂货店能帮你找回--僻静的街道旁有一家杂货店，只要写下烦恼投进卷帘门的投信口，      第二天就会在店后的牛奶箱里得到回答。因男友身患绝... 
    
'''soup = BeautifulSoup(text, 'lxml')print(soup.select_one("a.nbg").get("href"))print(soup.find("img").get("src"))title = soup.select_one("h3 a")print(title.get("href"))print(title.get("title"))print(soup.find("div", class_="pub").string)print(soup.find("span", class_="rating_nums").string)print(soup.find("span", class_="pl").string.strip())print(soup.find("p").string)

非常简单，如果对CSS选择器熟悉的话，很多复杂的结构也能轻松搞定。

七、CSS选择器

7.1 常用选择器

选择器	示例	说明
.class	.intro	选择class="intro"的所有节点
#id	#firstname	选择id="firstname"的所有节点
*	*	选择所有节点
element	p	选择所有p节点
element,element	div,p	选择所有div节点和所有p节点
element element	div p	选择div节点内部的所有p节点
element>element	div>p	选择父节点为div节点的所有p节点
element+element	div+p	选择紧接在div节点之后的所有p节点
element~element	p~ul	选择和p元素拥有相同父节点，并且在p元素之后的ul节点
[attribute^=value]	a[src^="https"]	选择其src属性值以"https"开头的每个a节点
[attribute$=value]	a[src$=".png"]	选择其src属性以".png"结尾的所有a节点
[attribute*=value]	a[src*="abc"]	选择其src属性中包含"abc"子串的每个a节点
[attribute]	[target]	选择带有target属性所有节点
[attribute=value]	[target=_blank]	选择target="_blank"的所有节点
[attribute~=value]	[title~=china]	选择title属性包含单词"china"的所有节点
[attribute\|=value]	[lang\|=zh]	选择lang属性值以"zh"开头的所有节点

div p是包含孙子节点，div > p只选择子节点

element~element选择器有点不好理解，看下面的例子：

                                ul-li1
            ul-li1
            ul-li1
        
        p标签
                    ul-li2
            ul-li2
            ul-li2
        
        h3 tag
                    ul-li3
            ul-li3
            ul-li3

7.2 位置选择器

选择器	示例	说明
:first-of-type	p:first-of-type	选择其父节点的首个p节点
:last-of-type	p:last-of-type	选择其父节点的最后p节点
:only-of-type	p:only-of-type	选择其父节点唯一的p节点
:only-child	p:only-child	选择其父节点的唯一子节点的p节点
:nth-child(n)	p:nth-child(2)	选择其父节点的第二个子节点的p节点
:nth-last-child(n)	p:nth-last-child(2)	从最后一个子节点开始计数
:nth-of-type(n)	p:nth-of-type(2)	选择其父节点第二个p节点
:nth-last-of-type(n)	p:nth-last-of-type(2)	选择其父节点倒数第二个p节点
:last-child	p:last-child	选择其父节点最后一个p节点

需要主要的是tag:nth-child(n)与tag:nth-of-type(n)，nth-child计算的时候不要求类型相同，nth-of-type计算的时候必须是相同的tag。

有点绕，可以看一下下面的示例。

    nth                 1-1p
        2-1div
        3-2p
        4-3p
        5-4p

7.3 其他选择器

选择器	示例	说明
:not(selector)	:not(p)	选择非p节点的节点
:empty	p:empty	选择没有子节点的p节点
::selection	::selection	选择被用户选取的节点
:focus	input:focus	选择获得焦点的input节点
:root	:root	选择文档的根节点
:enabled	input:enabled	选择每个启用的input节点
:disabled	input:disabled	选择每个禁用的input节点
:checked	input:checked	选择每个被选中的input节点
:link	a:link	选择所有未被访问的链接
:visited	a:visited	选择所有已被访问的链接
:active	a:active	选择活动链接
:hover	a:hover	选择鼠标指针位于其上的链接
:first-letter	p:first-letter	选择每个p节点的首字母
:first-line	p:first-line	选择每个p节点的首行
:first-child	p:first-child	选择属于父节点的第一个子节点的每个p节点
:before	p:before	在每个p节点的内容之前插入内容
:after	p:after	在每个p节点的内容之后插入内容
:lang(language)	p:lang(it)	选择带有以"it"开头的lang属性值的每个p节点