ad

使用美丽的汤从页面上刮擦链接,我现在如何迭代这些链接?-英雄云拓展知识分享

匿名投稿 275 2024-01-21

这是我在页面上检索链接的代码。

from urllib.request import urlopen

from bs4 import BeautifulSoup as soup

import re

使用美丽的汤从页面上刮擦链接,我现在如何迭代这些链接?-英雄云拓展知识分享

def getExternalLinks(includeURL):

html = urlopen(includeURL)

bsObj = soup(html, "html.parser")

externalLinks = []

links = bsObj.findAll("a",

href=re.compile("^(http://www.homedepot.com/b)"))

for link in links:

if link.attrs['href'] is not None:

if link.attrs['href'] not in externalLinks:

externalLinks.append(link.attrs['href'])

print(externalLinks)

getExternalLinks("http://www.homedepot.com/")

链接存储在下面的数组中。

['http://www.homedepot.com/b/Appliances/N⑸yc1vZbv1w?cm_sp=d-flyout-Appliances', 'http://www.homedepot.com/b/Bath/N⑸yc1vZbzb3?cm_sp=d-flyout-Bath_and_Faucets', 'http://www.homedepot.com/b/Decor/N⑸yc1vZas6p?cm_sp=d-flyout-Blinds_and_Decor', 'http://www.homedepot.com/b/Building-Materials/N⑸yc1vZaqns?cm_sp=d-flyout-Building_Materials', 'http://www.homedepot.com/b/Doors-Windows/N⑸yc1vZaqih?cm_sp=d-flyout-Doors_and_Windows', 'http://www.homedepot.com/b/Electrical/N⑸yc1vZarcd?cm_sp=d-flyout-Electrical', 'http://www.homedepot.com/b/Flooring/N⑸yc1vZaq7r?cm_sp=d-flyout-Flooring_and_Area_Rugs', 'http://www.homedepot.com/b/Hardware/N⑸yc1vZc21m', 'http://www.homedepot.com/b/Heating-Venting-Cooling/N⑸yc1vZc4k8?cm_sp=d-flyout-Heating_and_Cooling', 'http://www.homedepot.com/b/Kitchen/N⑸yc1vZar4i?cm_sp=d-flyout-Kitchen', 'http://www.homedepot.com/b/Outdoors-Garden-Center/N⑸yc1vZbx6k?cm_sp=d-flyout-Lawn_and_Garden', 'http://www.homedepot.com/b/Lighting-Ceiling-Fans/N⑸yc1vZbvn5?cm_sp=d-flyout-Lighting_and_Ceiling_Fans', 'http://www.homedepot.com/b/Outdoors/N⑸yc1vZbx82?cm_sp=d-flyout-Outdoor_Living', 'http://www.homedepot.com/b/Paint/N⑸yc1vZar2d?cm_sp=d-flyout-Paint', 'http://www.homedepot.com/b/Plumbing/N⑸yc1vZbqew?cm_sp=d-flyout-Plumbing', 'http://www.homedepot.com/b/Storage-Organization/N⑸yc1vZas7e?cm_sp=d-flyout-Storage_and_Organization', 'http://www.homedepot.com/b/Tools/N⑸yc1vZc1xy']

现在,我正在尝试经过这些链接进行迭代,然后转到每一个页面并获得信息。当我运行下一个代码时,我会遇到一些毛病。

def getInternalLinks(includeLinks):

internalHTML = urlopen(includeLinks)

Inner_bsObj = soup(internalHTML, "html.parser")

internalLinks = []

inner_links = Inner_bsObj.findAll("a", "href")

for inner_link in inner_links:

if inner_link.attrs['href'] is not None:

if inner_link.attrs['href'] not in internalLinks:

internalLinks.append(inner_link.attrs['href'])

print(internalLinks)

getInternalLinks(getExternalLinks("http://www.homedepot.com"))

File "C:/Users/anag/Documents/Python

Scripts/Webscrapers/BeautifulSoup/HomeDepot/HomeDepotScraper.py", line 20,

in getInternalLinks

internalHTML = urlopen(includeLinks)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 223, in urlopen

return opener.open(url, data, timeout)

File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 517, in open

req.timeout = timeout

AttributeError: 'NoneType' object has no attribute 'timeout'

我应当如何从我存储在外部链接数组中的每一个网页中提取信息?

看答案

这是一个列表,而不是数组。 python中的数组表示大多数时间与列表完全不同。

您的代码问题是 getExternalLinks() 功能返回 None 然后您将其作为论点 getInternalLinks() 期望单个URL的功能。第一个功能需要返回URL的列表(或汇聚),而不是(仅)打印它们,然后您需要一个循环对返回值并将每一个URL馈送到第2个功能。

这两个功能都包括几近相同的代码。 sans名称差异只是该参数 findAll() 不同的方法。我会将其重构为一个共同的函数。

import re

from urllib.request import urlopen

from bs4 import BeautifulSoup

def get_links(url, attrs=None):

if attrs is None:

attrs = dict()

links = set()

soup = BeautifulSoup(urlopen(url), 'html.parser')

for a_node in soup.find_all('a', attrs):

link = a_node.get('href')

if link is not None:

links.add(link)

return links

def main():

external_links = get_links(

'http://www.homedepot.com/',

{'href': re.compile('^(http://www.homedepot.com/b)')},

)

print(external_links)

for link in external_links:

#

# TODO I am not sure if you really want to filter on <a> elements

# with a class of 'href' but that is what your code did, so...

#

internal_links = get_links(link, {'class': 'href'})

print(internal_links)

if __name__ == '__main__':

main()


🚀🌟 点击注册 免费试用超级应用平台-英雄云企业级hpapaas 🌟🚀 😃👉🌐

免责声明:

本网址(www.yingxiongyun.com)发布的材料主要源于独立创作和网友匿名投稿。此处提供的所有信息仅供参考之用。我们致力于提供准确且可信的信息,但不对材料的完整性或真实性作出任何保证。用户应自行验证相关信息的正确性,并对其决策承担全部责任。对于由于信息的错误、不准确或遗漏所造成的任何损失,本网址不承担任何法律责任。本网站所展示的所有内容,如文字、图像、标志、音频、视频、软件和程序等的版权均属于原创作者。如果任何组织或个人认为网站内容可能侵犯其知识产权,或包含不准确之处,请即刻联系我们进行相应处理。

标签:Python python-3.x
上一篇:没法解决paypal paypal-英雄云拓展知识分享
下一篇:将U1阵列转换为python中的int-英雄云拓展知识分享
相关文章

 发表评论

暂时没有评论,来抢沙发吧~

×