[问题] 爬虫相关问题（BeautifulSoup） TZULIU PTT批踢踢实业坊

[问题] 爬虫相关问题（BeautifulSoup）

楼主: TZULIU (消费券收购商) 2017-09-02 16:20:11

小弟目前正在撰写计算从网页A到其他网站内连结的level，
采用的方式就是在各网页内直接抓取'href'，
当我将最高level设定在2时，没有任何问题，
但当我将其设定为3时，会跑出错误讯息如下：
UnboundLocalError: local variable 'soup' referenced before assignment
请问高手可以帮忙看看问题出在哪里吗？谢谢！
程式码如下：
import lxml.html
import urllib.request
from bs4 import BeautifulSoup
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
# splitting url by '://' and retrun a list
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
# this will only return 'https://xxxxx.com'
return ptorocol + '://' + domainName
stopLevel = 3 ## 此处若改为2时不会有任何问题
rootUrls = ['http://ps.ucdavis.edu/']
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
return ptorocol + '://' + domainName
def crawl(urls, stopLevel = 5, level=1):
nextUrls = []
if (level <= stopLevel):
for url in urls:
# need to handle urls (e.g., https) that cannot be read
try:
openedUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(openedUrl, 'lxml')
except:
print('cannot read for :' + url)
for a in soup.find_all('a', href=True):
href = a['href']
if href is not None:
# for the case of a link is relative path
if '://' not in href:
href = getProtocolAndDomainName(url) + href
# check url has been already visited or not
if href not in foundUrls:
foundUrls.update({href : level})
nextUrls.append(href)
# recursive call
crawl(nextUrls, stopLevel, level + 1)
crawl(rootUrls, stopLevel)
print(foundUrls)

作者: stucode 2017-09-02 22:05:00

错误讯息已经说了你还没有给soup值就用了它只要想想什么情况下会没给soup值就知道问题在哪了

作者: Sunal (SSSSSSSSSSSSSSSSSSSSSSS) 2017-09-02 22:59:00

execption处理还可以再好一点

楼主: TZULIU (消费券收购商) 2017-09-03 15:15:00

应该是有几个转换过的url无法用BeautifulSoup开启我现在处理的方式是在crawl函数底下多写一行global soup，虽然解决的问题但不确定是不是一个好的方法我也有想过在openedUrl = urllib.request.urlopen(url)底下放一个try-except而不是使用global soup请问有高手可以给点意见吗？谢谢！

作者: wennie0433 2017-09-03 16:04:00

把处理过的url印出来看看？看哪个会错？

作者: stucode 2017-09-03 16:14:00

用global等于是把soup摆到global symbol table去看起来解决问题只是碰巧没有出错而已比较正确的方法是当发现取得或解析网页失败的时候就做好流程控制不要再去使用soup变量因为它根本就没有被填入应有的内容

继续阅读

Fw: [征才] Dell - Data Science Analystmaplesida [问题] 爬虫出网站中所有的内文tosakashiron [问题] 辨识图片物体，上tagblue14753 [问题] 问一下python的正规式ptt0720 Re: [问题] Python cv2如何取ROIbackprog [问题] 关于XOR的所有可能QT14537 [问题] 多版本python3共存pip问题nc23nick [问题] 转职自学Python找工作lokip [问题] telnet无法print出结果jack622 [问题] 使用python自动发送讯息给QQ idguilechao