[问题] 初学网络爬虫问题

楼主: starlichin (白星羽)   2018-10-30 23:30:58
目前在Coursera上自学Python网络爬虫。
写作业的时候碰到一个题目,就是要根据使用者输入的position,
搜寻网页中使用者指定position的网址,进入该网址后,
再搜寻下一页面中该指定position的网址,如此重复counter次
原始题目叙述为:
In this assignment you will write a Python program that expands on
http://www.py4e.com/code3/urllinks.py. The program will use urllib to read
the HTML from the data files below, extract the href= vaues from the anchor
tags, scan for a tag that is in a particular position relative to the first
name in the list, follow that link and repeat the process a number of times
and report the last name you find.
下面是我目前写好的部分,但只能打印出第一层指定位置的网址,不知道该怎么
依照指定的counter重复进入该网址再打印下几层的网址,请大家协助解惑了
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
counter = input ('Enter counter: ')
position = input ('Enter position: ' )
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
lst = []
for tag in tags:
link = tag.get('href', None)
lst.append(link)
print(lst[int(position)-1])
作者: takingblue (takingblue)   2018-10-31 15:37:00
再对下个link发request根据你的counter写一个loop
楼主: starlichin (白星羽)   2018-10-31 23:15:00
解决了! 感谢你 :)

Links booklink

Contact Us: admin [ a t ] ucptt.com