我正想挑战用Python爬虫,但越爬越发现好像碰到大魔王...
由于是aspx网页,用requests去造访网页不会抓到动态表格的全部内容,
不知道我有没有理解错?
但是我在实作selenium的过程中发现,原本我用requests还能抓到的资料,
到了selenium却抓不到,只剩空list?
import requests
from selenium import webdriver
from time import sleep
from lxml import etree, html
url = "https://www.ntuh.gov.tw/labmed/检验目录/Lists/2015/BC.aspx"
browser = webdriver.Chrome()
browser.get(url)
# The url is visited with Chrome correctly
root = etree.fromstring(browser.page_source, etree.HTMLParser())
root.xpath("//table[@class='ms-listviewtable']/tr")
# It gives me [] while browser.page_source is a string of html
到这里就可以发现这个xpath没抓到东西。
但是,实际上这个xpath在我用requests抓时是有用的:
result = ""
while result == "":
try:
# Certificate is not verified to bypass the SSLError
# Not secure though
result = requests.get(url, verify = False)
break
except:
sleep(5)
continue
# Transform it into an element tree
root = etree.fromstring(result.content, etree.HTMLParser())
# Parse the information with Xpath
root.xpath("//table[@class='ms-listviewtable']/tr")
# It gives me many elements of tr tags
这里有两个问题:
1. 这种状况,如果要继续用selenium的话要如何解决?
2. 我在网络上找到可以透过browser.find_element_by_xpath(xpath).click()的方式,
去按“下一页”,但是在我想爬的这个网站里,
上一页跟下一页按钮的xpath我不知道怎么分...有人可以提点一下吗?
或有其它的方式?