[问题]网页疑似没有更新爬虫重复写入同一则贴文

楼主: GHdisf45a (The_rabbit)   2022-12-15 12:39:55
请问各位大大
我最近在学习如何使用爬虫程式所以我拿ptt网页板作为练习目标
但我碰到在10则后会反复抓取同一则贴文的title和连结的问题
https://imgur.com/a/Bnqo2B1
我猜想是网页没有加载新的网页资料
但是下拉式加载的动态网页不是只要下拉就会更新吗
而且我看chrom driver的selenium的下拉是有在执行的,请问是什么原因导致?
以下我的程式码
import urllib.request as req
import requests
import selenium
import schedule
import time
import json
from time import sleep
import json
import openpyxl
import random
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import bs4
pttWeb = openpyxl.load_workbook('pttweb.xlsx')
ws = pttWeb.active
i = 1
scroll_time = int(input("scroll_Times"))
options = Options()
options.chrome_executable_path = "C:\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(options = options)
sleep(3)
driver.get('https://www.pttweb.cc/hot/all/today')
sleep(5)
prev_ele = None
for now_time in range(1, scroll_time+1):
sleep(2)
eles = driver.find_elements(by=By.CLASS_NAME,value='e7-right.ml-2')
# 若串行中存在上一次的最后一个元素,则撷取上一次的最后一个元素到当前最后一
个元素进行爬取
try:
# print(eles)
# print(prev_ele)
eles = eles[eles.index(prev_ele):]
except:
pass
for ele in eles:
try:
titleInfo = ele.find_element(by=By.CLASS_NAME, value =
"e7-article-default")
title = titleInfo.text
href = titleInfo.get_attribute('href')
ws.cell(i,1,i)
ws.cell(i,2,title)
ws.cell(i,3,href)
sleep(3)
inner =req.Request(href, headers ={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
})
with req.urlopen(inner) as innerRespomse:
articleData = innerRespomse.read().decode("utf-8")
articleRoot = bs4.BeautifulSoup(articleData, "html.parser")
main_content = articleRoot.find("div", itemprop="articleBody")
boardInfo= articleRoot.find("span",
class_="e7-board-name-standalone")
authorInfo = articleRoot.find("span", itemprop="name")
timeInfo = articleRoot.find("time", itemprop="datePublished")
countInfo = articleRoot.find_all("span",
class_="e7-head-content")
board = boardInfo.text
author = authorInfo.text
Time = timeInfo.text
count = countInfo[4].text
allContent = main_content.text
pre_text = allContent.split('
作者: lycantrope (阿宽)   2022-12-15 13:09:00
建议先改掉try-except:pass,把code贴pastebin较容易看
楼主: GHdisf45a (The_rabbit)   2022-12-15 16:34:00
更:https://pastebin.com/cyUdWYLZ code的Pastebin更:https://pastebin.com/cyUdWYLZ code的Pastebin
作者: surimodo (好吃棉花糖)   2022-12-16 01:28:00
忙猜 你class抓错 标题不只 e7-article-default还有 e7-article-viewed 跟 e7-article-most-recently-viewed然后 try expect 不要 pass一定有跳出找不到class pass干嘛不用除错干脆把try expect全删好了写了又pass 脱裤子放屁

Links booklink

Contact Us: admin [ a t ] ucptt.com