[问题]网页疑似没有更新爬虫重复写入同一则贴文 GHdisf45a PTT批踢踢实业坊

[问题]网页疑似没有更新爬虫重复写入同一则贴文

楼主: GHdisf45a (The_rabbit) 2022-12-15 12:39:55

请问各位大大
我最近在学习如何使用爬虫程式所以我拿ptt网页板作为练习目标
但我碰到在10则后会反复抓取同一则贴文的title和连结的问题

我猜想是网页没有加载新的网页资料
但是下拉式加载的动态网页不是只要下拉就会更新吗
而且我看chrom driver的selenium的下拉是有在执行的，请问是什么原因导致?
以下我的程式码
import urllib.request as req
import requests
import selenium
import schedule
import time
import json
from time import sleep
import json
import openpyxl
import random
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import bs4
pttWeb = openpyxl.load_workbook('pttweb.xlsx')
ws = pttWeb.active
i = 1
scroll_time = int(input("scroll_Times"))
options = Options()
options.chrome_executable_path = "C:\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(options = options)
sleep(3)
driver.get('https://www.pttweb.cc/hot/all/today')
sleep(5)
prev_ele = None
for now_time in range(1, scroll_time+1):
sleep(2)
eles = driver.find_elements(by=By.CLASS_NAME,value='e7-right.ml-2')
# 若串行中存在上一次的最后一个元素，则撷取上一次的最后一个元素到当前最后一
个元素进行爬取
try:
# print(eles)
# print(prev_ele)
eles = eles[eles.index(prev_ele):]
except:
pass
for ele in eles:
try:
titleInfo = ele.find_element(by=By.CLASS_NAME, value =
"e7-article-default")
title = titleInfo.text
href = titleInfo.get_attribute('href')
ws.cell(i,1,i)
ws.cell(i,2,title)
ws.cell(i,3,href)
sleep(3)
inner =req.Request(href, headers ={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
})
with req.urlopen(inner) as innerRespomse:
articleData = innerRespomse.read().decode("utf-8")
articleRoot = bs4.BeautifulSoup(articleData, "html.parser")
main_content = articleRoot.find("div", itemprop="articleBody")
boardInfo= articleRoot.find("span",
class_="e7-board-name-standalone")
authorInfo = articleRoot.find("span", itemprop="name")
timeInfo = articleRoot.find("time", itemprop="datePublished")
countInfo = articleRoot.find_all("span",
class_="e7-head-content")
board = boardInfo.text
author = authorInfo.text
Time = timeInfo.text
count = countInfo[4].text
allContent = main_content.text
pre_text = allContent.split('

作者: lycantrope (阿宽) 2022-12-15 13:09:00

建议先改掉try-except:pass,把code贴pastebin较容易看

楼主: GHdisf45a (The_rabbit) 2022-12-15 16:34:00

更:https://pastebin.com/cyUdWYLZ code的Pastebin更:https://pastebin.com/cyUdWYLZ code的Pastebin

作者: surimodo (好吃棉花糖) 2022-12-16 01:28:00

忙猜你class抓错标题不只 e7-article-default还有 e7-article-viewed 跟 e7-article-most-recently-viewed然后 try expect 不要 pass一定有跳出找不到class pass干嘛不用除错干脆把try expect全删好了写了又pass 脱裤子放屁

继续阅读

[问题] 执行程式CPU_14%，GPU_0%unknown [心得] 互动模式下 if 结束后不得接任何程式码mikemike1021 [问题] beautifulsoup 上的 find() takes no keylivehorse [问题] 征会云端GoolgeCloudRun布署写python的angel2devil [问题] py程式之间的值如何传递XiaoLuu5566 [问题]把图片映射在网格上但是回贴回去发现变小kyly157 [闲聊] YOUTUBE 同步上ＬＯＧＯ或图片jackjenny [问题] 如何在sklearn中的分词加入自己的辞典?TiffanyPany Re: [问题] 列出一个列表中所有子集合poototo Re: [问题] 优化程式码，转成 dictpoototo