[问题] bs4抓取资料问题 MOONY135 PTT批踢踢实业坊

[问题] bs4抓取资料问题

楼主: MOONY135 (谈无欲) 2015-07-29 12:18:26

我想从网页版的ptt抓取资料每篇文章的作者跟发文日
还有文章的网页
import requests
from bs4 import BeautifulSoup
import sys
res_index = requests.get("https://www.ptt.cc/bbs/gamesale/index.html")
soup_index = BeautifulSoup(res_index.text,"html.parser") #抓每篇文章的URL联结
main_container_index = soup_index.select('.r-ent')
for link in main_container_index:
print(link.select('div.author')[0].text, link.select('div.date')[0].text)
print(link.find('a')['href'])
我有疑问的是print(link.find('a')['href'])这行
因为我想要抓网址但一定要这样写才可以抓到
a href="/bbs/Gamesale/M.1438136421.A.732.html"
这行不知道大家可以帮我解释一下为什么要这样写吗
=================以下是网页长相
thireh 7/29
<div class="title">
<a href="/bbs/Gamesale/M.1438136421.A.732.html">[PC ] 售mycard 点数85折</a>
</div>
DREAMLS 7/29
<div class="title">
<a href="/bbs/Gamesale/M.1438137518.A.6A3.html">[PSV ] 售/换 psv2007(青柠白）
+16g记忆卡+六片超值游戏</a>
</div>
CTC0115 7/29
<div class="title">
<a href="/bbs/Gamesale/M.1438137532.A.B0E.html">[PS3 ] 售 VR快打5 </a>
</div>

作者: s860134 (s860134) 2015-08-01 17:02:00

先了解 html可以依照他的tag (<a></a>,<div></div>这些)可以被解读成一个树状结构，而bs4就是帮你把建树和搜寻整合在一起的一个package。https://goo.gl/lCmf4C 帮你"逐行"解释了耐心看吧

楼主: MOONY135 (谈无欲) 2015-08-02 21:24:00

感恩

继续阅读

Re: [问题] python写财务技术指标forloricever Re: [问题] python 读取netCDFccwang002 [问题] python 读取netCDFihaveaids [问题] django 取得专案的网络流量aiweisen [问题] list取特定字串a9704030 [问题] 使用 FuncDesigner 制造二维oovar阵列BCRK7 [问题] 自动更新json资料chuanmaotou python在sublime或vim撰写a4679123 [问题] 请问一下我打完PIP以后怎么会这么怪呢?abcgo [问题] pip freeze怎么处理'requests[security]'prelight