[问题] 爬批踢踢文章内容 miao2361 PTT批踢踢实业坊

[问题] 爬批踢踢文章内容

楼主: miao2361 (Miao) 2015-04-01 17:44:26

[问题类型]:
程式咨询(我想用R 做某件事情，但是我不知道要怎么用R 写出来)
[软件熟悉度]:
使用者(已经有用R 做过不少作品)
[问题叙述]:
请简略描述你所要做的事情，或是这个程式的目的
问题一
用httr、XML套件想要把批踢踢的文章们存成.txt档以利后续text mining
但是八卦板因为有“确认已满18岁”的网页而无法存出文章
问题二
关于RCurl的问题（详见以下）
问题一
library(XML)
library(httr)
start <- regexpr('www', line)[1]
end <- regexpr('html', line)[1]
if(start != -1 & end != -1){
url <- substr(line, start, end+3)
html <- content(GET(url), encoding="UTF8")
doc <- xpathSApply(html, "//div[@id='main-content']", xmlValue)
name <- strsplit(url, '/')[[1]][4]
write(doc, gsub('html', 'txt', name))
}
# 当读入八卦板以外的批踢踢文章网址
line = "https://www.ptt.cc/bbs/StupidClown/M.1427811176.A.552.html"
工作路径中会出现一个新的txt档，其中存著这篇笨版文章的内容
# 当读入八卦板网址
line = "https://www.ptt.cc/bbs/Gossiping/M.1427816656.A.450.html"
存下来的txt档里面却是空的。
研判应该是八卦板的十八岁限制网页造成
https://www.ptt.cc/bbs/Gossiping/M.1427816656.A.450.html
想请问版上高手如何跳过这个网页呢？
问题二
原程式码
url <- substr(line, start, end+3)
html <- content(GET(url), encoding="UTF8")
doc <- xpathSApply(html, "//div[@id='main-content']", xmlValue)
name <- strsplit(url, '/')[[1]][4]
write(doc, gsub('html', 'txt', name))
原本想用RCurl套件来做第二行
html <- htmlParse(getURL(url), encoding='UTF-8')
存出来的html却失败
> html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr>
<center>nginx</center>
</body>
</html>
但后来改成用httr套件的content()和GET()就可以了，却不明白为什么XD

作者: andrew43 (讨厌有好心推文后删文者) 2015-04-01 18:52:00

把 over18 这个 cookie 的值写入 1 应该就可以了。不过我不了解在 R 这端怎么做。

楼主: miao2361 (Miao) 2015-04-01 19:13:00

恩恩~有看过python的code这样写，但R不知如何解...

作者: celestialgod (å¤©) 2015-04-01 20:04:00

http://tinyurl.com/pn8cvwj Rcurl有option可以设定，请参考rcurl的website，目前人在外面，不方便测试，抱歉。rcurl website: http://tinyurl.com/o3ckae3

楼主: miao2361 (Miao) 2015-04-08 17:54:00

非常感谢楼上！解开了，比想像中简单太多XD 程式码如下GET(url, config=set_cookies("over18"="1"),...)即可

不客气，谢谢您的回复

继续阅读

Re: [问题]坐标轴变更celestialgod Re: [问题]坐标轴变更andrew43 [问题]坐标轴变更coke228 Re: [问题] as.factor()和factor()JX660 Re: [问题] 矩阵中的属性不同ntme Re: [问题] 矩阵中的属性不同andrew43 [问题] 矩阵中的属性不同yeuan Re: [问题] iteration growing parallelcelestialgod [问题] iteration growing parallelmemphis Re: [问题] 资料来源不依celestialgod