Re: [问题] 丢入htmlParse的东西 oldjojotenya PTT批踢踢实业坊

Re: [问题] 丢入htmlParse的东西

楼主: oldjojotenya (旧舅舅) 2015-01-31 13:46:49

后来找了两种网页测试了一下：
一、全部资讯在单一页面的：
https://tw.stock.yahoo.com/d/s/company_2330.html
1.
url<-"https://tw.stock.yahoo.com/d/s/company_2330.html"
content0<-htmlParse(url)
结果：成功但是显示警告讯息：XML content does not seem to be XML
后来去stockoverflow查了一下，有人回答遇到这种状况的处理方法：
"You can use RCurl to fetch the content and then XML seems to be able to
handle it"，表示要用RCurl的getURL就能成功。
2.
url<-getURL("https://tw.stock.yahoo.com/d/s/company_2330.html")
content1<-htmlParse(url)
结果：成功
3.
url<-"https://tw.stock.yahoo.com/d/s/company_2330.html"
f<-file(url)
f_size<-file.info(url)$size
content2<-readChar(f,f_size)
close(f)
结果：
#错误在readChar(f, f_size) : 无法开启连结
此外: 警告讯息：
In readChar(f, f_size) : 不支援这种 URL 方法
二、搜寻页：
http://www.taifex.com.tw/chinese/3/7_12_1.asp
1.
url<-"http://www.taifex.com.tw/chinese/3/7_12_1.asp"
content0<-htmlParse(url)
结果：成功
2.
url<-getURL("http://www.taifex.com.tw/chinese/3/7_12_1.asp")
content1<-htmlParse(url)
结果：成功
3.
url<-"http://www.taifex.com.tw/chinese/3/7_12_1.asp"
f<-file(url)
f_size<-file.info(url)$size
content2<-readChar(f,f_size)
close(f)
结果：
#错误: 'nchars' 引数不正确
查了readChar的使用方法，nchars不能为NA，但在此处带入的f_size不知道为何却是NA
总结：
1.不管怎样用getURL比较保险
2.用file.info连接到本地file时，抓出来的size都是该file的size，但是连接到网络
上的file时，不知道为何都读不到正确的size(都显示为NA)，所以就不能用
readChar抓出网页内容了。
可请问为何是这样嘛？

作者: Wush978 (拒看低质媒体) 2015-02-01 22:43:00

谢谢你的研究精神！

继续阅读

[问题] 丢入htmlParse的东西oldjojotenya [问题]不知从何处理起的BUGcoke228 Re: [问题] Rcpp 初学Wush978 Re: [问题] 用R 写spss 的logistic regressionandrew43 Re: [问题] Rcpp 初学celestialgod [问题] Rcpp 初学gsuper [问题] 用R 写spss 的logistic regressionlepin2001 [问题] 字串\的输入方式lovesnow1990 [问题] 请问RSelenium套件问题mickey1231 Re: [问题] 关于R的速度Wush978