Re: [问题] table问题(网页编码) celestialgod PTT批踢踢实业坊

Re: [问题] table问题(网页编码)

楼主: celestialgod (å¤©) 2016-06-06 16:23:37

※ 引述《vicror84 (阿汘)》之铭言：
: ※ 引述《celestialgod (天)》之铭言：
: : 我不知道网页的header跟内文的关系
: : 我看charset是big5，不过我实际读表格的资料，还是要用UTF8
: : 所以我抓content的时候用big5是正常的，再用read_html with encoding utf8
: : 之后再把utf8转回去big5 (windows才要转，mac, linux不用stri_conv那段)
: : 之后就可以看到正常的表格内容了
: : library(httr)
: : library(pipeR)
: : library(xml2)
: : library(stringi)
: : library(stringr)
: : tableContent <- GET("http://depart.femh.org.tw/HMC/wholebody.html") %>>%
: : content("text", encoding = "BIG5") %>>% read_html("UTF-8") %>>%
: : xml_find_all("//tr/td/table/tbody/tr") %>>% lapply(function(x){
: : output_text <- xml_find_all(x, "td") %>>% xml_text %>>%
: : stri_conv(from = "UTF-8", to = "Big5") %>>% str_replace_all("\\s", "")
: : if (length(output_text) >= 9 && length(output_text) <= 11)
: : {
: : return(c(rep("", 12-length(output_text)), output_text))
: : } else if (length(output_text) == 8)
: : {
: : return(c(output_text[1:4], unlist(rbind(output_text[5:8], rep("", 4)))))
: : } else
: : {
: : return(output_text)
: : }
: : }) %>>% do.call(what = rbind)
: : 结果截图：
: :

: : 函数说明可以往前翻我的文章，某一篇(#1N9lFXFI (R_Language))下面有一些说明
: : 不懂再回文发问吧
: : #那串的用法：http://evolutionbrain.blogspot.tw/2015/08/ptt.html
: 因为我是新手刚学R，看不太懂 str_replace_all("\\s", "") 这段程式，
: 还有，
str_replace_all(string, pattern, replacement):
http://www.inside-r.org/packages/cran/stringr/docs/str_replace_all
把string中符合pattern的字串置换成replacement的字串
例如：str_replace_all("aa bb cc", "\\s", "")
"\\s"是regexp里的空白，""就是零长度的字串，所以置换后就会变成 "aabbcc"
%>>% 部分可以参考在板上/magrittr，有一篇简单的教学文，看%>%部分
我这里只有一个地方用到%>>%的特色 (do.call(what = rbind)这里用到而已)
: if (length(output_text) >= 9 && length(output_text) <= 11)
: {
: return(c(rep("", 12-length(output_text)), output_text))
: } else if (length(output_text) == 8)
: {
: return(c(output_text[1:4], unlist(rbind(output_text[5:8], rep("", 4))))
: } else
: {
: return(output_text)
: }
: }) %>>% do.call(what = rbind)
: 尤其是那些数字，不太懂他们代表什么意思，如果逐一注解更好，不方便的话没有关系
数字是因为网页的表格有跨字段的问题，所以他不见得都会抓满12栏
所以你需要自己手动去补成12栏，我只是做这样的事情而已
因此，才会在你下一篇回答说那个表格处理请参考我下一篇
也就是这个部分，自己抓出每一列之后，做补满或是调整空位的部分
这个可能对R新手比较困难一点，但是没有现成的函数可以直接读跨栏的字段
如果只会用XML的readHTMLTable，就只会出现都靠左的情况，剩下都补入""
这个就只能自己慢慢位移了
逐一注解就不做了，这里只是一个概念，你需要自己动手去把table抓下来
看看抓下来的每一列长怎样，然后程式里对应处理为何

作者: vicror84 (阿汘) 2016-06-07 22:47:00

谢谢你~~ 这语言真得很难!!

继续阅读

Re: [问题] table问题(网页编码)vicror84 Re: [问题] 格式转换celestialgod [问题] 格式转换clansoda Re: [问题] table问题(网页编码)celestialgod [问题] 储存格问题 (网页表格)vicror84 [问题] table问题(网页编码)vicror84 [问题] package car 安装问题sky84911 [问题] 关于自定函数$laputaca [问题] AER套件中的ivreg无法执行eternalheast Re: [问题] 内存不足jklkj