[问题] 抓取网页 david31408 PTT批踢踢实业坊

[问题] 抓取网页

楼主: david31408 (Hope) 2016-08-12 18:05:15

[软件熟悉度]:
请把以下不需要的部份删除
入门(写过其他程式，只是对语法不熟悉)
[问题叙述]:
请简略描述你所要做的事情，或是这个程式的目的
大家好，我是R的新手，所以最近在练习
想要用XML这个package试着抓取 baseballreference的资料试看看
由于很菜，所以就先乱试，程式码跟提示如下
会不会不是所有的网页都可以用xml抓取?
> library("XML", lib.loc="~/R/win-library/3.2")
> url <- "http://www.baseball-reference.com/leaders/H_career.shtml"
> Hits <- readHTMLTable(url)
Error in UseMethod("xpathApply") :
no applicable method for 'xpathApply' applied to an object of class "NULL"
在上面的case中，不知道为什么会出现这样的error message
但我猜网页本身不是table
后来又试了方法2
> url <- "http://www.baseball-reference.com/leaders/H_career.shtml"
> x <- xmlParse(url)
Error message 如下
Specification mandate value for attribute itemscope
attributes construct error
Couldn't find end of Start Tag html line
Extra content at the end of the document
Error: 1: Specification mandate value for attribute itemscope
2: attributes construct error
3: Couldn't find end of Start Tag html line 1
4: Extra content at the end of the document
可能baseballreference防止这样?
谢谢大家教学 :)
[关键字]:
MLB, XML

作者: andrew43 (讨厌有好心推文后删文者) 2016-08-12 20:26:00

你在板上先爬个文吧。另外，你这样“乱试”不是学习的好方法。多看说明文件和前人的例子。

楼主: david31408 (Hope) 2016-08-12 20:33:00

谢谢这算是爬虫吗?

作者: celestialgod (å¤©) 2016-08-12 22:20:00

是爬虫

楼主: david31408 (Hope) 2016-08-12 23:43:00

了解！！谢谢:)

继续阅读

[问题] 在资料中新增一个变量来进行统计分析swilly0906 [问题] 有条件的删除资料笔数amygm307 [问题] 矩阵运算问题Muhaosic Re: [问题] 求救QQ 时间序列分析绘图问题naturalsmen [问题] 求救QQ 时间序列分析绘图问题kindarex [问题] 有关网络爬虫"网址(url)"的问题wheado [问题] 如何用R读取本地的mdb档?Tampa [问题] bigmemory 用ssd硬盘会变快吗? f496328mm [问题] 爬虫相关问题GetRobin Re: [问题] 自动跳过填入验证码clansoda