[问题类型]:
程式咨询(我想用R 做某件事情,但是我不知道要怎么用R 写出来)
[软件熟悉度]:
入门(写过其他程式,只是对语法不熟悉)
[问题叙述]:
使用R透过RHadoop的Rhdfs 1.0.8
[程式范例]:
我目前的实验环境,需要读取很大的CSV File(存放在Hadoop的HDFS上,档案大小几乎
都大于20GB),
我使用了RHdoop的rhdfs R Package
Ref.
https://github.com/RevolutionAnalytics/RHadoop/wiki
使用Rstudio Web版开发,原始码如下
*************************************************************************************************
Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar")
Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/")
library(rmr2);
library(rhdfs);
library(lubridate);
hdfs.init();
f = hdfs.file("/bigdata/rawdata/201312.csv","r",buffersize=104857600);
m = hdfs.read(f);
c = rawToChar(m);
data = read.table(textConnection(c), sep = ",");
*************************************************************************************************
读完后,发现它只读进了前一千五百多笔的资料,正确应该有一亿多笔
*************************************************************************************************
去Google了一下,有查到下列这个解的方向
“rhdfs uses the java api for reading files stored in hdfs.
That api will not necessarily read the entire file in one shot.
It will return some number of bytes for each read.
When it reaches the end of the file it returns -1.
In the case of rhdfs, and end of the file will return NULL.
So, you need to loop on the hdfs.read call until NULL is returned”
不过看了rhdfs的手册,并没有仔细提到上面解法关于hdfs.read()的行为:<
不知道有人有这方面经验吗?
[关键字]:
R, Large Scale Data Set, Big Data, Hadoop, RHadoop, CSV, HDFS, rhdfs
Thanks in advance!