Re: [问题] 想利用data.table将Rawdata切割字段

楼主: celestialgod (天)   2015-03-02 10:44:06
改成用CharacterMatrix做为output
省去使用rbind的时间
PS: 之前对CharacterMatrix不甚熟悉,才没有使用。
library(data.table)
library(magrittr)
library(Rcpp)
library(inline)
sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix dat_split_f( std::vector< std::string > strings,
NumericVector loc) {
int loc_len = loc.size(), num_strings = strings.size();
CharacterMatrix output(num_strings, loc_len);
for( int j=0; j < num_strings; j++ )
{
for (int i=0; i < loc_len-1; i++)
output(j, i) = strings[j].substr(loc[i], loc[i+1] - loc[i]);
}
return output;
}')
# 测试一下40万的case 如果要用下面资料记得他们是一行...
dat = fread(paste0(rep("001female2019920404\n002male 3019920505\n003male
4019920606\n004female5019920707\n", 100000), collapse=""),sep="\n", sep2="",
header=FALSE)
tt = proc.time()
dat_split = dat_split_f(dat[[1]], c(0, 3, 9, 11, 19)) %>% data.table()
dat_split[, ':='(V1 = as.numeric(V1), V3 = as.numeric(V3))]
proc.time() - tt
# user system elapsed
# 0.97 0.06 1.03
# 测试一下400万的case 如果要用下面资料记得他们是一行...
dat = fread(paste0(rep("001female2019920404\n002male 3019920505\n003male
4019920606\n004female5019920707\n", 1000000), collapse=""),sep="\n", sep2="",
header=FALSE)
tt = proc.time()
dat_split = dat_split_f(dat[[1]], c(0, 3, 9, 11, 19)) %>% data.table()
dat_split[, ':='(V1 = as.numeric(V1), V3 = as.numeric(V3))]
proc.time() - tt
# user system elapsed
# 7.73 0.21 7.98
# 十倍的资料量,时间只花八倍,这样的结果好多了。
windows 64bit R-3.1.2 i7-3770K@4.4GHz
用了一下B830@1.8GHz的CPU跑40万的case
# user system elapsed
# 1.14 0.05 1.19
补上用regular expression的方法
library(data.table)
library(plyr)
library(dplyr)
library(magrittr)
dat = fread(paste0(rep("001female2019920404\n002male 3019920505\n003male
4019920606\n004female5019920707\n", 100000), collapse=""),sep="\n",
sep2="",header=FALSE)
tt = proc.time()
dat_regex = dat %>% select(V1) %>% extract2(1) %>%
regexec("([0-9]{3})(female|male\\s{2})([0-9]{2})([0-9]{8})", text = .)
dat_split = dat %>% select(V1) %>% extract2(1) %>%
regmatches(dat_regex) %>% do.call(rbind, .) %>% data.table() %>%
select(2:ncol(.)) %>% setnames(c("id", "gender", "age", "birthday"))
proc.time() - tt
# user system elapsed
# 14.99 0.34 23.76
with B830@1.8GHz
作者: sacidoO (阿骂)   2015-03-03 02:19:00
有试有成功 感谢C大分享 用您的CODE真的超快速的 推~~

Links booklink

Contact Us: admin [ a t ] ucptt.com