Re: [问题] 关于重复测量资料

楼主: celestialgod (天)   2015-03-02 14:27:34
※ 引述《yummy7922 (crucify)》之铭言:
: [问题叙述]:
: 我的资料是重复测量的资料,资料中有13820位病人的多次测量值,
: 但不是每位病人的观察笔数都相同,
: 我想要针对每一位病人,将每三笔资料计算一个平均值,
: 最后不到三笔的资料也算一个平均值,
: 不过我不知道该如何做,想请教各位高手们,谢谢。
library(plyr) # only used in data generation
library(dplyr)
library(data.table)
library(magrittr)
# data generation
n = 13620
dat = data.table(id = 1:n, len = sample(2:15, n, replace = TRUE)) %>%
mdply(function(id, len) data.table(id = rep(id, len), values = rnorm(len)))
# mean
k = 3
dat = select(dat, c(id, values)) # 你可以省略这行 只是删掉len而已
start_time = Sys.time()
result = dat %>% group_by(id) %>% mutate(newgroup = rep(1:length(values),
each = k, length = length(values))) %>% group_by(id, newgroup) %>%
summarise(mean(values))
Sys.time() - start_time
# Time difference of 0.558032 secs
# Other method
k = 3
dat = select(dat, c(id, values)) # 你可以省略这行 只是移除前面的变更
# library(reshape2) # 如果不用data.table 请跑这一行
start_time = Sys.time()
dat$newgroup = unlist(tapply(dat$values, dat$id, function(x){
rep(1:length(x), each = k, length = length(x))
}))
result2 = melt(tapply(dat$values, list(dat$id, dat$newgroup), mean))
result2 = result2[!is.na(result2$value),]
Sys.time() - start_time
# Time difference of 1.556089 secs
result2 = result2[order(result2$Var1),]
all.equal(result$value, result2$value)
# TRUE
根据版友aaron77217提供的资料格式,新增:
library(dplyr)
library(data.table)
library(magrittr)
dat_gen_f = function(N_patient, max_obs_time, n_vars){
dat = sample(max_obs_time, N_patient, replace = TRUE) %>% {
cbind(rep(1:N_patient, times=.), sapply(., seq, from = 1) %>% unlist())
} %>% cbind(matrix(rnorm(nrow(.)*n_vars),, n_vars)) %>% data.table()
setnames(dat, c("id", "obs_times", paste0("V", 1:n_vars)))
}
mean_dat_f = function(dat, k){
result = dat %>% group_by(id) %>%
mutate(newgroup = rep(1:ceiling(length(obs_times)/k), each = k,
length=length(obs_times)),
n_combine = (length(obs_times) %/% k) %>% {c(rep(k, . * k),
rep(length(obs_times) - . * k, length(obs_times) - . * k))}) %>%
ungroup() %>% mutate(times_combine = paste((newgroup-1)*3+1,
(newgroup-1)*3 + n_combine, sep="-"))
result = result %>% select(match(c(names(dat)[names(dat)!="obs_times"],
"times_combine"), names(result))) %>% extract(, lapply(.SD, mean),
by = "id,times_combine")
result
}
start_time = Sys.time()
dat = dat_gen_f(30000, 20, 15)
Sys.time() - start_time
# Time difference of 1.503086 secs
start_time = Sys.time()
result = mean_dat_f(dat, 3)
Sys.time() - start_time
# Time difference of 4.236243 secs
start_time = Sys.time()
dat = dat_gen_f(13820, 15, 1)
Sys.time() - start_time
# Time difference of 0.4750271 secs
start_time = Sys.time()
result = mean_dat_f(dat, 3)
Sys.time() - start_time
# Time difference of 1.848106 secs
作者: yummy7922 (crucify)   2015-03-07 13:47:00
谢谢

Links booklink

Contact Us: admin [ a t ] ucptt.com