Re: [问题] 用groupby()做累加

楼主: ccwang002 (亮)   2015-03-26 00:05:39
原文恕删,这类资料处理我都会用 pandas 来做,因为方便
import pandas as pd
资料格式有三个字段: zipcode date revenue
: mock=[[106,20150101,100],
: [106,20150101,200],
: [106,20150201,300],
: [106,20150201,400],
: [220,20150201,200],
: [220,20150201,300],
: [220,20150301,400],
: [220,20150301,500]]
# 读入资料
df = pd.DataFrame(mock, columns=['zipcode', 'date', 'revenue'])
# 如果有需要的话,可以把 date 转成 Python datetime 物件
# df.date = pd.to_datetime(df.date, format='%Y%m%d')
# 执行你想要的 groupby
merged_df = df.groupby(['zipcode', 'date']).sum()
# 把 Dataframe 转成 list of tuple 形式(可以再转成 list of list)
list(merged_df.to_records())
Output:
[(106, 20150101, 300),
(106, 20150201, 700),
(220, 20150201, 500),
(220, 20150301, 900)]
EDIT 抱歉刚刚没看到第二部份,我有点没理解你的意思,
是希望同一区每月的营收能逐月累加吗?
# 让它变回常见的 DataFrame
df = merged_df.reset_index()
# 确定按照 zipcode 小->大, date 旧->新 排序
df.sort(['zipcode', 'date'], inplace=True)
# 用 zipcode 做 groupby,对营收做累加,把结果存在新的 cumsum 字段
grouped = df.groupby('zipcode', sort=False)
df['cumsum'] = grouped.cumsum()
# 输出
list(df.to_records(index=False))
Output:
[(106, 20150101, 300, 300),
(106, 20150201, 700, 1000),
(220, 20150201, 500, 500),
(220, 20150301, 900, 1400)]
希望有帮到你的忙~

Links booklink

Contact Us: admin [ a t ] ucptt.com