python - Using df.apply() with a Pandas MuliIndex / carrying out operations on hierarchical index rows? -
question
efficient way carry out numerical operations hierarchical index rows?
problem
i have large dataframe, on 1gb, indexed year , country code. small subset shown below. each country has multiple observations per year. i'd take average of each country's observations in year , return overall average. desired end result dataframe indexed year each countries yearly average.
conceptually, i'd like:
df.ix[:,['x3yv_e', 'x3yv_d', 'x1yv_e', 'x1yv_d']].groupby(df.year).groupby(level=1).apply(lambda x: np.mean(x))
heres dataset:
x3yv_e x3yv_d x1yv_e x1yv_d year 2003 12 0.000000 0.000000 0.000000 0.000000 34 0.009953 0.001400 0.007823 0.000950 12 0.010210 0.001136 0.008333 0.000722 34 0.011143 0.006319 0.007520 0.006732 72 0.018791 0.016717 0.018808 0.015206 2004 0 0.009115 0.000000 0.010243 0.000000 38 0.009059 0.000932 0.010042 0.000573 53 0.009618 0.001152 0.010724 0.000729 70 0.000000 0.000000 0.000000 0.000000 70 0.020655 0.018411 0.012965 0.011640
what i've tried
benefits of panda's multiindex?
how apply condition on level of pandas.multiindex?
because of large size of dataframe, i'm looking avoid loops , copying dataframe multiple times solutions 2 questions above suggest.
any ideas on efficient solution? taking look!
create data
in [12]: df = dataframe(randn(10,4),columns=list('abcd')) in [13]: df['year'] = 2003 in [14]: df['id'] = [12,34,12,34,72,0,38,53,70,70] in [16]: df.loc[:5,'year'] = 2004 in [17]: df out[17]: b c d year id 0 -1.917262 0.228599 -0.463695 0.776567 2004 12 1 2.064658 -0.716104 -1.399685 0.402077 2004 34 2 -1.282627 0.338368 0.757658 -0.114086 2004 12 3 1.190319 -1.592282 0.942431 -0.778128 2004 34 4 1.928094 0.532387 -0.352155 -0.039304 2004 72 5 0.535093 -1.655569 -0.309651 0.438992 2004 0 6 0.332428 -0.427696 -1.324072 2.158907 2003 38 7 -1.343306 -0.288373 0.544344 -1.361189 2003 53 8 0.959273 -0.420134 0.691108 -0.469833 2003 70 9 0.692352 0.101226 -0.161140 -0.100968 2003 70
groupby year , id, mean
in [21]: df.groupby(['year','id']).mean() out[21]: b c d year id 2003 38 0.332428 -0.427696 -1.324072 2.158907 53 -1.343306 -0.288373 0.544344 -1.361189 70 0.825812 -0.159454 0.264984 -0.285401 2004 0 0.535093 -1.655569 -0.309651 0.438992 12 -1.599944 0.283483 0.146981 0.331241 34 1.627488 -1.154193 -0.228627 -0.188025 72 1.928094 0.532387 -0.352155 -0.039304
by year mean
in [24]: df.groupby(['year']).mean() out[24]: b c d id year 2003 0.160187 -0.258744 -0.062440 0.056729 57.750000 2004 0.419713 -0.477434 -0.137516 0.114353 27.333333
by id
in [25]: df.groupby(['id']).mean() out[25]: b c d year id 0 0.535093 -1.655569 -0.309651 0.438992 2004 12 -1.599944 0.283483 0.146981 0.331241 2004 34 1.627488 -1.154193 -0.228627 -0.188025 2004 38 0.332428 -0.427696 -1.324072 2.158907 2003 53 -1.343306 -0.288373 0.544344 -1.361189 2003 70 0.825812 -0.159454 0.264984 -0.285401 2003 72 1.928094 0.532387 -0.352155 -0.039304 2004
Comments
Post a Comment