python - Using df.apply() with a Pandas MuliIndex / carrying out operations on hierarchical index rows? -


question

efficient way carry out numerical operations hierarchical index rows?

problem

i have large dataframe, on 1gb, indexed year , country code. small subset shown below. each country has multiple observations per year. i'd take average of each country's observations in year , return overall average. desired end result dataframe indexed year each countries yearly average.

conceptually, i'd like:

df.ix[:,['x3yv_e', 'x3yv_d', 'x1yv_e', 'x1yv_d']].groupby(df.year).groupby(level=1).apply(lambda x: np.mean(x)) 

heres dataset:

                  x3yv_e      x3yv_d      x1yv_e      x1yv_d year                     2003        12   0.000000    0.000000    0.000000    0.000000             34   0.009953    0.001400    0.007823    0.000950             12   0.010210    0.001136    0.008333    0.000722             34   0.011143    0.006319    0.007520    0.006732             72   0.018791    0.016717    0.018808    0.015206 2004        0    0.009115    0.000000    0.010243    0.000000             38   0.009059    0.000932    0.010042    0.000573             53   0.009618    0.001152    0.010724    0.000729             70   0.000000    0.000000    0.000000    0.000000             70   0.020655    0.018411    0.012965    0.011640 

what i've tried

benefits of panda's multiindex?

how apply condition on level of pandas.multiindex?

because of large size of dataframe, i'm looking avoid loops , copying dataframe multiple times solutions 2 questions above suggest.

any ideas on efficient solution? taking look!

create data

in [12]: df = dataframe(randn(10,4),columns=list('abcd'))  in [13]: df['year'] = 2003  in [14]: df['id'] = [12,34,12,34,72,0,38,53,70,70] in [16]: df.loc[:5,'year'] = 2004  in [17]: df out[17]:                    b         c         d  year  id 0 -1.917262  0.228599 -0.463695  0.776567  2004  12 1  2.064658 -0.716104 -1.399685  0.402077  2004  34 2 -1.282627  0.338368  0.757658 -0.114086  2004  12 3  1.190319 -1.592282  0.942431 -0.778128  2004  34 4  1.928094  0.532387 -0.352155 -0.039304  2004  72 5  0.535093 -1.655569 -0.309651  0.438992  2004   0 6  0.332428 -0.427696 -1.324072  2.158907  2003  38 7 -1.343306 -0.288373  0.544344 -1.361189  2003  53 8  0.959273 -0.420134  0.691108 -0.469833  2003  70 9  0.692352  0.101226 -0.161140 -0.100968  2003  70 

groupby year , id, mean

in [21]: df.groupby(['year','id']).mean() out[21]:                          b         c         d year id                                         2003 38  0.332428 -0.427696 -1.324072  2.158907      53 -1.343306 -0.288373  0.544344 -1.361189      70  0.825812 -0.159454  0.264984 -0.285401 2004 0   0.535093 -1.655569 -0.309651  0.438992      12 -1.599944  0.283483  0.146981  0.331241      34  1.627488 -1.154193 -0.228627 -0.188025      72  1.928094  0.532387 -0.352155 -0.039304 

by year mean

in [24]: df.groupby(['year']).mean() out[24]:                       b         c         d         id year                                                    2003  0.160187 -0.258744 -0.062440  0.056729  57.750000 2004  0.419713 -0.477434 -0.137516  0.114353  27.333333 

by id

in [25]: df.groupby(['id']).mean() out[25]:                     b         c         d  year id                                               0   0.535093 -1.655569 -0.309651  0.438992  2004 12 -1.599944  0.283483  0.146981  0.331241  2004 34  1.627488 -1.154193 -0.228627 -0.188025  2004 38  0.332428 -0.427696 -1.324072  2.158907  2003 53 -1.343306 -0.288373  0.544344 -1.361189  2003 70  0.825812 -0.159454  0.264984 -0.285401  2003 72  1.928094  0.532387 -0.352155 -0.039304  2004 

Comments