import pandas as pd
import numpy as np
import scipy.stats
I've recently been working with some data that requires multi-indexing for manipulation and have had more than enough head-scratching moments. So here is my write-up of lessons learned that can be returned to anytime so I don't have to learn all these lessons again.
First I'll generate a simple DataFrame with a two-layered multi-index. Then I'll perform some grouping, joining, and reindexing manipulations that I've needed during my project.
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]
df = pd.DataFrame([0,1,1,1,1,2,4,2,0], index=arrays, columns=['orig']); df
Mean¶
Something I needed to do a fair amount was group the data by it's top level index and generate the mean for the group. I then had to fiddle around to join this top-level grouping back to my original DataFrame. Turns out none of this was necessary and can be achieved in one quick swoop like so:
df['mean'] = df['orig'].mean(level=0).reindex(df.index, level=0); df
Sum¶
sum
also takes a level
argument and performs the grouping without need for a groupby
.
df['sum'] = df['orig'].sum(level=0).reindex(df.index, level=0); df
Mode¶
Pandas doesn't have a builtin operation to get the mode of groups as far as I'm aware, but it can be achieved using the agg
function and the method from scipy. The operation is pretty slow and not without issues (e.g. cases where there are multiple modes) but for my use case it worked fine.
The function agg
doesn't accept an argument about the grouping level like the previous two examples. From this point on we need to perform the grouping first before applying any aggregation calculation.
df['mode'] = df['orig'].groupby(level=0).agg(lambda x: scipy.stats.mode(x)[0][0]).reindex(df.index, level=0); df
NUnique¶
Here's an example to get the number of unique values in a grouping and join it back onto the original multi-indexed DataFrame:
df['nunique'] = df['orig'].groupby(level=0).apply(lambda x: x.nunique()).reindex(df.index, level=0); df
Size¶
Now the size or "count" of instances in each grouping:
df['size'] = df.groupby(level=0).size().reindex(df.index, level=0); df
First¶
df['first'] = df['orig'].groupby(level=0).first().reindex(df.index, level=0); df
Align¶
df2 = pd.Series(np.random.randn(4), index=['A','B','C','D'], name='align'); df2
a1, a2 = df.align(df2, level=0, axis=0); a2
a1
is the original DataFrame (df
) and a2
is the Series which has been reindexed to match the original DataFrame.
You could also use reindex
again like the following:
df2.reindex(df.index, level=0)
Update or FillNA¶
I had a case during data manipulation where one column was missing data. I was able to fill in missing data by performing a calculation on data from another column. However, in some cases this other column was missing data where the calculatable column was not. This meant that if I performed the calculation, then replaced all values, I would lose some data.
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]
df = pd.DataFrame([0,None,1,1,1,2,4,None,None], index=arrays, columns=['orig']); df
df['sum'] = pd.Series([4,None,None,3], index=['A','B','C','D']).reindex(df.index, level=0); df
Imagine the missing value in A
was 3, or that there were other rows missing from this data that meant the value for sum
is really correct.
Here we want to fill in the values for B
and C
without overwriting the values for A
or D
as these cannot be recalculated using the data we are given.
replacement = df[~df['orig'].isnull()]['orig'].sum(level=0).reindex(df.index, level=0); replacement
fillna
will only update values in the original series where they are already None
- it won't overwrite existing values. This way we can keep the value for A
and D
and only update the missing cases that we need.
df['sum'] = df['sum'].fillna(replacement); df
Now instead imagine that we know the value for A
is wrong and we want to overwrite it using our new calculation. Since the original data fro D
is missing we're going to rely on the fact that it's correct and don't want to overwrite that. So we want to update values for A
, B
, and C
. Using update will overwrite the existing values where our new series is not None
. This is done "inplace".
df['sum'].update(replacement); df
These fillna
and update
functions aren't anything special to MultiIndexed DataFrames, this demonstration to myself is more to show the full workflow once the recalculated Series has been properly indexed to align with what we're filling or updating. None of these calculations will work if the new Series is only indexed at the higher level, and sometimes the code doesn't even throw errors so you won't know that it's failed.