Come with me now on a journey through code and data...

Lessons in Pandas Multi Indexing

In [1]:
import pandas as pd
import numpy as np
import scipy.stats

I've recently been working with some data that requires multi-indexing for manipulation and have had more than enough head-scratching moments. So here is my write-up of lessons learned that can be returned to anytime so I don't have to learn all these lessons again.

First I'll generate a simple DataFrame with a two-layered multi-index. Then I'll perform some grouping, joining, and reindexing manipulations that I've needed during my project.

In [2]:
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
          np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]

df = pd.DataFrame([0,1,1,1,1,2,4,2,0], index=arrays, columns=['orig']); df
Out[2]:
orig
A 1 0
2 1
3 1
B 1 1
2 1
C 1 2
2 4
3 2
D 1 0

Mean

Something I needed to do a fair amount was group the data by it's top level index and generate the mean for the group. I then had to fiddle around to join this top-level grouping back to my original DataFrame. Turns out none of this was necessary and can be achieved in one quick swoop like so:

In [3]:
df['mean'] = df['orig'].mean(level=0).reindex(df.index, level=0); df
Out[3]:
orig mean
A 1 0 0.666667
2 1 0.666667
3 1 0.666667
B 1 1 1.000000
2 1 1.000000
C 1 2 2.666667
2 4 2.666667
3 2 2.666667
D 1 0 0.000000

Sum

sum also takes a level argument and performs the grouping without need for a groupby.

In [4]:
df['sum'] = df['orig'].sum(level=0).reindex(df.index, level=0); df
Out[4]:
orig mean sum
A 1 0 0.666667 2
2 1 0.666667 2
3 1 0.666667 2
B 1 1 1.000000 2
2 1 1.000000 2
C 1 2 2.666667 8
2 4 2.666667 8
3 2 2.666667 8
D 1 0 0.000000 0

Mode

Pandas doesn't have a builtin operation to get the mode of groups as far as I'm aware, but it can be achieved using the agg function and the method from scipy. The operation is pretty slow and not without issues (e.g. cases where there are multiple modes) but for my use case it worked fine.

The function agg doesn't accept an argument about the grouping level like the previous two examples. From this point on we need to perform the grouping first before applying any aggregation calculation.

In [5]:
df['mode'] = df['orig'].groupby(level=0).agg(lambda x: scipy.stats.mode(x)[0][0]).reindex(df.index, level=0); df
Out[5]:
orig mean sum mode
A 1 0 0.666667 2 1
2 1 0.666667 2 1
3 1 0.666667 2 1
B 1 1 1.000000 2 1
2 1 1.000000 2 1
C 1 2 2.666667 8 2
2 4 2.666667 8 2
3 2 2.666667 8 2
D 1 0 0.000000 0 0

NUnique

Here's an example to get the number of unique values in a grouping and join it back onto the original multi-indexed DataFrame:

In [6]:
df['nunique'] = df['orig'].groupby(level=0).apply(lambda x: x.nunique()).reindex(df.index, level=0); df
Out[6]:
orig mean sum mode nunique
A 1 0 0.666667 2 1 2
2 1 0.666667 2 1 2
3 1 0.666667 2 1 2
B 1 1 1.000000 2 1 1
2 1 1.000000 2 1 1
C 1 2 2.666667 8 2 2
2 4 2.666667 8 2 2
3 2 2.666667 8 2 2
D 1 0 0.000000 0 0 1

Size

Now the size or "count" of instances in each grouping:

In [7]:
df['size'] = df.groupby(level=0).size().reindex(df.index, level=0); df
Out[7]:
orig mean sum mode nunique size
A 1 0 0.666667 2 1 2 3
2 1 0.666667 2 1 2 3
3 1 0.666667 2 1 2 3
B 1 1 1.000000 2 1 1 2
2 1 1.000000 2 1 1 2
C 1 2 2.666667 8 2 2 3
2 4 2.666667 8 2 2 3
3 2 2.666667 8 2 2 3
D 1 0 0.000000 0 0 1 1

First

In [8]:
df['first'] = df['orig'].groupby(level=0).first().reindex(df.index, level=0); df
Out[8]:
orig mean sum mode nunique size first
A 1 0 0.666667 2 1 2 3 0
2 1 0.666667 2 1 2 3 0
3 1 0.666667 2 1 2 3 0
B 1 1 1.000000 2 1 1 2 1
2 1 1.000000 2 1 1 2 1
C 1 2 2.666667 8 2 2 3 2
2 4 2.666667 8 2 2 3 2
3 2 2.666667 8 2 2 3 2
D 1 0 0.000000 0 0 1 1 0

Align

In [9]:
df2 = pd.Series(np.random.randn(4), index=['A','B','C','D'], name='align'); df2
Out[9]:
A   -0.614518
B    0.335981
C   -1.092953
D   -0.200379
Name: align, dtype: float64
In [10]:
a1, a2 = df.align(df2, level=0, axis=0); a2
Out[10]:
A  1   -0.614518
   2   -0.614518
   3   -0.614518
B  1    0.335981
   2    0.335981
C  1   -1.092953
   2   -1.092953
   3   -1.092953
D  1   -0.200379
Name: align, dtype: float64

a1 is the original DataFrame (df) and a2 is the Series which has been reindexed to match the original DataFrame.

You could also use reindex again like the following:

In [11]:
df2.reindex(df.index, level=0)
Out[11]:
A  1   -0.614518
   2   -0.614518
   3   -0.614518
B  1    0.335981
   2    0.335981
C  1   -1.092953
   2   -1.092953
   3   -1.092953
D  1   -0.200379
Name: align, dtype: float64

Update or FillNA

I had a case during data manipulation where one column was missing data. I was able to fill in missing data by performing a calculation on data from another column. However, in some cases this other column was missing data where the calculatable column was not. This meant that if I performed the calculation, then replaced all values, I would lose some data.

In [25]:
arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
          np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]

df = pd.DataFrame([0,None,1,1,1,2,4,None,None], index=arrays, columns=['orig']); df
df['sum'] = pd.Series([4,None,None,3], index=['A','B','C','D']).reindex(df.index, level=0); df
Out[25]:
orig sum
A 1 0.0 4.0
2 NaN 4.0
3 1.0 4.0
B 1 1.0 NaN
2 1.0 NaN
C 1 2.0 NaN
2 4.0 NaN
3 NaN NaN
D 1 NaN 3.0

Imagine the missing value in A was 3, or that there were other rows missing from this data that meant the value for sum is really correct.

Here we want to fill in the values for B and C without overwriting the values for A or D as these cannot be recalculated using the data we are given.

In [26]:
replacement = df[~df['orig'].isnull()]['orig'].sum(level=0).reindex(df.index, level=0); replacement
Out[26]:
A  1    1.0
   2    1.0
   3    1.0
B  1    2.0
   2    2.0
C  1    6.0
   2    6.0
   3    6.0
D  1    NaN
Name: orig, dtype: float64

fillna will only update values in the original series where they are already None - it won't overwrite existing values. This way we can keep the value for A and D and only update the missing cases that we need.

In [27]:
df['sum'] = df['sum'].fillna(replacement); df
Out[27]:
orig sum
A 1 0.0 4.0
2 NaN 4.0
3 1.0 4.0
B 1 1.0 2.0
2 1.0 2.0
C 1 2.0 6.0
2 4.0 6.0
3 NaN 6.0
D 1 NaN 3.0

Now instead imagine that we know the value for A is wrong and we want to overwrite it using our new calculation. Since the original data fro D is missing we're going to rely on the fact that it's correct and don't want to overwrite that. So we want to update values for A, B, and C. Using update will overwrite the existing values where our new series is not None. This is done "inplace".

In [28]:
df['sum'].update(replacement); df
Out[28]:
orig sum
A 1 0.0 1.0
2 NaN 1.0
3 1.0 1.0
B 1 1.0 2.0
2 1.0 2.0
C 1 2.0 6.0
2 4.0 6.0
3 NaN 6.0
D 1 NaN 3.0

These fillna and update functions aren't anything special to MultiIndexed DataFrames, this demonstration to myself is more to show the full workflow once the recalculated Series has been properly indexed to align with what we're filling or updating. None of these calculations will work if the new Series is only indexed at the higher level, and sometimes the code doesn't even throw errors so you won't know that it's failed.