In [1]:

import pandas as pd
import numpy as np
import scipy.stats

I've recently been working with some data that requires multi-indexing for manipulation and have had more than enough head-scratching moments. So here is my write-up of lessons learned that can be returned to anytime so I don't have to learn all these lessons again.

First I'll generate a simple DataFrame with a two-layered multi-index. Then I'll perform some grouping, joining, and reindexing manipulations that I've needed during my project.

In [2]:

arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
          np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]

df = pd.DataFrame([0,1,1,1,1,2,4,2,0], index=arrays, columns=['orig']); df

Out[2]:

		orig
A	1	0
	2	1
	3	1
B	1	1
B	2	1
C	1	2
	2	4
	3	2
D	1	0

Mean¶

Something I needed to do a fair amount was group the data by it's top level index and generate the mean for the group. I then had to fiddle around to join this top-level grouping back to my original DataFrame. Turns out none of this was necessary and can be achieved in one quick swoop like so:

In [3]:

df['mean'] = df['orig'].mean(level=0).reindex(df.index, level=0); df

Out[3]:

		orig	mean
A	1	0	0.666667
	2	1	0.666667
	3	1	0.666667
B	1	1	1.000000
B	2	1	1.000000
C	1	2	2.666667
	2	4	2.666667
	3	2	2.666667
D	1	0	0.000000

Sum¶

sum also takes a level argument and performs the grouping without need for a groupby.

In [4]:

df['sum'] = df['orig'].sum(level=0).reindex(df.index, level=0); df

Out[4]:

		orig	mean	sum
A	1	0	0.666667	2
	2	1	0.666667	2
	3	1	0.666667	2
B	1	1	1.000000	2
B	2	1	1.000000	2
C	1	2	2.666667	8
	2	4	2.666667	8
	3	2	2.666667	8
D	1	0	0.000000	0

Mode¶

Pandas doesn't have a builtin operation to get the mode of groups as far as I'm aware, but it can be achieved using the agg function and the method from scipy. The operation is pretty slow and not without issues (e.g. cases where there are multiple modes) but for my use case it worked fine.

The function agg doesn't accept an argument about the grouping level like the previous two examples. From this point on we need to perform the grouping first before applying any aggregation calculation.

In [5]:

df['mode'] = df['orig'].groupby(level=0).agg(lambda x: scipy.stats.mode(x)[0][0]).reindex(df.index, level=0); df

Out[5]:

		orig	mean	sum	mode
A	1	0	0.666667	2	1
	2	1	0.666667	2	1
	3	1	0.666667	2	1
B	1	1	1.000000	2	1
B	2	1	1.000000	2	1
C	1	2	2.666667	8	2
	2	4	2.666667	8	2
	3	2	2.666667	8	2
D	1	0	0.000000	0	0

NUnique¶

Here's an example to get the number of unique values in a grouping and join it back onto the original multi-indexed DataFrame:

In [6]:

df['nunique'] = df['orig'].groupby(level=0).apply(lambda x: x.nunique()).reindex(df.index, level=0); df

Out[6]:

		orig	mean	sum	mode	nunique
A	1	0	0.666667	2	1	2
	2	1	0.666667	2	1	2
	3	1	0.666667	2	1	2
B	1	1	1.000000	2	1	1
B	2	1	1.000000	2	1	1
C	1	2	2.666667	8	2	2
	2	4	2.666667	8	2	2
	3	2	2.666667	8	2	2
D	1	0	0.000000	0	0	1

Size¶

Now the size or "count" of instances in each grouping:

In [7]:

df['size'] = df.groupby(level=0).size().reindex(df.index, level=0); df

Out[7]:

		orig	mean	sum	mode	nunique	size
A	1	0	0.666667	2	1	2	3
	2	1	0.666667	2	1	2	3
	3	1	0.666667	2	1	2	3
B	1	1	1.000000	2	1	1	2
B	2	1	1.000000	2	1	1	2
C	1	2	2.666667	8	2	2	3
	2	4	2.666667	8	2	2	3
	3	2	2.666667	8	2	2	3
D	1	0	0.000000	0	0	1	1

First¶

In [8]:

df['first'] = df['orig'].groupby(level=0).first().reindex(df.index, level=0); df

Out[8]:

		orig	mean	sum	mode	nunique	size	first
A	1	0	0.666667	2	1	2	3	0
	2	1	0.666667	2	1	2	3	0
	3	1	0.666667	2	1	2	3	0
B	1	1	1.000000	2	1	1	2	1
B	2	1	1.000000	2	1	1	2	1
C	1	2	2.666667	8	2	2	3	2
	2	4	2.666667	8	2	2	3	2
	3	2	2.666667	8	2	2	3	2
D	1	0	0.000000	0	0	1	1	0

Align¶

In [9]:

df2 = pd.Series(np.random.randn(4), index=['A','B','C','D'], name='align'); df2

Out[9]:

A   -0.614518
B    0.335981
C   -1.092953
D   -0.200379
Name: align, dtype: float64

In [10]:

a1, a2 = df.align(df2, level=0, axis=0); a2

Out[10]:

A  1   -0.614518
   2   -0.614518
   3   -0.614518
B  1    0.335981
   2    0.335981
C  1   -1.092953
   2   -1.092953
   3   -1.092953
D  1   -0.200379
Name: align, dtype: float64

a1 is the original DataFrame (df) and a2 is the Series which has been reindexed to match the original DataFrame.

You could also use reindex again like the following:

In [11]:

df2.reindex(df.index, level=0)

Out[11]:

A  1   -0.614518
   2   -0.614518
   3   -0.614518
B  1    0.335981
   2    0.335981
C  1   -1.092953
   2   -1.092953
   3   -1.092953
D  1   -0.200379
Name: align, dtype: float64

Update or FillNA¶

I had a case during data manipulation where one column was missing data. I was able to fill in missing data by performing a calculation on data from another column. However, in some cases this other column was missing data where the calculatable column was not. This meant that if I performed the calculation, then replaced all values, I would lose some data.

In [25]:

arrays = [np.array(['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D']),
          np.array(['1', '2', '3', '1', '2', '1', '2', '3', '1'])]

df = pd.DataFrame([0,None,1,1,1,2,4,None,None], index=arrays, columns=['orig']); df
df['sum'] = pd.Series([4,None,None,3], index=['A','B','C','D']).reindex(df.index, level=0); df

Out[25]:

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	NaN
B	2	1.0	NaN
C	1	2.0	NaN
	2	4.0	NaN
	3	NaN	NaN
D	1	NaN	3.0

Imagine the missing value in A was 3, or that there were other rows missing from this data that meant the value for sum is really correct.

Here we want to fill in the values for B and C without overwriting the values for A or D as these cannot be recalculated using the data we are given.

In [26]:

replacement = df[~df['orig'].isnull()]['orig'].sum(level=0).reindex(df.index, level=0); replacement

Out[26]:

A  1    1.0
   2    1.0
   3    1.0
B  1    2.0
   2    2.0
C  1    6.0
   2    6.0
   3    6.0
D  1    NaN
Name: orig, dtype: float64

fillna will only update values in the original series where they are already None - it won't overwrite existing values. This way we can keep the value for A and D and only update the missing cases that we need.

In [27]:

df['sum'] = df['sum'].fillna(replacement); df

Out[27]:

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

Now instead imagine that we know the value for A is wrong and we want to overwrite it using our new calculation. Since the original data fro D is missing we're going to rely on the fact that it's correct and don't want to overwrite that. So we want to update values for A, B, and C. Using update will overwrite the existing values where our new series is not None. This is done "inplace".

In [28]:

df['sum'].update(replacement); df

Out[28]:

		orig	sum
A	1	0.0	1.0
	2	NaN	1.0
	3	1.0	1.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

These fillna and update functions aren't anything special to MultiIndexed DataFrames, this demonstration to myself is more to show the full workflow once the recalculated Series has been properly indexed to align with what we're filling or updating. None of these calculations will work if the new Series is only indexed at the higher level, and sometimes the code doesn't even throw errors so you won't know that it's failed.

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	NaN
B	2	1.0	NaN
C	1	2.0	NaN
	2	4.0	NaN
	3	NaN	NaN
D	1	NaN	3.0

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

		orig	sum
A	1	0.0	1.0
	2	NaN	1.0
	3	1.0	1.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	NaN
B	2	1.0	NaN
C	1	2.0	NaN
	2	4.0	NaN
	3	NaN	NaN
D	1	NaN	3.0

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

		orig	sum
A	1	0.0	1.0
	2	NaN	1.0
	3	1.0	1.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	NaN
B	2	1.0	NaN
C	1	2.0	NaN
	2	4.0	NaN
	3	NaN	NaN
D	1	NaN	3.0

		orig	sum
A	1	0.0	4.0
	2	NaN	4.0
	3	1.0	4.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0

		orig	sum
A	1	0.0	1.0
	2	NaN	1.0
	3	1.0	1.0
B	1	1.0	2.0
B	2	1.0	2.0
C	1	2.0	6.0
	2	4.0	6.0
	3	NaN	6.0
D	1	NaN	3.0