import warnings
warnings.filterwarnings("always")
import pandas as pd
import numpy as np
import scipy.stats
I've placed the category names and descriptions in a dictionary in another file so I can lookup and organize the different features there.
from column_categories import *
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
train_df = pd.read_csv('../input/train.csv')
train_df.set_index([household_id, individual_id], inplace=True)
train_df.head()
Check for class imbalance¶
from column_categories import target_values
def target_table_breakdown(df, target_desc=target_values):
household_target_sizes = df[target_column].value_counts().to_frame()
household_target_sizes.columns = ['total']
household_target_sizes['proportion'] = household_target_sizes['total']/household_target_sizes['total'].sum()
household_target_sizes['target description'] = household_target_sizes.index.map(target_desc.get)
return household_target_sizes
Based on individuals, how are the target values distributed?
target_table_breakdown(train_df)
Based on households, how are the target values distributed?
Before we can do this we need to decide how to group taget values in the case where they aren't consistent across the household.
Verify target values are consistent across households¶
is_target_consistent = train_df.groupby(household_id)[target_column].apply(lambda x: x.nunique() == 1)
inconsistent_targets = is_target_consistent[is_target_consistent != True]
print('There are %d households with inconsistent target values' % len(inconsistent_targets))
train_df.loc[inconsistent_targets.index][[head_of_household,target_column]].head()
The organizers say to update the household with the value stated by the head, but I think it should be the mode.
Use the following to get target from household head:
hh_target = pd.DataFrame(train_df[train_df[head_of_household] == 1][target_column] \
.groupby(household_id).first().astype(int))
print(len(hh_target))
There are cases where there is no head of household so some household targets are missing here.
Use the following to get target from mode of values given by individuals:
hh_target = pd.DataFrame(train_df[target_column].groupby(household_id).agg(lambda x: scipy.stats.mode(x)[0][0]))
print(len(hh_target))
Make sure we're not overwriting with any null values, so remove cases where household targets are missing - this is needed for when we are taking the household head's target value as a truth. (I didn't think update works this way but for some reason it's overwriting existing values with null in the case where a household head is missing).
df = train_df[[target_column]].join(hh_target, lsuffix='_x')
df = df[~df[target_column].isnull()]
df.head()
train_df.update(df[[target_column]])
is_target_consistent = df.groupby(household_id)[target_column].apply(lambda x: x.nunique() == 1)
inconsistent_targets = is_target_consistent[is_target_consistent != True]
print('There are %d households with inconsistent target values' % len(inconsistent_targets))
target_table_breakdown(train_df.groupby(household_id).first())
We need to be aware of this imbalance when manipulating data.
Ensure all features are numeric¶
Inspect data types, ensure all data is numeric
def get_column_dtypes(df):
columns_by_dtype = df.columns.groupby(df.dtypes)
return {k.name: v for k, v in columns_by_dtype.items()}
get_column_dtypes(train_df)
Convert dependency
, edjefe
, edjefa
to numerics.
Dependency¶
train_df['dependency'].value_counts().head()
Mix of string/boolean and continuous floats
This feature is defined as:
dependency
: Dependency rate, calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
From the features available to us we can see this calculation is:
hogar_nin
+hogar_mayor
/hogar_adul
-hogar_mayor
This is related to SQBdependency
, so we should check whether that feature needs to be updated too.
First lets look at the distribution of the numeric values we already have:
numeric_dep = train_df[(train_df['dependency']!='yes') & (train_df['dependency']!='no')]['dependency'].astype(float)
sns.distplot(numeric_dep, hist=False, rug=True)
There is a max cutoff value being used for the case where there are no adults between 19 and 64 (as it would cause division by zero). This is creating a peak at the value 8 which doesn't really exist.
This suggests it would be good to mark household that have no adults to depend on as this may signal poverty level.
Let's calculate the rate ourselves. As there's an issue with division by zero for the case where there are no adults under 65, let's instead calculate the dependecy rate as the number of dependents proportional of the household size overall.
The original calculation also misses out on information about disabled adults - potentially adults who cannot work and are dependents on the household. We'll incorporate them into the dependency calculation.
adult_dis = pd.DataFrame(train_df[(train_df['dis']==1) &
(train_df['age']>=19) &
(train_df['age']<=64)].groupby(household_id).size()).rename(columns={0:'adult-dis'})
train_df = train_df.join(adult_dis).fillna(0)
train_df[train_df['adult-dis']>1][['adult-dis','age','dis']].head()
depedents = train_df['hogar_nin']+train_df['hogar_mayor']+train_df['adult-dis']
calculated_dep = depedents/train_df['hogar_total']
sns.distplot(calculated_dep, hist=False, rug=True)
We see a peak around 0 where households have no dependents, and a small peak at 1 where households have no adults to depend on. The distribution between 0.1 - 0.9 seems to be the true dependency rate we're looking for.
Let's add a couple of features to indicate this and also fix the dependency rate values.
train_df['0-adults'] = ((train_df['hogar_adul']-train_df['hogar_mayor']-train_df['adult-dis'])==0).astype(int)
train_df['0-dependents'] = ((train_df['hogar_nin']+train_df['hogar_mayor']+train_df['adult-dis'])==0).astype(int)
train_df['dependency'] = calculated_dep
train_df['SQBdependency'] = train_df['dependency']**2
train_df[['0-adults','0-dependents','dependency','SQBdependency']].head()
'%d rows are missing a value' % len(train_df[train_df['dependency'].isnull()])
EdjeFe & EdjeFa¶
edjefe
and edjefa
represent the years of education of male or female head of household respectively. This value is based on the interaction of escolari
(years of education), head of household, and gender and the documentation states that yes=1 and no=0. Let's double check that we can replace strings with binary values.
Feature definitions:
escolari
represents the number of years in educationinstlevel#
represents the level of education reached (there are 9 binary values, let's compress them into a single column). We can check level of education against number of years in education for consistency.
def compress_columns(df, new_col, old_cols):
df[new_col] = (np.argmax(np.array(df[old_cols]), axis = 1))
return df
df = compress_columns(train_df, 'education-level', ['instlevel1', 'instlevel2', 'instlevel3', 'instlevel6', 'instlevel4', 'instlevel7', 'instlevel5', 'instlevel8', 'instlevel9'])
df['education-level'].head()
Now we can check both the number of years of education and the level of education reached in the cases where edjefe
or edjefa
are 'yes' to verify that the value should be set to 1.
cols = ['education-level','escolari','edjefe','edjefa','male','female']
head_educ = df[(df[head_of_household]==1) & ((df['edjefe']=='yes') | (df['edjefa']=='yes'))][cols]
head_educ.head()
It looks like this is correct, the 'yes' values should be set to 1. Before updating let's double check for any cases where the years of education or education-level is greater than 1.
len(head_educ[head_educ['escolari']>1])
len(head_educ[head_educ['education-level']>1]) # (1 = incomplete primary school, expected)
This is consistent. In all cases where the head of household education value is set to yes
the education level and years is 1
. Let's replace.
def convert_to_binary(df, feature):
df[feature].replace('no','0',inplace=True)
df[feature].replace('yes','1',inplace=True)
df[feature] = df[feature].astype(int)
return df
train_df = convert_to_binary(train_df, 'edjefe')
train_df = convert_to_binary(train_df, 'edjefa')
There also exists SQBedjefe
- let's check this is consistent by seeing if there are any instances where edjefe
squared does not match this value.
len(train_df[train_df['SQBedjefe']!=train_df['edjefe']**2])
Let's add the equivalent for female head-of-household's for consistency. We'll check later whether this is a helpful feature.
train_df['SQBedjefa']=train_df['edjefa']**2
Other potentially helpful features that capture parts of this information could be:
- is the head of household male/female
- how many years of education does the head-of-household have (regardless of gender)
- how many years of education do males in the household have / females in the household have (perhaps only adults)
- what level of education does the h-o-h have?
- what level of eduacation do males in the household have / females in the household have
This is a lot to add, we'll make note now and add them later.
Verify all values are numeric¶
get_column_dtypes(train_df)
All values are now numeric
Fill in missing values¶
nulls = train_df.isnull().sum(axis=0)
nulls[nulls!=0]/len(train_df)
Feature definitions for missing features:
v2a1
: Monthly rent paymentv18q1
: Number of tablets household ownsrez_esc
: Years behind in schoolmeaneduc
: Average years of education for adults (18+)SQBmeaned
: Average years of education for adults (18+) squared
Fill in values for monthly rent payment¶
cols = ['v2a1','tipovivi1','tipovivi2','tipovivi3','tipovivi4','tipovivi5']
train_df[cols].head()
Rent is only owed in the case where the household are renting or paying installments, i.e. tipovivi2
or tipovivi3
is marked as 1. Let's check whether there are any missing values for this case.
len(train_df[((train_df['tipovivi2']==1) | (train_df['tipovivi3']==1)) & (train_df['v2a1'].isnull())])
In all cases where we expect to see values for monthly payments, the payments are present. We can set missing values to 0. We can also add a marker to show whether a household makes monthly payments for their accomodation, and compress the binary columns that indicate the stability of their accomodation.
train_df['v2a1'] = train_df['v2a1'].fillna(0)
train_df['owes-montly-payments'] = ((train_df['tipovivi2']==1) | (train_df['tipovivi3']==1)).astype(int)
train_df = compress_columns(train_df, 'residence-stability', ['tipovivi5','tipovivi4','tipovivi3','tipovivi2','tipovivi1'])
Fill in values for number of tablets a household owns¶
First let's check what kind of values we're looking at:
train_df['v18q1'].value_counts()
We have another feature that represents at an individual's level whether or not they own a tablet. We can sum this for each household to generate the missing values.
tablets = pd.DataFrame(train_df.groupby(household_id)['v18q'].sum()).rename(columns={'v18q':'v18q1'})
It looked like some existing values for this feature were wrong so we'll drop the column and completely replace it with the generated value.
train_df = train_df.drop(columns=['v18q1']).join(tablets['v18q1'])
train_df[['v18q','v18q1']].head(7)
Fill in missing school years¶
School starts age 7 so the value for escolari
should be only 7 years less than a persons age (up until the age of 18).
tmp = train_df.copy()
tmp['age'] = train_df['age'].apply(lambda x: min(x, 18))
tmp['gap'] = (tmp['age']-tmp['escolari']-7).apply(lambda x: max(x, 0))
columns = ['rez_esc', 'gap']
tmp[~tmp['rez_esc'].isnull()][columns].head()
This calculation seems to work so let's generate values for this feature.
train_df['rez_esc'] = (train_df['age'].apply(lambda x: min(x, 18)) -tmp['escolari']-7).apply(lambda x: max(x, 0))
Fill in mean education values¶
meaneduc
represents mean years of education for members of household that are 18+
Let's have a look at the case where we're missing values:
cols = ['hogar_total','hogar_adul','hogar_nin','hogar_mayor','age','escolari','meaneduc']
train_df[train_df['meaneduc'].isnull()][cols]
We can calculate the mean education level of adults of 18 and over using the individual's feature representing their years of schooling and group by household. Let's generate this value and compare it to existing values to verify it's the correct approach.
tmp1 = pd.DataFrame(train_df[train_df['age']>=18]['escolari'].groupby(household_id).mean().round(4))
tmp2 = pd.DataFrame(train_df['meaneduc'].groupby(household_id).first().round(4))
tmp3 = pd.DataFrame(train_df['SQBmeaned'].groupby(household_id).first().round(4))
adult_educ = tmp1.join(tmp2).join(tmp3)
adult_educ['sq'] = adult_educ['escolari']**2
adult_educ.head()
Now create a DataFrame of replacement values where meaneduc
is null.
replacements = adult_educ[adult_educ['meaneduc'].isnull()] \
.drop(columns=['meaneduc','SQBmeaned']) \
.rename(columns={'escolari':'meaneduc','sq':'SQBmeaned'})
replacements
Fill in missing values for meaneduc
and SQBmeaned
:
train_df = train_df.fillna(replacements)
Verify there are no more missing values
nulls = train_df.isnull().sum(axis=0)
nulls[nulls!=0]/len(train_df)