Come with me now on a journey through code and data...

Poverty Prediction Kaggle Competition: Feature Generation

The workflow from the previous post has been moved to a utility file. First load data and check the utility functions have been correctly applied.

In [24]:
from data_cleaner import *

df = load_training_df()\
.pipe(clean_targets)\
.pipe(clean_non_numerics)\
.pipe(clean_missing_values)
In [2]:
is_target_consistent = df[target_column].groupby(household_id).apply(lambda x: x.nunique() == 1)
inconsistent_targets = is_target_consistent[is_target_consistent != True]
print('There are %d households with inconsistent target values' % len(inconsistent_targets))
There are 0 households with inconsistent target values
In [3]:
for k, v in df.columns.groupby(df.dtypes).items():
    print('There are %d features of type %s' % (len(v), k.name))
There are 136 features of type int64
There are 8 features of type float64
In [4]:
nulls = df.isnull().sum(axis=0)
nulls = nulls[nulls!=0]/len(df)
print('There are %d features of type null %s' % (len(nulls), ','.join(nulls.values)))
There are 0 features of type null 

New features for individuals

Before considering data at a household level there are some new features that may be useful to generate at an individual's level.

Education-level

There are 9 columns used as a binary one-hot encoding of the individuals level of education. We can compress this down to a single value to represent how far through education this individual has been.

There's slightly weird ordering here as instlevel4 and instlevel6 represent incomplete secondary school (academic or technical) and instlevel5 and instlevel7 represent complete secondary school. The ordering has been arranged based on inspecting correlation with household wealth.

In [5]:
df = df.pipe(compress_columns, new_col='education-level', 
        cols_to_compress=['instlevel1', 'instlevel2', 'instlevel3', 'instlevel6', 'instlevel4', 'instlevel7', 
                          'instlevel5', 'instlevel8', 'instlevel9'])
In [6]:
df[['education-level']].head(2)
Out[6]:
education-level
idhogar Id
21eb7fcc1 ID_279628684 4
0e5d7a658 ID_f29eb3ddd 7

New features for households

All new features from this point on will be descriptions at a household level so we'll append them all to a DataFrame indexed at household level.

In [7]:
hh_idx = df.index.get_level_values(level=household_id).drop_duplicates()
hh_df = pd.DataFrame(index=hh_idx)

Household size

In some cases there are features about the household that are more useful when broken down to a per-person ratio, such as number of ipads per person. There are quite a few different types of features indicating the number of people living in a household.

In [8]:
existing_features = ['tamviv','tamhog','hhsize','hogar_total','r4t3']
hh_size = df.groupby(household_id).size().rename('hh_size').reindex(hh_idx)

hhsizes = df[existing_features].groupby(household_id).first().join(hh_size)

features_are_equal = hhsizes.apply(lambda x: (max(x)-min(x))==0, axis=1)
hhsizes[~features_are_equal].sort_values('tamviv', ascending=False).head()
Out[8]:
tamviv tamhog hhsize hogar_total r4t3 hh_size
idhogar
d4e1dc02c 15 9 9 9 9 9
3fb291710 13 4 4 4 4 4
d43a04997 13 9 9 9 9 9
29024a31c 13 4 4 4 4 4
0592dc939 11 7 7 7 7 7
In [9]:
hhsizes = df[['r4t3','tamhog','hhsize','hogar_total']].groupby(household_id).first().join(hh_size)
features_are_equal = hhsizes.apply(lambda x: (max(x)-min(x))==0, axis=1)
hhsizes[~features_are_equal].head()
Out[9]:
r4t3 tamhog hhsize hogar_total hh_size
idhogar
03c6bdf85 5 5 5 5 2
048d64af0 4 2 2 2 2
053f09ebb 2 2 2 2 1
09b195e7a 4 4 4 4 1
0ccab16a8 5 5 5 5 2

Unfortunately the different features have inconsistent values fairly frequently. tamviv is often the largest and has the description "number of persons living in the household", it's not clear what this means in contrast to tamhog, "size of the household" or hhsize, "household size". My guess would be that there are non-family members living in this household.

It may be a good indication of poverty if a household has additional members from outside the family.

In [10]:
additional_hh_members = df[df['tamviv']!=df['tamhog']][target_column].groupby(household_id).first()
additional_hh_members.value_counts()
Out[10]:
4    64
2    23
1    13
3    12
Name: Target, dtype: int64
In [11]:
additional = df['tamviv']-df['tamhog']
df[additional>1][target_column].groupby(household_id).first().value_counts()
Out[11]:
4    48
2    17
3     9
1     9
Name: Target, dtype: int64

Our generated value of household size is often smaller than the others, this is probably due to the fact it only counts individuals that data has been collected for and not necessarily the number of people in the household as a whole. We should use our generated value when getting the mean or proportion of data we are grouping ourselves. We should choose one of the other values when breaking down numbers that have been provided already representing household level data such as the number of phones or tablets a household owns.

Supporters-Dependents

The data given to us calculates a dependency rate which looks at the number of adults between 19 and 64 (working age) vs the number of children or adults of 65+. This is likely to be due to the fact adults of working age will be supporting the household. Let's define a couple of terms:

  • supporter : Household member aged 19-64 who has not been marked as having a disability
  • dependent : Household member aged 0-19, 65+, or is disabled

We saw when cleaning the data that there are cases in which households have no supporters. We can add a couple of features to indicate whether there are no supporters in the household, or also no dependents in the household.

In [12]:
supporters = df[(df['age']>=18) & (df['age']<=64) & (df['dis']==0)]
dependents = df[(df['age']<=18) | (df['age']>=64) | (df['dis']==1)]

hh_df['num_supporters'] = supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)
hh_df['num_dependents'] = dependents.groupby(household_id).size().reindex(hh_idx, fill_value=0)

hh_df['supporters_rate'] = (supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0) / hh_size).round(2)
hh_df['dependents_rate'] = (dependents.groupby(household_id).size().reindex(hh_idx, fill_value=0) / hh_size).round(2)

hh_df['0_supporters'] = (hh_df['num_supporters']==0).astype(int)
hh_df['0_dependents'] = (hh_df['num_dependents']==0).astype(int)

Here our dependents_rate is the same idea as our previously calculated dependency feature however it's possible that some information is missing as it's been calculated from individual-level data. We can add in the dependency rate as well in case it's more helpful or accurate.

Remember that in this case the number of dependents is gathered from features hogar_nin, hogar_mayor, and the number of adults marked as disabled. And the household size value used is hogar_total - likely a higher value than our generated hh_size.

In [13]:
hh_df['dependency_rate'] = df['dependency'].groupby(household_id).first()

It may be useful to know the gender breakdown of supporters since there is a gender driven pay gap in most countries and this may have some effect on the wealth of the family.

In [14]:
m_supporters = supporters[supporters['male']==1] 
f_supporters = supporters[supporters['female']==1] 

hh_df['num_m_supporters'] = m_supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)
hh_df['num_f_supporters'] = f_supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)

Education

Education-level of household supporters is likely to have a large impact on the wealth of the family as well. We already have the mean education of adults in the household, but let's make a new value for supporters, and supporters broken down by gender.

In [15]:
hh_df['meaneduc'] = df['meaneduc'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['meaneduc_s'] = supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['meaneduc_m'] = m_supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['meaneduc_f'] = f_supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)

hh_df['ed_lev_s'] = supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['ed_lev_m'] = m_supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['ed_lev_f'] = f_supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)

Since a member of the household has been assigned 'head-of-household' it's possible that details relating this individual offer significant information about the household. We can add extra features from combinations of details about them.

In [16]:
hoh = df[(df[head_of_household]==1)].groupby(household_id).first()

hh_df['male_hoh'] = (hoh['male']==1).astype(int).reindex(hh_idx, fill_value=0)
hh_df['educ_hoh'] = hoh['escolari'].reindex(hh_idx, fill_value=0)
hh_df['educ_hoh_m'] = hoh['edjefe'].reindex(hh_idx, fill_value=0)
hh_df['educ_hoh_f'] = hoh['edjefa'].reindex(hh_idx, fill_value=0)
hh_df['SQeduc_hoh_m'] = hoh['SQBedjefe'].reindex(hh_idx, fill_value=0)
hh_df['SQeduc_hoh_f'] = hoh['SQBedjefa'].reindex(hh_idx, fill_value=0)
hh_df['ed_lev_hoh'] = hoh['education-level'].reindex(hh_idx, fill_value=0)

hh_df['hoh_is_sup'] = (((hoh['age']>=18) & (hoh['age']<=64) & (hoh['dis']==0))
                       .astype(int)
                       .reindex(hh_idx, fill_value=0))

Missing education is more significant for children as this indicates that they are falling behind rather than just showing the number of years they have been in education. Let's check for those under 18 who are falling behind in school. We'll only consider children without disabilities else the disability itself might be the cause of falling behind in school, rather than indicating it being due to wealth issues.

In [17]:
minors = df[(df['age']<=18) & (df['dis']==0)]

hh_df['missing_school'] = minors['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)
hh_df['missing_school_m'] = minors[minors['male']==1]['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)
hh_df['missing_school_f'] = minors[minors['female']==1]['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)

Rent

We can add in the values we were working with for whether monthly rent payments are owed, how much, and the stability of the household's residence.

In [18]:
hh_df['rent'] = df['v2a1'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['pays_rent'] = df['owes-montly-payments'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['residence-stability'] = df['residence-stability'].groupby(household_id).first().reindex(hh_idx, fill_value=0)

What if a household owes rent payments but has no supporters:

In [19]:
hh_df['rent_problems'] = ((hh_df['pays_rent']==1) & (hh_df['0_supporters']==1)).astype(int)

Possessions

While cleaning the data we also saw the feature describing the number of tablets a household owns. Let's add that and look more into features around the possessions of the household.

In [20]:
# Binary values
hh_df['refrig'] = df['refrig'].groupby(household_id).first().reindex(hh_idx)
hh_df['computer'] = df['computer'].groupby(household_id).first().reindex(hh_idx)
hh_df['television'] = df['television'].groupby(household_id).first().reindex(hh_idx)
# Count of how many owned
hh_df['tablets_ratio'] = (df['v18q1'].groupby(household_id).first().reindex(hh_idx)/hh_size).round(2)
hh_df['mobilephones_ratio'] = (df['qmobilephone'].groupby(household_id).first().reindex(hh_idx)/hh_size).round(2)

Combinations of electronic possessions like these probably give some indication to wealth.

In [21]:
hh_df['electronics'] = (hh_df['refrig'] + hh_df['computer'] + hh_df['television'] +
                        (hh_df['tablets_ratio']>0).astype(int) +
                        (hh_df['mobilephones_ratio']>0).astype(int))

Overcrowding

Information has been provided about numbers of rooms and overcrowding. Let's take a look.

Overcrowding is the number of people divided by the number of bedrooms. Where overcrowding is 3 or above the binary value for overcrowding by bedroom is set to True.

In [22]:
df[['rooms','hacapo','bedrooms','hacdor','overcrowding','hhsize']].head()
Out[22]:
rooms hacapo bedrooms hacdor overcrowding hhsize
idhogar Id
21eb7fcc1 ID_279628684 3 0 1 0 1.000000 1
0e5d7a658 ID_f29eb3ddd 4 0 1 0 1.000000 1
2c7317ea8 ID_68de51c94 8 0 2 0 0.500000 1
2b58d945f ID_d671db89c 5 0 3 0 1.333333 4
ID_d56d6f5f5 5 0 3 0 1.333333 4