The workflow from the previous post has been moved to a utility file. First load data and check the utility functions have been correctly applied.
from data_cleaner import *
df = load_training_df()\
.pipe(clean_targets)\
.pipe(clean_non_numerics)\
.pipe(clean_missing_values)
is_target_consistent = df[target_column].groupby(household_id).apply(lambda x: x.nunique() == 1)
inconsistent_targets = is_target_consistent[is_target_consistent != True]
print('There are %d households with inconsistent target values' % len(inconsistent_targets))
for k, v in df.columns.groupby(df.dtypes).items():
print('There are %d features of type %s' % (len(v), k.name))
nulls = df.isnull().sum(axis=0)
nulls = nulls[nulls!=0]/len(df)
print('There are %d features of type null %s' % (len(nulls), ','.join(nulls.values)))
New features for individuals¶
Before considering data at a household level there are some new features that may be useful to generate at an individual's level.
Education-level¶
There are 9 columns used as a binary one-hot encoding of the individuals level of education. We can compress this down to a single value to represent how far through education this individual has been.
There's slightly weird ordering here as instlevel4
and instlevel6
represent incomplete secondary school (academic or technical) and instlevel5
and instlevel7
represent complete secondary school. The ordering has been arranged based on inspecting correlation with household wealth.
df = df.pipe(compress_columns, new_col='education-level',
cols_to_compress=['instlevel1', 'instlevel2', 'instlevel3', 'instlevel6', 'instlevel4', 'instlevel7',
'instlevel5', 'instlevel8', 'instlevel9'])
df[['education-level']].head(2)
New features for households¶
All new features from this point on will be descriptions at a household level so we'll append them all to a DataFrame indexed at household level.
hh_idx = df.index.get_level_values(level=household_id).drop_duplicates()
hh_df = pd.DataFrame(index=hh_idx)
Household size¶
In some cases there are features about the household that are more useful when broken down to a per-person ratio, such as number of ipads per person. There are quite a few different types of features indicating the number of people living in a household.
existing_features = ['tamviv','tamhog','hhsize','hogar_total','r4t3']
hh_size = df.groupby(household_id).size().rename('hh_size').reindex(hh_idx)
hhsizes = df[existing_features].groupby(household_id).first().join(hh_size)
features_are_equal = hhsizes.apply(lambda x: (max(x)-min(x))==0, axis=1)
hhsizes[~features_are_equal].sort_values('tamviv', ascending=False).head()
hhsizes = df[['r4t3','tamhog','hhsize','hogar_total']].groupby(household_id).first().join(hh_size)
features_are_equal = hhsizes.apply(lambda x: (max(x)-min(x))==0, axis=1)
hhsizes[~features_are_equal].head()
Unfortunately the different features have inconsistent values fairly frequently. tamviv
is often the largest and has the description "number of persons living in the household", it's not clear what this means in contrast to tamhog
, "size of the household" or hhsize
, "household size". My guess would be that there are non-family members living in this household.
It may be a good indication of poverty if a household has additional members from outside the family.
additional_hh_members = df[df['tamviv']!=df['tamhog']][target_column].groupby(household_id).first()
additional_hh_members.value_counts()
additional = df['tamviv']-df['tamhog']
df[additional>1][target_column].groupby(household_id).first().value_counts()
Our generated value of household size is often smaller than the others, this is probably due to the fact it only counts individuals that data has been collected for and not necessarily the number of people in the household as a whole. We should use our generated value when getting the mean or proportion of data we are grouping ourselves. We should choose one of the other values when breaking down numbers that have been provided already representing household level data such as the number of phones or tablets a household owns.
Supporters-Dependents¶
The data given to us calculates a dependency rate which looks at the number of adults between 19 and 64 (working age) vs the number of children or adults of 65+. This is likely to be due to the fact adults of working age will be supporting the household. Let's define a couple of terms:
supporter
: Household member aged 19-64 who has not been marked as having a disabilitydependent
: Household member aged 0-19, 65+, or is disabled
We saw when cleaning the data that there are cases in which households have no supporters. We can add a couple of features to indicate whether there are no supporters in the household, or also no dependents in the household.
supporters = df[(df['age']>=18) & (df['age']<=64) & (df['dis']==0)]
dependents = df[(df['age']<=18) | (df['age']>=64) | (df['dis']==1)]
hh_df['num_supporters'] = supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)
hh_df['num_dependents'] = dependents.groupby(household_id).size().reindex(hh_idx, fill_value=0)
hh_df['supporters_rate'] = (supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0) / hh_size).round(2)
hh_df['dependents_rate'] = (dependents.groupby(household_id).size().reindex(hh_idx, fill_value=0) / hh_size).round(2)
hh_df['0_supporters'] = (hh_df['num_supporters']==0).astype(int)
hh_df['0_dependents'] = (hh_df['num_dependents']==0).astype(int)
Here our dependents_rate
is the same idea as our previously calculated dependency
feature however it's possible that some information is missing as it's been calculated from individual-level data. We can add in the dependency rate as well in case it's more helpful or accurate.
Remember that in this case the number of dependents is gathered from features hogar_nin
, hogar_mayor
, and the number of adults marked as disabled. And the household size value used is hogar_total
- likely a higher value than our generated hh_size.
hh_df['dependency_rate'] = df['dependency'].groupby(household_id).first()
It may be useful to know the gender breakdown of supporters since there is a gender driven pay gap in most countries and this may have some effect on the wealth of the family.
m_supporters = supporters[supporters['male']==1]
f_supporters = supporters[supporters['female']==1]
hh_df['num_m_supporters'] = m_supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)
hh_df['num_f_supporters'] = f_supporters.groupby(household_id).size().reindex(hh_idx, fill_value=0)
Education¶
Education-level of household supporters is likely to have a large impact on the wealth of the family as well. We already have the mean education of adults in the household, but let's make a new value for supporters, and supporters broken down by gender.
hh_df['meaneduc'] = df['meaneduc'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['meaneduc_s'] = supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['meaneduc_m'] = m_supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['meaneduc_f'] = f_supporters['escolari'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['ed_lev_s'] = supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['ed_lev_m'] = m_supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
hh_df['ed_lev_f'] = f_supporters['education-level'].groupby(household_id).mean().round(2).reindex(hh_idx, fill_value=0)
Since a member of the household has been assigned 'head-of-household' it's possible that details relating this individual offer significant information about the household. We can add extra features from combinations of details about them.
hoh = df[(df[head_of_household]==1)].groupby(household_id).first()
hh_df['male_hoh'] = (hoh['male']==1).astype(int).reindex(hh_idx, fill_value=0)
hh_df['educ_hoh'] = hoh['escolari'].reindex(hh_idx, fill_value=0)
hh_df['educ_hoh_m'] = hoh['edjefe'].reindex(hh_idx, fill_value=0)
hh_df['educ_hoh_f'] = hoh['edjefa'].reindex(hh_idx, fill_value=0)
hh_df['SQeduc_hoh_m'] = hoh['SQBedjefe'].reindex(hh_idx, fill_value=0)
hh_df['SQeduc_hoh_f'] = hoh['SQBedjefa'].reindex(hh_idx, fill_value=0)
hh_df['ed_lev_hoh'] = hoh['education-level'].reindex(hh_idx, fill_value=0)
hh_df['hoh_is_sup'] = (((hoh['age']>=18) & (hoh['age']<=64) & (hoh['dis']==0))
.astype(int)
.reindex(hh_idx, fill_value=0))
Missing education is more significant for children as this indicates that they are falling behind rather than just showing the number of years they have been in education. Let's check for those under 18 who are falling behind in school. We'll only consider children without disabilities else the disability itself might be the cause of falling behind in school, rather than indicating it being due to wealth issues.
minors = df[(df['age']<=18) & (df['dis']==0)]
hh_df['missing_school'] = minors['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)
hh_df['missing_school_m'] = minors[minors['male']==1]['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)
hh_df['missing_school_f'] = minors[minors['female']==1]['rez_esc'].mean(level=household_id).round(2).reindex(hh_idx, fill_value=0)
Rent¶
We can add in the values we were working with for whether monthly rent payments are owed, how much, and the stability of the household's residence.
hh_df['rent'] = df['v2a1'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['pays_rent'] = df['owes-montly-payments'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
hh_df['residence-stability'] = df['residence-stability'].groupby(household_id).first().reindex(hh_idx, fill_value=0)
What if a household owes rent payments but has no supporters:
hh_df['rent_problems'] = ((hh_df['pays_rent']==1) & (hh_df['0_supporters']==1)).astype(int)
Possessions¶
While cleaning the data we also saw the feature describing the number of tablets a household owns. Let's add that and look more into features around the possessions of the household.
# Binary values
hh_df['refrig'] = df['refrig'].groupby(household_id).first().reindex(hh_idx)
hh_df['computer'] = df['computer'].groupby(household_id).first().reindex(hh_idx)
hh_df['television'] = df['television'].groupby(household_id).first().reindex(hh_idx)
# Count of how many owned
hh_df['tablets_ratio'] = (df['v18q1'].groupby(household_id).first().reindex(hh_idx)/hh_size).round(2)
hh_df['mobilephones_ratio'] = (df['qmobilephone'].groupby(household_id).first().reindex(hh_idx)/hh_size).round(2)
Combinations of electronic possessions like these probably give some indication to wealth.
hh_df['electronics'] = (hh_df['refrig'] + hh_df['computer'] + hh_df['television'] +
(hh_df['tablets_ratio']>0).astype(int) +
(hh_df['mobilephones_ratio']>0).astype(int))
Overcrowding¶
Information has been provided about numbers of rooms and overcrowding. Let's take a look.
Overcrowding is the number of people divided by the number of bedrooms. Where overcrowding is 3 or above the binary value for overcrowding by bedroom is set to True.
df[['rooms','hacapo','bedrooms','hacdor','overcrowding','hhsize']].head()