## blawg

# Predicting Survival of Titanic Passengers

### Getting started with the Titanic data set

Garrett Mayock

*posted 2019-01-02 23:48:26 UTC*

The past two weeks have been hectic with the holidays, but I’ve spent a lot of time trying my hand at some applied data science. This particular effort was focused on the Titanic data set via Kaggle’s competition. Most people have heard about the Titanic – it was a big passenger ship, they bragged about it being unsinkable, and in a tragic irony, it sank. For those of you who haven’t heard of Kaggle, it’s a platform for companies to put out real-money contests for data scientists, with prizes sometimes in the hundreds of thousands of dollars. However, it’s also a platform for data scientists to learn – both by participating in contests for big money as well as “Getting Started” competitions that are designed to kick-start the learning process.

The Titanic Competition is one of those. In it, you’re given information on the survival of 891 passengers and are tasked to build a classifier to predict the survival of the remaining 418 passengers. It’s a popular training exercise because getting good results requires dealing with many of the data issues that you’ll face in the real world. There are missing values, there are strings containing non-normalized data (i.e., they contain title, first name, and last name, all in one column), there’s categorical data both ordinal (ticket class) and non-ordinal (port of embarkation), and so on. I went through several iterations, first using Kaggle’s IPython notebooks to do data exploration while maintaining a clean copy of the code in a script file. This duplication, while useful for keeping a clean master copy of the code, became repetitive, and I started using just notebooks after a few attempts.

I’ve made all of my notebooks public. The most recent was my attempt to get in the top 10% - I encourage you to take a look here. I’m two right answers from hitting the top 10%, but at this point I feel I can learn more by stepping away from this for a while and checking off some of the other items on my list. Those items are detailed in the “What’s next?” section of my previous blog post.

Anyway, here’s a list of the major categories of tasks I’ve undertaken while working on the competition:

**Cleaning data**

- Handling NaNs: Fare, Age
- Parse text strings (to find title, last name, cabin group)
- Aliasing using dictionaries
- Binning age, fare

**Feature engineering**

- Encoded categorical data
- Mapped non-numeric ordinal data to ordinal fields
- Binned age and fare data
- Iterate through rows in the data to make explicit relationships between rows
- Scaled features

**Classifier selection**

- Basic comparison of classifiers
- For loop of classifiers
- GridSearchCV with KNearestClassifier and RandomForestClassifier

**Feature selection / elimination**

- chi2
- Recursive Feature Elimination (RFE)
- Recursive Feature Elimination with Cross-Validation (RFECV)

What follows is some additional detail about each of those steps.

Cleaning data

I’ve enjoyed using a few different ways to approach each problem.

I filled the single NaN value for Fare by using the mode of the dataset.

```
combined_data['Fare'] = combined_data['Fare'].fillna(combined_data['Fare'].mode()[0])
```

I grouped the data by gender and a cleaned title (extracted from the Name field and mapped to an alias) before filling in the NaN Age values with the median.

```
grouped = combined_data.groupby(['female','Pclass', 'cleaned_title'])
combined_data['Age'] = grouped.Age.apply(lambda x: x.fillna(x.median()))
```

I used a for-loop to pull out the titles (and the cabin groups):

```
title = []
for i in combined_data['Name']:
period = i.find(".")
comma = i.find(",")
title_value = i[comma+2:period]
title.append(title_value)
combined_data['title'] = title
```

Whereas I used a lambda function to split out the last name:

```
combined_data['last_name'] = combined_data['Name'].apply(lambda x: str.split(x, ",")[0])
```

The whole process can be viewed in my Kaggle kernel.

Feature engineering

I encoded categorical data such as cabin_group, the titles, and the ports of embarkation using one-hot encoding (specifically, I used the pandas function get_dummies()).

I mapped non-numeric ordinal data to ordinal values using the built-in map() function. Specifically to encode the order passengers were picked up in, which I ended up not using as a feature in the final classifiers.

```
port = {
'S' : 1,
'C' : 2,
'Q' : 3
}
combined_data['pickup_order'] = combined_data['Embarked'].map(port)
```

I binned age and fare data using numpy.select().

```
combined_data['boarded_free'] = combined_data['Fare'] == 0
fare_bin_conditions = [
combined_data['Fare'] == 0,
(combined_data['Fare'] > 0) & (combined_data['Fare'] <= 7.9),
(combined_data['Fare'] > 7.9) & (combined_data['Fare'] <= 14.4),
(combined_data['Fare'] > 14.4) & (combined_data['Fare'] <= 31),
combined_data['Fare'] > 31
]
fare_bin_outputs = [0, 1, 2, 3, 4]
combined_data['fare_bin'] = np.select(fare_bin_conditions, fare_bin_outputs, 'Other').astype(int)
```

I also used a fun technique to look at groups of data – specifically, people riding on the same ticket, people with the same last name, and people with the same last name and same fare. The idea behind this was to see if riding in a group had any effect on your chance of survival.

The algorithm looking at people riding on the same ticket creates a new feature called ticket_rider_survival and sets it to the mean of the target. Then it groups the passengers by the ticket ID. If the group has more than one member, it looks at the known data. If there’s a known survivor other than the row in question on that ticket, it sets ticket_rider_survival to 1. If there’s no known survivor, but a known death, it sets ticket_rider_survival to 0. Otherwise, it leaves it untouched at the mean.

The chi-squared values I calculated imply there’s a greater than 95% chance that riding on the same ticket impacted chance of survival (with a chi-squared of 27.16 compared to a critical value of 23.68).

Here’s the code:

```
combined_data['ticket_rider_survival'] = combined_data['Survived'].mean()
for ticket_group, ticket_group_df in combined_data[['Survived', 'Ticket', 'PassengerId']].groupby(['Ticket']):
if (len(ticket_group_df) != 1):
for index, row in ticket_group_df.iterrows():
smax = ticket_group_df.drop(index)['Survived'].max()
smin = ticket_group_df.drop(index)['Survived'].min()
if (smax == 1.0):
combined_data.loc[combined_data['PassengerId'] == row['PassengerId'], 'ticket_rider_survival'] = 1
elif (smin==0.0):
combined_data.loc[combined_data['PassengerId'] == row['PassengerId'], 'ticket_rider_survival'] = 0
```

Finally I scaled features using a standard scaler and a min max scaler, which improved results. However, I need to better understand the intricacies of the use-cases for each of them.

Classifier selection

Truth be told, I just started out using some basic classifiers I was familiar with. (sklearn.linear_model.LogisticRegression, sklearn.tree.DecisionTreeClassifier, sklearn.ensemble.RandomForestClassifier, sklearn.neighbors.KNearestClassifier)

Although I expanded the list throughout the various iterations, understanding how to choose a classifier is definitely an area I need to learn more about.

Nevertheless, I did get some decent exposure to classifier optimization using GridSearchCV. GridSearchCV helps optimize classifiers by scoring the results of a classifier many times with different settings each time, and then returning the best score and the optimal parameters.

It can take a long time to run depending on the type of model and number of features to compare, so running it is not always a decision to take lightly. The code I wrote to use GridSearchCV on the KNearestClassifer took about three minutes to run in the notebook, whereas the code I wrote to test a huge number of settings on the RandomForestClassifier took about three hours. And keep in mind: this was on a thousand rows of data, which in the grand scheme of things is a trivial amount.

Feature selection / elimination

I used chi-squared values to measure feature importance, and I also played around with recursive feature elimination. Chi-squared values indicate how far from the expected values a certain result is. This is calculated to disprove the null hypothesis – that the feature (independent variable) scored has no effect on the dependent variable.

Although I was able to make that work, I am not sure how to best apply that method yet. It felt a little clunkier to implement than I expected, which indicates to me that there’s much more to learn.

I also played around with recursive feature elimination. Recursive feature elimination (RFE) is a method of determining which features are important by recursively eliminating features from a model and scoring performance. That is to say, if there are ten total features, RFE will train the classifier on all ten and record the score, then train it on all possible groups of nine and score them, and so forth, until it reveals the optimal number of feature.

I set about doing this by using Scikit Yellowbrick's RFECV function (RFE with Cross Validation), which produced some visually pleasing graphs, but didn’t output quite what I expected to go along with that. I expected a list of the important features for each classifier I ran it on. But it turns out Yellowbrick is a diagnostic tool set, and I had some struggles getting results from Scikit’s own RFE tool ... So I decided to put a pin in it and come back when I understand a bit more about how to go about feature selection more intelligently.

What’s next?

I still have the same top three from my previous blog post:

1. Data Engineering

2. Business Intelligence

3. Data Science

I’ll keep you posted!

contact me