Titanic exploratory data analysis

View the complete project on Github

This notebook aims at 2 goals: analyzing the titanic dataset to find out the underlying assumptions and predicting whether a person would have survived or not given their information

Steps

  1. Data wrangling
  2. Data analysis
  3. Machine learning

Data wrangling

Data summary

A summary of the dataset shows that the dataset is mostly clean, except the 2 features which are Age and Cabin

Data cleaning

The feature cabin has so many missing values so it won't add up too much value to our analysis. In addition, as the ticket is unique to each person, which might depend on other values, so it is removed too.

The PassengerId feature should also be removed as it is basically the index

However, the age feature should be imputed with the mean value by using the Simple Imputer later

Data analysis

Data Visualization

More about the conclusion can be found in the Notebook

Group analysis

Almost nothing useful. Potentially interesting:

  1. 1st class have higher survival rate (First label)
  2. 3rd class was travelling in larger families (Third label)
  3. 2nd and 3rd class fare was very close larger families had more children (low age passengers) in them

In general for all classes:

  • In every age group, women were more likely to be saved among children, this difference is minimal
  • On average, children are more likely to be saved than adults young men (15-25 years old) and old men survived least of all (although there are not enough statistics there)
  • Most of all women aged over 50 survived
  • Analysis by class:
  • In classes 1 and 2, the vast majority of women were saved
  • In classes 1 and 2 there were very few children and a quite a lot of old people
  • There were very few people over 50 in class 3 in the 2nd and 3rd classes, very few men were saved
  • EDA results

    Lets sum up our discoveries concerning importance of given features:

  • PassengerId - useless
  • Pclass - super useful. Shoud be categorical, not numerical or even ranged in my opinion
  • Name - full name including some titles. There is lots of information there, but it doesn't seem to have any predictive power
  • Sex - super useful (pun intended)
  • Age - useful, especially for children (higher survival rate). Problem: lots of missing values! Probably should be converted into categories like: child, adult, unknown
  • SibSp + Parch - somewhat useful, not explicit. Maybe shoud be aggregated into one feature 'family_size'
  • Ticket - seems to be useless
  • Fare - maybe useful, maybe not. Should be tested
  • Cabin - lots of missimg values. Consists of deck (letter) and room number. Room number is useless, deck letter probably can be useful
  • Embarked - maybe useful, maybe not. Should be tested. There are a few missing values, can be safely filled with most popular value
  • Machine learning

    In this case, 3 algorithms, including logistic classifier, support vector machine, random forest are applied to find the best model for prediction of survival. In addition, a neural network is also used as a reference to prove whether it is always best to use deep learning or not.

    In general, random forest outperforms the other 2 algorithms with an accuracy of 97.98% on the training set and 100% on the test set (More information about the accuracy of other algorithms can be found here

    However, neural networks are unlikely to work well in this situation, it soonly becomes overfit although the hyperparameters have been fine-tuned and the learning rate is quite small.