Titanic - Machine Learning from Disaster

Visit My Titanic Notebook!

First Kaggle Competition

Kaggle is an exciting platform for becoming introduced to the world of data science. The platform allows its users to learn machine learning and data science in a very hands-on approach through Kaggle’s notebooks, competitions, learning modules, datasets, discussions, and social community. My favorite part is being able to utilize both markdown and code in the same area in order to better organize and learn various data-related technologies.

My first ‘competition’ was called Titanic - Machine Learning from Disaster. The reason I have competition in quotes is because this competition is very beginner-friendly and is typically utilized as the first competition for brand new Kagglers. So, to me it did not quite feel like a real competition but it was extremely helpful in introducing me to the world of machine learning.

Prior to beginning this competition, I was under the impression that machine learning was going to consist of all kinds of fancy code and boujee algorithms. While machine learning likely is full of many advanced algorithms, it seems to me that you won’t be able to get very far in the discipline without the ability to quickly and accurately understand and manipulate your data. Being able to manipulate data frames and display important information in various ways regarding your dataset is crucial for building successful machine learning models and predictions. “Brilliance in the basics” of data is a vital stepping stone in advancing one’s knowledge in machine learning.

In this competition, you were given a dataset of 819 people as the training data and had to predict the survivability of 418 passengers in the test dataset. The dataset had 12 different columns that could be used to try to predict who survived the disaster.

The columns included:

  • PasssengerId
  • Survived (1=Yes, 0=No)
  • Pclass (1=first, 2=second, 3=third)
  • Name
  • Sex (“male”, “female”)
  • Age (Age in years)
  • SibSp (# of siblings / spouses aboard the Titanic)
  • Parch (# of parents / children aboard the Titanic)
  • Ticket (Ticket number)
  • Fare (Passenger fare)
  • Cabin (Cabin number)
  • Embarked (Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton)

Submission Results

My first submission was using features Sex, Pclass, and Parch and the RandomForestClassifier algorithm to predict what members of the Titanic would survive. The RandomForestClassifier did all the hard work of computing multiple component trees and averaging them to determine if the person survived or not. My score for this submission was 0.76315 meaning 76.3% of my test data predictions were correct.

For my second submission, I began to play around with the features and the Mean Absolute Error to determine what features might best go together to raise my score. I began playing around and found that Sex and Parch resulted in a pretty low MAE of 0.206278. My score for this submission was 0.76794 meaning I was 76.79% accurate in the test dataset. Improvement.

Some Things I Learned for Reference

training data: the data used to fit the model

validation data: the data used to test the model

features: the columns that are inputted into our model and later used to make predictions (conventionally X)

target: the column we want to predict (conventionally y)

overfitting: where a model matches the training data almost perfectly but does poorly in validation and other new data

underfitting: when a model fails to capture important distinctions and patterns in the data, so it performs poorly on even the training data

Mean Absolute Error (MAE): the average that our predictions are off by

average = actual - predicted
take absoolute value of each error
average the absolute errors

Random Forest: technique that makes a prediction based on the average of many component trees

scikit-learn: most popular library for modeling the types of data stored in DataFrames

import sklearn

Steps to Building a Model

  1. Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  2. Fit: Capture patterns from provided data. This is the heart of modeling.
  3. Model: Model the data. Then make predictions.
  4. Evaluate: Determine how accurate the model’s predicitons are.

Supervised Learning: A type of machine learning where an algorithm learns from labeled training data to make predictions or decisions. The goal is for the algorithm to learn the relationship between input features and output labels so that it can predict the labels for new, unseen data.

input --(predict)--> output
  • Regression: predicting continuous data
  • Classification: predicting discrete data

Unsupervised Learning: The dataset only contains input features without any corresponding labels or target values. It aims to uncover hidden patterns or groupings in the data without any prior knowledge of the output

  • Clustering: automatic grouping of similar objects into sets
  • Dimensionality Reduction: used to reduce the dimensionality of high-dimensional data while preserving its important structure and relationships
  • Anomaly Detection: aim to identify rare or unusual data patterns or points in a dataset
  • Association Rule Mining: discovering interesting relationships / associations among variables in a dataset
  • Density Estimation: aim to estimate the probability density function of a dataset
  • Generation Models: used to learn the underlying probability of the data