Journey into my first Kaggle Challenge: The Titanic Challenge

This is my first kaggle challenge. I needed help to complete it because I have only been studying deep learning and not data analysis. I did not know where to start or what to look for in doing data analysis. So I followed a previously done solution to the Titanic Challenge closely and tried to gain an intuition for everything that was done. This Jupyter notebook is that effort + the machine learning part in the end after the data analysis part!

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# Imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

I did a Kaggle Challenge and this is my work related to that. Everything above is my imports and connecting to my Google Drive for use in Google Colaboratory. Here is the Kaggle Challenge Link: https://www.kaggle.com/competitions/titanic/overview

Step 1: Explore the dataset and do data analysis.

Used https://www.kaggle.com/code/startupsci/titanic-data-science-solutions for data analysis inspiration

# Read in data and see a preview of it using .head()
data = pd.read_csv('./drive/MyDrive/Titanic_Challenge/train.csv')
data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# Convert categorical datatypes 'Sex' and 'Embarked' into numeric
  # to do correlation calculation
numeric_df = data.select_dtypes(include=[float, int])
numeric_df['Sex'] = data['Sex'].map({'male': 0, 'female':1})
numeric_df['Embarked'] = data['Embarked'].map({'S':0, 'C':1, 'Q':2})

# numeric_df.head()
numeric_df.corr()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare	Sex	Embarked
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658	-0.042939	-0.030555
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307	0.543351	0.108669
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500	-0.131900	0.043835
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067	-0.093254	0.012186
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651	0.114631	-0.060606
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225	0.245489	-0.079320
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000	0.182333	0.063462
Sex	-0.042939	0.543351	-0.131900	-0.093254	0.114631	0.245489	0.182333	1.000000	0.118593
Embarked	-0.030555	0.108669	0.043835	0.012186	-0.060606	-0.079320	0.063462	0.118593	1.000000

Gain an intuition for how the correlation calculation works by using a tiny example. What I found: the best correlations have graphs that depict a monotonic relationship like the sigmoid function and not snake curves like the piece wise function (graphed below)

# Prove to myself that outputs going up and down will give a bad correlation
# Correlation only captures monotonic relationships
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def piecewise(x):
    if x < -5:
        return 0
    elif -5 <= x < 0:
        return 1
    elif 0 <= x < 5:
        return 0
    else:
        return 1

# Generate x values
x = np.linspace(-10, 10, 400)  # 400 points from -10 to 10

# Apply the piecewise function to each x value
y = np.array([piecewise(xi) for xi in x])

# Plotting the Sigmoid function
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Piecewise Function')
plt.title('Piecewise Function')
plt.xlabel('X values')
plt.ylabel('Piecewise Output')
plt.grid(True)
plt.legend()
plt.show()

# Create a DataFrame
data_piecewise = pd.DataFrame({'X': x, 'Y': y})

# Calculate correlation
correlation = data_piecewise['X'].corr(data_piecewise['Y'])
print("Correlation coefficient:", correlation)

# Monotonic function - Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Apply the sigmoid function to each x value
y_sigmoid = np.array([sigmoid(xi) for xi in x])

# Plotting the Sigmoid function on the same plot
plt.plot(x, y_sigmoid, label='Sigmoid Function', linestyle='--')
plt.legend()
plt.show()

# Create a DataFrame for the sigmoid function
data_sigmoid = pd.DataFrame({'X': x, 'Y_Sigmoid': y_sigmoid})

# Calculate correlation for sigmoid
correlation_sigmoid = data_sigmoid['X'].corr(data_sigmoid['Y_Sigmoid'])
print("Sigmoid Correlation coefficient:", correlation_sigmoid)

png

Correlation coefficient: 0.43301405506325596

png

Sigmoid Correlation coefficient: 0.9362677445388511

Below plot shows why pandas.corr() may not be the most reliable metric for showing which features(Age, Embarked,etc) are the most important to deciding whether a passenger lived or died. The data is up and down: As age goes up there is no clear rise or fall in deaths its behaving more like the piece wise function and not like the sigmoid function above.

# Completely useless plot
# But it shows why the correlation calculation would not work.
# The data is up and down like the piece wise function above
plt.figure(figsize=(50,6))
plt.scatter(data['Age'], data['Survived'], alpha=1, color='red')
plt.title("Scatter Plot of Survived vs. Age")
plt.xlabel('Age')
plt.ylabel('Survived')
plt.xticks(range(1, 101, 1))
plt.grid(True)
plt.show()

png

# Determine Age bands and check correlation with Survived
# print(pd.cut(data['Age'],5))
# print(data['Age'].head())
data['AgeBand'] = pd.cut(data['Age'], 5)
data[['AgeBand','Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

	AgeBand	Survived
0	(0.34, 16.336]	0.550000
1	(16.336, 32.252]	0.369942
2	(32.252, 48.168]	0.404255
3	(48.168, 64.084]	0.434783
4	(64.084, 80.0]	0.090909

These are the graphs that I mainly relied on for my analysis because it shows the correlation between deaths and the other classes well. “Oh look the lower class people tended to not be survivors”

# These graphs are gold!
# Load and analyze the data
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for the plots
sns.set(style="whitegrid")

# Columns to plot, excluding 'PassengerId', 'Name', 'Ticket', 'Cabin' due to their specificity or lack of relevance
columns_to_plot = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

# Define a function to plot survival count based on a given feature
def plot_survival(data, column):
    if column=='Age' or column=='Fare':
        g = sns.FacetGrid(data, col='Survived', height=5, aspect=1)
        g.map(sns.histplot, column, bins=20, kde=False)
    else:
        # Create a count plot for categorical data
        plt.figure(figsize=(10, 6))
        sns.countplot(x=column, hue='Survived', data=data, palette='viridis')
        plt.title(f'Survival Count by {column}')
        plt.legend(title='Survived', loc='upper right', labels=['No', 'Yes'])
        plt.show()

# Create a loop to generate plots for each relevant column
for column in columns_to_plot:
    plot_survival(data, column)

png

# Read in training and test dataset
# Previous analysis above was also on training dataset it just had a different variable name
# Now calling training dataset train_df
train_df = pd.read_csv('./drive/MyDrive/Titanic_Challenge/train.csv')
test_df = pd.read_csv('./drive/MyDrive/Titanic_Challenge/test.csv')

# Get info on the train vs. test datasets
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Pclass       418 non-null    int64
 2   Name         418 non-null    object
 3   Sex          418 non-null    object
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64
 6   Parch        418 non-null    int64
 7   Ticket       418 non-null    object
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object
 10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

# Get statistics on the numerical categories in the dataset like Age and PClass(1,2,3)
train_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# Describe categorical data
# the include=['O'] means to include types that are objects
train_df.describe(include='O')

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	204	889
unique	891	2	681	147	3
top	Braund, Mr. Owen Harris	male	347082	B96 B98	S
freq	1	577	7	4	644

Mean statistics of each class. Mean Calculation is: total of survived(survived = 1) / total(survived=0 + survived=1)

# P class vs Survived Mean Stat
# Mean stat is where we get the mean of the values for each category in PClass that survived
# This gives the mean of each PClass that survived.
# Look at Class 3!!!
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Pclass	Survived
0	1	0.629630
1	2	0.472826
2	3	0.242363

train_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# Females were more likely to survive than males

	Sex	Survived
0	female	0.742038
1	male	0.188908

train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# The station people left from might be correlated with their PClass

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.336957

# Number of Parents+Children => Parch
train_df[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Parch', ascending=True)

	Parch	Survived
0	0	0.343658
1	1	0.550847
2	2	0.500000
3	3	0.600000
4	4	0.000000
5	5	0.200000
6	6	0.000000

# Sibsp => Number of siblings + Spouse
train_df[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='SibSp', ascending=True)

	SibSp	Survived
0	0	0.345395
1	1	0.535885
2	2	0.464286
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

I wondered if the amount of SibSp or Parch is correlated with PClass, but I did not find anything compelling there

train_df[['SibSp', 'Pclass']].groupby(['SibSp', 'Pclass'], as_index=False).value_counts()

	SibSp	Pclass	count
0	0	1	137
1	0	2	120
2	0	3	351
3	1	1	71
4	1	2	55
5	1	3	83
6	2	1	5
7	2	2	8
8	2	3	15
9	3	1	3
10	3	2	1
11	3	3	12
12	4	3	18
13	5	3	5
14	8	3	7

train_df[['Parch', 'Pclass']].groupby(['Parch', 'Pclass'], as_index=False).value_counts()

	Parch	Pclass	count
0	0	1	163
1	0	2	134
2	0	3	381
3	1	1	31
4	1	2	32
5	1	3	55
6	2	1	21
7	2	2	16
8	2	3	43
9	3	2	2
10	3	3	3
11	4	1	1
12	4	3	3
13	5	3	5
14	6	3	1

Clearer view graphs. To get a clearer view of who survived vs. who did not for specific classes we are interested in

# Get a clear view of Age range and who did not survive (0) vs. who survived (1)
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)

<seaborn.axisgrid.FacetGrid at 0x7fbc33717490>

png

# Age range and P Class for who did not survive(0) vs. who survived(1)
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=0.9, bins=20)
grid.add_legend();

png

# Station embarked, male/female, and P Class for who did not survive vs who survived
# As you can see the male survival rate is higher for the first two stations
# But then there is a higer survival rate
# The survival rate always drops off for lower PClass
  # (Except in station C for female PClass 3)
grid = sns.FacetGrid(train_df, row='Embarked', height=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:718: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:723: UserWarning: Using the pointplot function without specifying `hue_order` is likely to produce an incorrect plot.
  warnings.warn(warning)

<seaborn.axisgrid.FacetGrid at 0x7fbc3343a7d0>

png

# Different embarked stations and male vs. female at those stations
  # Survived vs did not survive
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', height=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, errorbar=None)
grid.add_legend()

/usr/local/lib/python3.10/dist-packages/seaborn/axisgrid.py:718: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

<seaborn.axisgrid.FacetGrid at 0x7fbc3384d000>

png

Transform the data to be fed into Machine Learning Algorithms

by dropping features that will not have correlation to survival. Like ticket number and cabin
And transforming categorical features like embarked station to numerical feature because the ML algorithms do not understand categorical features.
Also creating new features out of the existing features that may be helpful. Like extracting titles from names.
Filling in null values since some ML classifiers cannot handle null.

# Correcting by dropping features
# Drop ticket and cabin number features
combine = [train_df, test_df]
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape

Before (891, 12) (418, 11) (891, 12) (418, 11)

('After', (891, 10), (418, 9), (891, 10), (418, 9))

The blog I am following talks about creating a new feature out of the name feature called titles. Because it may have some correlation to who survived and who did not.

# Create title feature
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

Sex	female	male
Title
Capt	0	1
Col	0	2
Countess	1	0
Don	0	1
Dr	1	6
Jonkheer	0	1
Lady	1	0
Major	0	2
Master	0	40
Miss	182	0
Mlle	2	0
Mme	1	0
Mr	0	517
Mrs	125	0
Ms	1	0
Rev	0	6
Sir	0	1

# Mean statistic on survival with new title feature
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean().sort_values('Survived')

	Title	Survived
2	Mr	0.156673
4	Rare	0.347826
0	Master	0.575000
1	Miss	0.702703
3	Mrs	0.793651

# Transfer the rarer titles to a overall Rare category
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

# See transformed dataset
train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked	Title
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C	3
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	S	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	S	3
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	S	1

# Drop name and PassengerId features for test and training datasets
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape

((891, 9), (418, 9))

# Convert categorical sex feature (male, female) to numeric feature
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	22.0	1	7.2500	S	1
1	1	1	1	38.0	1	71.2833	C	3
2	1	3	1	26.0	0	7.9250	S	2
3	1	1	1	35.0	1	53.1000	S	3
4	0	3	0	35.0	0	8.0500	S	1

# Completing values for missing or null values
# For which columns?
# Age
print(len(train_df['Age']))
train_df['Age'].unique()
print(train_df['Age'].isna().sum())
# More accurate way of guessing missing values is to use other correlated features.
#  In our case we note correlation among Age, Gender, and Pclass.
#  Guess Age values using median values for Age across sets of Pclass and Gender feature combinations.
#  So, median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on..
grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=1, bins=20)
grid.add_legend()

891
177

<seaborn.axisgrid.FacetGrid at 0x7fbc330248b0>

png

# Guessed values of Age for different combinations of Pclass and Sex
# Fill in the guessed age values where age is null

guess_ages = np.zeros((2,3))
# Iterate over Sex (0,1) and Pclass(1,2,3) to calculate guessed values of Age for the six combinations
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5

    # Fill in the guessed age values where age is null in the dataset
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

train_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	22	1	7.2500	S	1
1	1	1	1	38	1	71.2833	C	3
2	1	3	1	26	0	7.9250	S	2
3	1	1	1	35	1	53.1000	S	3
4	0	3	0	35	0	8.0500	S	1

# Create Age Bands and determine correlations with Survived
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

	AgeBand	Survived
0	(-0.08, 16.0]	0.550000
1	(16.0, 32.0]	0.337374
2	(32.0, 48.0]	0.412037
3	(48.0, 64.0]	0.434783
4	(64.0, 80.0]	0.090909

# Replace age with ordinals for these bands
for dataset in combine:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title	AgeBand
0	0	3	0	1	1	7.2500	S	1	(16.0, 32.0]
1	1	1	1	2	1	71.2833	C	3	(32.0, 48.0]
2	1	3	1	1	0	7.9250	S	2	(16.0, 32.0]
3	1	1	1	2	1	53.1000	S	3	(32.0, 48.0]
4	0	3	0	2	0	8.0500	S	1	(32.0, 48.0]

# Remove the age band feature
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title
0	0	3	0	1	1	7.2500	S	1
1	1	1	1	2	1	71.2833	C	3
2	1	3	1	1	0	7.9250	S	2
3	1	1	1	2	1	53.1000	S	3
4	0	3	0	2	0	8.0500	S	1

# Create a new feature called 'FamilySize' which combines Parch and SibSp.
# This will enable us to drop Parch and SibSp from the datasets
for dataset in combine:
    # We are adding 1 to count ourselves as a person
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	FamilySize	Survived
3	4	0.724138
2	3	0.578431
1	2	0.552795
6	7	0.333333
0	1	0.303538
4	5	0.200000
5	6	0.136364
7	8	0.000000
8	11	0.000000

# Calculate statistics for people that survived who were alone
# Vs. ones that had family
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

	IsAlone	Survived
0	0	0.505650
1	1	0.303538

# Find the most frequently occurring port
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port

'S'

# Fill missing port values with the most frequent occurring port
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.339009

# Convert categorical port to numeric port feature
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Title	FamilySize	IsAlone
0	0	3	0	1	1	7.2500	0	1	2	0
1	1	1	1	2	1	71.2833	1	3	2	0
2	1	3	1	1	0	7.9250	0	2	1	1
3	1	1	1	2	1	53.1000	0	3	2	0
4	0	3	0	2	0	8.0500	0	1	1	1

# The test data set had a single missing value for fare so we are filling that in
# The model algorithm needs to operate on nonnull values
# Round off the fare to two decimals because it represents currency
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	892	3	0	2	0	0	7.8292	2	1	1	1
1	893	3	1	2	1	0	7.0000	0	3	2	0
2	894	2	0	3	0	0	9.6875	2	1	1	1
3	895	3	0	1	0	0	8.6625	0	1	1	1
4	896	3	1	1	1	1	12.2875	0	3	3	0

# Fareband and survived mean stat

train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

	FareBand	Survived
0	(-0.001, 7.91]	0.197309
1	(7.91, 14.454]	0.303571
2	(14.454, 31.0]	0.454955
3	(31.0, 512.329]	0.581081

# Convert the Fare Feature to ordinal values based on FareBand
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]

train_df.head(10)

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	0	3	0	1	1	0	0	0	1	2	0
1	1	1	1	2	1	0	3	1	3	2	0
2	1	3	1	1	0	0	1	0	2	1	1
3	1	1	1	2	1	0	3	0	3	2	0
4	0	3	0	2	0	0	1	0	1	1	1
5	0	3	0	1	0	0	1	2	1	1	1
6	0	1	0	3	0	0	3	0	1	1	1
7	0	3	0	0	3	1	2	0	4	5	0
8	1	3	1	1	0	2	1	0	3	3	0
9	1	2	1	0	1	0	2	1	3	2	0

# One last look at test_df
test_df.head(10)

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	892	3	0	2	0	0	0	2	1	1	1
1	893	3	1	2	1	0	0	0	3	2	0
2	894	2	0	3	0	0	1	2	1	1	1
3	895	3	0	1	0	0	1	0	1	1	1
4	896	3	1	1	1	1	1	0	3	3	0
5	897	3	0	0	0	0	1	0	1	1	1
6	898	3	1	1	0	0	0	2	2	1	1
7	899	2	0	1	1	1	2	0	1	3	0
8	900	3	1	1	0	0	0	1	3	1	1
9	901	3	0	1	2	0	2	0	1	3	0

Begin Machine Learning Portion

# Imports for Machine Learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# We are ready to start training a model
# We are performing supervised learning because we are training it on our dataset
# And we are doing Classification and Regression

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

((891, 10), (891,), (418, 10))

# Logistic Regression
# Explain how it works
# We have an equation like sigmoid(Wx + b) where we are trying to learn W(W is the coefficient) and b
# the sigmoid equation + a threshold for making the output either 0 or 1.
# The coefficient shows how its related to the value we are trying to predict.
  # Its the linear relationship. How changing a variable(W) can affect its output. W1x1 + w2x2 +...
  # W shows how much changing to change the input by to get the correct output

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
# logreg.coef_

80.25

# Statistics using the coefficient from Logistic Regression
# This shows the correlation of the features to survived
# A positive correlation means as that feature increases the value of survived
# increases as well.

coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)

	Feature	Correlation
1	Sex	2.140744
7	Title	0.461005
6	Embarked	0.262530
5	Fare	0.215071
2	Age	-0.034734
4	Parch	-0.034867
3	SibSp	-0.228568
8	FamilySize	-0.263568
9	IsAlone	-0.512509
0	Pclass	-0.699627

# Support Vector Machines
# We are trying to find the line of best fit between the two classes.
# We are trying to find the line in between the two closest nodes (closest in relation to the other class) of the two classes

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

82.04

# Decision Tree
# A Decision tree starts out with a split of classes based on which split will have the most
# information gain(for entropy calculations) and the most purity (based on gini index calculations)
# The most information gain will be on the class with the least entropy see image above
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

89.11

Below graph shows how to decide on which class to split up based on entropy calculations for Decision Tree Algorithm.

There is the Entropy method of splitting the decision tree and there is a gini index way of splitting the decision tree. Both are similar with regards to performance.

# Random Forest
# Wikipedia
# Random forests or random decision forests is an ensemble learning method for classification,
# regression and other tasks that operates by constructing a multitude of decision trees at training time.
# For classification tasks, the output of the random forest is the class selected by most trees.
# For regression tasks, the mean or average prediction of the individual trees is returned.
# Random decision forests correct for decision trees' habit of overfitting to their training set.
#

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

89.11

# Perceptron
# How is a Perceptron different from Logistic Regression?
# A Perceptron can use the logistic function but that is the only thing Logistic Regression can use
# Also everything about it depends on how its implemented SciKit

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

77.33

# K-Nearest-Neighbors (kNN)
# Choosing K: You select the number of nearest neighbors, kk, which is a critical parameter.
  # The choice of kk can affect the robustness of the model against noise.
# Distance Metric: Calculate the distance between the point you are trying to classify (the query point) and all other points in the data set.
  # Common distance metrics include Euclidean, Manhattan, and Hamming distance.
# Find Nearest Neighbors: Identify the kk closest points to the query point in the feature space.
# Majority Vote for Classification:
  # For classification, the label of the query point is determined by the majority label among its kk nearest neighbors.

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

85.07

# Gaussian Naive Bayes
# Regularly Naive Bayes multiplies prior probabilities to get the probability of the input
  # Find probability of classA | given word1 and word2
  # = (Probability that its classA) * (probability that word1 occurs in classA) * (probability that word2 occurs in classA)
# Gaussian Naive Bayes is the same multiplication but the value for probability is taken from the gaussian dist. of the data
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

78.45

# Linear SVC - Support Vector Classifier
# What is a Linear SVC?
  # Its like SVM but the decision boundary is linear so its more efficient than SVM
  # This is because calculating the linear decision boundary is computationally simpler
  # than calculating non-linear decision boundaries that require transformations and complex calculations due to the kernel trick.
# Use Cases: It's best used when you have a reason to believe the data can be separated by a linear boundary or when computational efficiency is a concern.
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
print(acc_linear_svc)
# print(linear_svc.coef_[0])
w = linear_svc.coef_[0]
print(w)

80.13
[-0.24660274  0.83744421 -0.02822989 -0.10231388 -0.01159634  0.0504932
  0.08623507  0.17555587 -0.08942259 -0.21129048]


/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py:1244: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn(

# Stochastic Gradient Descent
# Linear classifiers (SVM, logistic regression, etc.) with SGD training. (From docs)
# The reason its stochastic is because we are using mini-batches for computing the gradient of the loss function
  # as opposed to the entire dataset
  # and then with the gradients we make updates

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

72.62

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
              'Random Forest', 'Naive Bayes', 'Perceptron',
              'Stochastic Gradient Decent', 'Linear SVC',
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log,
              acc_random_forest, acc_gaussian, acc_perceptron,
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

	Model	Score
3	Random Forest	89.11
8	Decision Tree	89.11
1	KNN	85.07
0	Support Vector Machines	82.04
2	Logistic Regression	80.25
7	Linear SVC	80.13
4	Naive Bayes	78.45
5	Perceptron	77.33
6	Stochastic Gradient Decent	72.62

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
submission.to_csv('./submission.csv', index=False)

	SibSp	Pclass	count
0	0	1	137
1	0	2	120
2	0	3	351
3	1	1	71
4	1	2	55
5	1	3	83
6	2	1	5
7	2	2	8
8	2	3	15
9	3	1	3
10	3	2	1
11	3	3	12
12	4	3	18
13	5	3	5
14	8	3	7

	Parch	Pclass	count
0	0	1	163
1	0	2	134
2	0	3	381
3	1	1	31
4	1	2	32
5	1	3	55
6	2	1	21
7	2	2	16
8	2	3	43
9	3	2	2
10	3	3	3
11	4	1	1
12	4	3	3
13	5	3	5
14	6	3	1

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	0	3	0	1	1	0	0	0	1	2	0
1	1	1	1	2	1	0	3	1	3	2	0
2	1	3	1	1	0	0	1	0	2	1	1
3	1	1	1	2	1	0	3	0	3	2	0
4	0	3	0	2	0	0	1	0	1	1	1
5	0	3	0	1	0	0	1	2	1	1	1
6	0	1	0	3	0	0	3	0	1	1	1
7	0	3	0	0	3	1	2	0	4	5	0
8	1	3	1	1	0	2	1	0	3	3	0
9	1	2	1	0	1	0	2	1	3	2	0

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	892	3	0	2	0	0	0	2	1	1	1
1	893	3	1	2	1	0	0	0	3	2	0
2	894	2	0	3	0	0	1	2	1	1	1
3	895	3	0	1	0	0	1	0	1	1	1
4	896	3	1	1	1	1	1	0	3	3	0
5	897	3	0	0	0	0	1	0	1	1	1
6	898	3	1	1	0	0	0	2	2	1	1
7	899	2	0	1	1	1	2	0	1	3	0
8	900	3	1	1	0	0	0	1	3	1	1
9	901	3	0	1	2	0	2	0	1	3	0

	SibSp	Pclass	count
0	0	1	137
1	0	2	120
2	0	3	351
3	1	1	71
4	1	2	55
5	1	3	83
6	2	1	5
7	2	2	8
8	2	3	15
9	3	1	3
10	3	2	1
11	3	3	12
12	4	3	18
13	5	3	5
14	8	3	7

	Parch	Pclass	count
0	0	1	163
1	0	2	134
2	0	3	381
3	1	1	31
4	1	2	32
5	1	3	55
6	2	1	21
7	2	2	16
8	2	3	43
9	3	2	2
10	3	3	3
11	4	1	1
12	4	3	3
13	5	3	5
14	6	3	1

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	0	3	0	1	1	0	0	0	1	2	0
1	1	1	1	2	1	0	3	1	3	2	0
2	1	3	1	1	0	0	1	0	2	1	1
3	1	1	1	2	1	0	3	0	3	2	0
4	0	3	0	2	0	0	1	0	1	1	1
5	0	3	0	1	0	0	1	2	1	1	1
6	0	1	0	3	0	0	3	0	1	1	1
7	0	3	0	0	3	1	2	0	4	5	0
8	1	3	1	1	0	2	1	0	3	3	0
9	1	2	1	0	1	0	2	1	3	2	0

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	892	3	0	2	0	0	0	2	1	1	1
1	893	3	1	2	1	0	0	0	3	2	0
2	894	2	0	3	0	0	1	2	1	1	1
3	895	3	0	1	0	0	1	0	1	1	1
4	896	3	1	1	1	1	1	0	3	3	0
5	897	3	0	0	0	0	1	0	1	1	1
6	898	3	1	1	0	0	0	2	2	1	1
7	899	2	0	1	1	1	2	0	1	3	0
8	900	3	1	1	0	0	0	1	3	1	1
9	901	3	0	1	2	0	2	0	1	3	0

	SibSp	Pclass	count
0	0	1	137
1	0	2	120
2	0	3	351
3	1	1	71
4	1	2	55
5	1	3	83
6	2	1	5
7	2	2	8
8	2	3	15
9	3	1	3
10	3	2	1
11	3	3	12
12	4	3	18
13	5	3	5
14	8	3	7

	Parch	Pclass	count
0	0	1	163
1	0	2	134
2	0	3	381
3	1	1	31
4	1	2	32
5	1	3	55
6	2	1	21
7	2	2	16
8	2	3	43
9	3	2	2
10	3	3	3
11	4	1	1
12	4	3	3
13	5	3	5
14	6	3	1

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	0	3	0	1	1	0	0	0	1	2	0
1	1	1	1	2	1	0	3	1	3	2	0
2	1	3	1	1	0	0	1	0	2	1	1
3	1	1	1	2	1	0	3	0	3	2	0
4	0	3	0	2	0	0	1	0	1	1	1
5	0	3	0	1	0	0	1	2	1	1	1
6	0	1	0	3	0	0	3	0	1	1	1
7	0	3	0	0	3	1	2	0	4	5	0
8	1	3	1	1	0	2	1	0	3	3	0
9	1	2	1	0	1	0	2	1	3	2	0

	PassengerId	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	FamilySize	IsAlone
0	892	3	0	2	0	0	0	2	1	1	1
1	893	3	1	2	1	0	0	0	3	2	0
2	894	2	0	3	0	0	1	2	1	1	1
3	895	3	0	1	0	0	1	0	1	1	1
4	896	3	1	1	1	1	1	0	3	3	0
5	897	3	0	0	0	0	1	0	1	1	1
6	898	3	1	1	0	0	0	2	2	1	1
7	899	2	0	1	1	1	2	0	1	3	0
8	900	3	1	1	0	0	0	1	3	1	1
9	901	3	0	1	2	0	2	0	1	3	0