Applying 7 Classification Algorithms on the Titanic Dataset

Eshita Goel
Geek Culture
Published in
7 min readJul 1, 2021

--

In search of the most accurate algorithm!

If you’re just starting out with data science, the Titanic: Machine Learning from Disaster project on Kaggle is one of the best ways to learn Classification Algorithms! In this article, I go through how I applied 7 different classification algorithms on this dataset, and we find out which gives us the best accuracy.

I’ve even submitted the models to the Kaggle competition, so we find out exactly how each model performs on unknown data!

Firstly, if you’re going to be using any of these models, you’ll have to clean your data. I’ve gone through the various ways to clean and visualise the Titanic dataset in this article. So if you’re wondering how to handle those missing values and want to know how each feature has a correlation with the Survived column, then check the that article out first.

Lets look at the prerequisites to making the models:

Storing our data in variables

The data from Kaggle is available in .csv files, we store that data into variables.

X = train[["Pclass","Sex","Age","Fare","Cabin","Prefix","Q","S","Family"]]
Y = train["Survived"]
X_TEST = test[["Pclass","Sex","Age","Fare","Cabin","Prefix","Q","S","Family"]]

Standardisation of the Data

Even though standardisation is not needed for all algorithms, it is needed for algorithms like Logistic Regression and K Nearest Neighbours, since they use Euclidean or Manhattan distance. Therefore, we standardise our data.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
X_TEST = sc.transform(X_TEST)

Note that we train the standard scalar only using the training data. This prevents any information leak between the training and testing data.

Train-Test-Split

We perform the train-test-split on the training data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size = 0.2, random_state=1)

Making the Models

1. K — Nearest Neighbor Algorithm

The K-Nearest Neighbor algorithm works well for classification if the right k value is chosen. We can select the right k value using a small for-loop that tests the accuracy for each k value between 1 and 20.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

acc = []

for i in range(1,20):
knn = KNeighborsClassifier(n_neighbors = i)
knn.fit(X_train,y_train)
yhat = knn.predict(X_test)
acc.append(accuracy_score(y_test,yhat))
print("For k = ",i," : ",accuracy_score(y_test,yhat))
Image by Author

To visualise and compare these values, we draw a line graph to find out which k value gives us the best accuracy:

plt.figure(figsize=(8,6))
plt.plot(range(1,20),acc, marker = "o")
plt.xlabel("Value of k")
plt.ylabel("Accuracy Score")
plt.title("Finding the right k")
plt.xticks(range(1,20))
plt.show()
Image by Author

Our preferred value for k that gives us the highest accuracy is k = 9.
We can now use this k value for making our model:

KNN = KNeighborsClassifier(n_neighbors = 9)
KNN.fit(X,Y)
y_pred = KNN.predict(X_TEST)
df_KNN = pd.DataFrame()
df_KNN["PassengerId"] = test2["PassengerId"]
df_KNN["Survived"] = y_pred
KNN Output

When we submit this model to the Kaggle Competition to see how well our model performs, we get an accuracy score of 77.27%

2. Decision Tree Algorithm

We try out the Decision Tree algorithm for this classification problem. We need to find the right depth till which the decision tree should split the data, since without specifying the maximum depth, the model could easily get overfitted.

This can be done using Cross Validation, however for beginners, a simple for-loop can also help you compare which depth to choose.

from sklearn.tree import DecisionTreeClassifier

depth = [];

for i in range(1,8):
clf_tree = DecisionTreeClassifier(criterion="entropy", random_state = 100, max_depth = i)
clf_tree.fit(X_train,y_train)
yhat = clf_tree.predict(X_test)
depth.append(accuracy_score(y_test,yhat))
print("For max depth = ",i, " : ",accuracy_score(y_test,yhat))
Image by Author

Here too, we can plot and see which depth gives us the most accurate predictions:

plt.figure(figsize=(8,6))
plt.plot(range(1,8),depth,color="red", marker = "o")
plt.xlabel("Depth of Tree")
plt.ylabel("Accuracy Score")
plt.title("Finding the right depth with highest accuracy")
plt.xticks(range(1,8))
plt.show()
Image by Author

Highest accuracy is obtained with depth = 3, we now train and predict with this depth.

clf_tr = DecisionTreeClassifier(criterion="entropy", random_state = 100, max_depth = 3)
clf_tr.fit(X,Y)
pred_tree = clf_tr.predict(X_TEST)
df_TREE = pd.DataFrame()
df_TREE["PassengerId"] = test2["PassengerId"]
df_TREE["Survived"] = pred_tree
df_TREE.head()
Decision Tree Output

When we submit this model to the Kaggle Competition to see how well our model performs, we get an accuracy score of 78.46%

3. Random Forest Algorithm

Random Forest is one of the ensemble techniques that uses multiple decision trees. We see how it performs on our data:

from sklearn.ensemble import RandomForestClassifier

clf_forest = RandomForestClassifier(random_state=0)
clf_forest.fit(X_train,y_train)
yhat = clf_forest.predict(X_test)
print("Accuracy for training data : ",accuracy_score(y_test,yhat))
Accuracy for training data : 0.776536312849162

We now save it and submit it to Kaggle

clf_for = RandomForestClassifier(random_state=0)
clf_for.fit(X,Y)
y_forest = clf_for.predict(X_TEST)
df_FOREST = pd.DataFrame()
df_FOREST["PassengerId"] = test2["PassengerId"]
df_FOREST["Survived"] = y_forest
df_FOREST.head()
Random Forest Output

We submit our predictions for this model on kaggle for the Titanic: Machine Learning from Disaster Kaggle Competition and check our accuracy.

Our accuracy is 77.27%

4. Support Vector Machine

We try out the Support Vector Machine Algorithm for this classification problem.

from sklearn.svm import SVC
clf_svm = SVC(gamma='auto')
clf_svm.fit(X_train,y_train)
yhat = clf_svm.predict(X_test)
clf_SVM = SVC(gamma='auto')
clf_SVM.fit(X,Y)
pred_svm = clf_SVM.predict(X_TEST)
df_SVM = pd.DataFrame()
df_SVM["PassengerId"] = test2["PassengerId"]
df_SVM["Survived"] = pred_svm
df_SVM.head()
SVM Output

We submit our predictions for this model on Kaggle for the Titanic: Machine Learning from Disaster Kaggle Competition and check our accuracy

Our accuracy is 77.51%

5. Naive Bayes Algorithm

We try out the Naive Bayes Algorithm for this classification problem.

from sklearn.naive_bayes import GaussianNB
clf_NB = GaussianNB()
clf_NB.fit(X_train,y_train)
y_hat = clf_NB.predict(X_test)
print("Accuracy for training data : ",accuracy_score(y_test,y_hat))
clf_NB = GaussianNB()
clf_NB.fit(X,Y)
pred_NB = clf_NB.predict(X_TEST)
df_NB = pd.DataFrame()
df_NB["PassengerId"] = test2["PassengerId"]
df_NB["Survived"] = pred_NB
df_NB.head()
Naive Bayes Output

We submit our predictions for this model on kaggle for the Titanic: Machine Learning from Disaster Kaggle Competition and check our accuracy

Our accuracy is 72.72%

6. Logistic Regression Algorithm

We try out the Logistic Regression Algorithm for this classification problem.

from sklearn.linear_model import LogisticRegression
regr = LogisticRegression(solver='liblinear', random_state=1)
regr.fit(X_train,y_train)
yhat = regr.predict(X_test)
print("Accuracy for training data : ",accuracy_score(y_test,y_hat))
reg = LogisticRegression(solver='liblinear', random_state=1)
reg.fit(X,Y)
y_LR = reg.predict(X_TEST)
df_LR = pd.DataFrame()
df_LR["PassengerId"] = test2["PassengerId"]
df_LR["Survived"] = y_LR
df_LR.head()
Logistic Regression Output

We submit our predictions for this model on kaggle for the Titanic: Machine Learning from Disaster Kaggle Competition and check our accuracy

Our accuracy is 76.55%

7. Stochastic Gradient Descent Classifier

We try out the Stochastic Gradient Descent Classifier for this classification problem.

from sklearn.linear_model import SGDClassifier

clf_SGD = SGDClassifier(loss="squared_loss", penalty="l2", max_iter=4500,tol=-1000, random_state=1)
clf_SGD.fit(X_train,y_train)
yhat = clf_SGD.predict(X_test)
print(accuracy_score(y_test,yhat))
Training Accuracy
clf_SGD = SGDClassifier(loss="squared_loss", penalty="l2", max_iter=4500, tol=-1000, random_state=1)
clf_SGD.fit(X,Y)
y_SGD = clf_SGD.predict(X_TEST)
df_SGD = pd.DataFrame()
df_SGD["PassengerId"] = test2["PassengerId"]
df_SGD["Survived"] = y_SGD
df_SGD.head()
SGD Classifier Output

We submit our predictions for this model on kaggle for the Titanic: Machine Learning from Disaster Kaggle Competition and check our accuracy

Our accuracy is 76.79%

Final Results

Let us try to plot the various accuracies we received so far and see which model performed the best for us :

plt.figure(figsize=(8,6))
plt.plot(range(1,8),[KNN_accuracy,TREE_accuracy,FOREST_accuracy,SVM_accuracy,NB_accuracy,LR_accuracy,SGD_accuracy],marker='o')
plt.xticks(range(1,8),['KNN','Decision Tree','Random Forest','SVM','Naive Bayes','Log Regression','SGD'],rotation=25)
plt.title('Accuracy of Various Models')
plt.xlabel('Model Names')
plt.ylabel("Accuracy Score")
plt.show()
Image by Author

Clearly, Decision Tree seems to be giving us the best accuracy. We need to note that this was the case when we straightaway applied the model without any hyperparameter tuning or Cross Validation. Our accuracy is bound to improve when we take additional steps. But this simple application is a good way to learn about applying models for classification problems!

This marks the end to our Titanic: Machine Learning from Disaster project!

You can view the complete code with all the steps — cleaning data, preprocessing, standardising and applying the model, on my GitHub:

--

--