# Part 2: Titanic — model building, scikit-learn

I started the Titanic project with the data analysis part, you can read the text here and see both notebooks in Github.

The next step was to build the model which predicts if the person survived or not. I did several models just to test out which model type is the best. Also, it was a great way to learn scikit-learn and Fast.ai packages

Model building was much harder than the initial data analysis part but during the model-building part, I learned several new things.

I created separate notebook for model building part so that’s why I started with basics again.

**# Upload the packages**

import **pandas** **as** **pd**

**import** **numpy** **as** **np**

After loading necessary packages I loaded the data:

`df=pd.read_csv("train.csv")`

Then I had a quick look at the data:

`df.head()`

#This prints you the first 5 rows of the table

`df.tail()`

# This prints you out the last five rows of the table

df.shape

# Replace missing values

First of all, I started to prepare the data set which I used for model training. It means that I chose which features to use and which not.

I started to check how many missing values are there.

`df.isnull().sum()`

# Count missing values in each column

Then I thought that what to do with all missing values so that they wouldn’t affect the model accuracy in a negative way. I chose to fill missing Age values with the mean value.

`df['Age'] = df['Age'].fillna((df['Age'].mean()))`

# Fills missing values in "Age" -column with the averages values.

There were only a few missing embarked value so I chose to fill it with last previous row value.

`df["Embarked"].fillna( method ='ffill', inplace = `**True**)

# Missing values are filled with previous row value

Most of the cabin columns are missing values so I chose to add just value “empty” there.

`df["Cabin"].fillna("Empty", inplace = `**True**)

# All missing values in column "Cabin" are filled with word "empty"

# Preprocessing data

Preprocessing the data is important because it helps to build models that are as accurate as possible.

**from** **sklearn** **import** preprocessing

# this package provides common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

# This package can be uploaded in the beginning together with pandas and numpy.

I processed the data with get_dummies.

`df = pd.get_dummies(df, prefix_sep='_', drop_first=`**True**)

df.head()

# prefix_sep='_' - If appending prefix, separator/delimiter to use.

# drop_first=True - Whether to get k-1 dummies out of k categorical levels by removing the first level.

Using get_dummies makes values easier to quantify and compare. Dummy variables are needed for model training to separate different values of categorical variables.

# Training and test sets

The next step is to create separate datasets for training and testing.

fromsklearn.model_selectionimporttrain_test_split

# Split arrays or matrices into random train and test sets# This can be added also to first row where pandas and numpy are uploaded.

First of all, I need to choose what is the dependent variable I want to train a model for. As my goal is to predict whether a person survived I choose this value to be the “Survived” -column.

`y =df["Survived"]`

Then I choose which columns are used for the training set.

`df_train_input=df.loc[:, df.columns != 'Survived']`

# Training set is built by using all the columns but excluding column "Survived".

After choosing the columns it’ s time to create training and testing sets.

`X_train, X_test, y_train, y_test = train_test_split(df_train_input, y, test_size=0.2)`

print (X_train.shape, y_train.shape)

print (X_test.shape, y_test.shape)

I created the following sets: X_train, X_test, y_train, and y_test. For the train set, I took columns that I defined earlier and for the test set I took 20% of data in the training set.

# Building the model

Finally, it is time to start to train the model. I used Scikit-learn for the model building but there are also several other ways to build the model.

I started with Gaussian Naive Bayes. It is a simple classification algorithm and based on the Bayes’ theorem.

**from** **sklearn.naive_bayes** **import** GaussianNB

**from** **sklearn.metrics** **import** accuracy_score

gnb = GaussianNB()

pred=gnb.fit(X_train, y_train).predict(X_test)

print("Naive-Bayes accuracy : ",accuracy_score(y_test, pred, normalize = **True**))

Also, I tested LinearSVC which means Linear Support Vector Classification.

**from** **sklearn.svm** **import** LinearSVC

**from** **sklearn.metrics** **import** accuracy_score

svc_model = LinearSVC(random_state=0)

pred = svc_model.fit(X_train, y_train).predict(X_test)

print("LinearSVC accuracy : ",accuracy_score(y_test, pred, normalize = **True**))

And last model type I tried was the K-Neighbors classifier.

**from** **sklearn.neighbors** **import** KNeighborsClassifier

**from** **sklearn.metrics** **import** accuracy_score

neigh = KNeighborsClassifier(n_neighbors=3)

neigh.fit(X_train, y_train)

pred = neigh.predict(X_test)

print ("KNeighbors accuracy score : ",accuracy_score(y_test, pred))

As you can see none of these models was very good. I was not satisfied with the results so I started to learn how to extract features for building a better model. Here I used all the features just as they were in CSV-file but I realized it’s no good way to build the model.

Thank you for the highly trained monkey (Risto Hinno) for editing my texts and inspiring me!