import pandas as pd
import numpy as np

Load the train and test datasets to create two DataFrames¶

train_url = "./data/train.csv"
train = pd.read_csv(train_url)

test_url = "./data/test.csv"
test = pd.read_csv(test_url)

print "train len: {}".format(len(train))
print "test len: {}".format(len(test))

#data example
train.head()

train len: 891
test len: 418

age_null = train["Age"].isnull()
print "missed age count: {}".format(len(train[age_null]))
print "missed age embarked value: {}".format(len(train[train["Embarked"].isnull()]))
print "missed age fare value: {}".format(len(train[train["Fare"].isnull()]))
print "missed age cabin value: {}".format(len(train[train["Cabin"].isnull()]))

missed age count: 177
missed age embarked value: 2
missed age fare value: 0
missed age cabin value: 687

print "For {} passangers fare is 0.".format(len(train[train["Fare"] == 0]))

For 15 passangers fare is 0.

Who has missed Embarked field?¶

train[train["Embarked"].isnull()]

Boarding info is missing for 2 first class passengers. Let see, how many passangers from diffrent ports have the first class ticket¶

zero_fare = train["Fare"] == 0
first_class = train["Pclass"] == 1

train[(~zero_fare) & first_class].pivot_table(values='Fare', index='Embarked', aggfunc=[np.size, np.mean, np.min, np.max])

train[(~zero_fare) & first_class & (train["Fare"] == 5.0)]

Seems that more likelly, passanges 62 and 830 were boarded in Southampton¶

train.loc[61, "Embarked"] = 'S'
train.loc[829, "Embarked"] = 'S'

Let's look to passangers that have 0 fare¶

train[zero_fare]

All of them were boarded in Southampton, let's see what we have for Southampton¶

from_southampton = train["Embarked"] == "S"

train[(~zero_fare) & from_southampton].groupby("Pclass")["Fare"].describe()

Pclass       
1       count    124.000000
        mean      73.357560
        std       57.743728
        min        5.000000
        25%       30.375000
        50%       52.827100
        75%       84.231250
        max      263.000000
2       count    158.000000
        mean      21.099367
        std       13.285582
        min       10.500000
        25%       13.000000
        50%       14.500000
        75%       26.000000
        max       73.500000
3       count    349.000000
        mean      14.811923
        std       13.259006
        min        6.237500
        25%        7.875000
        50%        8.050000
        75%       16.100000
        max       69.550000
Name: Fare, dtype: float64

While I looked how propage missed fare, I saw that some 3rd class passangers payed much more than others.¶

third_class = train["Pclass"] == 3

train[(train["Fare"] > 69) & third_class]

This is a big family, seems that is fare for whole family 69.55 / 10 = 6.955. Mean for 3rd passangers from Southampton¶

third_from_s = from_southampton & third_class
train[(~ zero_fare) & third_from_s & (train["Ticket"] != "CA. 2343")].groupby("Pclass")["Fare"].describe()

Pclass       
3       count    342.000000
        mean      13.691554
        std       10.800204
        min        6.237500
        25%        7.854200
        50%        8.050000
        75%       15.900000
        max       56.495800
Name: Fare, dtype: float64

Lets fill missed fare data¶

But first, I`ll create new data set by combining train and test datasets in order get more precise mean of "fare"

all_data = train.copy().append(test.copy())

As was discovered with Sage family, same passengers have same ticket number and combined ticket price for all family/group of people.¶

Lets create new column with count of passangers that share same ticket number

all_data["group"] = all_data.groupby("Ticket")["PassengerId"].transform("count")

Now I can calculate ticket price per passenger¶

all_data["ticket_price"] = all_data["Fare"] / all_data["group"]
all_data[all_data["ticket_price"] > 0].groupby("Pclass").ticket_price.describe()

Pclass       
1       count    316.000000
        mean      34.661682
        std       14.675124
        min        5.000000
        25%       26.550000
        50%       30.000000
        75%       39.133350
        max      128.082300
2       count    271.000000
        mean      11.663652
        std        2.031927
        min        5.250000
        25%       10.500000
        50%       12.650000
        75%       13.000000
        max       16.000000
3       count    704.000000
        mean       7.370788
        std        1.367423
        min        3.170800
        25%        7.061975
        50%        7.750000
        75%        7.925000
        max       19.966700
Name: ticket_price, dtype: float64

I noticed some outlier in data of 1st class passengers. Max ticket_price: 128.082300 with mean 34. Intresting¶

all_data[all_data["ticket_price"] > 128]

According to https://www.encyclopedia-titanica.org/titanic-survivor/thomas-cardeza.html, that was mother and son with 2 their servants. They occupied most expensive cabins

all_data[(all_data["SibSp"] + all_data["Parch"]) + 1 > all_data["group"]].head()

Seems that not all family members have same/shared ticket number.¶

Let's update "group" column with maximum of current value or parch + sibsp + 1, whatever is bigger

all_data["group"] = all_data.groupby("Ticket")["PassengerId"].transform("count")
all_data["family"] =  all_data["SibSp"] + all_data["Parch"] + 1

all_data["group"] = all_data[["family", "group"]].max(axis=1)

Now I can calculate ticket_price for passengers that had originally messed "Fare" value¶

zero_fare = (all_data["Fare"] == 0) | (all_data["Fare"].isnull())

first_class = all_data["Pclass"] == 1
second_class = all_data["Pclass"] == 2
third_class = all_data["Pclass"] == 3

#as all passengers that have missed "fare" values were boarded in Southampton
from_southampton = all_data["Embarked"] == "S"

first_from_s = from_southampton & first_class
second_from_s = from_southampton & second_class
third_from_s = from_southampton & third_class

all_data.loc[zero_fare & first_class, "ticket_price"] = all_data[~zero_fare & first_from_s].ticket_price.mean()
all_data.loc[zero_fare & second_class, "ticket_price"] = all_data[~zero_fare & second_from_s].ticket_price.mean()
all_data.loc[zero_fare & third_class, "ticket_price"] = all_data[~zero_fare & third_from_s].ticket_price.mean()

Time to investigate age data¶

age_null = all_data["Age"].isnull()

all_data[~age_null]["Age"].describe()

count    1046.000000
mean       29.881138
std        14.413493
min         0.170000
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: Age, dtype: float64

all_data[age_null].groupby("Pclass")["PassengerId"].count()

Pclass
1     39
2     16
3    208
Name: PassengerId, dtype: int64

all_data[~age_null & third_class].groupby("Sex").Age.describe()

Sex          
female  count    152.000000
        mean      22.185329
        std       12.205254
        min        0.170000
        25%       16.000000
        50%       22.000000
        75%       30.000000
        max       63.000000
male    count    349.000000
        mean      25.962264
        std       11.682415
        min        0.330000
        25%       20.000000
        50%       25.000000
        75%       32.000000
        max       74.000000
Name: Age, dtype: float64

all_data[~age_null & first_class].groupby("Sex").Age.describe()

Sex          
female  count    133.000000
        mean      37.037594
        std       14.272460
        min        2.000000
        25%       24.000000
        50%       36.000000
        75%       48.000000
        max       76.000000
male    count    151.000000
        mean      41.029272
        std       14.578529
        min        0.920000
        25%       30.000000
        50%       42.000000
        75%       50.000000
        max       80.000000
Name: Age, dtype: float64

Seems that median of age of the first class is higher than 3rd, also women are younger than men.¶

Lets use this information to fill our missed data

women = all_data["Sex"] == "female"
men = all_data["Sex"] == "male"

all_data.loc[age_null & first_class & women, "Age"] = all_data[~age_null & first_class & women].Age.mean()
all_data.loc[age_null & first_class & men, "Age"] = all_data[~age_null & first_class & men].Age.mean()

all_data.loc[age_null & second_class & women, "Age"] = all_data[~age_null & second_class & women].Age.mean()
all_data.loc[age_null & second_class & men, "Age"] = all_data[~age_null & second_class & men].Age.mean()

all_data.loc[age_null & third_class & women, "Age"] = all_data[~age_null & third_class & women].Age.mean()
all_data.loc[age_null & third_class & men, "Age"] = all_data[~age_null & third_class & men].Age.mean()

And finally propagete calculated age and fare to our origin train and test datasets¶

train_filter = all_data[all_data.PassengerId.isin(train.PassengerId)]
test_filter = all_data[all_data.PassengerId.isin(test.PassengerId)]

train.loc[train["Age"].isnull(), "Age"] = train_filter.Age
test.loc[test["Age"].isnull(), "Age"] = test_filter.Age

train["ticket_price"] = train_filter.ticket_price
test["ticket_price"] = test_filter.ticket_price

train["group"] = train_filter.group
test["group"] = test_filter.group

Prepare data for ml, converting categorical classes to integer representation¶

df_categorical_train = pd.get_dummies(train[["Pclass", "Sex", "Embarked"]])
df_categorical_test = pd.get_dummies(test[["Pclass", "Sex", "Embarked"]])

target_train = train["Survived"].values

#add visual utils
from inspect import getsourcefile
import os.path as path, sys
current_dir = path.dirname(path.dirname(path.abspath(getsourcefile(lambda:0))))
sys.path.insert(0, current_dir[:current_dir.rfind(path.sep)])

import visuals as vs
import metrics as ms

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

features = ["Age", "group", "ticket_price"]
selected_features_train = pd.concat([train[features], df_categorical_train], axis=1).values
selected_features_test = pd.concat([test[features], df_categorical_test], axis=1).values

# Initialize the three models
clf_A = AdaBoostClassifier()
clf_B = GradientBoostingClassifier()
clf_C = RandomForestClassifier()

# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = int(len(selected_features_train) * 0.01)
samples_10 = int(len(selected_features_train) * 0.1)
samples_100 = len(selected_features_train)

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = ms.train_predict(clf, samples, selected_features_train, target_train, selected_features_test)

# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results)

AdaBoostClassifier trained on 8 samples.
AdaBoostClassifier trained on 89 samples.
AdaBoostClassifier trained on 891 samples.
GradientBoostingClassifier trained on 8 samples.
GradientBoostingClassifier trained on 89 samples.
GradientBoostingClassifier trained on 891 samples.
RandomForestClassifier trained on 8 samples.
RandomForestClassifier trained on 89 samples.
RandomForestClassifier trained on 891 samples.

predictions = clf_A.predict(selected_features_test)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(predictions, PassengerId, columns = ["Survived"])

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("/home/denys/AdaBoostClassifier.csv", index_label = ["PassengerId"])

	size	mean	amin	amax
Embarked
C	85.0	104.718529	26.55	512.3292
Q	2.0	90.000000	90.00	90.0000
S	122.0	73.248668	5.00	263.0000

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
179	180	0	3	Leonard, Mr. Lionel	male	36.0	LINE	NaN	S
263	264	0	1	Harrison, Mr. William	male	40.0	112059	B94	S
271	272	1	3	Tornquist, Mr. William Henry	male	25.0	LINE	NaN	S
277	278	0	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S
302	303	0	3	Johnson, Mr. William Cahoone Jr	male	19.0	LINE	NaN	S
413	414	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S
466	467	0	2	Campbell, Mr. William	male	NaN	239853	NaN	S
481	482	0	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S
597	598	0	3	Johnson, Mr. Alfred	male	49.0	LINE	NaN	S
633	634	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S
674	675	0	2	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S
732	733	0	2	Knight, Mr. Robert J	male	NaN	239855	NaN	S
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	B102	S
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
159	160	3	Sage, Master. Thomas Henry	male	NaN	8	2	CA. 2343	69.55	NaN	S
180	181	3	Sage, Miss. Constance Gladys	female	NaN	8	2	CA. 2343	69.55	NaN	S
201	202	3	Sage, Mr. Frederick	male	NaN	8	2	CA. 2343	69.55	NaN	S
324	325	3	Sage, Mr. George John Jr	male	NaN	8	2	CA. 2343	69.55	NaN	S
792	793	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.55	NaN	S
846	847	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.55	NaN	S
863	864	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.55	NaN	S

Programing tips

Friday, February 10, 2017

Titanic: Machine Learning from Disaster