Some time ago, I found great course about Deep learning (fast.ai). I watched few videos and was fascinated about results that could be achieved by using DL for visual recognition.
I wanted to try it how it is working by myself, and found active competition/playground on kaggle.
The goal of competition to identify breed of the dog on the picture.
Train data contains approximately 10K photos of dog of 120 different breeds. Test data - 10K images as well.
I started with playing to identify only 2 different breed - doberman vs pomeranian. For training I had 157 images, for test - 30 (15 for each breed)
First I tried simple architecture with only one convolution layer. I run training for 3-fold cross validation with 100 epoch for each fold.
Time of training each epoch - 4s
I got accuracy of 0.8 on my validation set. Full training took approximately 20 mins.
Whole source code can be found on github
Programing tips
IT, Hadoop, Big Data, Java, Scala, Android
Friday, January 19, 2018
Friday, February 10, 2017
Titanic: Machine Learning from Disaster
In [1]:
import pandas as pd
import numpy as np
Load the train and test datasets to create two DataFrames¶
In [2]:
train_url = "./data/train.csv"
train = pd.read_csv(train_url)
test_url = "./data/test.csv"
test = pd.read_csv(test_url)
In [3]:
print "train len: {}".format(len(train))
print "test len: {}".format(len(test))
#data example
train.head()
train len: 891 test len: 418
Out[3]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 
In [4]:
age_null = train["Age"].isnull()
print "missed age count: {}".format(len(train[age_null]))
print "missed age embarked value: {}".format(len(train[train["Embarked"].isnull()]))
print "missed age fare value: {}".format(len(train[train["Fare"].isnull()]))
print "missed age cabin value: {}".format(len(train[train["Cabin"].isnull()]))
missed age count: 177 missed age embarked value: 2 missed age fare value: 0 missed age cabin value: 687
In [5]:
print "For {} passangers fare is 0.".format(len(train[train["Fare"] == 0]))
For 15 passangers fare is 0.
Who has missed Embarked field?¶
In [6]:
train[train["Embarked"].isnull()]
Out[6]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 
| 829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 
Boarding info is missing for 2 first class passengers. Let see, how many passangers from diffrent ports have the first class ticket¶
In [7]:
zero_fare = train["Fare"] == 0
first_class = train["Pclass"] == 1
train[(~zero_fare) & first_class].pivot_table(values='Fare', index='Embarked', aggfunc=[np.size, np.mean, np.min, np.max])
Out[7]:
| size | mean | amin | amax | |
|---|---|---|---|---|
| Embarked | ||||
| C | 85.0 | 104.718529 | 26.55 | 512.3292 | 
| Q | 2.0 | 90.000000 | 90.00 | 90.0000 | 
| S | 122.0 | 73.248668 | 5.00 | 263.0000 | 
In [8]:
train[(~zero_fare) & first_class & (train["Fare"] == 5.0)]
Out[8]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0 | B51 B53 B55 | S | 
Seems that more likelly, passanges 62 and 830 were boarded in Southampton¶
In [9]:
train.loc[61, "Embarked"] = 'S'
train.loc[829, "Embarked"] = 'S'
Let's look to passangers that have 0 fare¶
In [10]:
train[zero_fare]
Out[10]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 179 | 180 | 0 | 3 | Leonard, Mr. Lionel | male | 36.0 | 0 | 0 | LINE | 0.0 | NaN | S | 
| 263 | 264 | 0 | 1 | Harrison, Mr. William | male | 40.0 | 0 | 0 | 112059 | 0.0 | B94 | S | 
| 271 | 272 | 1 | 3 | Tornquist, Mr. William Henry | male | 25.0 | 0 | 0 | LINE | 0.0 | NaN | S | 
| 277 | 278 | 0 | 2 | Parkes, Mr. Francis "Frank" | male | NaN | 0 | 0 | 239853 | 0.0 | NaN | S | 
| 302 | 303 | 0 | 3 | Johnson, Mr. William Cahoone Jr | male | 19.0 | 0 | 0 | LINE | 0.0 | NaN | S | 
| 413 | 414 | 0 | 2 | Cunningham, Mr. Alfred Fleming | male | NaN | 0 | 0 | 239853 | 0.0 | NaN | S | 
| 466 | 467 | 0 | 2 | Campbell, Mr. William | male | NaN | 0 | 0 | 239853 | 0.0 | NaN | S | 
| 481 | 482 | 0 | 2 | Frost, Mr. Anthony Wood "Archie" | male | NaN | 0 | 0 | 239854 | 0.0 | NaN | S | 
| 597 | 598 | 0 | 3 | Johnson, Mr. Alfred | male | 49.0 | 0 | 0 | LINE | 0.0 | NaN | S | 
| 633 | 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | NaN | 0 | 0 | 112052 | 0.0 | NaN | S | 
| 674 | 675 | 0 | 2 | Watson, Mr. Ennis Hastings | male | NaN | 0 | 0 | 239856 | 0.0 | NaN | S | 
| 732 | 733 | 0 | 2 | Knight, Mr. Robert J | male | NaN | 0 | 0 | 239855 | 0.0 | NaN | S | 
| 806 | 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39.0 | 0 | 0 | 112050 | 0.0 | A36 | S | 
| 815 | 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0 | B102 | S | 
| 822 | 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38.0 | 0 | 0 | 19972 | 0.0 | NaN | S | 
All of them were boarded in Southampton, let's see what we have for Southampton¶
In [11]:
from_southampton = train["Embarked"] == "S"
train[(~zero_fare) & from_southampton].groupby("Pclass")["Fare"].describe()
Out[11]:
Pclass       
1       count    124.000000
        mean      73.357560
        std       57.743728
        min        5.000000
        25%       30.375000
        50%       52.827100
        75%       84.231250
        max      263.000000
2       count    158.000000
        mean      21.099367
        std       13.285582
        min       10.500000
        25%       13.000000
        50%       14.500000
        75%       26.000000
        max       73.500000
3       count    349.000000
        mean      14.811923
        std       13.259006
        min        6.237500
        25%        7.875000
        50%        8.050000
        75%       16.100000
        max       69.550000
Name: Fare, dtype: float64
While I looked how propage missed fare, I saw that some 3rd class passangers payed much more than others.¶
In [12]:
third_class = train["Pclass"] == 3
train[(train["Fare"] > 69) & third_class]
Out[12]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 180 | 181 | 0 | 3 | Sage, Miss. Constance Gladys | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 201 | 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 324 | 325 | 0 | 3 | Sage, Mr. George John Jr | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
| 863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.55 | NaN | S | 
This is a big family, seems that is fare for whole family 69.55 / 10 = 6.955. Mean for 3rd passangers from Southampton¶
In [13]:
third_from_s = from_southampton & third_class
train[(~ zero_fare) & third_from_s & (train["Ticket"] != "CA. 2343")].groupby("Pclass")["Fare"].describe()
Out[13]:
Pclass       
3       count    342.000000
        mean      13.691554
        std       10.800204
        min        6.237500
        25%        7.854200
        50%        8.050000
        75%       15.900000
        max       56.495800
Name: Fare, dtype: float64
Lets fill missed fare data¶
But first, I`ll create new data set by combining train and test datasets in order get more precise mean of "fare"
In [14]:
all_data = train.copy().append(test.copy())
As was discovered with Sage family, same passengers have same ticket number and combined ticket price for all family/group of people.¶
Lets create new column with count of passangers that share same ticket number
In [15]:
all_data["group"] = all_data.groupby("Ticket")["PassengerId"].transform("count")
Now I can calculate ticket price per passenger¶
In [16]:
all_data["ticket_price"] = all_data["Fare"] / all_data["group"]
all_data[all_data["ticket_price"] > 0].groupby("Pclass").ticket_price.describe()
Out[16]:
Pclass       
1       count    316.000000
        mean      34.661682
        std       14.675124
        min        5.000000
        25%       26.550000
        50%       30.000000
        75%       39.133350
        max      128.082300
2       count    271.000000
        mean      11.663652
        std        2.031927
        min        5.250000
        25%       10.500000
        50%       12.650000
        75%       13.000000
        max       16.000000
3       count    704.000000
        mean       7.370788
        std        1.367423
        min        3.170800
        25%        7.061975
        50%        7.750000
        75%        7.925000
        max       19.966700
Name: ticket_price, dtype: float64
I noticed some outlier in data of 1st class passengers. Max ticket_price: 128.082300 with mean 34. Intresting¶
In [17]:
all_data[all_data["ticket_price"] > 128]
Out[17]:
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | group | ticket_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 258 | 35.0 | NaN | C | 512.3292 | Ward, Miss. Anna | 0 | 259 | 1 | female | 0 | 1.0 | PC 17755 | 4 | 128.0823 | 
| 679 | 36.0 | B51 B53 B55 | C | 512.3292 | Cardeza, Mr. Thomas Drake Martinez | 1 | 680 | 1 | male | 0 | 1.0 | PC 17755 | 4 | 128.0823 | 
| 737 | 35.0 | B101 | C | 512.3292 | Lesurer, Mr. Gustave J | 0 | 738 | 1 | male | 0 | 1.0 | PC 17755 | 4 | 128.0823 | 
| 343 | 58.0 | B51 B53 B55 | C | 512.3292 | Cardeza, Mrs. James Warburton Martinez (Charlo... | 1 | 1235 | 1 | female | 0 | NaN | PC 17755 | 4 | 128.0823 | 
According to https://www.encyclopedia-titanica.org/titanic-survivor/thomas-cardeza.html, that was mother and son with 2 their servants. They occupied most expensive cabins
In [18]:
all_data[(all_data["SibSp"] + all_data["Parch"]) + 1 > all_data["group"]].head()
Out[18]:
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | group | ticket_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 | 1 | 7.2500 | 
| 38 | 18.0 | NaN | S | 18.0000 | Vander Planke, Miss. Augusta Maria | 0 | 39 | 3 | female | 2 | 0.0 | 345764 | 2 | 9.0000 | 
| 40 | 40.0 | NaN | S | 9.4750 | Ahlin, Mrs. Johan (Johanna Persdotter Larsson) | 0 | 41 | 3 | female | 1 | 0.0 | 7546 | 1 | 9.4750 | 
| 68 | 17.0 | NaN | S | 7.9250 | Andersson, Miss. Erna Alexandra | 2 | 69 | 3 | female | 4 | 1.0 | 3101281 | 1 | 7.9250 | 
| 69 | 26.0 | NaN | S | 8.6625 | Kink, Mr. Vincenz | 0 | 70 | 3 | male | 2 | 0.0 | 315151 | 1 | 8.6625 | 
Seems that not all family members have same/shared ticket number.¶
Let's update "group" column with maximum of current value or parch + sibsp + 1, whatever is bigger
In [19]:
all_data["group"] = all_data.groupby("Ticket")["PassengerId"].transform("count")
all_data["family"] =  all_data["SibSp"] + all_data["Parch"] + 1
all_data["group"] = all_data[["family", "group"]].max(axis=1)
Now I can calculate ticket_price for passengers that had originally messed "Fare" value¶
In [20]:
zero_fare = (all_data["Fare"] == 0) | (all_data["Fare"].isnull())
first_class = all_data["Pclass"] == 1
second_class = all_data["Pclass"] == 2
third_class = all_data["Pclass"] == 3
#as all passengers that have missed "fare" values were boarded in Southampton
from_southampton = all_data["Embarked"] == "S"
first_from_s = from_southampton & first_class
second_from_s = from_southampton & second_class
third_from_s = from_southampton & third_class
all_data.loc[zero_fare & first_class, "ticket_price"] = all_data[~zero_fare & first_from_s].ticket_price.mean()
all_data.loc[zero_fare & second_class, "ticket_price"] = all_data[~zero_fare & second_from_s].ticket_price.mean()
all_data.loc[zero_fare & third_class, "ticket_price"] = all_data[~zero_fare & third_from_s].ticket_price.mean()
Time to investigate age data¶
In [21]:
age_null = all_data["Age"].isnull()
all_data[~age_null]["Age"].describe() 
Out[21]:
count 1046.000000 mean 29.881138 std 14.413493 min 0.170000 25% 21.000000 50% 28.000000 75% 39.000000 max 80.000000 Name: Age, dtype: float64
In [22]:
all_data[age_null].groupby("Pclass")["PassengerId"].count()
Out[22]:
Pclass 1 39 2 16 3 208 Name: PassengerId, dtype: int64
In [23]:
all_data[~age_null & third_class].groupby("Sex").Age.describe()
Out[23]:
Sex          
female  count    152.000000
        mean      22.185329
        std       12.205254
        min        0.170000
        25%       16.000000
        50%       22.000000
        75%       30.000000
        max       63.000000
male    count    349.000000
        mean      25.962264
        std       11.682415
        min        0.330000
        25%       20.000000
        50%       25.000000
        75%       32.000000
        max       74.000000
Name: Age, dtype: float64
In [24]:
all_data[~age_null & first_class].groupby("Sex").Age.describe()
Out[24]:
Sex          
female  count    133.000000
        mean      37.037594
        std       14.272460
        min        2.000000
        25%       24.000000
        50%       36.000000
        75%       48.000000
        max       76.000000
male    count    151.000000
        mean      41.029272
        std       14.578529
        min        0.920000
        25%       30.000000
        50%       42.000000
        75%       50.000000
        max       80.000000
Name: Age, dtype: float64
Seems that median of age of the first class is higher than 3rd, also women are younger than men.¶
Lets use this information to fill our missed data
In [25]:
women = all_data["Sex"] == "female"
men = all_data["Sex"] == "male"
all_data.loc[age_null & first_class & women, "Age"] = all_data[~age_null & first_class & women].Age.mean()
all_data.loc[age_null & first_class & men, "Age"] = all_data[~age_null & first_class & men].Age.mean()
all_data.loc[age_null & second_class & women, "Age"] = all_data[~age_null & second_class & women].Age.mean()
all_data.loc[age_null & second_class & men, "Age"] = all_data[~age_null & second_class & men].Age.mean()
all_data.loc[age_null & third_class & women, "Age"] = all_data[~age_null & third_class & women].Age.mean()
all_data.loc[age_null & third_class & men, "Age"] = all_data[~age_null & third_class & men].Age.mean()
And finally propagete calculated age and fare to our origin train and test datasets¶
In [26]:
train_filter = all_data[all_data.PassengerId.isin(train.PassengerId)]
test_filter = all_data[all_data.PassengerId.isin(test.PassengerId)]
train.loc[train["Age"].isnull(), "Age"] = train_filter.Age
test.loc[test["Age"].isnull(), "Age"] = test_filter.Age
train["ticket_price"] = train_filter.ticket_price
test["ticket_price"] = test_filter.ticket_price
train["group"] = train_filter.group
test["group"] = test_filter.group
Prepare data for ml, converting categorical classes to integer representation¶
In [27]:
df_categorical_train = pd.get_dummies(train[["Pclass", "Sex", "Embarked"]])
df_categorical_test = pd.get_dummies(test[["Pclass", "Sex", "Embarked"]])
target_train = train["Survived"].values
In [28]:
#add visual utils
from inspect import getsourcefile
import os.path as path, sys
current_dir = path.dirname(path.dirname(path.abspath(getsourcefile(lambda:0))))
sys.path.insert(0, current_dir[:current_dir.rfind(path.sep)])
import visuals as vs
import metrics as ms
In [29]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
In [30]:
features = ["Age", "group", "ticket_price"]
selected_features_train = pd.concat([train[features], df_categorical_train], axis=1).values
selected_features_test = pd.concat([test[features], df_categorical_test], axis=1).values
In [31]:
# Initialize the three models
clf_A = AdaBoostClassifier()
clf_B = GradientBoostingClassifier()
clf_C = RandomForestClassifier()
# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_1 = int(len(selected_features_train) * 0.01)
samples_10 = int(len(selected_features_train) * 0.1)
samples_100 = len(selected_features_train)
# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = ms.train_predict(clf, samples, selected_features_train, target_train, selected_features_test)
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results)
AdaBoostClassifier trained on 8 samples. AdaBoostClassifier trained on 89 samples. AdaBoostClassifier trained on 891 samples. GradientBoostingClassifier trained on 8 samples. GradientBoostingClassifier trained on 89 samples. GradientBoostingClassifier trained on 891 samples. RandomForestClassifier trained on 8 samples. RandomForestClassifier trained on 89 samples. RandomForestClassifier trained on 891 samples.
In [32]:
predictions = clf_A.predict(selected_features_test)
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(predictions, PassengerId, columns = ["Survived"])
# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("/home/denys/AdaBoostClassifier.csv", index_label = ["PassengerId"])
Subscribe to:
Comments (Atom)

