Work through the kaggle Titanic competition found here with bagging, random forest, and GBM boosting.
Libraries I might Use
library(tidyverse) library(class) library(ISLR) library(tidyverse) library(randomForest) library(gbm) library(MASS) library(rpart) library(caTools)
Loading in the Data
train_data = read.csv("train.csv") # Train data Is really my sample overall test_data = read.csv("test.csv") # New predictions will be performed on this data set
Encoding Factors and forcing NA for char columns
train_data[train_data==""] <- NA train_data$Sex=factor(train_data$Sex, levels=c("male","female") ,labels = c(0,1)) train_data$Pclass=factor(train_data$Pclass, levels = c(1,2,3), labels = c(0,1,2)) train_data$Embarked=factor(train_data$Embarked, levels = c("C","Q","S"), labels = c(0,1,2)) train_data$Survived=factor(train_data$Survived) train_data
Exploring the data
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Min. : 1.0 0:549 0:216 Length:891 0:577 Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891 Min. : 0.00 1st Qu.:223.5 1:342 1:184 Class :character 1:314 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character 1st Qu.: 7.91 Median :446.0 2:491 Mode :character Median :28.00 Median :0.000 Median :0.0000 Mode :character Median : 14.45 Mean :446.0 Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20 3rd Qu.:668.5 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00 Max. :891.0 Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33 NA's :177 Cabin Embarked Length:891 0 :168 Class :character 1 : 77 Mode :character 2 :644 NA's: 2
I’m predicting Survived
Intuitively, I’d expect Pclass to be a significant predictor
Likewise, don’t women and children always go first? So age and sex should be valuable predictors.
SibSp represents the number of siblings/spouses aboard. Maybe important? If there are people with you they might help you escape, but you might also stay behind for them.
Parch is the number of parents or children abboard. If you are young I’d expect higher number of parents increases your chance, while if you are older the number of children might cause you to hold back trying to help them.
Information in fare is probably contained in class
Cabin/Ticket seem unlikely to provide much meaningful information
embarked is the port of embarkation. Unlikely to help much, but I’m not that confident about it.
All told our big factors are pclass, age, sex, sibsp, parch. Maybe embarked.
ggplot(train_data)+ geom_violin(aes(x=Survived, y = Age, fill = Survived))
Right away we are reminded of 177 missing age values seen inthe summary.
We can also see that the didnt survive group has has a heavier weight around the age of 25.
ggplot(train_data[which(is.na(train_data$Age)),])+ geom_bar(aes(x=Survived, fill = Survived))
Most of our missing values died