Austin Bikes
Austin Bike Data Exploration
Data set and description can be found here
Github page with cleaned up data and presentation here
Libraries I might use
library(ISLR)
library(tidyverse)
library(randomForest)
library(gbm)
library(MASS)
library(lubridate)
library(zoo)
library(GGally)
library(e1071) #SVM and NB library
Loading in the Data
bikedata = read.csv("Austin_MetroBike_Trips.csv")
weatherdata = read.csv("austindaily.csv")
bikedata
weatherdata
Data Cleaning
Weather Data
weatherclean = filter(weatherdata, year(X1938.06.01)>2012)
weatherclean = weatherclean[,c(1,2)]
colnames(weatherclean) <- c("Date","AvgTemp")
weatherclean$Date = as.Date(strptime(weatherclean$Date, format = '%Y-%m-%d'))
summary(weatherclean)
Date AvgTemp
Min. :2013-01-01 Min. :-9.20
1st Qu.:2015-02-12 1st Qu.:15.20
Median :2017-03-26 Median :22.00
Mean :2017-03-26 Mean :20.86
3rd Qu.:2019-05-07 3rd Qu.:27.50
Max. :2021-06-20 Max. :34.00
NA's :88
But we can see we have 80 dates with missing temperature values. I would like to impute an average around it, but is that a sound strategy?
set.seed(123)
weathercleannn=na.omit(weatherclean)
fakeweather = weathercleannn
nas = sample (1:nrow(weathercleannn), nrow(weathercleannn)*0.02934703) # The same rate of missing values as in our original data set
fakeweather$AvgTemp[nas] <- NA
fakeweather$AvgTemp = (na.locf(fakeweather$AvgTemp) + rev(na.locf(rev(fakeweather$AvgTemp))))/2 # Fills in missing values with the average of the weather values on either side
mean((weathercleannn$AvgTemp[nas]-fakeweather$AvgTemp[nas])^2)
[1] 5.109943
Great! We are in a reasonable range of the true temperature and can apply this method to our dataset.
weatherclean$AvgTemp = (na.locf(weatherclean$AvgTemp) + rev(na.locf(rev(weatherclean$AvgTemp))))/2
weatherclean