Austin Bike Data Exploration
Data set and description can be found here
Github page with cleaned up data and presentation here
Libraries I might use
library(ISLR) library(tidyverse) library(randomForest) library(gbm) library(MASS) library(lubridate) library(zoo) library(GGally) library(e1071) #SVM and NB library
Loading in the Data
bikedata = read.csv("Austin_MetroBike_Trips.csv") weatherdata = read.csv("austindaily.csv")
weatherclean = filter(weatherdata, year(X1938.06.01)>2012) weatherclean = weatherclean[,c(1,2)] colnames(weatherclean) <- c("Date","AvgTemp") weatherclean$Date = as.Date(strptime(weatherclean$Date, format = '%Y-%m-%d'))
Date AvgTemp Min. :2013-01-01 Min. :-9.20 1st Qu.:2015-02-12 1st Qu.:15.20 Median :2017-03-26 Median :22.00 Mean :2017-03-26 Mean :20.86 3rd Qu.:2019-05-07 3rd Qu.:27.50 Max. :2021-06-20 Max. :34.00 NA's :88
But we can see we have 80 dates with missing temperature values. I would like to impute an average around it, but is that a sound strategy?
set.seed(123) weathercleannn=na.omit(weatherclean) fakeweather = weathercleannn nas = sample (1:nrow(weathercleannn), nrow(weathercleannn)*0.02934703) # The same rate of missing values as in our original data set fakeweather$AvgTemp[nas] <- NA fakeweather$AvgTemp = (na.locf(fakeweather$AvgTemp) + rev(na.locf(rev(fakeweather$AvgTemp))))/2 # Fills in missing values with the average of the weather values on either side mean((weathercleannn$AvgTemp[nas]-fakeweather$AvgTemp[nas])^2)
Great! We are in a reasonable range of the true temperature and can apply this method to our dataset.
weatherclean$AvgTemp = (na.locf(weatherclean$AvgTemp) + rev(na.locf(rev(weatherclean$AvgTemp))))/2