Austin Bike Data Exploration

Data set and description can be found here

Github page with cleaned up data and presentation here

Libraries I might use

library(ISLR)
library(tidyverse)
library(randomForest)
library(gbm)
library(MASS)
library(lubridate)
library(zoo)
library(GGally)
library(e1071) #SVM and NB library

Loading in the Data

bikedata = read.csv("Austin_MetroBike_Trips.csv")
weatherdata = read.csv("austindaily.csv")
bikedata
weatherdata

Data Cleaning

Weather Data

weatherclean = filter(weatherdata, year(X1938.06.01)>2012)
weatherclean = weatherclean[,c(1,2)]
colnames(weatherclean) <- c("Date","AvgTemp")
weatherclean$Date = as.Date(strptime(weatherclean$Date, format = '%Y-%m-%d'))
summary(weatherclean)
      Date               AvgTemp
 Min.   :2013-01-01   Min.   :-9.20
 1st Qu.:2015-02-12   1st Qu.:15.20
 Median :2017-03-26   Median :22.00
 Mean   :2017-03-26   Mean   :20.86
 3rd Qu.:2019-05-07   3rd Qu.:27.50
 Max.   :2021-06-20   Max.   :34.00
                      NA's   :88     

But we can see we have 80 dates with missing temperature values. I would like to impute an average around it, but is that a sound strategy?

set.seed(123)
weathercleannn=na.omit(weatherclean)
fakeweather = weathercleannn
nas = sample (1:nrow(weathercleannn), nrow(weathercleannn)*0.02934703) # The same rate of missing values as in our original data set
fakeweather$AvgTemp[nas] <- NA
fakeweather$AvgTemp = (na.locf(fakeweather$AvgTemp) + rev(na.locf(rev(fakeweather$AvgTemp))))/2 # Fills in missing values with the average of the weather values on either side
mean((weathercleannn$AvgTemp[nas]-fakeweather$AvgTemp[nas])^2)
[1] 5.109943

Great! We are in a reasonable range of the true temperature and can apply this method to our dataset.

weatherclean$AvgTemp = (na.locf(weatherclean$AvgTemp) + rev(na.locf(rev(weatherclean$AvgTemp))))/2
weatherclean