Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. People regularly quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project examined data from accelerometers on the belt, forearm, arm, and dumbell of six young men who conducted a weight-lifting exercise to predict whether they performed the exercise correctly. The analysis found that the random forest model was the most accurate of the four models considered.
The first step involved loading the data.
fileUrl1<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv?accessType=DOWNLOAD"
download.file(fileUrl1, destfile = "pml-training.csv", method="curl")
dateDownloaded<-date()
fileUrl2<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv?accessType=DOWNLOAD"
download.file(fileUrl2, destfile = "pml-testing.csv", method="curl")
dateDownloaded<-date()
traindata<-read.csv("pml-training.csv", header=TRUE,sep=",", stringsAsFactors = FALSE)
testdata<-read.csv("pml-testing.csv", header=TRUE,sep=",", stringsAsFactors = FALSE)
The training data set comprises 19,622 observations and 160 variables. As the goal of this project is to predict whether each particant performed the exercise correctly, it is useful to tabulate the variable classe. The classe variable had five levels - A to E. A indicated that the exercise was performed correctly and categories B to E indicated that there was some form of mistake.
As well as the classe variable, the training data set also included data from accelerometers on the belt, forearm, and dumbell of the six participants. However variables had many missing values. So it was necessary to remove variables which have many NAs or missing values, both in the training and in the test data sets. We also repeated this process for the test data set.
table(traindata$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
traindata<-select(traindata, (8:11),(37:49),(60:68),(84:86),102,(113:124),140,(151:159),160)
testdata<-select(testdata, (8:11),(37:49),(60:68),(84:86),102,(113:124),140,(151:159))
As a result of this data cleaning, the number of variables was reduced to 53.
dim(traindata)
## [1] 19622 53
In order to assess the performance of alternative models, it is necessary to split the original training set (comprising 16922 observations) into two - a training data set and a testing data set which provides an opportunity to understand how each model performs when it is required to predict out of sample.
r library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
r inTrain<-createDataPartition(y=traindata$classe, p=0.75, list=FALSE) training<-traindata[inTrain,] testing<-traindata[-inTrain,] table(training$classe)
## ## A B C D E ## 4185 2848 2567 2412 2706
The analysis looked at four approaches which are suitable for predicting categorical data - Linear Discrimant Analysis (LDA), tree based methods, Boosting and Random Forests. For each model, we (a) estimated the parameters for each model based on the training data set; (b) estimated predictions based on each model using the testing data set; and (c) assessed the performance of each model based on the output of the confusion Matrix.
Of particular interest was the accuracy rate for each model as this shows the proportion of correct predictions when the model predicts out of sample. Note that a seed was set for the tree based, boosting and random forests to enable reproducibility.
library(caret)
set.seed(109)
model1<-train(classe~., data=training, method="lda")
pmodel1<-predict(model1, newdata=testing)
testing$classe<-as.factor(testing$classe)
confusionMatrix(pmodel1, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1127 138 82 52 35
## B 26 632 99 32 158
## C 114 108 546 81 100
## D 123 33 111 612 98
## E 5 38 17 27 510
##
## Overall Statistics
##
## Accuracy : 0.6988
## 95% CI : (0.6858, 0.7116)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6191
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8079 0.6660 0.6386 0.7612 0.5660
## Specificity 0.9125 0.9204 0.9005 0.9110 0.9783
## Pos Pred Value 0.7859 0.6674 0.5753 0.6264 0.8543
## Neg Pred Value 0.9228 0.9199 0.9219 0.9511 0.9092
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2298 0.1289 0.1113 0.1248 0.1040
## Detection Prevalence 0.2924 0.1931 0.1935 0.1992 0.1217
## Balanced Accuracy 0.8602 0.7932 0.7695 0.8361 0.7722
The LDA model had an accuracy of around 0.70.
library(caret)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(123)
model2<-train(classe~., method="rpart", data=training)
The chart shows that the model performed relatively poorly within sample, predicting that 52% of observations in the training data set were in category A when the actual percentage in class A was 28%. The model therefore also performed badly out of sample (in the testing set), with an accuracy of around 0.49.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1243 359 398 345 119
## B 22 339 28 139 106
## C 109 251 429 320 265
## D 0 0 0 0 0
## E 21 0 0 0 411
##
## Overall Statistics
##
## Accuracy : 0.4939
## 95% CI : (0.4798, 0.508)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3402
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8910 0.35722 0.50175 0.0000 0.45616
## Specificity 0.6520 0.92541 0.76661 1.0000 0.99475
## Pos Pred Value 0.5045 0.53470 0.31223 NaN 0.95139
## Neg Pred Value 0.9377 0.85714 0.87932 0.8361 0.89043
## Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Detection Rate 0.2535 0.06913 0.08748 0.0000 0.08381
## Detection Prevalence 0.5024 0.12928 0.28018 0.0000 0.08809
## Balanced Accuracy 0.7715 0.64131 0.63418 0.5000 0.72546
Boosting is an approach which improves the predictions from a decision tree. It works by fitting trees on a sequential basis to the errors and by placing greater weight on large errors, it generates more accurate predictions than simple decision tree-based models.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1376 29 0 0 1
## B 10 903 22 7 13
## C 5 15 825 30 5
## D 2 1 6 757 11
## E 2 1 2 10 871
##
## Overall Statistics
##
## Accuracy : 0.9649
## 95% CI : (0.9594, 0.9699)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9556
## Mcnemar's Test P-Value : 2.632e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9864 0.9515 0.9649 0.9415 0.9667
## Specificity 0.9915 0.9869 0.9864 0.9951 0.9963
## Pos Pred Value 0.9787 0.9455 0.9375 0.9743 0.9831
## Neg Pred Value 0.9946 0.9884 0.9925 0.9886 0.9925
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2806 0.1841 0.1682 0.1544 0.1776
## Detection Prevalence 0.2867 0.1947 0.1794 0.1584 0.1807
## Balanced Accuracy 0.9889 0.9692 0.9757 0.9683 0.9815
The accuracy for this model was around 0.96.
Random Forest models build on the basic principles of decision trees but use an approach which leads to improved accuracy. It only considers a subset of the predictors at each split, and places less weight on strong predictors, to avoid the problem of highly correlated trees. This makes the resulting trees less variable and more reliable.
A disadvantage of random forests is that they can be slow to estimate. For this reason, the parallel package was used in conjuntion with the caret package. For further discussion see https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md for further discussion.
## Loading required package: foreach
## Loading required package: iterators
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 8 0 0 0
## B 0 939 10 0 0
## C 0 2 844 11 0
## D 0 0 1 791 3
## E 0 0 0 2 898
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.9896, 0.9947)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9895 0.9871 0.9838 0.9967
## Specificity 0.9977 0.9975 0.9968 0.9990 0.9995
## Pos Pred Value 0.9943 0.9895 0.9848 0.9950 0.9978
## Neg Pred Value 1.0000 0.9975 0.9973 0.9968 0.9993
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1915 0.1721 0.1613 0.1831
## Detection Prevalence 0.2861 0.1935 0.1748 0.1621 0.1835
## Balanced Accuracy 0.9989 0.9935 0.9920 0.9914 0.9981
The random forest model had the highest accuracy at around 0.99.
For this reason, the random forest model was used to predict the variable classe in the test data. It predicted correctly for 100% of the observations in the test data set.