Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. People regularly quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project examined data from accelerometers on the belt, forearm, arm, and dumbell of six young men who conducted a weight-lifting exercise to predict whether they performed the exercise correctly. The analysis found that the random forest model was the most accurate of the four models considered.

Exploratory data analysis

The first step involved loading the data.

fileUrl1<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv?accessType=DOWNLOAD"
download.file(fileUrl1, destfile = "pml-training.csv", method="curl")
dateDownloaded<-date()

fileUrl2<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv?accessType=DOWNLOAD"
download.file(fileUrl2, destfile = "pml-testing.csv", method="curl")
dateDownloaded<-date()

traindata<-read.csv("pml-training.csv", header=TRUE,sep=",", stringsAsFactors = FALSE)
testdata<-read.csv("pml-testing.csv", header=TRUE,sep=",", stringsAsFactors = FALSE)

The training data set comprises 19,622 observations and 160 variables. As the goal of this project is to predict whether each particant performed the exercise correctly, it is useful to tabulate the variable classe. The classe variable had five levels - A to E. A indicated that the exercise was performed correctly and categories B to E indicated that there was some form of mistake.

As well as the classe variable, the training data set also included data from accelerometers on the belt, forearm, and dumbell of the six participants. However variables had many missing values. So it was necessary to remove variables which have many NAs or missing values, both in the training and in the test data sets. We also repeated this process for the test data set.

table(traindata$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
traindata<-select(traindata, (8:11),(37:49),(60:68),(84:86),102,(113:124),140,(151:159),160)
testdata<-select(testdata, (8:11),(37:49),(60:68),(84:86),102,(113:124),140,(151:159))

As a result of this data cleaning, the number of variables was reduced to 53.

dim(traindata)
## [1] 19622    53

Creating of a testing data set

In order to assess the performance of alternative models, it is necessary to split the original training set (comprising 16922 observations) into two - a training data set and a testing data set which provides an opportunity to understand how each model performs when it is required to predict out of sample.

r library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

r inTrain<-createDataPartition(y=traindata$classe, p=0.75, list=FALSE) training<-traindata[inTrain,] testing<-traindata[-inTrain,] table(training$classe)

## ## A B C D E ## 4185 2848 2567 2412 2706

Development of four models based on the training data set

The analysis looked at four approaches which are suitable for predicting categorical data - Linear Discrimant Analysis (LDA), tree based methods, Boosting and Random Forests. For each model, we (a) estimated the parameters for each model based on the training data set; (b) estimated predictions based on each model using the testing data set; and (c) assessed the performance of each model based on the output of the confusion Matrix.

Of particular interest was the accuracy rate for each model as this shows the proportion of correct predictions when the model predicts out of sample. Note that a seed was set for the tree based, boosting and random forests to enable reproducibility.

Linear Discriminant Analysis

library(caret)
set.seed(109)
model1<-train(classe~., data=training, method="lda")
pmodel1<-predict(model1, newdata=testing)
testing$classe<-as.factor(testing$classe)
confusionMatrix(pmodel1, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1127  138   82   52   35
##          B   26  632   99   32  158
##          C  114  108  546   81  100
##          D  123   33  111  612   98
##          E    5   38   17   27  510
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6988          
##                  95% CI : (0.6858, 0.7116)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6191          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8079   0.6660   0.6386   0.7612   0.5660
## Specificity            0.9125   0.9204   0.9005   0.9110   0.9783
## Pos Pred Value         0.7859   0.6674   0.5753   0.6264   0.8543
## Neg Pred Value         0.9228   0.9199   0.9219   0.9511   0.9092
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2298   0.1289   0.1113   0.1248   0.1040
## Detection Prevalence   0.2924   0.1931   0.1935   0.1992   0.1217
## Balanced Accuracy      0.8602   0.7932   0.7695   0.8361   0.7722

The LDA model had an accuracy of around 0.70.

Tree-based model

library(caret)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(123)
model2<-train(classe~., method="rpart", data=training)

The chart shows that the model performed relatively poorly within sample, predicting that 52% of observations in the training data set were in category A when the actual percentage in class A was 28%. The model therefore also performed badly out of sample (in the testing set), with an accuracy of around 0.49.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1243  359  398  345  119
##          B   22  339   28  139  106
##          C  109  251  429  320  265
##          D    0    0    0    0    0
##          E   21    0    0    0  411
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4939         
##                  95% CI : (0.4798, 0.508)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3402         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8910  0.35722  0.50175   0.0000  0.45616
## Specificity            0.6520  0.92541  0.76661   1.0000  0.99475
## Pos Pred Value         0.5045  0.53470  0.31223      NaN  0.95139
## Neg Pred Value         0.9377  0.85714  0.87932   0.8361  0.89043
## Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
## Detection Rate         0.2535  0.06913  0.08748   0.0000  0.08381
## Detection Prevalence   0.5024  0.12928  0.28018   0.0000  0.08809
## Balanced Accuracy      0.7715  0.64131  0.63418   0.5000  0.72546

Boosting

Boosting is an approach which improves the predictions from a decision tree. It works by fitting trees on a sequential basis to the errors and by placing greater weight on large errors, it generates more accurate predictions than simple decision tree-based models.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1376   29    0    0    1
##          B   10  903   22    7   13
##          C    5   15  825   30    5
##          D    2    1    6  757   11
##          E    2    1    2   10  871
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9649          
##                  95% CI : (0.9594, 0.9699)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9556          
##  Mcnemar's Test P-Value : 2.632e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9864   0.9515   0.9649   0.9415   0.9667
## Specificity            0.9915   0.9869   0.9864   0.9951   0.9963
## Pos Pred Value         0.9787   0.9455   0.9375   0.9743   0.9831
## Neg Pred Value         0.9946   0.9884   0.9925   0.9886   0.9925
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2806   0.1841   0.1682   0.1544   0.1776
## Detection Prevalence   0.2867   0.1947   0.1794   0.1584   0.1807
## Balanced Accuracy      0.9889   0.9692   0.9757   0.9683   0.9815

The accuracy for this model was around 0.96.

Random Forest

Random Forest models build on the basic principles of decision trees but use an approach which leads to improved accuracy. It only considers a subset of the predictors at each split, and places less weight on strong predictors, to avoid the problem of highly correlated trees. This makes the resulting trees less variable and more reliable.

A disadvantage of random forests is that they can be slow to estimate. For this reason, the parallel package was used in conjuntion with the caret package. For further discussion see https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md for further discussion.

## Loading required package: foreach
## Loading required package: iterators
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    8    0    0    0
##          B    0  939   10    0    0
##          C    0    2  844   11    0
##          D    0    0    1  791    3
##          E    0    0    0    2  898
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9925          
##                  95% CI : (0.9896, 0.9947)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9905          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9895   0.9871   0.9838   0.9967
## Specificity            0.9977   0.9975   0.9968   0.9990   0.9995
## Pos Pred Value         0.9943   0.9895   0.9848   0.9950   0.9978
## Neg Pred Value         1.0000   0.9975   0.9973   0.9968   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1915   0.1721   0.1613   0.1831
## Detection Prevalence   0.2861   0.1935   0.1748   0.1621   0.1835
## Balanced Accuracy      0.9989   0.9935   0.9920   0.9914   0.9981

The random forest model had the highest accuracy at around 0.99.

Prediction using the test data

For this reason, the random forest model was used to predict the variable classe in the test data. It predicted correctly for 100% of the observations in the test data set.