We illustrate the task of binary classification on the spam dataset from the UCI machine learning repository (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science).
Short description of the spam dataset, as it appears on the repository: The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography… Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. The last column of ‘spambase.data’ denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters.
Load and parse the data as follows.
library(RCurl)
## Loading required package: bitops
downloaded.url <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data')
connection <- textConnection(downloaded.url)
Spam <- read.csv(connection, header=FALSE)
str(Spam)
## 'data.frame': 4601 obs. of 58 variables:
## $ V1 : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ V2 : num 0.64 0.28 0 0 0 0 0 0 0 0.12 ...
## $ V3 : num 0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
## $ V4 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ V5 : num 0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
## $ V6 : num 0 0.28 0.19 0 0 0 0 0 0 0.32 ...
## $ V7 : num 0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
## $ V8 : num 0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
## $ V9 : num 0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
## $ V10: num 0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
## $ V11: num 0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
## $ V12: num 0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
## $ V13: num 0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
## $ V14: num 0 0.21 0 0 0 0 0 0 0 0 ...
## $ V15: num 0 0.14 1.75 0 0 0 0 0 0 0.12 ...
## $ V16: num 0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
## $ V17: num 0 0.07 0.06 0 0 0 0 0 0 0 ...
## $ V18: num 1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
## $ V19: num 1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
## $ V20: num 0 0 0.32 0 0 0 0 0 3.53 0.06 ...
## $ V21: num 0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
## $ V22: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V23: num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ V24: num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ V25: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V26: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V27: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V28: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V29: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V30: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V31: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V32: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V33: num 0 0 0 0 0 0 0 0 0.15 0 ...
## $ V34: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V35: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V36: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V37: num 0 0.07 0 0 0 0 0 0 0 0 ...
## $ V38: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V39: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V40: num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ V41: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V42: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V43: num 0 0 0.12 0 0 0 0 0 0.3 0 ...
## $ V44: num 0 0 0 0 0 0 0 0 0 0.06 ...
## $ V45: num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ V46: num 0 0 0.06 0 0 0 0 0 0 0 ...
## $ V47: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V48: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V49: num 0 0 0.01 0 0 0 0 0 0 0.04 ...
## $ V50: num 0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
## $ V51: num 0 0 0 0 0 0 0 0 0 0 ...
## $ V52: num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ V53: num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ V54: num 0 0.048 0.01 0 0 0 0 0 0.022 0 ...
## $ V55: num 3.76 5.11 9.82 3.54 3.54 ...
## $ V56: int 61 101 485 40 40 15 4 11 445 43 ...
## $ V57: int 278 1028 2259 191 191 54 112 49 1257 749 ...
## $ V58: int 1 1 1 1 1 1 1 1 1 1 ...
attach(Spam)
n = nrow(Spam)
The data contains 57 attributes, for 4601 observations. We predict the label of an email (Spam or Non-Spam / 1 or 0 / corresponding to the last column, denoted V58) using all predictors, with a logistic regression model first.
The logistic regression model is fitted using the command glm (aka Generalised Linear Model). We must use family=binomial in its arguments to specify the link function to be used in the model; here a logistic link. The coefficients are extracted using the function coef. To make predictions using the fitted model, we use the command predict. To specify R that we want to return the class conditional probabilities \(P(Y=1|X=x)\), we must put type = "response" in its arguments.
To test the prediction accuracy of the logistic regression model, we divide randomly the dataset into two subsets: train and test. We use 3000 observations to train the model, and the remaining 1600 to assess its performance.
n.train = 3000
n.test = n-n.train
train = sample(n, n.train)
test = setdiff(1:n, train)
test.data.x = Spam[test, -ncol(Spam)]
test.data.y = Spam[test, ncol(Spam)]
lr.fit = glm(V58~., data=Spam, family=binomial, subset=train)
lr.probs=predict(lr.fit, test.data.x, type="response")
lr.pred = rep(0, n.test)
lr.pred[lr.probs >.5] = 1
table(lr.pred, test.data.y)
## test.data.y
## lr.pred 0 1
## 0 956 65
## 1 53 527
The test error is:
1-mean(lr.pred == test.data.y)
## [1] 0.07370394
In addition, we plot a ROC curve to assess the test error for different threshold values \(t\): classify as spam if the predicted probability is larger than \(t\), and as non-spam otherwise. Let: TP = the number of true positives, TN = the number of true negatives, FP = the number of false positives, and FN = the number of false hegatives. A ROC curve plots the sensitivity (TP/(TP+FN)) as a function of the specificity (TN/(TN+FP)), see chapter SL: FOUNDATIONS. We use the function roc, which is part of the package pROC. The Area Under the Curve (AUC) is returned using auc = TRUE.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc(test.data.y, lr.probs, auc=TRUE, plot=TRUE)
##
## Call:
## roc.default(response = test.data.y, predictor = lr.probs, auc = TRUE, plot = TRUE)
##
## Data: lr.probs in 1009 controls (test.data.y 0) < 592 cases (test.data.y 1).
## Area under the curve: 0.9703
We perform the analysis again using LDA.
library(MASS)
lda.fit = lda(V58~., data=Spam, subset=train)
lda.pred = predict(lda.fit, test.data.x)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class = lda.pred$class
table(lda.class, test.data.y)
## test.data.y
## lda.class 0 1
## 0 956 110
## 1 53 482
1-mean(lda.class == test.data.y)
## [1] 0.1018114
LDA performs worse than the logistic model on this dataset.
Load the MNIST digit recognition training dataset into R from Kaggle https://www.kaggle.com/c/digit-recognizer/data
Each image is 28 by 28 pixels for a total of \(d=784\) pixels. Each pixel value is an integer between 0 and 255. We make use of the training dataset only, called ‘train.csv’, which has 785 columns, the first column being the label of the image.
data <- read.csv("train.csv")
head(names(data))
## [1] "label" "pixel0" "pixel1" "pixel2" "pixel3" "pixel4"
We divide the dataset into training and test examples. Two thirds of the original data are retained for training (corresponding to \(28000\) observations), and the remaining third to check the model predictive accuracy (\(14000\) observations).
n = nrow(data)
train = 1:(2*n/3)
train.X = data[train,-1]
train.Y = data[train, 1]
test.X = data[-train,-1]
test.Y = data[-train, 1]
We reduce the dimensionality of the input space by keeping the principal components accounting for most of the variability in the data. An overview of PCA and its implementation can be found in the chapter UL: PRINCIPAL COMPONENT ANALYSIS.
pc.train = prcomp(train.X)
pc.train.X = pc.train$x
pc.var = pc.train$sdev^2
pve = pc.var/sum(pc.var)
cumsum(pve)[80]
## [1] 0.8910555
We observe that the first 80 components explain about 90% of the variance in the data. The principal components calculated on the training data are used to project the test data.
pc.test.X = predict(pc.train, newdata = test.X)
With the pre-processed data pc.train.X and pc.test.X at hand, we test the logistic regression, LDA and QDA models on the first 80 principal components. For multivariate logistic regression, we use the function multinom from the package nnet. We use the argument "class" in predict to indicate that we are interested in the final classification result. If one is interested in the class conditional probabilities, we must put "probs".
train.data = data.frame(cbind(train.Y, pc.train.X[,1:80]))
library(nnet)
lr.fit = multinom(train.Y~., data = train.data)
## # weights: 820 (729 variable)
## initial value 64472.382604
## iter 10 value 12598.427421
## iter 20 value 11945.954268
## iter 30 value 11851.976385
## iter 40 value 11827.130885
## iter 50 value 11819.976844
## iter 60 value 11818.273682
## iter 70 value 11817.781694
## iter 80 value 11817.318624
## iter 90 value 11816.441876
## iter 100 value 11812.742751
## final value 11812.742751
## stopped after 100 iterations
lr.pred = predict(lr.fit, newdata = data.frame(pc.test.X[,1:80]), "class")
table(lr.pred, test.Y)
## test.Y
## lr.pred 0 1 2 3 4 5 6 7 8 9
## 0 1052 0 9 2 4 20 11 1 3 8
## 1 1 1531 22 15 22 46 5 23 67 22
## 2 21 7 1166 47 6 11 15 19 20 3
## 3 11 8 21 1255 1 75 0 2 55 23
## 4 18 0 13 2 1198 20 12 20 15 84
## 5 210 9 8 47 1 1010 18 1 35 8
## 6 46 3 39 12 33 37 1294 1 17 3
## 7 27 2 30 20 5 14 0 1380 13 115
## 8 20 16 37 27 8 22 3 3 1118 10
## 9 5 1 8 15 38 19 0 27 26 1147
1-mean(lr.pred == test.Y)
## [1] 0.1320714
lda.fit = lda(train.Y~., data = train.data)
lda.pred = predict(lda.fit, newdata = data.frame(pc.test.X[,1:80]))
lda.class = lda.pred$class
table(lda.class, test.Y)
## test.Y
## lda.class 0 1 2 3 4 5 6 7 8 9
## 0 1322 0 10 3 3 22 18 7 6 10
## 1 0 1521 38 21 18 16 15 29 89 12
## 2 2 5 1104 50 9 3 11 12 8 3
## 3 5 2 34 1206 0 64 1 3 37 23
## 4 5 2 27 2 1163 23 19 27 12 84
## 5 38 10 14 54 6 1011 32 2 59 11
## 6 19 1 24 12 13 27 1252 0 13 2
## 7 3 3 22 28 2 11 0 1286 5 46
## 8 14 32 67 35 9 71 9 5 1104 9
## 9 3 1 13 31 93 26 1 106 36 1223
1-mean(lda.class == test.Y)
## [1] 0.1291429
qda.fit = qda(train.Y~., data = train.data)
qda.pred = predict(qda.fit, newdata = data.frame(pc.test.X[,1:80]))
qda.class = qda.pred$class
table(qda.class, test.Y)
## test.Y
## qda.class 0 1 2 3 4 5 6 7 8 9
## 0 1389 0 4 0 2 6 11 2 3 4
## 1 0 1520 0 1 1 0 1 0 3 0
## 2 6 20 1314 21 5 4 2 19 10 7
## 3 2 1 7 1370 0 19 1 2 16 14
## 4 0 4 1 3 1277 0 1 8 2 23
## 5 3 0 0 11 0 1221 28 1 5 4
## 6 1 1 2 0 2 3 1307 0 1 0
## 7 0 2 2 2 2 0 0 1399 2 22
## 8 10 29 23 31 7 19 7 15 1322 30
## 9 0 0 0 3 20 2 0 31 5 1319
1-mean(qda.class == test.Y)
## [1] 0.04014286
The QDA approach performs the best here, with an error rate around 4%.