Adapted from ISLR Chapter 4 Lab
Load ISLR library
## Loading required package: ISLR
Check dataset Smarket
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
Create a dataframe for data browsing
Plot the data with smaller dots
Logistic regression,
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
## (Dispersion parameter for binomial family taken to be 1)
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
## Number of Fisher Scoring iterations: 3
## 1 2 3 4 5
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
## [1] 0.5216
Make training and test set
train = Year<2005,
data=Smarket,family=binomial, subset=train)
glm.pred=ifelse(glm.probs >0.5,"Up","Down")
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
## [1] 0.4801587
Fit smaller model that excludes the insignificant variables from the original model,
data=Smarket,family=binomial, subset=train)
glm.pred=ifelse(glm.probs >0.5,"Up","Down")
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
## [1] 0.5595238
Check accuracy rate
## [1] 0.5824176
## Up
## Down 0
## Up 1
Can you interpret the results?
### Logistic regression models the probability that Y belongs to a particular category. In our example, it models the probability that "Direction" is either "up" or "down." The DV Direction is a factor with levels Down and Up indicating whether the market had a positive or negative return on a given day. The goal is then to build a model that predicts the probability of the market having a positive or negative return considering the percentage return for the last five days and the volume of shares traded. The coefficients we get after using logistic regression tell us how much those particular variables contribute to the log odds. For example, each one-unit increase in Lag1 decreases the log odds of getting a positive return by 0.07. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, the p-value of 0.15 indicates that there is no clear evidence of a real association between Lag1 and Direction. 52% of the daily movements have been correctly predicted. However, the smaller model shows that on days when logistic regression predicts an increase in the market, it has a 58% accuracy rate.