Use the TEDS2016 dataset to run a logit (logistic regression) model using female as sole predictor. The dependent variable is the vote (1-0) for Tsai Ing-wen, the female candidate for the then opposition party Democratic Progressive Party (DPP). Access the data set using the following codes:

library(haven)
TEDS_2016<-read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")

Check the dataset

names(TEDS_2016)
##  [1] "District"        "Sex"             "Age"             "Edu"            
##  [5] "Arear"           "Career"          "Career8"         "Ethnic"         
##  [9] "Party"           "PartyID"         "Tondu"           "Tondu3"         
## [13] "nI2"             "votetsai"        "green"           "votetsai_nm"    
## [17] "votetsai_all"    "Independence"    "Unification"     "sq"             
## [21] "Taiwanese"       "edu"             "female"          "whitecollar"    
## [25] "lowincome"       "income"          "income_nm"       "age"            
## [29] "KMT"             "DPP"             "npp"             "noparty"        
## [33] "pfp"             "South"           "north"           "Minnan_father"  
## [37] "Mainland_father" "Econ_worse"      "Inequality"      "inequality5"    
## [41] "econworse5"      "Govt_for_public" "pubwelf5"        "Govt_dont_care" 
## [45] "highincome"      "votekmt"         "votekmt_nm"      "Blue"           
## [49] "Green"           "No_Party"        "voteblue"        "voteblue_nm"    
## [53] "votedpp_1"       "votekmt_1"

Logistic regression

 teds.fit=glm(votetsai~female, data=TEDS_2016,family=binomial)
summary(teds.fit)
## 
## Call:
## glm(formula = votetsai ~ female, family = binomial, data = TEDS_2016)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4180  -1.3889   0.9546   0.9797   0.9797  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.54971    0.08245   6.667 2.61e-11 ***
## female      -0.06517    0.11644  -0.560    0.576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1666.5  on 1260  degrees of freedom
## Residual deviance: 1666.2  on 1259  degrees of freedom
##   (429 observations deleted due to missingness)
## AIC: 1670.2
## 
## Number of Fisher Scoring iterations: 4

Female voters are not more likely to vote for President Tsai becasue the coefficient for “female” (-0.06) is negative and it is not statistically significant.

Improve the model by adding party ID variables (KMT, DPP) and other demographic variables (age, edu, income)

teds.fit2=glm(votetsai~female+KMT+DPP+Age+edu+income,
                data=TEDS_2016,family=binomial)
summary(teds.fit2)
## 
## Call:
## glm(formula = votetsai ~ female + KMT + DPP + Age + edu + income, 
##     family = binomial, data = TEDS_2016)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7416  -0.3658   0.2370   0.3098   2.5712  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.73673    0.50898   3.412 0.000644 ***
## female       0.04276    0.17769   0.241 0.809828    
## KMT         -3.14616    0.25036 -12.567  < 2e-16 ***
## DPP          2.90604    0.26860  10.819  < 2e-16 ***
## Age         -0.18582    0.08132  -2.285 0.022307 *  
## edu         -0.21355    0.08135  -2.625 0.008660 ** 
## income       0.01534    0.03447   0.445 0.656222    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1661.76  on 1256  degrees of freedom
## Residual deviance:  833.61  on 1250  degrees of freedom
##   (433 observations deleted due to missingness)
## AIC: 847.61
## 
## Number of Fisher Scoring iterations: 6

Add more variables to further improve the model

teds.fit3=glm(votetsai~female+KMT+DPP+Age+edu+income+Independence+Econ_worse+Govt_dont_care+Minnan_father+Mainland_father+Taiwanese,
                 data=TEDS_2016,family=binomial)
summary(teds.fit3)
## 
## Call:
## glm(formula = votetsai ~ female + KMT + DPP + Age + edu + income + 
##     Independence + Econ_worse + Govt_dont_care + Minnan_father + 
##     Mainland_father + Taiwanese, family = binomial, data = TEDS_2016)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0923  -0.3137   0.1752   0.4018   2.7948  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.30622    0.58758   0.521  0.60226    
## female          -0.09986    0.18979  -0.526  0.59878    
## KMT             -2.91362    0.25916 -11.243  < 2e-16 ***
## DPP              2.47566    0.27566   8.981  < 2e-16 ***
## Age             -0.01681    0.08932  -0.188  0.85075    
## edu             -0.12769    0.08846  -1.444  0.14887    
## income           0.02281    0.03643   0.626  0.53127    
## Independence     0.99884    0.25097   3.980 6.89e-05 ***
## Econ_worse       0.31991    0.19007   1.683  0.09236 .  
## Govt_dont_care  -0.02141    0.18852  -0.114  0.90960    
## Minnan_father   -0.23182    0.25413  -0.912  0.36166    
## Mainland_father -1.04536    0.39853  -2.623  0.00872 ** 
## Taiwanese        0.89430    0.19939   4.485 7.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1661.76  on 1256  degrees of freedom
## Residual deviance:  767.27  on 1244  degrees of freedom
##   (433 observations deleted due to missingness)
## AIC: 793.27
## 
## Number of Fisher Scoring iterations: 6

With the addition of new variables, age and education become statistically insignificant. The other two variables (KMT and DPP) hold in significance. Additionally, “Independence,” “Mainland_father” and “Econ_worse” also become statistically significant.


Logistic regression in STATA

Logistic regression

logit votetsai Independence Econ_worse Govt_dont_care Minnan_father Mainland_father Taiwanese KMT DPP age edu female

Output

The difference between the R and Stata models is that the R logit model includes “income” while the Stata-based model does not. However, the two models’ results are quite similar (the same variables are significant in both models and in the same direction).