Knowledge Mining: Text mining

File: Lab_sentiment_tidytext01.R

Theme: Running sentiment anlaysis using tidytext package

Data: Twitter data via REST API

### install.packages(c("easypackages","rtweet","tidyverse","RColorBrewer","tidytext","syuzhet", "plotly"))
library(easypackages)
libraries("rtweet","tidyverse","RColorBrewer","tidytext","data.table","tidyr", "plotly")
## Loading required package: rtweet
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()  masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag()     masks stats::lag()
## Loading required package: RColorBrewer
## Loading required package: tidytext
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## All packages loaded successfully

In this step, I follow these instructions: https://datageneration.org/adp/twitter/ for detail

I use API Method to collect Twitter data (API means that the data is collected via R)

I first created an essential developer account on Twitter but wasn’t able to download any data

I then created an academic research developer account on Twitter, which enabled me to obtain the data

Once my academic research developer account was aproved, I created a project Knowledge_Mining

The Knowledge_Mining project came with “Keys and tokens” for direct authentication to access Twitter data

Enter key and tokens from Twitter academic research developer account

(Developer Portal-> Projects & Apps -> Effect of nonprofits on community subjective wellbeing -> Knowledge_Mining -> Keys and tokens)

twitter_token <- rtweet::create_token(
app="Knowledge_Mining",
consumer_key <- "KFo5Ua7O1xQDtArxRhYWyPQt9",
consumer_secret <- "O97SYv4LR05vXyvYH2ChXT2AGkFCPScfNvo3GEpz1T3teHe4Mj",
access_token_key <- "1502021687990128643-w8CtO5zFuGDcs2OtVU0Qs7ib9nfImQ",
access_secret <- "EPc28tX2CSSWkcqlBg5HbaAp709eX3HsuPRewogpZsu58")
tw <- search_tweets("taiwan", n=100, retryonratelimit = TRUE)

Plot by time

ts_plot(tw,"mins",cex=.25,alpha=1) +
  theme_bw() +
  theme(text = element_text(family="Palatino"),
        plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5),plot.caption = element_text(hjust = 0.5)) +
  labs(title = "Frequency of keyword 'Taiwan' used in last 100,000 Twitter tweets",
       subtitle = "Twitter tweet counts aggregated per minute interval ",
       caption = "\nSource: Data collected from Twitter's REST API via rtweet",hjust = 0.5)

Preprocess text data

twtxt = tw$text
textDF <- tibble(txt = tw$text)
tidytwt= textDF %>% 
  unnest_tokens(word, txt)
tidytwt <- tidytwt %>%  anti_join(stop_words) # Removing stopwords
## Joining, by = "word"
tidytwt %>%
  count(word, sort = TRUE) %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab("Keyword") + ylab("Count") +
  coord_flip() + theme_bw()

tidytwt <- tidytwt %>%
  mutate(linenumber = row_number()) # create linenumber

Joining bing lexicon using on average tweet of 12 words.

sentiment_tw <- tidytwt %>%          
  inner_join(get_sentiments("bing")) %>%
  count(index = linenumber %/% 12, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(sentiment_tw, aes(index, sentiment)) +
  geom_col(show.legend = FALSE)+theme_bw()

sentiment_tw$posneg=ifelse(sentiment_tw$sentiment>0,1,ifelse(sentiment_tw$sentiment<0,-1,0))

Use Plotly library to plot density chart

ggplot(sentiment_tw, aes(sentiment, fill = posneg)) + 
  geom_density(alpha = 0.5, position = "stack") + 
  ggtitle("stacked sentiment density chart")+theme_bw()

bing_word_counts <- tidytwt %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Sentiments toward Taiwan, March 9, 2022",
       x = NULL) +
  coord_flip() + theme_bw()+ theme(strip.text.x = element_text(family="Palatino"), 
                                   axis.title.x=element_text(face="bold", size=15,family="Palatino"),
                                   axis.title.y=element_text(family="Palatino"), 
                                   axis.text.x = element_text(family="Palatino"), 
                                   axis.text.y = element_text(family="Palatino"))
## Selecting by n