-->

Jendela Statistika

Melihat Dunia Dengan Data Sebagai Sebuah Investasi

WORDCLOOUD AND SENTIMENT ANALYSIS OF TWITTER DATA USING R

Hello, guy's..
Good evening and happy return write again in my blog.

In the previous topic WAYS COLLECTING DATA FROM TWITTER USING R we've covered regarding data collection.
library(twitteR)
api_key=  "your api key"
api_secret= "your api_secret password"
access_token= "your access token"
access_token_secret= "your access token password"
setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

The fundamental question after we perform the collection of data from twitter, What will we do with the data?
#colleting data
tweets= searchTwitteR("ahok", n=1000, lang = "en", since = "2017-02-05", until = "2017-02-08")
nDocs <- length(tweets)
## [1] 1000   16
above there are 1000 tweets with 16 variables. How we will manage this data??

in the early stages of step after gathering we will do:

Transform tweets data into a data frame :
#transform the tweets into a data frame format
df <- do.call("rbind", lapply(tweets, as.data.frame))
dim(df)
 How we will provide insigth of the data???


In this regard I would like to try wordcloud analysis to see the spread the word and i can try sentiment analysis algorithm used here is based on the NRC Word-Emotion Association Lexicon of Saif Mohammad and Peter Turney. The idea here is that these researchers have built a dictionary/lexicon containing lots of words with associated scores for eight different emotions and two sentiments (positive/negative). Each individual word in the lexicon will have a “yes” (one) or “no” (zero) for the emotions and sentiments, and we can calculate the total sentiment of a sentence by adding up the individual sentiments for each word in the sentence. 

at this stage we will cleaning data with package (tm) in R:

before doing the cleaning of the data with the package (tm) we will cleaning some strange text (unicode, etc.)
# Clean text to remove odd charactersdf$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
library (tm)
myCorpus <- Corpus(VectorSource(df$text))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus,content_transformer(tolower), mc.cores=1)
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions
myCorpus <- tm_map(myCorpus, removeWords, c("example"))
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
myCorpus <- tm_map(myCorpus, stripWhitespace)
corpus_clean <- tm_map(myCorpus, PlainTextDocument) ##this ensures the corpus transformations final output is a PTD
##optional, remove word stems (cleaning, cleaned, cleaner all would become clean):
##wordCorpus <- tm_map(wordCorpus, stemDocument)
To achieve the goal of getting insigth from data, 
So, we must form a text data into value in the term of a matrix.
By the way the above command is a vectorisasi text data to be established within the vector matrix.

After that, we can continue create term document matrix:
#create term document matrix for analysis
myTdm <- TermDocumentMatrix(corpus_clean, control=list(wordLengths=c(1,Inf)))
myTdm
If document matrix already formed then we can analyze wordcloud, by following steps:
library(wordcloud)
##or create a word cloud from the corpus_clean data
pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)
wordcloud(words = corpus_clean, scale=c(4,0.5), max.words=50, random.order=FALSE,
          rot.per=0.35, use.r.layout=TRUE, colors=pal)
then to find out how close the tweets of appreciation towards the word "ahok" then we'll do approach with sentiment analysis:
#sentiment analysis
library(syuzhet)
mySentiment <- get_nrc_sentiment(df$text)
df <- cbind(df, mySentiment)
sentimentTotals <- data.frame(colSums(df[,c(17:25)])) ##select columns with sentiment data
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL ##graph would be messy if these were left when plotting
and making visualizations about it with ggplot2
#plot sentiment
library(ggplot2)
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score")


and the above results showed sentiment user from 5-8 February 2017 on twitter about the word "ahok".

Seeing as I'm not really wanted to intervene, DKI election then please describe the results.

Very important::????????
I think that needs to be the subject of discussion here is about this:


If there are any who understand, and want to share it please comment.

This is a bugfix when the corpus is formed is the unique code from the code location or latin letters or other unique codes can not form into vector using the package (tm).

may be useful and could be shared learning materials.


Best regard,

Jendela Statistik

Baca juga:

0 komentar



Emoticon