Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
rstudio:text [2013/03/22 14:21]
angie [Word Cloud]
rstudio:text [2016/05/13 20:45] (current)
Line 3: Line 3:
 ==== Initializing text ==== ==== Initializing text ====
  
-To start doing any type of text analysis, you need to initialize the text so R knows how to interpret it. The way to do this is by using the ''​initializeText''​() function,+To start doing any type of text analysis, you need to initialize the text so R knows how to interpret it. The way to do this is by using the ''​InitializeText''​() function,
 <code r> <code r>
-TwitterText = initializeText(twitterwithdate$message)+TwitterText = InitializeText(twitterwithdate$message)
 </​code>​ </​code>​
  
-As with everything else, you need to give your new data a name so it doesn’t just print out in the Console. After running this command, you should be able to see your new variable in the Workspace tab, as in Figure 22\\  +As with everything else, you need to give your new data a name so it doesn’t just print out in the Console. After running this command, you should be able to see your new variable in the Workspace tab, 
-However, if you click on it, you’ll likely get a warning like the one in Figure 23 +
 {{ :​rstudio:​textloaded.png?​direct&​700 |Text initialized}} {{ :​rstudio:​textloaded.png?​direct&​700 |Text initialized}}
-Text initializedas seen in the Workspace pane. +Howeverif you click on it, you’ll likely get a warning like this: 
 {{ :​rstudio:​textpreviewwarning.png?​direct&​700 |Text preview warning}} {{ :​rstudio:​textpreviewwarning.png?​direct&​700 |Text preview warning}}
-Warning about previewing text data- I recommend saying “No”+I recommend saying “No”
  
 ==== Looking at Text Data ==== ==== Looking at Text Data ====
  
-If you want to look at text data and not get a warning like the one in 23, use the command inspect() and the head() command,+If you want to look at text data and not get a warning like the one above, use the command ​''​inspect()'' ​and the ''​head()'' ​command,
 <code r> <code r>
 inspect(head(TwitterText)) inspect(head(TwitterText))
Line 52: Line 49:
  
 ==== Processing Text==== ==== Processing Text====
-After you’ve initialized the text, you probably want to process it to convert all the words to lowercase or remove numbers or stem words. To do that, use the function ​processText(),+After you’ve initialized the text, you probably want to process it to convert all the words to lowercase or remove numbers or stem words. To do that, use the function ​''​ProcessText()''​,
  
 <code r> <code r>
-GoodText = processText(TwitterText)+GoodText = ProcessText(TwitterText)
 </​code>​ </​code>​
  
-By default, this function converts all the text to lowercase, removes ​punctionation, and removes numbers. It does not automatically remove whitespace, remove stopwords, or “stem” words (e.g. convert both “walked” and “walking” to “walk”). If you want to do any of those things (or if you don’t want to use the defaults), you can change the options,+By default, this function converts all the text to lowercase, removes ​punctuation, and removes numbers. It does not automatically remove whitespace, remove stopwords ​(e.g. words like '​the',​ '​and',​ '​there'​), or “stem” words (e.g. convert both “walked” and “walking” to “walk”). If you want to do any of those things (or if you don’t want to use the defaults), you can change the options,
  
 <code r> <code r>
-GoodText = processText(TwitterText, ​removenumbers ​FALSE, removestopwords = TRUE)+GoodText = ProcessText(TwitterText, ​stopwords.list ​stopwords("​SMART"​), removestopwords = TRUE, removenumbers = FALSE)
 </​code>​ </​code>​
 +
 +Something to note is that there are standard sets of stopword lists. The ProcessText() function uses the "​EN"​ list by default. The "​SMART"​ stopword is a more extensive alternative. To choose one or the other set the stopwords.list argument to be stopwords("​SMART"​) or stopwords("​EN"​).
  
 ==== Bar Plot of Words ==== ==== Bar Plot of Words ====
  
-Much like the normal barplot, the command ​MobilizeWordBar ​creates a bar plot of frequently occurring words in the dataset.+Much like the normal barplot, the command ​''​MakeWordBar'' ​creates a bar plot of frequently occurring words in the dataset.
 <code r> <code r>
-MobilizeWordBar(GoodText)+MakeWordBar(GoodText)
 </​code>​ </​code>​
 +{{ :​rstudio:​wordbarchart.jpeg?​direct&​300 |}}
 +
  
-{{ :​rstudio:​wordbar.jpg?​direct&​300 |Word Bar}} 
 By default, it only shows words that appeared at least 2 times, but you can change that, too. By default, it only shows words that appeared at least 2 times, but you can change that, too.
  
 <code r> <code r>
-MobilizeWordBar(GoodText, min.freq = 10)+MakeWordBar(GoodText, min.freq = 25)
 </​code>​ </​code>​
 +{{ :​rstudio:​wordbarchart25.jpeg?​direct&​300 |}}
  
-{{ :rstudio:wordbar2.jpg?​direct&​300 |Word Bar 2}}+Finally, if you want to look at the top 5 words (by total counts) you can use 
 +<code r> 
 +MakeWordBar(GoodText,​ top=5, format='​count'​) 
 +</​code>​ 
 +{{ :rstudio:wordbarchart5c.jpeg?​direct&​300 |}}
  
-Since only a few words are appearingyou might want to consider rotating them with las=2as described in Section 8.5.5.+or you can look at the top 5% of words by using 
 +<code r> 
 +MakeWordBar(GoodTexttop=5format='​count'​) 
 +</​code>​ 
 +{{ :​rstudio:​wordbarchart5p.jpeg?​direct&​300 |}}
  
 ==== Word Cloud ==== ==== Word Cloud ====
  
-To make a word cloud, use the function ''​MobilizeWordCloud''​. ​+To make a word cloud, use the function ''​MakeWordCloud''​. ​
 <code r> <code r>
-MobilizeWordCloud(GoodText)+MakeWordCloud(GoodText)
 </​code>​ </​code>​
 +{{ :​rstudio:​wordcloud.jpeg?​direct&​300 |}}
  
-{{ :​rstudio:​wordcloud1.jpg?​direct&​300 |Word Cloud}} 
  
-This function has two parameters you can modify The first is color, which should be a specification for a color range. Unlike most plots, we want to have a range of colors instead of just one, so we need to specify the colors different. The default value is "​BuGn",​ but you can see the list by using ''​display.brewer.all()'',​+This function has a few parameters you can modifyThe first is color, which should be a specification for a color range. Unlike most plots, we want to have a range of colors instead of just one, so we need to specify the colors different. The default value is "​BuGn",​ but you can see the list by using ''​display.brewer.all()'',​
 <code r> <code r>
 display.brewer.all(type = "​seq"​) display.brewer.all(type = "​seq"​)
Line 98: Line 107:
 {{ :​rstudio:​displaycolors.jpg?​direct&​300 |Display Colors}} {{ :​rstudio:​displaycolors.jpg?​direct&​300 |Display Colors}}
  
-The other parameter is the same as for the MobilizeWordBar(), min.freq, which again defaults to only showing words that occurred two or more times.+Another ​parameter is the same as for the ''​MakeWordBar()''​''​min.freq''​, which again defaults to only showing words that occurred two or more times.
 <code r> <code r>
-MobilizeWordCloud(GoodText, color = "​OrRd",​ min.freq = 10)+MakeWordCloud(GoodText, color = "​OrRd",​ min.freq = 10)
 </​code>​ </​code>​
  
-{{ :rstudio:wordcloud2.jpg?​direct&​300 |Word Cloud 2}}+{{ :rstudio:wordcloudorrd.jpeg?​direct&​300 |}} 
 + 
 +And just like the MakeWordBar() function, we can specify to display the top 5 (by count) words 
 +<r code> 
 +MakeWordCloud(GoodText,​ top = 5, format = "​count"​) 
 +</​code>​ 
 +{{ :​rstudio:​wordcloud5c.jpeg?​direct&​300 |}} 
 + 
 +or you can display the top 5% of words 
 +<r code> 
 +MakeWordCloud(GoodText,​ top = 5, format = "​count"​) 
 +</​code>​ 
 +{{ :​rstudio:​wordcloud5p.jpeg?​direct&​300 |}}
Print/export