# Differences

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision | ||

rstudio:examining_data [2013/03/31 14:14] angie [Descriptive statistics] |
rstudio:examining_data [2016/05/13 20:45] (current) |
||
---|---|---|---|

Line 1: | Line 1: | ||

=====Examining Data===== | =====Examining Data===== | ||

====The Dollar Sign==== | ====The Dollar Sign==== | ||

- | One of the most useful things to know in R is that the dollar sign, $, lets you access variables within a data set. For example, if you’re looking at the dataset called labike, you might want to access the variable bike_count_pm to make a plot, to calculate the average, etc. To tell R you want that variable, use this syntax:\\ | + | One of the most useful things to know in R is that the dollar sign, ''$'', lets you access variables within a data set. For example, if you’re looking at the dataset called ''labike'', you might want to access the variable ''bike_count_pm'' to make a plot, to calculate the average, etc. To tell R you want that variable, use this syntax:\\ |

<code r> | <code r> | ||

labike$bike_count_pm | labike$bike_count_pm | ||

Line 18: | Line 18: | ||

<code r> | <code r> | ||

mean(labike$bike_count_pm) | mean(labike$bike_count_pm) | ||

+ | ## [1] 132.9 | ||

+ | </code> | ||

+ | In this example, we are taking the mean of the variable ''bike_count_pm'', and it is 132.9. Because this is only one number, the index [1] appears next to it. | ||

+ | |||

+ | ==== Square brackets ==== | ||

+ | Another useful thing to know about are square brackets, ''[]''. These brackets let you access values in a dataset by their index (described on the [[rstudio:Useful_Terms|useful terms]] page). A data frame has two dimensions, so you need to either specify two numbers, or leave a blank space if you want all of a certain dimension. | ||

+ | |||

+ | <code r> | ||

+ | labike[10, 2] | ||

+ | ## [1] -118.3 | ||

</code> | </code> | ||

- | ''## [1] 132.9''\\ | + | <code r> |

- | In this example, we are taking the mean of the variable bike_count_pm, and it is 132.9. Because this is only one number, the index [1] appears next to it. | + | labike[10, ] |

+ | </code> | ||

+ | ''## name longitude latitude type bike_count_pm ped_count_pm \\ | ||

+ | ## 10 Echo Park & Sunset -118.3 34.08 none 121 1369''\\ | ||

+ | | ||

+ | The first example here is showing us the 10th row of the ''labike'' dataset, and the second column (which is the longitude). The second example is not specifying a column, so it shows us the entire 10th row of the dataset. | ||

==== Summaries of data (“Frequency tables”) ==== | ==== Summaries of data (“Frequency tables”) ==== | ||

- | If you want to know some basic frequencies or statistics about a particular variable, summary() is a very useful command. | + | If you want to know some basic frequencies or statistics about a particular variable, ''summary()'' is a very useful command. |

<code r> | <code r> | ||

summary(labike$type) | summary(labike$type) | ||

Line 29: | Line 44: | ||

{{:rstudio:summary_labike_type_.png?direct|}} | {{:rstudio:summary_labike_type_.png?direct|}} | ||

- | So, for the variable type, the summary() command tells us how many responses there were in each of those categories. | + | So, for the variable type, the ''summary()'' command tells us how many responses there were in each of those categories. |

<code r> | <code r> | ||

summary(labike$bike_count_pm) | summary(labike$bike_count_pm) | ||

Line 35: | Line 50: | ||

{{:rstudio:summary_labike_bike_count_pm_.png?direct|}} | {{:rstudio:summary_labike_bike_count_pm_.png?direct|}} | ||

- | If you use summary() on a variable that is numeric, like bike_count_pm, it gives us some basic statistics, like the minimum and maximum values, the median and mean, and the 1st and 3rd quartiles. | + | If you use ''summary()'' on a variable that is numeric, like ''bike_count_pm'', it gives us some basic statistics, like the minimum and maximum values, the median and mean, and the 1st and 3rd quartiles. |

- | ==== Sorting and ordering ==== | + | ====Length and dimension ==== |

+ | Sometimes you want to know how many values are contained within a dataset or a variable. ''length()'' and ''dim()'' allow you to find that out. | ||

- | Because R is a statistical programming language, not a spreadsheet program like Excel, it isn’t as natural to sort a data set. If you think about it, why do we like sorting datasets? Well, it allows us to know the maximum and minimum value of a variable. But, summary() will tell us the maximum and minimum values. Sorting also allows us to scroll through and see the distribution of values, but it’s hard to hold enough information in our head to really grasp the distribution from scrolling (people can usually only remember 7 numbers at once, which is why we have 7-digit phone numbers!). A better way to see a distribution of values would be making a histogram, hist().\\ | + | <code r> |

+ | dim(labike) | ||

+ | </code> | ||

+ | ''## [1] 38 6''\\ | ||

- | With that said, if you want to sort your dataset, there is a way to do it. We use the order() command. But, if we use the command alone, it doesn’t give us what we want: | ||

<code r> | <code r> | ||

- | order(labike$bike_count_pm) | + | length(labike$latitude) |

+ | ## [1] 38 | ||

</code> | </code> | ||

- | {{:rstudio:order_labike_bike_count_pm_.png?direct|}} | + | This is what we expected, because the variable is just as long as the dataset has rows. |

- | What are these values? Well, look through the data in the viewing pane– what row has the smallest number in the bike_count_pm column? It’s a tie between row [21] and row [30], which both have 35 in that column. Notice that the first two numbers that R printed out were 21 and 30, so it’s just giving us the list of indices, ordered by the values in bike_count_pm.\\ | + | ==== Tables (contingency tables)==== |

+ | To get summaries of the number of variables in a given category (or set of categories), ''table()'' is a great command. If you run, you get a table of values (sometimes called a contingency table)\\ | ||

+ | <code r> | ||

+ | table(cdc$gender) | ||

- | So to get the results we really want, we need to apply it to the dataset | + | ## |

- | <code r> | + | ## Female Male |

- | labike[order(labike$bike_count_pm),] | + | ## 7036 6992 |

- | </code> | + | |

- | MISSING IMAGE\\ | + | table(cdc$gender, cdc$general_health) |

- | This is what we wanted, right? Our Console window isn’t wide enough to see all the columns at once, so R is printing out the last two columns after the rest of the dataset, but we can see that the data is sorted by bike_count_pm.\\ | + | ## |

- | + | ## Excellent Very good Good Fair Poor | |

- | But, this is just printing the values out in the Console window. If you look up at your file-viewing pane, the labike data has not changed order. Again, this is because R is a programming language, not a spreadsheet program. If we want the data in the viewing pane to change, we need to save over the old version of labike, like so | + | ## Female 615 1597 2661 1106 183 |

- | <code r> | + | ## Male 1294 1982 2053 578 118 |

- | labike = labike[order(labike$bike_count_pm), ] | + | |

</code> | </code> | ||

+ | |||

+ | Notice that providing two variables as arguments resulted in a two-way contingency table.\\ | ||

+ | |||

==== Determining data types==== | ==== Determining data types==== | ||

- | R has a few useful functions for learning about variables. One is class(), | + | R has a few useful functions for learning about variables. One is ''class()'', |

<code r> | <code r> | ||

class(labike) | class(labike) | ||

Line 79: | Line 102: | ||

</code> | </code> | ||

- | A similar function that provides a little more information is attributes(), | + | A similar function that provides a little more information is ''attributes()'', |

<code r> | <code r> | ||

Line 94: | Line 117: | ||

''## NULL''\\ | ''## NULL''\\ | ||

- | Notice that class(labike$latitude) told us something about the latitude variable, but that attributes(labike$latitude returned a NULL response. So, attributes() isn’t always the better choice.\\ | + | Notice that ''class(labike$latitude)'' told us something about the latitude variable, but that ''attributes(labike$latitude)'' returned a ''NULL'' response. So, ''attributes()'' isn’t always the better choice.\\ |

- | | + | |

- | ==== Tables (contingency tables)==== | + | |

- | To get summaries of the number of variables in a given category (or set of categories), table() is a great command. If you run, you get a table of values (sometimes called a contingency table)\\ | + | |

- | <code r> | + | |

- | table(cdc$gender) | + | |

- | | + | |

- | ## | + | |

- | ## Female Male | + | |

- | ## 7036 6992 | + | |

- | | + | |

- | table(cdc$gender, cdc$general_health) | + | |

- | | + | |

- | ## | + | |

- | ## Excellent Very good Good Fair Poor | + | |

- | ## Female 615 1597 2661 1106 183 | + | |

- | ## Male 1294 1982 2053 578 118 | + | |

- | </code> | + | |

- | | + | |

- | Notice that providing two variables as arguments resulted in a two-way contingency table.\\ | + | |

- | | + | |

- | ==== Descriptive statistics==== | + | |

- | We’ve discussed summary() before, if you run this on a numeric variable you get a fair number of descriptive statistics.\\ | + | |

- | <code r> | + | |

- | summary(cdc$weight) | + | |

- | ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA s | + | |

- | ## 34.5 56.7 65.3 68.6 77.1 181.0 979 | + | |

- | </code> | + | |

- | | + | |

- | But if you want specific descriptive statistics, they also have their own separate functions that are called just what you would expect | + | |

- | <code r> | + | |

- | mean(cdc$weight, na.rm = TRUE) | + | |

- | | + | |

- | ## [1] 68.55 | + | |

- | | + | |

- | median(cdc$weight, na.rm = TRUE) | + | |

- | | + | |

- | ## [1] 65.32 | + | |

- | | + | |

- | min(cdc$weight, na.rm = TRUE) | + | |

- | | + | |

- | ## [1] 34.47 | + | |

- | | + | |

- | max(cdc$weight, na.rm = TRUE) | + | |

- | | + | |

- | ## [1] 181 | + | |

- | </code> | + | |

- | Notice the only tricky thing here– including na.rm=TRUE. If you don’t include that option, all these functions will return NA, because they are trying to compute a number and are encountering NA values. By passing na.rm=TRUE, you are telling the function to remove na values, thus na.rm.\\ | + | For the dataset ''labike'', the ''attributes()'' function told us the names of every column in the dataset (the variables) as well as the row names. In this case, the row names are just numbers, but you could imagine cases when the rows would have names, maybe students in a class. Then, it tells us the class, which is the same thing the ''class()'' function told us– data.frame. |

+ | For the ''type'' variable, we got a little more information. Not only do we learn that the ''class'' is ''factor'', we also see all the ''levels''. Levels are possible values for a factor variable. If you had a multiple-choice question, maybe the levels would be A, B, C, D. In this case, they’re things like "bike route". | ||

- | ==== Transforming data==== | ||

- | One of the great things about R is that it acts like a calculator, and it can be used to transform data very easily. For example, say you wanted to transform the height variable in the cdc data set from height in centimeters to height in inches. We know the formula is inches = centimeters × 39.37, and that’s easy to do in R, |