Three Different Ways in How to Store Your Datasets in R
Data.frames
(df) are the most common type of data you will find in R. If you import your dataset from Excel you will likely import it as a df. Of course this makes perfect sense because most of the data fits into this object class. Columns for the variables and rows for the observations – that is the simple pattern of a data.frame
and of nearly all data. Of course there might be simple vectors or time series data but those make up for only a small minority of cases.
Fortunately R offers several quality ways in how to deal with those datasets. We can see the same thing in R graphing where you can choose between R-base the standard which comes with R itself, then there is the lattice package with plots especially suited for scientific publications and for some years now we also have ggplot2 which becomes the default way of producing quality plots in R. So those are the 3 main ways in how you can do the same thing in R: producing great graphs. Of course there are alternatives and add-ons, but those are the dominant ones.
So how does this relate to data.frames
and data management you might ask?
Well, we have the same thing here. We have the function data.frame
from R-base, which you might already know quite well. But we also have a whole different system which does the same thing, it is called data.table
from the package data.table. You can store the same type of data in those dt and it offers some advantages to the standard way of a data.frame
.
And there is also the data_frame
which is part of the dplyr package from Hadley Wickham. You can use it to store the same type of data and a few of the blind spots from data.frames
are covered as well.
So again we have 3 quality ways in how we can do the same things. We can store our data, we can filter, clean or subset our data in 3 different ways.
So now let’s take a closer look at those 3 classes and let’s compare them!
1. Data.frame (df)
With a standard data.frame
the things are quite straight forward: It is already on your machine, you do not need to load any external packages. Depending on how good you are with R you will find it more or less easy to work with this class. One of the main drawbacks of a data.frame is the fact that if you have a string in your data, which can be for example text like a name, it will get imported as a factor. Of course there are ways to prevent this but this is the default setting of the function and it led to much confusion for many users in the past.
Another feature of a data.frame
is the recycling it will automatically perform on certain columns if the data provided is too short in that specific column. Recycling occurs in a data.table
as well, but in data_frame
this feature is eliminated.
Furthermore, and this is unique under the 3 classes we are discussing here, you can provide row names to your dataset. Depending on your data this might be helpful, however it is general consent that providing row names is rather avoided.
Let’s take a look at an example here:
mydf = data.frame(a = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"), b = c("A", "A", "B", "B", "B", "C"), c = rnorm(2)); mydf OUTPUT ## a b c ## 1 Paul A 0.01534917 ## 2 Kim A 1.41145962 ## 3 Nora B 0.01534917 ## 4 Sue B 1.41145962 ## 5 Paul B 0.01534917 ## 6 Kim C 1.41145962
We have 3 columns. Column a contains 6 names, col b has a group ID and col c is a random measurement but only of length 2 which means there should be recycling going on in column c.
sapply(mydf, class) OUTPUT ## a b c ## "factor" "factor" "numeric"
If we check the class for each of those 3 columns we can see that in deed the string columns a and b are formatted as factor, which was not intended in this case. To prevent this you would need to use the argument stringsAsFactors = F
.
2. Data.table (dt)
A data table is superior to a standard df in that it requires less programming. The whole class can be manipulated with less code and function calls. Which not only saves time but helps to reduce errors. Furthermore a dt requires less computing time which also enables you to handle much larger datasets. And there are other nice side effects like the possibility of ordered joins or the great documentation of the package.
Note that data.table
has a second class which is data.frame
. That basically means if you are working with your dt and you are using a function from a package which does not recognize the dt format, R will just default to data.frame
and will treat the dt as a df in this case. So you will get all the upside of the data.table
where possible but no downsides since you will be covered with that second backup class data.frame
. This feature is also available with data_frame
.
In fact the whole class data.table has its own structure, which is very similar to a SQL structure, if you are already familiar with it. In the documentation you will quite likely see something like this
nameDT [i, j, by]
You can filter, sort, order with the understanding of this structure.
So let’s see what those letters actually mean:
- I stands for the subset from our dt we want to work with
- J is the actual calculation that will be performed in the data subset i
- And this whole calculation will be grouped by by
library(data.table) mytable = data.table(a = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"), b = c("A", "A", "B", "B", "B", "C"), c = rnorm(2)); mytable OUTPUT ## a b c ## 1: Paul A 1.153581 ## 2: Kim A 1.397833 ## 3: Nora B 1.153581 ## 4: Sue B 1.397833 ## 5: Paul B 1.153581 ## 6: Kim C 1.397833 ######################### sapply(mytable, class) OUTPUT ## a b c ## "character" "character" "numeric"
If we take a look at the output of a data.table
in R we can see that the row IDs are clearly separated by those dots. Furthermore if you would have a string or character column this column would not automatically be transformed by dt. This is a quite nice feature because this can lead to confusion if you have a factor in your data and you do not really recognize it as such at the beginning.
There are no row names in a data.table
, just the row IDs at the beginning.
By the way, if you want to print a dt and the amount of rows exceeds your machine capacity only the first and last 5 rows are printed.
3. Data_frame
The first thing here to keep in mind is the fact that you need columns of equal length. If they are not equally long you will get an error message telling you which column needs to be fixed. The only exception of this rule occurs if you have a column with length 1 (this would be recycled).
Like with a dt if you view a large dataset you will display the first few rows and not the whole set like in a df.
library(dplyr) my_df = data_frame(a = c("Paul", "Kim", "Nora", "Sue", "Paul", "Kim"), b = c("A", "A", "B", "B", "B", "C"), c = rnorm(6)); my_df # c need to be of length 6 OUTPUT ## Source: local data frame [6 x 3] ## ## a b c ## 1 Paul A -1.2487364 ## 2 Kim A -2.7467359 ## 3 Nora B 0.2346139 ## 4 Sue B 0.3249917 ## 5 Paul B 0.5882327 ## 6 Kim C -0.2813162 ###################### sapply(my_df, class) OUTPUT ## a b c ## "character" "character" "numeric"
So we can see that the output looks quite similar to a standard df. We had to provide a column c of equal length to a an b and we also got character for a and b instead of factor.
Conclusion
If we compare those 3 classes, we can see that for extended data management tasks it is beneficial to use an advanced tool like data.table. Since data.table as well as data_frame have the standard data.frame as backup, you will not get stuck in case a function you are using does not recognize the class.
class(mydf); class(mytable); class(my_df) OUTPUT ## [1] "data.frame" ## [1] "data.table" "data.frame" ## [1] "tbl_df" "tbl" "data.frame"
Keep those classes in mind next time you are dealing with extended datasets and data management tasks!