R Tutorials | Famous and Very Useful Pre-Installed Exercise Datasets in R

Famous and Very Useful Pre-Installed Exercise Datasets in R

R Tutorials
15 Jul, 2015
Blog, Exercise Database, R Blog
cars, dataset, diamonds, exercise, iris, lynx, mtcars, rivers

As most of you surely know, R has many exercise datasets already installed. That simply means, as soon as you installed R Base, which includes the library ‘datasets’, you have ample opportunity to explore R with real world data frames.

For me as course content creator those datasets help tremendously, because with them I can be sure that my students have the exact same data available and the results are reproducible. For you as a student this is at least as beneficial. If you reach out on a discussion board like stackoverflow, most of the time standard datasets are used to explain nearly any imaginable problem.

On top of that it safes time! Using new datasets each time would mean spending significant amounts of time in understanding the dataset. Each dataset has different features like variable number and type, dataset length, completeness and many more. Since you are here to learn R as quickly and wholly as possible, and not to study iris biology or car mechanics, I want to reduce time spent on dataset familiarization as much as possible.

Finally, I want to remark that by using those datasets the only step omitted in the whole data analysis procedure is the data import. There are several videos on data import in the courses R Basics and R Level 1. A .CSV file once imported properly behaves like any other data frame and hence like ‘mtcars’.

In this article I will show you the most common standard datasets which you will find frequently in R-Tutorials training materials as well as in the R user community. You will learn about the features of those specific datasets. All of them are part of R Base except the ‘diamonds’ dataset from ggplot2.

Iris

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Every R user has used this dataset. It belongs to R like the Eiffel tower to Paris. You will find this dataset in pretty much any tutorial. The data frame is structured in 5 variables and 150 observations. The data is derived from a biological question: Difference in leaf features of three plant species.

Each of the three species (setosa, versicolor, virginica) has exactly 50 observations, which makes it perfectly balanced. And it is also a clean dataset without missing data or nonsensical values in it.

The dataset can also be had as array, called “iris3”.

Variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width (continuous, numeric) and Species (categorical).

Usage: the strong point of the dataset is the variable Species as grouping factor. With the three distinct groups you can do plenty of analysis and plotting on the continuous variables. We are talking about boxplots, ANOVA and all sorts of classification methods. Of course the continuous variables can be used for regressions as well.

Mtcars

                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C         17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

Another classic dataset. This one offers more variety in variable type, thus the dataset can be used on a wider range of analytics tools. As the name indicates, the data is a comparison of different cars according to a motor sports magazine of the ’70s. There are 32 observations (cars) on 11 variables. All of the variables are numeric, some continuous, some discrete. The dataset is clean without missing values.

Variables vs and am (code for transmission type and engine type) can be factorized, which simply allows you to put the observations in different groups. If factorized they have similar features like Species in ‘iris’.

	mpg	Miles/(US) gallon, continuous
	cyl	Number of cylinders, discrete
	disp	Displacement (cu.in.), continuous
	hp	Gross horsepower, continuous
	drat	Rear axle ratio, continuous
	wt	Weight (lb/1000), continuous
	qsec	1/4 mile time, continuous
	vs	V/S, engine type, factor!
	am	Transmission (0 = automatic, 1 = manual), factor!
	gear	Number of forward gears, discrete
	carb	Number of carburetors, discrete

Usage: The dataset has a strong point in its variety. You can use it for a wide range of analysis as long as the 32 observations are enough to get reliable results. The fact that the data frame is rather small makes it also easy for data science novices to grasp general statistics concepts.

Cars

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
7    10   18
8    10   26
9    10   34
10   11   17

Not to be confused with ‘mtcars’. This dataset deals with speed and corresponding stopping distance of cars. In case you might be wondering why the speed measurements are quite low, the whole data was measured in the 1920s. You will see the ‘cars’ dataset used occasionally in R tutorials all over the web, but it is not as popular as ‘mtcars’. This is understandable since ‘cars’ is less diverse and the number of variables is simply smaller.

You will get 50 observations on 2 variables, both numeric and continuous.

	speed	numeric	Speed (mph), continuous
	dist	numeric	Stopping distance (ft), continuous

Usage: This is a simple data frame which is great for introductions to regression and correlation analysis as well as scatterplots. But due to a small number of variables (2) the availability is limited.

Lynx

Time Series:
Start = 1821
End = 1934
Frequency = 1
[1]    269  321  585  871 1475 2821 3928 5943 4950 2577  523   98  184  279  409 2285 2685
[18]  3409 1824  409  151   45   68  213  546 1033 2129 2536  957  361  377  225  360  731
[35]  1638 2725 2871 2119  684  299  236  245  552 1623 3311 6721 4254  687  255  473  358
[52]   784 1594 1676 2251 1426  756  299  201  229  469  736 2042 2811 4431 2511  389   73
[69]    39   49   59  188  377 1292 4031 3495  587  105  153  387  758 1307 3465 6991 6313
[86]  3794 1836  345  382  808 1388 2713 3800 3091 2985 3790  674   81   80  108  229  399
[103] 1132 2432 3574 2935 1537  529  485  662 1000 1590 2657 3396

A rather simple dataset that is rarely used in the R community. It covers the number of lynx trapped between 1821 – 1934 in Canada. Basically it is a one dimensional time series dataset. It shows the discrete counts of trapped lynx per year. There are no missing values or outliers.

Usage: I like to use ‘lynx’ to show one dimensional plots like histograms. Simple data structuring and order processes can be exemplified with this kind of data. The classic use would be in time series analysis.

Rivers

[1]   735  320  325  392  524  450 1459  135 465 600 330  336  280 315  870  906  202
[18]  329  290 1000  600  505 1450  840 1243 890 350 407  286  280 525  720  390  250
[35]  327  230  265  850  210  630  260  230 360 730 600  306  390 420  291  710  340
[52]  217  281  352  259  250  470  680  570 350 300 560  900  625 332 2348 1171 3710
[69] 2315 2533  780  280  410  460  260  255 431 350 760  618  338 981 1306  500  696
[86]  605  250  411 1054  735  233  435  490 310 460 383  375 1270 545  445 1885  380
[103] 300  380  377  425  276  210  800  420 350 360 538 1100 1205 314  237  610  360
[120] 540 1038  424  310  300  444  301  268 620 215 652  900  525 246  360  529  500
[137] 720  270  430  671 1770

Like ‘lynx’ it is a very simple dataset. In this case we have a simple vector containing the river lengths (in miles) of 141 rivers. You will find this dataset rarely used in the R community.

Usage: I like to use it for simple vector operations, histograms or simple data structuring exercises.

Diamonds

  carat     cut color clarity depth table price    x    y    z
1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29 Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31    Good     J     SI2  63.3    58   335 4.34 4.35 2.75

This is definitely a big one. It is part of the library ‘ggplot2’, so you have to install this very popular and advanced graphics package to use it. I first discovered the dataset while reading the intro book to ‘ggplot2’ by Hadley Wickham. It is one of the main datasets used by Hadley in the book.

You can find 53.940 observations (diamonds). Each observation has measurements and categories like color or cut more or less correlating to the continuous variable price, which can be seen as the key variable in the dataset.

Usage: I keep that one for advanced courses like “Machine Learning” since it is very large. You can use the dataset for nearly any type of analysis. It is also great to practice handling of large datasets (big data) as well as data cleaning and imputation.