Famous and Very Useful Pre-Installed Exercise Datasets in R
As most of you surely know, R has many exercise datasets already installed. That simply means, as soon as you installed R Base, which includes the library ‘datasets’, you have ample opportunity to explore R with real world data frames.
For me as course content creator those datasets help tremendously, because with them I can be sure that my students have the exact same data available and the results are reproducible. For you as a student this is at least as beneficial. If you reach out on a discussion board like stackoverflow, most of the time standard datasets are used to explain nearly any imaginable problem.
On top of that it safes time! Using new datasets each time would mean spending significant amounts of time in understanding the dataset. Each dataset has different features like variable number and type, dataset length, completeness and many more. Since you are here to learn R as quickly and wholly as possible, and not to study iris biology or car mechanics, I want to reduce time spent on dataset familiarization as much as possible.
Finally, I want to remark that by using those datasets the only step omitted in the whole data analysis procedure is the data import. There are several videos on data import in the courses R Basics and R Level 1. A .CSV file once imported properly behaves like any other data frame and hence like ‘mtcars’.
In this article I will show you the most common standard datasets which you will find frequently in R-Tutorials training materials as well as in the R user community. You will learn about the features of those specific datasets. All of them are part of R Base except the ‘diamonds’ dataset from ggplot2.
Iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa
Every R user has used this dataset. It belongs to R like the Eiffel tower to Paris. You will find this dataset in pretty much any tutorial. The data frame is structured in 5 variables and 150 observations. The data is derived from a biological question: Difference in leaf features of three plant species.
Each of the three species (setosa, versicolor, virginica) has exactly 50 observations, which makes it perfectly balanced. And it is also a clean dataset without missing data or nonsensical values in it.
The dataset can also be had as array, called “iris3”.
Variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width (continuous, numeric) and Species (categorical).
Usage: the strong point of the dataset is the variable Species as grouping factor. With the three distinct groups you can do plenty of analysis and plotting on the continuous variables. We are talking about boxplots, ANOVA and all sorts of classification methods. Of course the continuous variables can be used for regressions as well.
Mtcars
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Another classic dataset. This one offers more variety in variable type, thus the dataset can be used on a wider range of analytics tools. As the name indicates, the data is a comparison of different cars according to a motor sports magazine of the ’70s. There are 32 observations (cars) on 11 variables. All of the variables are numeric, some continuous, some discrete. The dataset is clean without missing values.
Variables vs and am (code for transmission type and engine type) can be factorized, which simply allows you to put the observations in different groups. If factorized they have similar features like Species in ‘iris’.
mpg | Miles/(US) gallon, continuous | |
cyl | Number of cylinders, discrete | |
disp | Displacement (cu.in.), continuous | |
hp | Gross horsepower, continuous | |
drat | Rear axle ratio, continuous | |
wt | Weight (lb/1000), continuous | |
qsec | 1/4 mile time, continuous | |
vs | V/S, engine type, factor! | |
am | Transmission (0 = automatic, 1 = manual), factor! | |
gear | Number of forward gears, discrete | |
carb | Number of carburetors, discrete |
Usage: The dataset has a strong point in its variety. You can use it for a wide range of analysis as long as the 32 observations are enough to get reliable results. The fact that the data frame is rather small makes it also easy for data science novices to grasp general statistics concepts.
Cars
speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 7 10 18 8 10 26 9 10 34 10 11 17
Not to be confused with ‘mtcars’. This dataset deals with speed and corresponding stopping distance of cars. In case you might be wondering why the speed measurements are quite low, the whole data was measured in the 1920s. You will see the ‘cars’ dataset used occasionally in R tutorials all over the web, but it is not as popular as ‘mtcars’. This is understandable since ‘cars’ is less diverse and the number of variables is simply smaller.
You will get 50 observations on 2 variables, both numeric and continuous.
speed | numeric | Speed (mph), continuous | |
dist | numeric | Stopping distance (ft), continuous |
Usage: This is a simple data frame which is great for introductions to regression and correlation analysis as well as scatterplots. But due to a small number of variables (2) the availability is limited.
Lynx
Time Series: Start = 1821 End = 1934 Frequency = 1 [1] 269 321 585 871 1475 2821 3928 5943 4950 2577 523 98 184 279 409 2285 2685 [18] 3409 1824 409 151 45 68 213 546 1033 2129 2536 957 361 377 225 360 731 [35] 1638 2725 2871 2119 684 299 236 245 552 1623 3311 6721 4254 687 255 473 358 [52] 784 1594 1676 2251 1426 756 299 201 229 469 736 2042 2811 4431 2511 389 73 [69] 39 49 59 188 377 1292 4031 3495 587 105 153 387 758 1307 3465 6991 6313 [86] 3794 1836 345 382 808 1388 2713 3800 3091 2985 3790 674 81 80 108 229 399 [103] 1132 2432 3574 2935 1537 529 485 662 1000 1590 2657 3396
A rather simple dataset that is rarely used in the R community. It covers the number of lynx trapped between 1821 – 1934 in Canada. Basically it is a one dimensional time series dataset. It shows the discrete counts of trapped lynx per year. There are no missing values or outliers.
Usage: I like to use ‘lynx’ to show one dimensional plots like histograms. Simple data structuring and order processes can be exemplified with this kind of data. The classic use would be in time series analysis.
Rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 906 202 [18] 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 525 720 390 250 [35] 327 230 265 850 210 630 260 230 360 730 600 306 390 420 291 710 340 [52] 217 281 352 259 250 470 680 570 350 300 560 900 625 332 2348 1171 3710 [69] 2315 2533 780 280 410 460 260 255 431 350 760 618 338 981 1306 500 696 [86] 605 250 411 1054 735 233 435 490 310 460 383 375 1270 545 445 1885 380 [103] 300 380 377 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 [120] 540 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 500 [137] 720 270 430 671 1770
Like ‘lynx’ it is a very simple dataset. In this case we have a simple vector containing the river lengths (in miles) of 141 rivers. You will find this dataset rarely used in the R community.
Usage: I like to use it for simple vector operations, histograms or simple data structuring exercises.
Diamonds
carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
This is definitely a big one. It is part of the library ‘ggplot2’, so you have to install this very popular and advanced graphics package to use it. I first discovered the dataset while reading the intro book to ‘ggplot2’ by Hadley Wickham. It is one of the main datasets used by Hadley in the book.
You can find 53.940 observations (diamonds). Each observation has measurements and categories like color or cut more or less correlating to the continuous variable price, which can be seen as the key variable in the dataset.
Usage: I keep that one for advanced courses like “Machine Learning” since it is very large. You can use the dataset for nearly any type of analysis. It is also great to practice handling of large datasets (big data) as well as data cleaning and imputation.