R Tutorials | R Exercises – 51-60 – Data Pre-Processing with Data.Table

R Exercises – 51-60 – Data Pre-Processing with Data.Table

R Tutorials
30 Aug, 2016
Exercise Database
data management, data pre-processing, exercise database

Required packages for the excises

install.packages("data.table")
install.packages("ISLR") #for the dataset College

library(data.table)
library(ISLR)

dtcollege = data.table(College, keep.rownames = T); class(dtcollege)

1. ‘College’ dataset – Basic row manipulations

a. Transform ‘College’ from ‘ISLR’ to data.table. Make sure to keep the University identifier. We will use this new data.table called ‘dtcollege’ throughout this block of exercises.

b. Get familiar with the dataset and its variables.

c. Extract rows 40 to 60 as a new data.table (‘mysubset’).

#expected result

                       rn Private Apps Accept Enroll Top10perc Top25perc  [...]
 1:     Baylor University     Yes 6075   5349   2367        34        66
 2:        Beaver College     Yes 1163    850    348        23        56
 3:    Bellarmine College     Yes  807    707    308        39        63
 4: Belmont Abbey College     Yes  632    494    129        17        36
 5:    Belmont University     Yes 1220    974    481        28        67
[...]

d. What is the average enrollment number in this subset?

#expected result

[1] 597.8571

e. Round the average enrollment to one digit.

#expected result

[1] 597.9

Reveal solution

a.
dtcollege = data.table(College, keep.rownames = T); class(dtcollege)

b.

?College

head(dtcollege)

summary(dtcollege)

c.
mysubset = dtcollege[40:60]; mysubset

d.
mean.enrollment = mysubset[,mean(Enroll)]; mean.enrollment

e.
round(mean.enrollment, digits = 1)

2. ‘College’ dataset – Advanced row selection

a. Get a data.table with all rows except the ones with an ‘Outstate’ fee between 8000-14000 USD.

#expected result

                                  rn Private  Apps  Accept Enroll Top10perc Top25perc  [...]
  1:    Abilene Christian University     Yes  1660    1232    721        23        52
  2:       Alaska Pacific University     Yes   193     146     55        16        44
  3:                Albright College     Yes  1038     839    227        30        63
  4:               Alfred University     Yes  1732    1425    472        37        75
  5:               Allegheny College     Yes  2652    1900    484        44        77
 --- 
381: Worcester Polytechnic Institute     Yes  2768    2314    682        49        86
382:         Worcester State College      No  2197    1515    543         4        26
383:  Xavier University of Louisiana     Yes  2097    1915    695        34        61
384:                 Yale University     Yes 10705    2453   1317        95        99
385:    York College of Pennsylvania     Yes  2989    1855    691        28        63

Reveal solution

a.
dtcollege[!(Outstate) %between% c(8000,14000)]

3. ‘College’ dataset – Basic column operations

a. What are the top 10 Universities in terms of top 10% High School students (Top10perc)?

#expected result

 [1] "Massachusetts Institute of Technology"
 [2] "Harvey Mudd College" 
 [3] "University of California at Berkeley" 
 [4] "Yale University" 
 [5] "Duke University" 
 [6] "Harvard University" 
 [7] "Princeton University" 
 [8] "Georgia Institute of Technology" 
 [9] "Brown University" 
[10] "Dartmouth College"

b. What are the top 10 Universities in terms of student enrollment vs. accepted applications (highest student enrollment ratio)? Add a new column to the data.table. Code this exercise step by step in several lines.

#expected result

                                   rn   earatio Enroll Accept
 1:    California Lutheran University 1.0000000    247    247
 2:            Brewton-Parker College 0.9788274   1202   1228
 3:  Mississippi University for Women 0.9382716    380    405
 4:                Peru State College 0.9141717    458    501
 5:       Indiana Wesleyan University 0.8652482    366    423
 6: Brigham Young University at Provo 0.8543132   4615   5402
 7:               Columbia College MO 0.8354430    132    158
 8:         Hardin-Simmons University 0.8254717    350    424
 9:              University of Mobile 0.8126888    269    331
10:         National-Louis University 0.8040346    279    347

c. What are the top 10 Universities in terms of favorable S.F.Ratio with a Room.Board cost lower 4000 USD?

#expected result

                               rn S.F.Ratio Room.Board
 1:      University of Charleston       2.5       3540
 2:         Siena Heights College       7.2       3880
 3:        University of the Arts       7.5       3860
 4: Marian College of Fond du Lac       8.4       3400
 5:    Wisconsin Lutheran College       8.5       3700
 6:             Bethel College KS       8.8       3580
 7:           Buena Vista College       8.8       3797
 8:    Tennessee Wesleyan College       8.9       3640
 9:           Christendom College       9.3       3600
10:             Albertson College       9.4       3335

Reveal solution

a.
head(dtcollege[order(-Top10perc), rn],10)

b.

earatio = dtcollege$Enroll/dtcollege$Accept

dtcollegeplus = data.table(dtcollege, earatio); head(dtcollegeplus)

head(dtcollegeplus[order(-earatio), .(rn, earatio, Enroll, Accept)],10)

rb4000 = dtcollege[Room.Board < 4000] 
head(rb4000[order(S.F.Ratio), .(rn, S.F.Ratio, Room.Board)],10)

4. ‘College’ dataset – Permanently changing the column order

a. Create a new data.table ‘college.gradnr’ which consists of: ‘rn’, ‘F.Undergrad’, ‘P.Undergrad’, ‘Accept’, ‘Enroll’, ‘Apps’.

#expected result

                             rn F.Undergrad P.Undergrad Accept Enroll Apps
1: Abilene Christian University        2885         537   1232    721 1660
2:           Adelphi University        2683        1227   1924    512 2186
3:               Adrian College        1036          99   1097    336 1428
4:          Agnes Scott College         510          63    349    137  417
5:    Alaska Pacific University         249         869    146     55  193
6:            Albertson College         678          41    479    158  587

b. Permanently change the column order as seen below:

#expected result

                             rn Apps Accept Enroll F.Undergrad P.Undergrad
1: Abilene Christian University 1660   1232    721        2885         537
2:           Adelphi University 2186   1924    512        2683        1227
3:               Adrian College 1428   1097    336        1036          99
4:          Agnes Scott College  417    349    137         510          63
5:    Alaska Pacific University  193    146     55         249         869
6:            Albertson College  587    479    158         678          41

Reveal solution

college.gradnr = dtcollege[,c("rn", "F.Undergrad", "P.Undergrad", "Accept", "Enroll", "Apps"), with = F]; head(college.gradnr)

b.
setcolorder(college.gradnr, c(1,6,4,5,2,3)); head(college.gradnr)

5. ‘College’ dataset – Adding and removing new columns

a. Add a new column called ‘HighInterest’ to the data.table. The column has an integer 1 for all observations with a number of applications higher 1000.

#expected result

                                rn Private  Apps [...] HighInterest
1: Worcester Polytechnic Institute     Yes  2768                N/A
2:         Worcester State College      No  2197                N/A
3:               Xavier University     Yes  1959                N/A
4:  Xavier University of Louisiana     Yes  2097                N/A
5:                 Yale University     Yes 10705                  1
6:    York College of Pennsylvania     Yes  2989                N/A

b. Remove the ‘HighInterest’ column again.

Reveal solution

a.
dtcollege[Apps > 10000, HighInterest := 1L]; tail(dtcollege)

b.
dtcollege[, c("HighInterest") := NULL]; tail(dtcollege)

6. ‘College’ dataset – Adding new columns; Advanced

a. Add a new column ‘undergradratio’ which is the undergraduate ratio (F.Undergrad/P.Undergrad). Get the first ten observations starting with the highest ratio. The whole exercise should be coded in one line – Use chaining!

#expected result

                                   rn Private Apps [...] udergradratio
 1:                College of Wooster     Yes 2100           1704.0000
 2:                    Kenyon College     Yes 2212           1445.0000
 3:          Hampden - Sydney College     Yes  817            945.0000
 4:         Claremont McKenna College     Yes 1860            887.0000
 5:     Washington and Lee University     Yes 3315            528.0000
 6: Hobart and William Smith Colleges     Yes 2688            358.4000
 7:                   Amherst College     Yes 4302            318.6000
 8:                  Colorado College     Yes 3207            270.2857
 9:  Rensselaer Polytechnic Institute     Yes 4996            268.1875
10:                  Davidson College     Yes 2373            266.8333

Reveal solution

head(dtcollege[,undergradratio := F.Undergrad/P.Undergrad]

[order(-undergradratio)],10)

#same line

7. ‘College’ dataset – Counting observations

a. How many Universities have instructional expenditures of over 20000 USD per year?

#expected result

[1] 28

b. How many Universities have a combined ‘Books’ and ‘Room.Board’ costs of over 7000 USD per year?

#expected result

[1] 37

c. How many Universities are public and how many are private?

#expected result

   Private   N
1:     Yes 565
2:      No 212

Reveal solution

a.
dtcollege[Expend > 20000, .N]

b.
dtcollege[Books+Room.Board > 7000, .N]

c.
dtcollege[, by = Private, .N]

8. ‘College’ dataset – Working with keys and subsetting

a. Set two keys to your ‘College’ data.table: ‘F.Undergrad’ and ‘P.Undergrad’. Check if the order has changed.

b. Get a subset of the ‘College’ data with ‘F.Undergrad’ lower 1000 and ‘P.Undergrad’ lower 100 students.

#expected result
                            rn Private Apps [...]
 1:        Christendom College     Yes   81
 2: Wisconsin Lutheran College     Yes  152
 3:  Montreat-Anderson College     Yes  263
 4:          McPherson College     Yes  420
 5:             Bethel College     Yes  502
[...]

c. Is there a college with exactly 393 full-time and 4 part-time undergraduate students?

#expected result

St. John's College

Reveal solution

head(dtcollege)
setkey(dtcollege, F.Undergrad, P.Undergrad); 
head(dtcollege)

b.
dtcollege[F.Undergrad < 1000 & P.Undergrad < 100]

c.
dtcollege[.(393, 4)]

9. ‘College’ dataset – Selecting existing columns and reshaping

a. Get a data.table with all columns except ‘Apps’, ‘Accept’, ‘Enroll’. Use at least two different ways for this.

#expected result

                                       rn Private Top10perc [...]
 1:                   Christendom College     Yes        33 
 2:                        Wilson College     Yes        16 
 3:                 College of St. Joseph     Yes        12 
 4:             Alaska Pacific University     Yes        16 
 5:            Wisconsin Lutheran College     Yes        17 
 --- 
773:            Michigan State University      No        23 
774:    Brigham Young University at Provo     Yes        48 
775: Pennsylvania State Univ. Main Campus      No        48 
776:        University of Texas at Austin      No        48 
777:   Texas A&amp;M Univ. at College Station      No        49

b. Get a data.table with the three columns ‘Apps’, ‘Accept’, ‘Enroll’. Use at least two different code efficient methods.

#expected result

      Apps Accept Enroll
  1:    81     72     51
  2:   167    130     46
  3:   141    118     55
  4:   193    146     55
  5:   152    128     75
 --- 
773: 18114  15096   6180
774:  7365   5402   4615
775: 19315  10344   3450
776: 14752   9572   5329
777: 14474  10519   6392

Reveal solution

dtcollege[, !c("Apps", "Accept", "Enroll"), with = F]



dtcollege[, -c("Apps", "Accept", "Enroll"), with = F]

dtcollege[, Apps:Enroll, with = F]



dtcollege[, 3:5, with = F]

10. ‘College’ dataset – Getting counts for grouped data

a. How many Colleges with less than 800 applications received, have a Top 10 student percentage above 40?

#expected result

   Top10perc   N
1:     FALSE 193
2:      TRUE   9

b. How many Colleges with less than 900 applications received and an ‘Out of state tuition’ below 10000, have a top 10 student percentage above 30?

#expected result

   Top10perc   N
1:      TRUE  13
2:     FALSE 135

c. How many Colleges with less than 1000 applications received, have a ‘Top10perc’ above 20 OR a ‘Top25perc’ above 30?

#expected result

   Top10perc   N
1:      TRUE 234
2:     FALSE  41

Reveal solution

a.
dtcollege[Apps < 800, .N, by = (Top10perc > 40)]

b.
dtcollege[Apps < 900 & Outstate < 10000, .N, by = (Top10perc > 30)]

c.
dtcollege[Apps < 1000, .N, by = (Top10perc > 20 | Top25perc > 30 )]

R Tutorials Training