R Exercises – 51-60 – Data Pre-Processing with Data.Table
Required packages for the excises
install.packages("data.table") install.packages("ISLR") #for the dataset College library(data.table) library(ISLR) dtcollege = data.table(College, keep.rownames = T); class(dtcollege)
1. ‘College’ dataset – Basic row manipulations
a. Transform ‘College’ from ‘ISLR’ to data.table. Make sure to keep the University identifier. We will use this new data.table called ‘dtcollege’ throughout this block of exercises.
b. Get familiar with the dataset and its variables.
c. Extract rows 40 to 60 as a new data.table (‘mysubset’).
#expected result rn Private Apps Accept Enroll Top10perc Top25perc [...] 1: Baylor University Yes 6075 5349 2367 34 66 2: Beaver College Yes 1163 850 348 23 56 3: Bellarmine College Yes 807 707 308 39 63 4: Belmont Abbey College Yes 632 494 129 17 36 5: Belmont University Yes 1220 974 481 28 67 [...]
d. What is the average enrollment number in this subset?
#expected result [1] 597.8571
e. Round the average enrollment to one digit.
#expected result [1] 597.9
dtcollege = data.table(College, keep.rownames = T); class(dtcollege)
b.
?College
head(dtcollege)
summary(dtcollege)
c.
mysubset = dtcollege[40:60]; mysubset
d.
mean.enrollment = mysubset[,mean(Enroll)]; mean.enrollment
e.
round(mean.enrollment, digits = 1)
2. ‘College’ dataset – Advanced row selection
a. Get a data.table with all rows except the ones with an ‘Outstate’ fee between 8000-14000 USD.
#expected result rn Private Apps Accept Enroll Top10perc Top25perc [...] 1: Abilene Christian University Yes 1660 1232 721 23 52 2: Alaska Pacific University Yes 193 146 55 16 44 3: Albright College Yes 1038 839 227 30 63 4: Alfred University Yes 1732 1425 472 37 75 5: Allegheny College Yes 2652 1900 484 44 77 --- 381: Worcester Polytechnic Institute Yes 2768 2314 682 49 86 382: Worcester State College No 2197 1515 543 4 26 383: Xavier University of Louisiana Yes 2097 1915 695 34 61 384: Yale University Yes 10705 2453 1317 95 99 385: York College of Pennsylvania Yes 2989 1855 691 28 63
dtcollege[!(Outstate) %between% c(8000,14000)]
3. ‘College’ dataset – Basic column operations
a. What are the top 10 Universities in terms of top 10% High School students (Top10perc)?
#expected result [1] "Massachusetts Institute of Technology" [2] "Harvey Mudd College" [3] "University of California at Berkeley" [4] "Yale University" [5] "Duke University" [6] "Harvard University" [7] "Princeton University" [8] "Georgia Institute of Technology" [9] "Brown University" [10] "Dartmouth College"
b. What are the top 10 Universities in terms of student enrollment vs. accepted applications (highest student enrollment ratio)? Add a new column to the data.table. Code this exercise step by step in several lines.
#expected result rn earatio Enroll Accept 1: California Lutheran University 1.0000000 247 247 2: Brewton-Parker College 0.9788274 1202 1228 3: Mississippi University for Women 0.9382716 380 405 4: Peru State College 0.9141717 458 501 5: Indiana Wesleyan University 0.8652482 366 423 6: Brigham Young University at Provo 0.8543132 4615 5402 7: Columbia College MO 0.8354430 132 158 8: Hardin-Simmons University 0.8254717 350 424 9: University of Mobile 0.8126888 269 331 10: National-Louis University 0.8040346 279 347
c. What are the top 10 Universities in terms of favorable S.F.Ratio with a Room.Board cost lower 4000 USD?
#expected result rn S.F.Ratio Room.Board 1: University of Charleston 2.5 3540 2: Siena Heights College 7.2 3880 3: University of the Arts 7.5 3860 4: Marian College of Fond du Lac 8.4 3400 5: Wisconsin Lutheran College 8.5 3700 6: Bethel College KS 8.8 3580 7: Buena Vista College 8.8 3797 8: Tennessee Wesleyan College 8.9 3640 9: Christendom College 9.3 3600 10: Albertson College 9.4 3335
head(dtcollege[order(-Top10perc), rn],10)
b.
earatio = dtcollege$Enroll/dtcollege$Accept
dtcollegeplus = data.table(dtcollege, earatio); head(dtcollegeplus)
head(dtcollegeplus[order(-earatio), .(rn, earatio, Enroll, Accept)],10)
c.
rb4000 = dtcollege[Room.Board < 4000]
head(rb4000[order(S.F.Ratio), .(rn, S.F.Ratio, Room.Board)],10)
4. ‘College’ dataset – Permanently changing the column order
a. Create a new data.table ‘college.gradnr’ which consists of: ‘rn’, ‘F.Undergrad’, ‘P.Undergrad’, ‘Accept’, ‘Enroll’, ‘Apps’.
#expected result rn F.Undergrad P.Undergrad Accept Enroll Apps 1: Abilene Christian University 2885 537 1232 721 1660 2: Adelphi University 2683 1227 1924 512 2186 3: Adrian College 1036 99 1097 336 1428 4: Agnes Scott College 510 63 349 137 417 5: Alaska Pacific University 249 869 146 55 193 6: Albertson College 678 41 479 158 587
b. Permanently change the column order as seen below:
#expected result rn Apps Accept Enroll F.Undergrad P.Undergrad 1: Abilene Christian University 1660 1232 721 2885 537 2: Adelphi University 2186 1924 512 2683 1227 3: Adrian College 1428 1097 336 1036 99 4: Agnes Scott College 417 349 137 510 63 5: Alaska Pacific University 193 146 55 249 869 6: Albertson College 587 479 158 678 41
college.gradnr = dtcollege[,c("rn", "F.Undergrad", "P.Undergrad", "Accept", "Enroll", "Apps"), with = F]; head(college.gradnr)
b.
setcolorder(college.gradnr, c(1,6,4,5,2,3)); head(college.gradnr)
5. ‘College’ dataset – Adding and removing new columns
a. Add a new column called ‘HighInterest’ to the data.table. The column has an integer 1 for all observations with a number of applications higher 1000.
#expected result rn Private Apps [...] HighInterest 1: Worcester Polytechnic Institute Yes 2768 N/A 2: Worcester State College No 2197 N/A 3: Xavier University Yes 1959 N/A 4: Xavier University of Louisiana Yes 2097 N/A 5: Yale University Yes 10705 1 6: York College of Pennsylvania Yes 2989 N/A
b. Remove the ‘HighInterest’ column again.
dtcollege[Apps > 10000, HighInterest := 1L]; tail(dtcollege)
b.
dtcollege[, c("HighInterest") := NULL]; tail(dtcollege)
6. ‘College’ dataset – Adding new columns; Advanced
a. Add a new column ‘undergradratio’ which is the undergraduate ratio (F.Undergrad/P.Undergrad). Get the first ten observations starting with the highest ratio. The whole exercise should be coded in one line – Use chaining!
#expected result rn Private Apps [...] udergradratio 1: College of Wooster Yes 2100 1704.0000 2: Kenyon College Yes 2212 1445.0000 3: Hampden - Sydney College Yes 817 945.0000 4: Claremont McKenna College Yes 1860 887.0000 5: Washington and Lee University Yes 3315 528.0000 6: Hobart and William Smith Colleges Yes 2688 358.4000 7: Amherst College Yes 4302 318.6000 8: Colorado College Yes 3207 270.2857 9: Rensselaer Polytechnic Institute Yes 4996 268.1875 10: Davidson College Yes 2373 266.8333
head(dtcollege[,undergradratio := F.Undergrad/P.Undergrad]
[order(-undergradratio)],10)
#same line
7. ‘College’ dataset – Counting observations
a. How many Universities have instructional expenditures of over 20000 USD per year?
#expected result [1] 28
b. How many Universities have a combined ‘Books’ and ‘Room.Board’ costs of over 7000 USD per year?
#expected result [1] 37
c. How many Universities are public and how many are private?
#expected result Private N 1: Yes 565 2: No 212
dtcollege[Expend > 20000, .N]
b.
dtcollege[Books+Room.Board > 7000, .N]
c.
dtcollege[, by = Private, .N]
8. ‘College’ dataset – Working with keys and subsetting
a. Set two keys to your ‘College’ data.table: ‘F.Undergrad’ and ‘P.Undergrad’. Check if the order has changed.
b. Get a subset of the ‘College’ data with ‘F.Undergrad’ lower 1000 and ‘P.Undergrad’ lower 100 students.
#expected result rn Private Apps [...] 1: Christendom College Yes 81 2: Wisconsin Lutheran College Yes 152 3: Montreat-Anderson College Yes 263 4: McPherson College Yes 420 5: Bethel College Yes 502 [...]
c. Is there a college with exactly 393 full-time and 4 part-time undergraduate students?
#expected result St. John's College
head(dtcollege)
setkey(dtcollege, F.Undergrad, P.Undergrad);
head(dtcollege)
b.
dtcollege[F.Undergrad < 1000 & P.Undergrad < 100]
c.
dtcollege[.(393, 4)]
9. ‘College’ dataset – Selecting existing columns and reshaping
a. Get a data.table with all columns except ‘Apps’, ‘Accept’, ‘Enroll’. Use at least two different ways for this.
#expected result rn Private Top10perc [...] 1: Christendom College Yes 33 2: Wilson College Yes 16 3: College of St. Joseph Yes 12 4: Alaska Pacific University Yes 16 5: Wisconsin Lutheran College Yes 17 --- 773: Michigan State University No 23 774: Brigham Young University at Provo Yes 48 775: Pennsylvania State Univ. Main Campus No 48 776: University of Texas at Austin No 48 777: Texas A&M Univ. at College Station No 49
b. Get a data.table with the three columns ‘Apps’, ‘Accept’, ‘Enroll’. Use at least two different code efficient methods.
#expected result Apps Accept Enroll 1: 81 72 51 2: 167 130 46 3: 141 118 55 4: 193 146 55 5: 152 128 75 --- 773: 18114 15096 6180 774: 7365 5402 4615 775: 19315 10344 3450 776: 14752 9572 5329 777: 14474 10519 6392
dtcollege[, !c("Apps", "Accept", "Enroll"), with = F]
dtcollege[, -c("Apps", "Accept", "Enroll"), with = F]
b.
dtcollege[, Apps:Enroll, with = F]
dtcollege[, 3:5, with = F]
10. ‘College’ dataset – Getting counts for grouped data
a. How many Colleges with less than 800 applications received, have a Top 10 student percentage above 40?
#expected result Top10perc N 1: FALSE 193 2: TRUE 9
b. How many Colleges with less than 900 applications received and an ‘Out of state tuition’ below 10000, have a top 10 student percentage above 30?
#expected result Top10perc N 1: TRUE 13 2: FALSE 135
c. How many Colleges with less than 1000 applications received, have a ‘Top10perc’ above 20 OR a ‘Top25perc’ above 30?
#expected result Top10perc N 1: TRUE 234 2: FALSE 41
dtcollege[Apps < 800, .N, by = (Top10perc > 40)]
b.
dtcollege[Apps < 900 & Outstate < 10000, .N, by = (Top10perc > 30)]
c.
dtcollege[Apps < 1000, .N, by = (Top10perc > 20 | Top25perc > 30 )]