R Exercises – 61-70 – R String Manipulation | Working with ‘gsub’ and ‘regex’ | Regular Expressions in R
Required packages and datasets
install.packages("ISLR") install.packages("stringr") #version 1.1.0 (or higher) install.packages("tidyr") library(ISLR) #for the college dataset library(stringr) #string manipulations library(tidyr) #modified datasets to be used in some exercises college.names = rownames(College); college.names mtcars.names = rownames(mtcars); mtcars.names
1. ‘College’ dataset – Colleges in Texas
a. Get familiar with the ‘college’ dataset and its row names.
b. Get a vector with the college names (‘college.names’) which you will need in the further steps of this and the next exercises.
#expected result [1] "Abilene Christian University" [2] "Adelphi University" [3] "Adrian College" [4] "Agnes Scott College" [5] "Alaska Pacific University" [6] "Albertson College" [7] "Albertus Magnus College" [8] "Albion College" [9] "Albright College" [...]
c. Get a vector (‘texas.college’) which contains all colleges with ‘Texas’ in its name. Summarize the vector.
#expected result V1 Texas: 11 NA's :766
d. Which rows contain ‘Texas’?
#expected result [1] 179 582 583 584 585 586 587 658 685 686 687
e. Extract the subset of ‘Texas’ colleges from the original ‘College’ dataset.
#expected result Private Apps Accept [...] East Texas Baptist University Yes 379 341 Texas A&M Univ. at College Station No 14474 10519 Texas A&M University at Galveston No 529 481 Texas Christian University Yes 4095 3079 Texas Lutheran College Yes 497 423 Texas Southern University No 4345 3245 Texas Wesleyan University Yes 592 501 University of North Texas No 4418 2737 University of Texas at Arlington No 3281 2559 University of Texas at Austin No 14752 9572 University of Texas at San Antonio No 4217 3100 [...]
head(College)
College$rownames
b.
college.names = rownames(College);
college.names
c.
library(stringr)
texas.college = str_match(college.names, "Texas");
summary(texas.college)
d.
rownumbers = which(texas.college == "Texas");
rownumbers
e.
College[rownumbers,]
2. ‘College’ dataset – General string manipulations
a. Use the ‘college.names’ vector from the previous exercise and split it into single words.
#expected result [[1]] [1] "Abilene" "Christian" "University" [[2]] [1] "Adelphi" "University" [[3]] [1] "Adrian" "College" [[4]] [1] "Agnes" "Scott" "College" [[5]] [1] "Alaska" "Pacific" "University" [[6]] [1] "Albertson" "College"
b. Put the single words to lower-case.
#expected result [1] "c(\"abilene\", \"christian\", \"university\")" [2] "c(\"adelphi\", \"university\")" [3] "c(\"adrian\", \"college\")" [4] "c(\"agnes\", \"scott\", \"college\")" [5] "c(\"alaska\", \"pacific\", \"university\")" [...]
c. Put the single words to upper-case using the function: ‘casefold’.
#expected result [1] "C(\"ABILENE\", \"CHRISTIAN\", \"UNIVERSITY\")" [2] "C(\"ADELPHI\", \"UNIVERSITY\")" [3] "C(\"ADRIAN\", \"COLLEGE\")" [4] "C(\"AGNES\", \"SCOTT\", \"COLLEGE\")" [5] "C(\"ALASKA\", \"PACIFIC\", \"UNIVERSITY\")" [...]
mycorpus = strsplit(college.names, " ");
head(mycorpus)
b.
tolower((mycorpus))
c.
casefold(mycorpus, upper = T)
3. ‘College’ dataset – Counting strings
a. Get a vector of all rows of the ‘College’ dataset containing the term ‘University’.
#expected result [1] 1 2 5 11 18 19 20 21 22 24 26 28 34 39 40 [16] 44 46 60 62 64 65 66 70 71 73 75 78 80 82 85 [31] 88 92 95 103 104 105 108 112 113 116 118 119 124 142 145 [46] 148 149 153 162 163 164 166 167 172 173 175 177 178 179 181 [61] 182 190 192 193 197 198 202 204 206 208 209 212 214 215 216 [76] 219 220 222 224 228 234 244 246 248 249 251 258 265 270 271 [91] 274 275 276 278 280 283 284 285 289 290 299 304 305 306 307 [106] 310 313 315 316 317 321 325 326 328 329 330 341 345 346 351 [...]
b. How many ‘Universities’ are in the dataset vs ‘Colleges’?
#expected result #University [1] 345 #College [1] 406
universities = str_count(college.names, "University");
universities
which(universities == 1)
b.
sum(str_count(college.names, "University"))
sum(str_count(college.names, "College"))
4. String padding
a. Let’s say you have a string containing one word ‘get’. Add zeros on both sides of the vector so that the whole character has a length of 10. (hint: ‘str_pad’).
#expected result [1] "000get0000"
b. Pad the word ‘Oslo’ with spaces on the left side to a total length of 10. Call it ‘paddedstring’.
#expected result [1] " Oslo"
c. Trim the ‘paddedstring’ back. (hint: ‘str_trim’).
#expected result [1] "Oslo"
d. Truncate ‘paddedstring’ to a width of seven (hint: ‘str_trunc’).
#expected result [1] "...Oslo"
str_pad("get", width = 10, pad = "0", side = "both")
b.
paddedstring = str_pad("Oslo", width = 10);
paddedstring
c.
str_trim(paddedstring)
d.
str_trunc(paddedstring, width = 7, side = "left")
5. Separating and uniting string columns of a data.frame
a. Get the data frame ‘mydata’ as below:
mydata = data.frame( name = c("Hank", "Mike", "Jane", "Sue"), measurements = c("183m", "179m", "172f", "169f"), residency = c("London", "Sydney", "Prague", "Dublin") ); mydata
b. Use the library ‘tidyr’ to separate the column ‘measurements’ into ‘height’ and ‘sex’. Code the split to both sides..
#expected result name height sex residency 1 Hank 183 m London 2 Mike 179 m Sydney 3 Jane 172 f Prague 4 Sue 169 f Dublin
c. Get a vector which contains ‘names’ and ‘residency’ combined.
#expected result name_res measurements 1 Hank_London 183m 2 Mike_Sydney 179m 3 Jane_Prague 172f 4 Sue_Dublin 169f
mydata = data.frame(
name = c("Hank", "Mike", "Jane", "Sue"),
measurements = c("183m", "179m", "172f", "169f"),
residency = c("London", "Sydney", "Prague", "Dublin"));
mydata
b.
library(tidyr)
separate(data = mydata, col = measurements,
into = c("height", "sex"), 3)
separate(data = mydata, col = measurements,
into = c("height", "sex"), -2)
c.
unite(data = mydata, col = name_res,
c(1, 3))
6. String separation without patterns
a. Get the data frame ‘stringdf’ as outlined below:
a = c("Tom Hastings", "Brian Wall", "Sue Klark") b = c(T, F, T) c = rnorm(3) stringdf = data.frame(Names = as.character(a), Indicator = b, Measurement = c, stringsAsFactors = F); stringdf
b. Attach the ‘stringdf’ and check the class of ‘Names’.
#expected result [1] "character"
c. Create a data.frame with two columns: one for the first and one for the last name.
#expected result X1 X2 1 Tom Hastings 2 Brian Wall 3 Sue Klark
d. Add ‘first’ and ‘last names’ to the original data.frame and order it accordingly.
#expected result FirstName LastName Indicator Measurement 1 Tom Hastings TRUE 0.89027801 2 Brian Wall FALSE -0.04893415 3 Sue Klark TRUE -0.77613727
a = c("Tom Hastings", "Brian Wall", "Sue Klark")
b = c(T, F, T)
c = rnorm(3)
stringdf = data.frame(Names = as.character(a),
Indicator = b, Measurement = c,
stringsAsFactors = F);
stringdf
b.
attach(stringdf)
# attach to make the work easier
class(Names)
#check the class of "Names", avoid factors here
c.
namelist = strsplit(stringdf$Names, split = " ")
namelist
# as a list
namedf = data.frame(do.call(rbind, namelist));
# turning the list to a df
namedf
d.
stringdf$FirstName = namedf$X1
#adding the first name
stringdf$LastName = namedf$X2
#adding the last name
stringdf
#checking the df
ResultDF = stringdf[,c(4,5,2,3)];
#reordering the df for final result
ResultDF
7. Strings separation into two columns – alternative method
a. Use the same data.frame as in exercise #6.
a = c("Tom Hastings", "Brian Wall", "Sue Klark") b = c(T, F, T) c = rnorm(3) stringdf = data.frame(Names = as.character(a), Indicator = b, Measurement = c, stringsAsFactors = F); stringdf attach(stringdf) #attach to make the work easier class(Names) #check the class of Names to character, no Factor desired namelist = strsplit(stringdf$Names, split = " ")
b. This time split the names into a vector.
#expected result [1] "Tom" "Hastings" "Brian" "Wall" "Sue" [6] "Klark"
c. Organize the vector in a data.frame as below (hint: matrix and data.frame).
#expected result FirstName LastName 1 Tom Hastings 2 Brian Wall 3 Sue Klark
a = c("Tom Hastings", "Brian Wall", "Sue Klark")
b = c(T, F, T)
c = rnorm(3)
stringdf = data.frame(Names = as.character(a),
Indicator = b, Measurement = c,
stringsAsFactors = F);
stringdf
attach(stringdf)
#attach to make the work easier
class(Names)
#check the class of Names to character, no Factor desired
namelist = strsplit(stringdf$Names, split = " ")
b.
namevec = unlist(namelist); namevec
#as a vector
c.
namedf2 <- as.data.frame(matrix(namevec, ncol=2, byrow=TRUE, dimnames = list(c(1:3),c("FirstName", "LastName"))));
namedf2
#organizing the data in a matrix and subsequently a dataframe
#now you can work with V1 and V2 as usual
8. ‘Mtcars’ dataset – ‘regex’
a. Get the vector ‘mtcars.names’ which contains the row names of ‘mtcars’.
#expected result [1] "Mazda RX4" "Mazda RX4 Wag" [3] "Datsun 710" "Hornet 4 Drive" [5] "Hornet Sportabout" "Valiant" [7] "Duster 360" "Merc 240D" [9] "Merc 230" "Merc 280" [...]
b. Step by step: remove all digits, punctuations, space symbols, upper and lower-cases from the data, so that you end up with an empty vector.
#expected result [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" [22] "" "" "" "" "" "" "" "" "" "" ""
mtcars.names = rownames(mtcars);
mtcars.names
b.
mtcars.names = gsub("[[:digit:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:punct:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:space:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:upper:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:lower:]]", "", mtcars.names); mtcars.names
9. ‘Mtcars’ dataset – ‘gsub’
a. Change the row names containing ‘Merc’ to the full name ‘Mercedes’.
#expected result [1] "Mazda RX4" "Mazda RX4 Wag" [3] "Datsun 710" "Hornet 4 Drive" [5] "Hornet Sportabout" "Valiant" [7] "Duster 360" "Mercedes 240D" [9] "Mercedes 230" "Mercedes 280" [...]
mtcars.names = rownames(mtcars);
mtcars.names
mtcars.names = gsub("Merc", "Mercedes", mtcars.names);
mtcars.names
10. Advanced string manipulations with ‘gsub’
a. Get ‘mystring’ as outlined below:
mystring = "Gabriel-Henry.Tedd.-John (Yorkshire)"
b. Use at least two different methods to replace the dot (.) with a space.
#expected result [1] "Gabriel-Henry Tedd -John (Yorkshire)"
c. How would you delete the brackets including its content: ‘(Yorkshire)’?
#expected result [1] "Gabriel-Henry.Tedd.-John"
mystring = "Gabriel-Henry.Tedd.-John (Yorkshire)"
b.
gsub(".", " ", mystring, fixed = T)
gsub("\\.", " ", mystring)
gsub("[.]", " ", mystring)
c.
gsub(" *\\([^)]*)", "", mystring)