R Exercises – 61-70 – R String Manipulation | Working with ‘gsub’ and ‘regex’ | Regular Expressions in R

Required packages and datasets

install.packages("ISLR")
install.packages("stringr") #version 1.1.0 (or higher)
install.packages("tidyr")

library(ISLR) #for the college dataset
library(stringr) #string manipulations
library(tidyr)

#modified datasets to be used in some exercises
college.names = rownames(College); college.names
mtcars.names = rownames(mtcars); mtcars.names

1. ‘College’ dataset – Colleges in Texas

a. Get familiar with the ‘college’ dataset and its row names.

b. Get a vector with the college names (‘college.names’) which you will need in the further steps of this and the next exercises.

#expected result

 [1] "Abilene Christian University" 
 [2] "Adelphi University" 
 [3] "Adrian College" 
 [4] "Agnes Scott College" 
 [5] "Alaska Pacific University" 
 [6] "Albertson College" 
 [7] "Albertus Magnus College" 
 [8] "Albion College" 
 [9] "Albright College" 
 [...]

c. Get a vector (‘texas.college’) which contains all colleges with ‘Texas’ in its name. Summarize the vector.

#expected result

     V1 
 Texas: 11 
 NA's :766 

d. Which rows contain ‘Texas’?

#expected result

 [1] 179 582 583 584 585 586 587 658 685 686 687

e. Extract the subset of ‘Texas’ colleges from the original ‘College’ dataset.

#expected result

                                   Private  Apps Accept  [...]
East Texas Baptist University          Yes   379    341
Texas A&M Univ. at College Station      No 14474  10519
Texas A&M University at Galveston       No   529    481
Texas Christian University             Yes  4095   3079
Texas Lutheran College                 Yes   497    423
Texas Southern University               No  4345   3245
Texas Wesleyan University              Yes   592    501
University of North Texas               No  4418   2737
University of Texas at Arlington        No  3281   2559
University of Texas at Austin           No 14752   9572
University of Texas at San Antonio      No  4217   3100
[...]
a.
head(College)

College$rownames


b.
college.names = rownames(College);
college.names


c.
library(stringr)

texas.college = str_match(college.names, "Texas");
summary(texas.college)


d.
rownumbers = which(texas.college == "Texas");
rownumbers


e.
College[rownumbers,]

2. ‘College’ dataset – General string manipulations

a. Use the ‘college.names’ vector from the previous exercise and split it into single words.

#expected result

[[1]]
[1] "Abilene" "Christian" "University"

[[2]]
[1] "Adelphi" "University"

[[3]]
[1] "Adrian" "College"

[[4]]
[1] "Agnes" "Scott" "College"

[[5]]
[1] "Alaska" "Pacific" "University"

[[6]]
[1] "Albertson" "College" 

b. Put the single words to lower-case.

#expected result

 [1] "c(\"abilene\", \"christian\", \"university\")" 
 [2] "c(\"adelphi\", \"university\")" 
 [3] "c(\"adrian\", \"college\")" 
 [4] "c(\"agnes\", \"scott\", \"college\")" 
 [5] "c(\"alaska\", \"pacific\", \"university\")"
 [...]

c. Put the single words to upper-case using the function: ‘casefold’.

#expected result

 [1] "C(\"ABILENE\", \"CHRISTIAN\", \"UNIVERSITY\")" 
 [2] "C(\"ADELPHI\", \"UNIVERSITY\")" 
 [3] "C(\"ADRIAN\", \"COLLEGE\")" 
 [4] "C(\"AGNES\", \"SCOTT\", \"COLLEGE\")" 
 [5] "C(\"ALASKA\", \"PACIFIC\", \"UNIVERSITY\")"
 [...]

 

a.
mycorpus = strsplit(college.names, " ");
head(mycorpus)


b.
tolower((mycorpus))

c.
casefold(mycorpus, upper = T)

3. ‘College’ dataset – Counting strings

a. Get a vector of all rows of the ‘College’ dataset containing the term ‘University’.

#expected result

  [1]   1   2   5  11  18  19  20  21  22  24  26  28  34  39  40
 [16]  44  46  60  62  64  65  66  70  71  73  75  78  80  82  85
 [31]  88  92  95 103 104 105 108 112 113 116 118 119 124 142 145
 [46] 148 149 153 162 163 164 166 167 172 173 175 177 178 179 181
 [61] 182 190 192 193 197 198 202 204 206 208 209 212 214 215 216
 [76] 219 220 222 224 228 234 244 246 248 249 251 258 265 270 271
 [91] 274 275 276 278 280 283 284 285 289 290 299 304 305 306 307
[106] 310 313 315 316 317 321 325 326 328 329 330 341 345 346 351
 [...]

b. How many ‘Universities’ are in the dataset vs ‘Colleges’?

#expected result

#University
[1] 345

#College
[1] 406

 

a.
universities = str_count(college.names, "University");
universities

which(universities == 1)


b.
sum(str_count(college.names, "University"))

sum(str_count(college.names, "College"))

4. String padding

a. Let’s say you have a string containing one word ‘get’. Add zeros on both sides of the vector so that the whole character has a length of 10. (hint: ‘str_pad’).

#expected result

[1] "000get0000"

b. Pad the word ‘Oslo’ with spaces on the left side to a total length of 10. Call it ‘paddedstring’.

#expected result

[1] "      Oslo"

c. Trim the ‘paddedstring’ back. (hint: ‘str_trim’).

#expected result

[1] "Oslo"

d. Truncate ‘paddedstring’ to a width of seven (hint: ‘str_trunc’).

#expected result

[1] "...Oslo"

 

a.
str_pad("get", width = 10, pad = "0", side = "both")

b.
paddedstring = str_pad("Oslo", width = 10);
paddedstring


c.
str_trim(paddedstring)

d.
str_trunc(paddedstring, width = 7, side = "left")

5. Separating and uniting string columns of a data.frame

a. Get the data frame ‘mydata’ as below:

 mydata = data.frame(
        name = c("Hank", "Mike", "Jane", "Sue"),
        measurements = c("183m", "179m", "172f", "169f"),
        residency = c("London", "Sydney", "Prague", "Dublin")
); mydata

b. Use the library ‘tidyr’ to separate the column ‘measurements’ into ‘height’ and ‘sex’. Code the split to both sides..

#expected result

  name height sex residency
1 Hank    183   m    London
2 Mike    179   m    Sydney
3 Jane    172   f    Prague
4  Sue    169   f    Dublin

c. Get a vector which contains ‘names’ and ‘residency’ combined.

#expected result

     name_res measurements
1 Hank_London         183m
2 Mike_Sydney         179m
3 Jane_Prague         172f
4  Sue_Dublin         169f

 

a.
mydata = data.frame(
name = c("Hank", "Mike", "Jane", "Sue"),
measurements = c("183m", "179m", "172f", "169f"),
residency = c("London", "Sydney", "Prague", "Dublin"));

mydata


b.
library(tidyr)

separate(data = mydata, col = measurements,
into = c("height", "sex"), 3)

separate(data = mydata, col = measurements,
into = c("height", "sex"), -2)


c.
unite(data = mydata, col = name_res, c(1, 3))

6. String separation without patterns

a. Get the data frame ‘stringdf’ as outlined below:

a = c("Tom Hastings", "Brian Wall", "Sue Klark")

b = c(T, F, T) 

c = rnorm(3)

stringdf = data.frame(Names = as.character(a), 
                     Indicator = b,
                     Measurement = c,
                     stringsAsFactors = F); stringdf

b. Attach the ‘stringdf’ and check the class of ‘Names’.

#expected result

[1] "character"

c. Create a data.frame with two columns: one for the first and one for the last name.

#expected result

     X1       X2
1   Tom Hastings
2 Brian     Wall
3   Sue    Klark

d. Add ‘first’ and ‘last names’ to the original data.frame and order it accordingly.

#expected result

  FirstName LastName Indicator Measurement
1       Tom Hastings      TRUE  0.89027801
2     Brian     Wall     FALSE -0.04893415
3       Sue    Klark      TRUE -0.77613727

 

a.
a = c("Tom Hastings", "Brian Wall", "Sue Klark")

b = c(T, F, T)

c = rnorm(3)

stringdf = data.frame(Names = as.character(a),
Indicator = b, Measurement = c,
stringsAsFactors = F);
stringdf


b.
attach(stringdf)
# attach to make the work easier

class(Names)
#check the class of "Names", avoid factors here


c.
namelist = strsplit(stringdf$Names, split = " ")

namelist
# as a list

namedf = data.frame(do.call(rbind, namelist));
# turning the list to a df

namedf


d.
stringdf$FirstName = namedf$X1
#adding the first name

stringdf$LastName = namedf$X2
#adding the last name

stringdf
#checking the df

ResultDF = stringdf[,c(4,5,2,3)];
#reordering the df for final result

ResultDF

7. Strings separation into two columns – alternative method

a. Use the same data.frame as in exercise #6.

a = c("Tom Hastings", "Brian Wall", "Sue Klark")

b = c(T, F, T) 

c = rnorm(3)

stringdf = data.frame(Names = as.character(a), 
                     Indicator = b,
                     Measurement = c,
                     stringsAsFactors = F); stringdf

attach(stringdf) #attach to make the work easier

class(Names) #check the class of Names to character, no Factor desired

namelist = strsplit(stringdf$Names, split = " ")

b. This time split the names into a vector.

#expected result

[1] "Tom" "Hastings" "Brian" "Wall" "Sue" 
[6] "Klark"

c. Organize the vector in a data.frame as below (hint: matrix and data.frame).

#expected result

  FirstName LastName
1       Tom Hastings
2     Brian     Wall
3       Sue    Klark

 

a.
a = c("Tom Hastings", "Brian Wall", "Sue Klark")

b = c(T, F, T)

c = rnorm(3)

stringdf = data.frame(Names = as.character(a),
Indicator = b, Measurement = c,
stringsAsFactors = F);
stringdf

attach(stringdf)
#attach to make the work easier

class(Names)
#check the class of Names to character, no Factor desired

namelist = strsplit(stringdf$Names, split = " ")


b.
namevec = unlist(namelist); namevec
#as a vector


c.
namedf2 <- as.data.frame(matrix(namevec, ncol=2, byrow=TRUE, dimnames = list(c(1:3),c("FirstName", "LastName"))));
namedf2
#organizing the data in a matrix and subsequently a dataframe
#now you can work with V1 and V2 as usual

8. ‘Mtcars’ dataset – ‘regex’

a. Get the vector ‘mtcars.names’ which contains the row names of ‘mtcars’.

#expected result

 [1] "Mazda RX4"         "Mazda RX4 Wag" 
 [3] "Datsun 710"        "Hornet 4 Drive" 
 [5] "Hornet Sportabout" "Valiant" 
 [7] "Duster 360"        "Merc 240D" 
 [9] "Merc 230"          "Merc 280" 
 [...]

b. Step by step: remove all digits, punctuations, space symbols, upper and lower-cases from the data, so that you end up with an empty vector.

#expected result

 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[22] "" "" "" "" "" "" "" "" "" "" ""

 

a.
mtcars.names = rownames(mtcars);
mtcars.names


b.
mtcars.names = gsub("[[:digit:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:punct:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:space:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:upper:]]", "", mtcars.names); mtcars.names
mtcars.names = gsub("[[:lower:]]", "", mtcars.names); mtcars.names

9. ‘Mtcars’ dataset – ‘gsub’

a. Change the row names containing ‘Merc’ to the full name ‘Mercedes’.

#expected result

 [1] "Mazda RX4"         "Mazda RX4 Wag" 
 [3] "Datsun 710"        "Hornet 4 Drive" 
 [5] "Hornet Sportabout" "Valiant" 
 [7] "Duster 360"        "Mercedes 240D" 
 [9] "Mercedes 230"      "Mercedes 280"
 [...]

 

a.
mtcars.names = rownames(mtcars);
mtcars.names

mtcars.names = gsub("Merc", "Mercedes", mtcars.names);
mtcars.names

10. Advanced string manipulations with ‘gsub’

a. Get ‘mystring’ as outlined below:

mystring = "Gabriel-Henry.Tedd.-John (Yorkshire)"

b. Use at least two different methods to replace the dot (.) with a space.

#expected result

[1] "Gabriel-Henry Tedd -John (Yorkshire)"

c. How would you delete the brackets including its content: ‘(Yorkshire)’?

#expected result

[1] "Gabriel-Henry.Tedd.-John"

 

a.
mystring = "Gabriel-Henry.Tedd.-John (Yorkshire)"

b.
gsub(".", " ", mystring, fixed = T)
gsub("\\.", " ", mystring)
gsub("[.]", " ", mystring)


c.
gsub(" *\\([^)]*)", "", mystring)

Quality R Training for You