Core functions you need to know and probably share with 5 of your friends

Let’s talk more about functions. Most of what you’ll do in R will be learning how to use functions. You’ll rarely need to write things up from scratch. Let’s see some common functions that we’ll use here and then.

Housekeeping: install.packages(), library(), getwd(), setwd()

In R, we often deal with packages (or sets of functions that others wrote to make our lives easier). To use those packages, we first need to install those packages into our computers. To do this, we will use install.packages(x) where x is the name of the package. Let’s say we want to install a package called “babynames”. We can simply type:

install.packages("babynames")

And you’ll soon see some colored output from R telling you how things are going.

Now that we have installed the package into our computer, we need to import it to R and start using it. We can use library to do this.

library("babynames")

This command will make the functions (or data) in this package available for us to use.

Handing working directories in R can be tricky but let’s prepare for the worst and hope for the best. In R, there is this notion of “working directory” and it is the address in your file system that R is running at. Later, we’ll need to read files and write files and hence we’ll need to handle file system properly. To do those we’ll first need to know where we are so that we can point to the right file. I can’t give you directions to the Empire State if I don’t know what is your starting point.

We we’ll use getwd() to “get the working directory” which R recognizes. Files in this working directory can be read easily by using their names. For example, let’s say I want to read a table called names.csv (this following code will not run and it just for illustration):

data <- read.csv('names.csv')

It will only work if there exists a table called names.csv in the current working directory. How to get that? Here we use

getwd()

More often, we want to change that working directory. Let’s say that the table names.csv is in another sub-sub-sub-sub-sub-sub directory. In this case, you will need to read the table using the full path of the file:

data<- read.csv('folder/subfolder/subsubfolder/subsubsubfolder/names.csv')

That’s because R is looking at Franklin Ave station and you are referring to a building at 34th Station. To teleport from a directory to another, we can use setwd() and change the working directory:

setwd('folder/subfolder/subsubfolder/subsubsubfolder')

This will make the subsubsubfolder our new working directory and now we can simply call the file names.csv without using any folders before the name.

This is very important as we dive deeper to read and write data. For now, however, you can just move on.

Statistical functions: mean, median, sum, var, sd, max, min, table, sort, unique

R is statistical language and it has tons of functions doing all sorts of statistics. Let’s discuss a few functions and explore the rest as we progress in this course. Those functions usually require a set of numbers as inputs. Let’s define a set of numbers as we did last time (we called it a vector):

numbers <- c(4,3,6,6,3,3,5,3,2,7,3,5,6,3,2,1,5,6,2,12,3,4,5,5)

Calculate the sum of those numbers:

sum(numbers)
## [1] 104

Calculate the length of this list of numbers:

length(numbers)
## [1] 24

We can calculate the mean of those numbers by dividing the sum over the length:

sum(numbers) / length(numbers)
## [1] 4.333333

Or we can simply use mean

mean(numbers)
## [1] 4.333333

We can also find the median

median(numbers)
## [1] 4

Or the variance of those numbers

var(numbers)
## [1] 5.188406

And from the variance, we can find the standard deviation using another function: the square root

sqrt(var(numbers)) 
## [1] 2.277807

Or we can use sd

sd(numbers)
## [1] 2.277807

We can also use sort to sort those numbers from lowest to highest:

sort(numbers)
##  [1]  1  2  2  2  3  3  3  3  3  3  3  4  4  5  5  5  5  5  6  6  6  6  7
## [24] 12

or from the hieghest to lowest by changing the value of decreasing input:

sort(numbers, decreasing=TRUE)
##  [1] 12  7  6  6  6  6  5  5  5  5  5  4  4  3  3  3  3  3  3  3  2  2  2
## [24]  1

Remember, sort returns another vector that we can easily play with. For example, we can use the result of sort to find the minimum number in the list by selecting the first element after the sorting:

sort(numbers)[1]
## [1] 1

Or the maximum number

sort(numbers, decreasing=TRUE)[1]
## [1] 12

Notice that we can use length(numbers) instead of 1 to index the location of the last element in a vector.

We can instead use max and min functions to get the same results:

max(numbers)
## [1] 12
min(numbers) 
## [1] 1

Let’s see one more function: unique which returns a list of unique elements in a given set of numbers. We make use of this function all the time.

numbers <- c(1,1,1,2,2,2,3,3,3,4,4,4)
unique(numbers)
## [1] 1 2 3 4

Now, I am going to give you a function and ask you to guess what this function is doing!

table(numbers)
## numbers
## 1 2 3 4 
## 3 3 3 3

Do you have any ideas? Well, how do we know what functions do and where to read help? We can type a question mark before the name of the function which will give us a readable explanation of what that function is doing with examples and free donuts.

?table

Variable Information functions: length, class, is.numeric, as.numeric, is.character, as.charachter

We have already talked about length but we have a few more functions that are designed to manipulate variables or test specific things about those variables. For example, we can use class to find the recognized type of any variable:

class(numbers) 
## [1] "numeric"

And we can use is.numeric to ask if R recognize a variable as numeric

is.numeric(numbers)
## [1] TRUE

There are also as.numeric which will convert a given convertible variable to its numeric form. For example, let’s say we have:

numbers_in_char_form <- c('100', '-100', '2.5')

Now we do recognize those as numerals but they are in R as charachters. We can see that in class

class(numbers_in_char_form)
## [1] "character"

which gives us character type – meaning that we can’t really do any calculations on them. Have you ever divided your name by your hieght? How to tell R that those are actually numerals and possibly convert them to numbers? Using as.numeric

as.numeric(numbers_in_char_form)
## [1]  100.0 -100.0    2.5

And now we have those numbers in a numeric form. We will need this later when we get to know different classes of variables.

Similarly, we have as.charachter() and is.charachter() to do the same with charachter data.

Custom functions: seq, rep

Now there are still some functions that we’ll use here and then. Take seq, short of sequence.

seq(from=10, to=1000, by=2)
##   [1]   10   12   14   16   18   20   22   24   26   28   30   32   34   36
##  [15]   38   40   42   44   46   48   50   52   54   56   58   60   62   64
##  [29]   66   68   70   72   74   76   78   80   82   84   86   88   90   92
##  [43]   94   96   98  100  102  104  106  108  110  112  114  116  118  120
##  [57]  122  124  126  128  130  132  134  136  138  140  142  144  146  148
##  [71]  150  152  154  156  158  160  162  164  166  168  170  172  174  176
##  [85]  178  180  182  184  186  188  190  192  194  196  198  200  202  204
##  [99]  206  208  210  212  214  216  218  220  222  224  226  228  230  232
## [113]  234  236  238  240  242  244  246  248  250  252  254  256  258  260
## [127]  262  264  266  268  270  272  274  276  278  280  282  284  286  288
## [141]  290  292  294  296  298  300  302  304  306  308  310  312  314  316
## [155]  318  320  322  324  326  328  330  332  334  336  338  340  342  344
## [169]  346  348  350  352  354  356  358  360  362  364  366  368  370  372
## [183]  374  376  378  380  382  384  386  388  390  392  394  396  398  400
## [197]  402  404  406  408  410  412  414  416  418  420  422  424  426  428
## [211]  430  432  434  436  438  440  442  444  446  448  450  452  454  456
## [225]  458  460  462  464  466  468  470  472  474  476  478  480  482  484
## [239]  486  488  490  492  494  496  498  500  502  504  506  508  510  512
## [253]  514  516  518  520  522  524  526  528  530  532  534  536  538  540
## [267]  542  544  546  548  550  552  554  556  558  560  562  564  566  568
## [281]  570  572  574  576  578  580  582  584  586  588  590  592  594  596
## [295]  598  600  602  604  606  608  610  612  614  616  618  620  622  624
## [309]  626  628  630  632  634  636  638  640  642  644  646  648  650  652
## [323]  654  656  658  660  662  664  666  668  670  672  674  676  678  680
## [337]  682  684  686  688  690  692  694  696  698  700  702  704  706  708
## [351]  710  712  714  716  718  720  722  724  726  728  730  732  734  736
## [365]  738  740  742  744  746  748  750  752  754  756  758  760  762  764
## [379]  766  768  770  772  774  776  778  780  782  784  786  788  790  792
## [393]  794  796  798  800  802  804  806  808  810  812  814  816  818  820
## [407]  822  824  826  828  830  832  834  836  838  840  842  844  846  848
## [421]  850  852  854  856  858  860  862  864  866  868  870  872  874  876
## [435]  878  880  882  884  886  888  890  892  894  896  898  900  902  904
## [449]  906  908  910  912  914  916  918  920  922  924  926  928  930  932
## [463]  934  936  938  940  942  944  946  948  950  952  954  956  958  960
## [477]  962  964  966  968  970  972  974  976  978  980  982  984  986  988
## [491]  990  992  994  996  998 1000

It clearly makes a list of numbers in a given range. We also have rep, short of repeat.

rep(c(1,2,3), each=3)
## [1] 1 1 1 2 2 2 3 3 3

Which repeats a given sequence a given number of times (hint: try times=3 instead of each=3 and see what happens).

So now that we have learned about few functions you should ask: how do you know if a function actually exists? Nobody really knows but we use Google so you should. However, things are less painful if you use a common cheatsheets reference for you to know what functions are out there at your disposal (but you’ll google it anyway so why bother?). I personally use those cheatsheets just to assess how much I know about R’s core functions. You’ll probably need about 10% of those functions in the cheatsheets but you still want to be friends with them.

Jump from vectors to matrices

A vector is just a set of numbers. In real life we deal with sets of numbers, usually called a table or a matrix. We’ll see in this section how we create a matrix, how to select specific elements in rows and columns

Let’s create a simple matrix:

my_matrix  <- matrix(seq(from=1,to=20,by=1), nrow=5,ncol=4)

Which should create a matrix numbered from 1 to 20 with 5 rows and 4 columns.

my_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

Now that we have a matrix, what can we do with it? A matrix is simply a table of numbers. Let’s see some useful functions that help us handle a matrix (or any table). Meet dim which prints the dimensions of the matrix, or the number of rows followed by the number of columns.

dim(my_matrix)
## [1] 5 4

Now, sum will return the sum of the whole matrix

sum(my_matrix)
## [1] 210

To find the sum of rows or columns separately, we need to use special functions: rowSums which returns a list of the sum of each row, and colSums which does the same with columns.

rowSums(my_matrix)
## [1] 34 38 42 46 50
colSums(my_matrix)
## [1] 15 40 65 90

Row and Column Indexing

Let’s now see how we can select specific rows and columns. To select a specific row, let’s say 3rd row, we simply need to type the position inside the bracket:

my_matrix[3,]
## [1]  3  8 13 18

And to select multiple rows, we can type those rows inside a c()

my_matrix[c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    3    8   13   18

Which will return the first and third row of the table. Let’s now select the 2nd and 4th column

my_matrix[,c(2,4)]
##      [,1] [,2]
## [1,]    6   16
## [2,]    7   17
## [3,]    8   18
## [4,]    9   19
## [5,]   10   20

The only thing that have changed is the position of index to be after the comma. So anything before the comma is to index the row, and anything after the comma is used to index the column.

We can also use a logical index. For example, let’s select only the first two rows. First, we need to create an index (or a sequence) of all rows:

row_index <- 1:5 # we can also use    seq(from=1, to=5, by=1)

Now, we want to retrieve the first two rows using logical indexing. To select the first 2 numbers:

row_index < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE

And we can use that as our index:

my_matrix[row_index < 3,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17

Which will return the first two rows, because only the first two numbers evaluate True.

We can use logical indexing in filtering records from large tables and you’ll definitly make use of it all the time.

Matrices to DataFrames

Now let’s talk about actual data that you might find in the real-world. First of all, we can see that the matrix is already some kind of data, but it lacks labels and names. So let’s make that into a table with names using data.frame:

df <- data.frame(my_matrix)
df
##   X1 X2 X3 X4
## 1  1  6 11 16
## 2  2  7 12 17
## 3  3  8 13 18
## 4  4  9 14 19
## 5  5 10 15 20

Now, we see that our table has column headings and row numbers. To clean up things a little bit, we can use names function to change the column names into some fictional names (sorry real-world):

names(df) <- c('age', 'sex', 'day', 'time')
df
##   age sex day time
## 1   1   6  11   16
## 2   2   7  12   17
## 3   3   8  13   18
## 4   4   9  14   19
## 5   5  10  15   20

Which should improve things for us. Do you know why? Becasue we can now select columns by their names, instead of by their positions as we did in matrices. To select the first column (i.e., age), we used this before:

df[,c(1)]
## [1] 1 2 3 4 5

But now in the new world of data.frames, we can do this:

df[,c('age')]
## [1] 1 2 3 4 5

Or we can simply write the $ sign and then the column name:

df$age
## [1] 1 2 3 4 5

All those ways go to Rome. But we aren’t really going there, so I’ll stick with the $ notation to select single columns. If we want to select multiple columns, then we can do either one of the first two options (by position or by name).

We can use all the functions we learned about: dim, nrow, ncol, etc. We also have a few more functions to learn about. Let’s use fictional data:

x <- data.frame(student_name=c('Roy','Tania','Sara'), 
          age=c(35, 23, 28),
          sex=c('m','f','f'))
x
##   student_name age sex
## 1          Roy  35   m
## 2        Tania  23   f
## 3         Sara  28   f

We can select age column and deal with it as a list of numbers:

x$age
## [1] 35 23 28

And this means we can filter rows based on age. For example, let’s use logical indexing for rows where age is bigger than 25:

x$age > 25
## [1]  TRUE FALSE  TRUE

And we can use that (with potentially any other conditions) to filter rows:

x[x$age>25, ]
##   student_name age sex
## 1          Roy  35   m
## 3         Sara  28   f

Let’s find the name of the students whose age is bigger than 25.

x$student_name[x$age > 25]
## [1] Roy  Sara
## Levels: Roy Sara Tania

Or we can simply type

x[x$age > 25, 'student_name'] 
## [1] Roy  Sara
## Levels: Roy Sara Tania

See how in the row section we used a filter and in the column (after the comma) we selected a specific column. We can also do:

x[x$age > 25,]$student_name
## [1] Roy  Sara
## Levels: Roy Sara Tania

All those are valid ways of filtering and selecting elements in our table.

Now that we have explored this fake dataset, let’s see some real data.

We’ll deal with a baby names datasets that tracks the popularity of individual baby names from the U.S. Social Security Administration To install the data, we’ll install a package and then use library command to add the data.

install.packages('babynames')
library(babynames)

We first want to look at the first few rows to see what we have:

head(babynames)
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

We have 5 columns: year, sex, name, n (which I assume is the number of babies with that name at the given year and sex) and prop (i.e., proportion).

We can also look at the last few rows using tail

tail(babynames)
## # A tibble: 6 x 5
##    year sex   name       n       prop
##   <dbl> <chr> <chr>  <int>      <dbl>
## 1  2017 M     Zyhier     5 0.00000255
## 2  2017 M     Zykai      5 0.00000255
## 3  2017 M     Zykeem     5 0.00000255
## 4  2017 M     Zylin      5 0.00000255
## 5  2017 M     Zylis      5 0.00000255
## 6  2017 M     Zyrie      5 0.00000255

Selecting rows (also known as: filtering)

Just like in matrices, we can filter rows in a dataframe using logical indexing. For example, let’s filter only records of 2017:

babynames [ babynames$year==2017 , ]
## # A tibble: 32,469 x 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2017 F     Emma      19738 0.0105 
##  2  2017 F     Olivia    18632 0.00994
##  3  2017 F     Ava       15902 0.00848
##  4  2017 F     Isabella  15100 0.00805
##  5  2017 F     Sophia    14831 0.00791
##  6  2017 F     Mia       13437 0.00717
##  7  2017 F     Charlotte 12893 0.00688
##  8  2017 F     Amelia    11800 0.00629
##  9  2017 F     Evelyn    10675 0.00569
## 10  2017 F     Abigail   10551 0.00563
## # … with 32,459 more rows

Which we can deal with as another table. We can simply ask how many records do we have by using nrow or dim:

nrow(babynames [ babynames$year==2017 , ])
## [1] 32469

Let’s make very simple plots with name frequencies across all years. We will use a function called plot which will require an x-axis and a y-axis. Both x and y should be a list of numbers. For example, we will plot the frequency of Sarah across all years

result_sarah_n <- babynames$n[babynames$name=='Sarah' & babynames$sex == 'F']
result_sarah_year <- babynames$year[babynames$name=='Sarah' & babynames$sex == 'F']

Now we are ready to use plot

plot(result_sarah_year, result_sarah_n, type='l')

Did you like that simple plot? We’ll do more plotting next lab and it won’t be this ugly, but now let’s master the basics.

I want to know what are the top 5 names in the year 1989. How should we approach this? Here we will combine lots of what we have learned previously: sort with logical indexing. To know the most frequent name in 1972, we first need to filter data in 1972:

data_subset <- babynames[babynames$year == 1989,]

Now, we’ll sort the proportions with setting decreasing=false and select the first element (or we can use max(data_subset))

most_freq_n <- sort(data_subset$n, decreasing = TRUE)[1]
most_freq_n
## [1] 65382

Now, we will look for the records whose n equals what we just got and print those records:

data_subset[ data_subset$n == most_freq_n , ]
## # A tibble: 1 x 5
##    year sex   name        n   prop
##   <dbl> <chr> <chr>   <int>  <dbl>
## 1  1989 M     Michael 65382 0.0312

If you really want to read few good articles about this dataset, then here are few links:

We’ll continue with this dataset later on – hopefully after you skim through those links.

Exercise