A vector is just a set of numbers. In real life we deal with sets of numbers, usually called a table or a matrix. We’ll see in this section how we create a matrix, how to select specific elements in rows and columns
Let’s create a simple matrix:
my_matrix <- matrix(seq(from=1,to=20,by=1), nrow=5,ncol=4)
Which should create a matrix numbered from 1 to 20 with 5 rows and 4 columns.
my_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
Now that we have a matrix, what can we do with it? A matrix is simply a table of numbers. Let’s see some useful functions that help us handle a matrix (or any table). Meet dim
which prints the dimensions of the matrix, or the number of rows followed by the number of columns.
dim(my_matrix)
## [1] 5 4
Now, sum
will return the sum of the whole matrix
sum(my_matrix)
## [1] 210
To find the sum of rows or columns separately, we need to use special functions: rowSums
which returns a list of the sum of each row, and colSums
which does the same with columns.
rowSums(my_matrix)
## [1] 34 38 42 46 50
colSums(my_matrix)
## [1] 15 40 65 90
Let’s now see how we can select specific rows and columns. To select a specific row, let’s say 3rd row, we simply need to type the position inside the bracket:
my_matrix[3,]
## [1] 3 8 13 18
And to select multiple rows, we can type those rows inside a c()
my_matrix[c(1,3),]
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 3 8 13 18
Which will return the first and third row of the table. Let’s now select the 2nd and 4th column
my_matrix[,c(2,4)]
## [,1] [,2]
## [1,] 6 16
## [2,] 7 17
## [3,] 8 18
## [4,] 9 19
## [5,] 10 20
The only thing that have changed is the position of index to be after the comma. So anything before the comma is to index the row, and anything after the comma is used to index the column.
We can also use a logical index. For example, let’s select only the first two rows. First, we need to create an index (or a sequence) of all rows:
row_index <- 1:5 # we can also use seq(from=1, to=5, by=1)
Now, we want to retrieve the first two rows using logical indexing. To select the first 2 numbers:
row_index < 3
## [1] TRUE TRUE FALSE FALSE FALSE
And we can use that as our index:
my_matrix[row_index < 3,]
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
Which will return the first two rows, because only the first two numbers evaluate True.
We can use logical indexing in filtering records from large tables and you’ll definitly make use of it all the time.
Now let’s talk about actual data that you might find in the real-world. First of all, we can see that the matrix is already some kind of data, but it lacks labels and names. So let’s make that into a table with names using data.frame
:
df <- data.frame(my_matrix)
df
## X1 X2 X3 X4
## 1 1 6 11 16
## 2 2 7 12 17
## 3 3 8 13 18
## 4 4 9 14 19
## 5 5 10 15 20
Now, we see that our table has column headings and row numbers. To clean up things a little bit, we can use names
function to change the column names into some fictional names (sorry real-world):
names(df) <- c('age', 'sex', 'day', 'time')
df
## age sex day time
## 1 1 6 11 16
## 2 2 7 12 17
## 3 3 8 13 18
## 4 4 9 14 19
## 5 5 10 15 20
Which should improve things for us. Do you know why? Becasue we can now select columns by their names, instead of by their positions as we did in matrices. To select the first column (i.e., age), we used this before:
df[,c(1)]
## [1] 1 2 3 4 5
But now in the new world of data.frames, we can do this:
df[,c('age')]
## [1] 1 2 3 4 5
Or we can simply write the $ sign and then the column name:
df$age
## [1] 1 2 3 4 5
All those ways go to Rome. But we aren’t really going there, so I’ll stick with the $ notation to select single columns. If we want to select multiple columns, then we can do either one of the first two options (by position or by name).
We can use all the functions we learned about: dim
, nrow
, ncol
, etc. We also have a few more functions to learn about. Let’s use fictional data:
x <- data.frame(student_name=c('Roy','Tania','Sara'),
age=c(35, 23, 28),
sex=c('m','f','f'))
x
## student_name age sex
## 1 Roy 35 m
## 2 Tania 23 f
## 3 Sara 28 f
We can select age column and deal with it as a list of numbers:
x$age
## [1] 35 23 28
And this means we can filter rows based on age. For example, let’s use logical indexing for rows where age is bigger than 25:
x$age > 25
## [1] TRUE FALSE TRUE
And we can use that (with potentially any other conditions) to filter rows:
x[x$age>25, ]
## student_name age sex
## 1 Roy 35 m
## 3 Sara 28 f
Let’s find the name of the students whose age is bigger than 25.
x$student_name[x$age > 25]
## [1] Roy Sara
## Levels: Roy Sara Tania
Or we can simply type
x[x$age > 25, 'student_name']
## [1] Roy Sara
## Levels: Roy Sara Tania
See how in the row section we used a filter and in the column (after the comma) we selected a specific column. We can also do:
x[x$age > 25,]$student_name
## [1] Roy Sara
## Levels: Roy Sara Tania
All those are valid ways of filtering and selecting elements in our table.
Now that we have explored this fake dataset, let’s see some real data.
We’ll deal with a baby names datasets that tracks the popularity of individual baby names from the U.S. Social Security Administration To install the data, we’ll install a package and then use library
command to add the data.
install.packages('babynames')
library(babynames)
We first want to look at the first few rows to see what we have:
head(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
We have 5 columns: year, sex, name, n (which I assume is the number of babies with that name at the given year and sex) and prop (i.e., proportion).
We can also look at the last few rows using tail
tail(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Zyhier 5 0.00000255
## 2 2017 M Zykai 5 0.00000255
## 3 2017 M Zykeem 5 0.00000255
## 4 2017 M Zylin 5 0.00000255
## 5 2017 M Zylis 5 0.00000255
## 6 2017 M Zyrie 5 0.00000255
Just like in matrices, we can filter rows in a dataframe using logical indexing. For example, let’s filter only records of 2017:
babynames [ babynames$year==2017 , ]
## # A tibble: 32,469 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 F Emma 19738 0.0105
## 2 2017 F Olivia 18632 0.00994
## 3 2017 F Ava 15902 0.00848
## 4 2017 F Isabella 15100 0.00805
## 5 2017 F Sophia 14831 0.00791
## 6 2017 F Mia 13437 0.00717
## 7 2017 F Charlotte 12893 0.00688
## 8 2017 F Amelia 11800 0.00629
## 9 2017 F Evelyn 10675 0.00569
## 10 2017 F Abigail 10551 0.00563
## # … with 32,459 more rows
Which we can deal with as another table. We can simply ask how many records do we have by using nrow
or dim
:
nrow(babynames [ babynames$year==2017 , ])
## [1] 32469
Let’s make very simple plots with name frequencies across all years. We will use a function called plot
which will require an x-axis and a y-axis. Both x and y should be a list of numbers. For example, we will plot the frequency of Sarah
across all years
result_sarah_n <- babynames$n[babynames$name=='Sarah' & babynames$sex == 'F']
result_sarah_year <- babynames$year[babynames$name=='Sarah' & babynames$sex == 'F']
Now we are ready to use plot
plot(result_sarah_year, result_sarah_n, type='l')
Did you like that simple plot? We’ll do more plotting next lab and it won’t be this ugly, but now let’s master the basics.
I want to know what are the top 5 names in the year 1989. How should we approach this? Here we will combine lots of what we have learned previously: sort
with logical indexing. To know the most frequent name in 1972, we first need to filter data in 1972:
data_subset <- babynames[babynames$year == 1989,]
Now, we’ll sort the proportions with setting decreasing=false
and select the first element (or we can use max(data_subset)
)
most_freq_n <- sort(data_subset$n, decreasing = TRUE)[1]
most_freq_n
## [1] 65382
Now, we will look for the records whose n
equals what we just got and print those records:
data_subset[ data_subset$n == most_freq_n , ]
## # A tibble: 1 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1989 M Michael 65382 0.0312
If you really want to read few good articles about this dataset, then here are few links:
We’ll continue with this dataset later on – hopefully after you skim through those links.
mean
), and compare your result with colMeans
.babynames
dataset, do the following (and you are free to use any function now):plot
function. What about other names? Just type as many names as you can until you see names that have interesting trends. When you see something interesting, just use it in your final solution and tell me why you think it is interesting (probaby in blackboard when you submit the assignment)