FIXME: Write these
In R you have multiple options when repeating calculations: vectorized operations, for
loops, and apply
functions.
This lesson is an extension of Analyzing Multiple Data Sets. In that lesson, we introduced how to run a custom function, analyze
, over multiple data files:
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
filenames <- list.files(path = "data", pattern = "inflammation", full.names = TRUE)
A key difference between R and many other languages is a topic known as vectorization. When you wrote the total
function, we mentioned that R already has sum
to do this; sum
is much faster than the interpreted for
loop because sum
is coded in C to work with a vector of numbers. Many of R's functions work this way; the loop is hidden from you in C. Learning to use vectorized operations is a key skill in R.
For example, to add pairs of numbers contained in two vectors
a <- 1:10
b <- 1:10
you could loop over the pairs adding each in turn, but that would be very inefficient in R.
res <- numeric(length = length(a))
for (i in seq_along(a)) {
res[i] <- a[i] + b[i]
}
res
[1] 2 4 6 8 10 12 14 16 18 20
Instead, +
is a vectorized function which can operate on entire vectors at once
res2 <- a + b
all.equal(res, res2)
[1] TRUE
When performing vector operations in R, it is important to know about recycling. If you perform an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:
a <- 1:10
b <- 1:5
a + b
[1] 2 4 6 8 10 7 9 11 13 15
The elements of a
and b
are added together starting from the first element of both vectors. When R reaches the end of the shorter vector b
, it starts again at the first element of b
and contines until it reaches the last element of the longest vector a
. This behaviour may seem crazy at first glance, but it is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector a
by 5:
a <- 1:10
b <- 5
a * b
[1] 5 10 15 20 25 30 35 40 45 50
Remember there are no scalars in R, so b
is actually a vector of length 1; in order to add its value to every element of a
, it is recycled to match the length of a
.
When the length of the longer object is a multiple of the shorter object length (as in our example above), the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:
a <- 1:10
b <- 1:7
a + b
Warning in a + b: longer object length is not a multiple of shorter object
length
[1] 2 4 6 8 10 12 14 9 11 13
for
or apply
?A for
loop is used to apply the same function calls to a collection of objects. R has a family of functions, the apply
family, which can be used in much the same way. You've already used one of the family, apply
in the first lesson. The apply
family members include
apply
- apply over the margins of an array (e.g. the rows or columns of a matrix)lapply
- apply over an object and return listsapply
- apply over an object and return a simplified object (an array) if possiblevapply
- similar to sapply
but you specify the type of object returned by the iterationsEach of these has an argument FUN
which takes a function to apply to each element of the object. Instead of looping over filenames
and calling analyze
, as you did earlier, you could sapply
over filenames
with FUN = analyze
:
sapply(filenames, FUN = analyze)
Deciding whether to use for
or one of the apply
family is really personal preference. Using an apply
family function forces to you encapsulate your operations as a function rather than separate calls with for
. for
loops are often more natural in some circumstances; for several related operations, a for
loop will avoid you having to pass in a lot of extra arguments to your function.
No, they are not! If you follow some golden rules:
c
, cbind
, etc) during the loop - R has to create a new object and copy across the information just to add a new element or row/columnAs an example, we'll create a new version of analyze
that will return the mean inflammation per day (column) of each file.
analyze2 <- function(filenames) {
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
res <- apply(fdata, 2, mean)
if (f == 1) {
out <- res
} else {
# The loop is slowed by this call to cbind that grows the object
out <- cbind(out, res)
}
}
return(out)
}
system.time(avg2 <- analyze2(filenames))
user system elapsed
0.040 0.004 0.045
Note how we add a new column to out
at each iteration? This is a cardinal sin of writing a for
loop in R.
Instead, we can create an empty matrix with the right dimensions (rows/columns) to hold the results. Then we loop over the files but this time we fill in the f
th column of our results matrix out
. This time there is no copying/growing for R to deal with.
analyze3 <- function(filenames) {
out <- matrix(ncol = length(filenames), nrow = 40) ## assuming 40 here from files
for (f in seq_along(filenames)) {
fdata <- read.csv(filenames[f], header = FALSE)
out[, f] <- apply(fdata, 2, mean)
}
return(out)
}
system.time(avg3 <- analyze3(filenames))
user system elapsed
0.048 0.000 0.047
In this simple example there is little difference in the compute time of analyze2
and analyze3
. This is because we are only iterating over 12 files and hence we only incur 12 copy/grow operations. If we were doing this over more files or the data objects we were growing were larger, the penalty for copying/growing would be much larger.
Note that apply
handles these memory allocation issues for you, but then you have to write the loop part as a function to pass to apply
. At its heart, apply
is just a for
loop with extra convenience.