In data analysis data preparation is a very important early step. It involves casting data to the right format for downstream use. Reshaping data into the proper format in R is easier said than done. This article shows how to convert a dataset between wide and long format in R.
reshape numeric vectors
To reshape numeric vectors it is best to use the
matrix command. let’s see an example:
a <- seq(1, 50) b <- matrix(a, 5, 10) b <- matrix(a, 5, 10, byrow = T)
It is important to notice the byrow option, this tell that the matrix should be filled first horizontally (byrow) and only after by vertically (the default).
reshape matrix back to a vector
Let’s imagine that we have the last variable b in our work space, we can convert back to a vector by using:
Notice that we had to transpose the matrix b (with t(b)) because b was defined by row and the as.vector will return a vector of the vectors in the matrix.
reshape data frames in R
This is where it becomes interesting… data frames are objects where each line corresponds to one observation characterised by several different properties (each column entry).
So what does it meant to reshape a data frame? The main consideration here is to understand that each row represents an unique observation. This is called the “Wide” format. But there is also another representation of the same data that is useful. In this representation we want each row to represent a measurement even if we end up with several rows for the same observation representing different measurements. This is called the “Long” format.
See the two examples below of the two formats:
Data Frame in Wide format
id age height sex 1 1 19 89 M 2 2 34 65 F 3 3 40 74 M 4 4 20 65 M
Data Frame in Long format
id subj score 1.age 1 age 19 2.age 2 age 34 3.age 3 age 40 4.age 4 age 20 1.height 1 height 89 2.height 2 height 65 3.height 3 height 74 4.height 4 height 65 1.sex 1 sex M 2.sex 2 sex F 3.sex 3 sex M 4.sex 4 sex M
As you can see each variable is now unfolded and basically you have two new variables called subj and score. This can be obtained by using the reshape command as follows:
d <- reshape(a, varying = c('age','height','sex'), timevar = "subj", v.names="score", times = c('age','height','sex'), direction = 'long')
complicated? your eyes twisted? Yes, mine too. So what is the easiest way to reshape a data frame?
The easy way to reshape a data frame is to use melt and cast from the reshape2 package (There’s a reshape package that is an older version. Use the new reshape2 version.)
install.packages('reshape2') library(reshape2) melt(a) sex variable value 1 M id 1 2 F id 2 3 M id 3 4 M id 4 5 M age 19 6 F age 34 7 M age 40 8 M age 20 9 M height 89 10 F height 65 11 M height 74 12 M height 65
In this case the simple command melt(a) tried to guess the id variables and chose sex as the id variable (the last column), but we can explicitly tell melt which is the id variable:
d <- melt(a, id.vars = 'id') d id variable value 1 1 age 19 2 2 age 34 3 3 age 40 4 4 age 20 5 1 height 89 6 2 height 65 7 3 height 74 8 4 height 65 9 1 sex M 10 2 sex F 11 3 sex M 12 4 sex M
cast data frame back to wide format?
Well that is also easy. Reshape2 has two commands for it depending on the final Wide format you want. If you want a Wide matrix/vector/array you use acast
acast(d, id ~ ...) age height sex 1 "19" "89" "M" 2 "34" "65" "F" 3 "40" "74" "M" 4 "20" "65" "M"
But if you want to cast the data frame into a data frame you use dcast:
dcast(d, id ~ ...) id age height sex 1 1 19 89 M 2 2 34 65 F 3 3 40 74 M 4 4 20 65 M
How to reshape data is very straightforward, but you need to know what the arguments of the functions mean. The first argument is the data in the long format that you want to convert to wide. The second argument is a formula that defines the relation between X variables and the Ys. In our case X is the ID and then the Ys are variable and value. Also formulas have special sequences: . (dot) means no variable and … means all remaining variables. I hope that converting data between wide and long format is now clear to you. Try it.