How to reshape data in R? Quick Reference

In data analysis the preparation of data is usually a very important early step. It involves casting data to the right format for latter usage in downstream steps. Reshaping data into the proper format in R is easier said than done. This reference explains how to reshape data in R when doing any kind of data analysis.

reshape numeric vectors

the best solution to reshape vectors is to use the matrix command. let’s see an example:

    a <- seq(1, 50)
    b <- matrix(a, 5, 10)
    b <- matrix(a, 5, 10, byrow = T)

It is important to notice the byrow option, this tell that the matrix should be filled first horizontally (byrow) and only after by vertically (the default).

reshape matrix back to a vector

Let’s imagine that we have the last variable b in our work space, we can convert back to a vector by using:

    as.vector(t(b))

Notice that we had to transpose the matrix b (with t(b)) because b was defined by row and the as.vector will return a vector of the vectors in the matrix.

reshape data frames in R

This is where it becomes interesting… data frames are objects where each line corresponds to one observation characterised by several different properties (each column entry).

So what does it meant to reshape a data frame? The main consideration here is to understand that each row represents an unique observation. This is called the “Wide” format. But there is also another representation of the same data that is useful. In this representation we want each row to represent a measurement even if we end up with several rows for the same observation representing different measurements. This is called the “Long” format.

See the two examples below of the two formats:

Data Frame in Wide format

      id age height sex
    1  1  19     89   M
    2  2  34     65   F
    3  3  40     74   M
    4  4  20     65   M

Data Frame in Long format

             id   subj score
    1.age     1    age    19
    2.age     2    age    34
    3.age     3    age    40
    4.age     4    age    20
    1.height  1 height    89
    2.height  2 height    65
    3.height  3 height    74
    4.height  4 height    65
    1.sex     1    sex     M
    2.sex     2    sex     F
    3.sex     3    sex     M
    4.sex     4    sex     M

As you can see each variable is now unfolded and basically you have two new variables called subj and score. This can be obtained by using the reshape command as follows:

    d  <- reshape(a, varying = c('age','height','sex'), 
              timevar = "subj", v.names="score",
              times = c('age','height','sex'), direction = 'long')

complicated? your eyes twisted? Yes, mine too. So what is the easiest way to reshape a data frame?

The easy way to reshape a data frame is to use melt and cast from the reshape2 package (There’s a reshape package that is an older version. Use the new reshape2 version.)

    install.packages('reshape2')
    library(reshape2)
    melt(a)
       sex variable value
    1    M       id     1
    2    F       id     2
    3    M       id     3
    4    M       id     4
    5    M      age    19
    6    F      age    34
    7    M      age    40
    8    M      age    20
    9    M   height    89
    10   F   height    65
    11   M   height    74
    12   M   height    65

In this case the simple command melt(a) tried to guess the id variables and chose sex as the id variable (the last column), but we can explicitly tell melt which is the id variable:

    d <- melt(a, id.vars = 'id')
    d
       id variable value
    1   1      age    19
    2   2      age    34
    3   3      age    40
    4   4      age    20
    5   1   height    89
    6   2   height    65
    7   3   height    74
    8   4   height    65
    9   1      sex     M
    10  2      sex     F
    11  3      sex     M
    12  4      sex     M

EASY!

cast data frame back to wide format?

Well that is also easy. Reshape2 has two commands for it depending on the final Wide format you want. If you want a Wide matrix/vector/array you use acast

    acast(d, id ~  ...)
      age  height sex
    1 "19" "89"   "M"
    2 "34" "65"   "F"
    3 "40" "74"   "M"
    4 "20" "65"   "M"

But if you want to cast the data frame into a data frame you use dcast:

    dcast(d, id ~  ...)
      id age height sex
    1  1  19     89   M
    2  2  34     65   F
    3  3  40     74   M
    4  4  20     65   M

How to reshape data is very straightforward, but you need to know what the arguments of the functions mean. The first argument is the data in the long format that you want to convert to wide. The second argument is a formula that defines the relation between X variables and the Ys. In our case X is the ID and then the Ys are variable and value. Also formulas have special sequences: . (dot) means no variable and … means all remaining variables. Experiment with it to get different results.