How to reshape data in R? Quick Reference

Trying to reshape data or cast data into the proper format in R is a task that is easier said than done. Usually you never remember how to do it properly. This quick reference explains how to reshape data in R.

reshape numeric vectors

the best solution to reshape vectors is to use the matrix command. let’s see an example:

    a <- seq(1, 50)
    b <- matrix(a, 5, 10)
    b <- matrix(a, 5, 10, byrow = T)

It is important to notice the byrow option, this tell that the matrix should be filled first horizontally (byrow) and only after by vertically (the default).

reshape matrix back to a vector

Let’s imagine that we have the last variable b in our work space, we can convert back to a vector by using:

    as.vector(t(b))

Notice that we had to transpose the matrix b (with t(b)) because b was defined by row and the as.vector will return a vector of the vectors in the matrix.

reshape data frames in R

This is where it becomes interesting… data frames are objects where each line corresponds to one observation characterised by several different properties (each column entry).

So what does it meant to reshape a data frame? The main consideration here is to understand that each row represents an unique observation. This is called the “Wide” format. But there is also another representation of the same data that is useful. In this representation we want each row to represent a measurement even if we end up with several rows for the same observation representing different measurements. This is called the “Long” format.

See the two examples below of the two formats:

Data Frame in Wide format

      id age height sex
    1  1  19     89   M
    2  2  34     65   F
    3  3  40     74   M
    4  4  20     65   M

Data Frame in Long format

             id   subj score
    1.age     1    age    19
    2.age     2    age    34
    3.age     3    age    40
    4.age     4    age    20
    1.height  1 height    89
    2.height  2 height    65
    3.height  3 height    74
    4.height  4 height    65
    1.sex     1    sex     M
    2.sex     2    sex     F
    3.sex     3    sex     M
    4.sex     4    sex     M

As you can see each variable is now unfolded and basically you have two new variables called subj and score. This can be obtained by using the reshape command as follows:

    d  <- reshape(a, varying = c('age','height','sex'), 
              timevar = "subj", v.names="score",
              times = c('age','height','sex'), direction = 'long')

complicated? your eyes twisted? Yes, mine too. So what is the easiest way to reshape a data frame?

The easy way to reshape a data frame is to use melt and cast from the reshape2 package (There’s a reshape package that is an older version. Use the new reshape2 version.)

    install.packages('reshape2')
    library(reshape2)
    melt(a)
       sex variable value
    1    M       id     1
    2    F       id     2
    3    M       id     3
    4    M       id     4
    5    M      age    19
    6    F      age    34
    7    M      age    40
    8    M      age    20
    9    M   height    89
    10   F   height    65
    11   M   height    74
    12   M   height    65

In this case the simple command melt(a) tried to guess the id variables and chose sex as the id variable (the last column), but we can explicitly tell melt which is the id variable:

    d <- melt(a, id.vars = 'id')
    d
       id variable value
    1   1      age    19
    2   2      age    34
    3   3      age    40
    4   4      age    20
    5   1   height    89
    6   2   height    65
    7   3   height    74
    8   4   height    65
    9   1      sex     M
    10  2      sex     F
    11  3      sex     M
    12  4      sex     M

EASY!

cast data frame back to wide format?

Well that is also easy. Reshape2 has two commands for it depending on the final Wide format you want. If you want a Wide matrix/vector/array you use acast

    acast(d, id ~  ...)
      age  height sex
    1 "19" "89"   "M"
    2 "34" "65"   "F"
    3 "40" "74"   "M"
    4 "20" "65"   "M"

But if you want to cast the data frame into a data frame you use dcast:

    dcast(d, id ~  ...)
      id age height sex
    1  1  19     89   M
    2  2  34     65   F
    3  3  40     74   M
    4  4  20     65   M

How to reshape data is very straightforward, but you need to know what the arguments of the functions mean. The first argument is the data in the long format that you want to convert to wide. The second argument is a formula that defines the relation between X variables and the Ys. In our case X is the ID and then the Ys are variable and value. Also formulas have special sequences: . (dot) means no variable and … means all remaining variables. Experiment with it to get different results.

R Tip: define ggplot axis labels

Formatting text and labels in ggplot or ggplot2 axis is easy. A common task when producing plots for publication is to replace default labels. Default labels in axes tend to reflect the name of variables used and sometimes these are not the most descriptive labels. At least not when you are publishing the plots in a scientific journal. So let’s try to break down some ways to personalise ggplot plot axes.

Quick Navigation:

For this formatting example I’ll use the movies dataset that is available in R. First thing we need to do is to load ggplot2 library and then the movies dataset

library(ggplot2)
data(movies)

The default ggplot axis labels

Traditionally the labels are set in the axis directly by ggplot from the aesthetics selected e.g.:

p0<-ggplot(data=movies, aes(x=year))
p0<-p0+geom_point(aes(y=rating))+geom_smooth(aes(y=rating))
p0

plot of define x and y axis ggplot

To make ggplot axes’ labels different we can use xlab and ylab. This defines x and y axis in ggplot easily.

p0+xlab('The glorious years of the movies')+ylab('The public ratings')

Setting axes labels in ggplot with scales

p0+
  scale_x_continuous('The glorious years of the movies (with scales)')+
  scale_y_continuous('The public ratings (with scales)')

Also worth investigating is the labs function that allow the change of the axes and the title e.g.:

p0+labs(
  x='The glorious years of the movies (with labs)',
  y='The public ratings (with labs)'
  )

Formatting labels text for size and rotation?

Ggplot can change axis label orientation, size and colour. To rotate the axes in ggplot you just add the angle property. To change size ou use size and for colour you uses color (Notice that a ggplot uses US-english spelling). Finally, note that you can use the face property to define if the font is bold or italic.

p0 + xlab('The Years of Cinema')+
  ylab('Public Ratings')+
  theme(
    axis.text.x=element_text(angle=90, size=8),
    axis.title.x=element_text(angle=10, color='red'),
    axis.title.y=element_text(angle=80, color='blue', face='bold', size=14)
    )

The formatting of the text in the labels is a bit counter intuitive because it uses a slightly different nomenclature. The formatting is done with the theme function and by defining element_text’s with the wanted format. In the example above the axis.text.x defines the ticks format and the axis.title.? define the labels format.

A good way to learn all the elements that a ggplot theme can format can be obtained from the help menu by entering ?theme. These examples are just scrapping the surface of what you can do but hope they can get you started in formatting text size and orientation inside ggplot plots.

Side Note: Did you noticed how crappy the movies from the 70s, 80s and 90s were?

R: Extrair código de um documento Sweave

A produção de documentos reprodutíveis é muito fácil com Sweave em R. Mas por vezes quero extrair somente o código R para um ficheiro separado, sem todo o boilerplate do Latex (apesar de ser muito o melhor ambiente de edição de texto do mundo). Para tal pode-se utilizar o seguinte comando para produzir um ficheiro big_sweave_code.R com todas a vinhetas de código existentes no ficheiro Sweave.

Stangle("big_sweave_doc.Rnw", output="big_sweave_code.R")

Social Network Analysis em R e algum arrumar de casa

A área de Social Network Analysis está cada vez na actualidade científica e não só. Em 2010 leccionei numa Winter School uma cadeira sobre sobre Software para Análise de Redes Sociais no qual dei uma achega à utilização do R1 para análise de redes. O R não é só útil para análise de redes sociais, servindo para produção de documentos com gráficos de forma automática e reprodutível, análise estatística variada, manipulação de big data de forma rápida, etc… Na verdade o R é uma verdadeira mula de trabalho que se presta a diversas fases da manipulação e análise de dados.

Na área da Social Network Analysis (SNA) o R apresenta alguns packages que merecem ser analisados. Um deles é o package igraph que é possui muitas das funcionalidades necessárias para o estudo de redes, desde a produção de grafos segundo determinados modelos, análise de propriedades, detecção de comunidades… O próprio site do igraph tem um livro online sobre o igraph que pode ajudar quem se inicia neste package. Quem estiver a estudar SNA pela primeira vez pode ver também os tutoriais de Hanneman, embora em alguns casos não seja utilizado o R, mas outros softwares como o Ucinet ou o Pajek.

Para quem se estiver a iniciar no R no entanto há outros tutorias ou apresentações que ajudarão a entrar na linguagem. Se precisam de uma introdução em português vejam estes pdfs produzidos no IST aqui e aqui.

Finding communities in networks with R and igraph

Finding communities in networks

Finding communities in networks is a common task under the paradigm of complex systems. Doing it in R is easy. There are several ways to do community partitioning of graphs using very different packages. I’m going to use igraph to illustrate how communities can be extracted from given networks.

igraph is a lovely library to work with graphs. 95% of what you’ll ever need is available in igraph. It has the advantage that the libraries are written in C and are fast as hell.

algorithms for community detection in networks

walktrap.community

This algorithm finds densely connected subgraphs by performing random walks. The idea is that random walks will tend to stay inside communities instead of jumping to other communities.

Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106

edge.betweenness.community

This algorithm is the Girvan-Newman algorithm. It is a divisive algorithm where at each step the edge with the highest betweenness is removed from the graph. For each division you can compute the modularity of the graph. At the end, choose to cut the dendrogram where the process gives you the highest value of modularity.

M Newman and M Girvan: Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004)

fastgreedy.community

This algorithm is the Clauset-Newman-Moore algorithm. In this case the algorithm is agglomerative. At each step two groups merge. The merging is decided by optimising modularity. This is a fast algorithm, but has the disadvantage of being a greedy algorithm. Thus, is might not produce the best overall community partitioning, although I find it useful and accurate.

A Clauset, MEJ Newman, C Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187

spinglass.community

This algorithm uses as spin-glass model and simulated annealing to find the communities inside a network.

J. Reichardt and S. Bornholdt: Statistical Mechanics of Community Detection, Phys. Rev. E, 74, 016110 (2006), http://arxiv.org/abs/cond-mat/0603718

M. E. J. Newman and M. Girvan: Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004)

An example:

# First we load the ipgrah package
library(igraph)
 
# let's generate two networks and merge them into one graph.
g2 <- barabasi.game(50, p=2, directed=F)
g1 <- watts.strogatz.game(1, size=100, nei=5, p=0.05)
g <- graph.union(g1,g2)
 
# let's remove multi-edges and loops
g <- simplify(g)
 
# let's see if we have communities here using the 
# Grivan-Newman algorithm
# 1st we calculate the edge betweenness, merges, etc...
ebc <- edge.betweenness.community(g, directed=F)
 
# Now we have the merges/splits and we need to calculate the modularity
# for each merge for this we'll use a function that for each edge
# removed will create a second graph, check for its membership and use
# that membership to calculate the modularity
mods <- sapply(0:ecount(g), function(i){
  g2 <- delete.edges(g, ebc$removed.edges[seq(length=i)])
  cl <- clusters(g2)$membership
# March 13, 2014 - compute modularity on the original graph g 
# (Thank you to Augustin Luna for detecting this typo) and not on the induced one g2. 
  modularity(g,cl)
})
 
# we can now plot all modularities
plot(mods, pch=20)
 
# Now, let's color the nodes according to their membership
g2<-delete.edges(g, ebc$removed.edges[seq(length=which.max(mods)-1)])
V(g)$color=clusters(g2)$membership
 
# Let's choose a layout for the graph
g$layout <- layout.fruchterman.reingold
 
# plot it
plot(g, vertex.label=NA)
 
# if we wanted to use the fastgreedy.community agorithm we would do
fc <- fastgreedy.community(g)
com<-community.to.membership(g, fc$merges, steps= which.max(fc$modularity)-1)
V(g)$color <- com$membership+1
g$layout <- layout.fruchterman.reingold
plot(g, vertex.label=NA)

try it!

R in the Top 20 of Programming Languages

Programming languages come and go, but its nice to see what’s gaining momentum and what’s not. In the latest Tiobe report for January 2012 we can see some interesting surprises in the top 20 chart of programming languages. C is still highly demanded and closing on Java. Both account for 1/3 of the programming languages panorama.

Other interesting aspect is the fading of Python. Python lost half of it’s market share. Maybe this is because of Python 3 and the incompatibilities with Python 2.x that might have sent many programmers in search other solutions. Another problem might be GIL that hinders thread programming in Python in a time when programming is moving to the concurrent and distributed programming.

Also interesting is the rise of R. R is one of my favorite languages for science. It makes reproducibility of research results very easy (specially if you use Sweave with R) and for any kind of statistical analysis it is almost perfect. It also produces great plots for scientific publications.

How to plot multiple data series in R?

plot multiple data series - Multiple plots in R

I usually use ggplot2 to plot multiple data series, but if I don’t use ggplot2, there are TWO simple ways to plot multiple data series in R. I’ll go over both today.

Matlab users can easily plot multiple data series in the same figure. They use hold on and plot the data series as usual. Every data series goes into the same plot until they use hold off.

But can the same thing be done in R? R is getting big as a programming language so plotting multiple data series in R should be trivial.

The R points and lines way

Solution 1: just plot one data series and then use the points or lines commands to plot the other data series in the same figure, creating the multiple data series plot:

> plot(time, series1, type='l', xlab='t /s', ylab='s1')
> points(time, series2, type='l')

Plot Multiple Data Series the Matlab way

Solution 2: this one mimics Matlab hold on/off behaviour. It uses the new parameter of graphical devices. Let’s see how:

Setting new to TRUE tells R NOT to clean the previous frame before drawing the new one. It’s a bit counter intuitive but R is saying “Hey, theres a new plot for the same figure so don’t erase whatever is there before plotting the new data series“.

Example (plot series2 on the same plot as series1):

> plot(time, series1, type='l', xlim=c(0.0,20.0), 
+ ylim=c(0.0,1.0), xlab='t /s', ylab='s1')
> par(new=T)
> plot(time, series2, type='l', xlim=c(0.0,20.0), 
+ ylim=c(0.0,1.0), xlab='', ylab='', axes=F)
> par(new=F)

The par(new=T) tells R to make the second plot without cleaning the first. Two things to consider though: in the second set axes to FALSE, and xlabel and ylabel to empty strings or in the final result you’ll see some overlapping and bleeding of the several labels and axes.

Finally, because of all this superimposing you need to know your axes ranges and set them up equally in all plot commands (xlim, and ylim in this example are set to the range [0,20] and [0,1]).

R doesn’t automatically adjust the axes, as it doesn’t use the first frame as reference or the multiple data series. You need to supply these values or you’ll end up with a wrong looking plot like Marge Simpson’s hair.

In conclusion, either solution will work to plot multiple data series inside R, but sometimes one will be better than the other. Sometimes your data series represent different properties and you’ll need to specify the y ranges individually. In this case the latter option might be useful. Other times you just want a quick exploratory data analysis plot, or your data series are measuring the same property and the former method suffices.