5 essential tricks for R users

R is very powerful and is becoming the language of data scientists. But some things require a bit of learning and are not obvious to the R newcomer. Here are five useful tips if you are just starting out:

  1. Sometimes data is not in the correct format and you need to reshape data to use it in R. Instead of using external software you can do it directly in R.
  2. Plotting multiple series in the same figure. This can be accomplished using R ggplot2 library producing better looking graphics.
  3. If you do Network Analysis, you’ll need to partition the graph into communities. Finding communities in R is easy with the igraph package.
  4. Still playing with graphs, you can colour different nodes according to some data property. Check how to colour graph nodes in R.
  5. R ggplot2 allows you to accept most of the defaults and have great plots, but sometimes you might want to customise them further. Check how to customise ggplot2 axes labels.

Extra: If you use Sweave to automate your reports with live data in R, you might sometimes want to extract the R snippets to a new R file. Instead of copying and pasting, try this:

Stangle("big_sweave_doc.Rnw", output="big_sweave_code.R")

R Tip: define ggplot axis labels

Formatting text and labels in ggplot or ggplot2 axis is easy. A common task when producing plots for publication is to replace default labels. Default labels in axes tend to reflect the name of variables used and sometimes these are not the most descriptive labels. At least not when you are publishing the plots in a scientific journal. So let’s try to break down some ways to personalise ggplot plot axes.

Quick Navigation:

For this formatting example I’ll use the movies dataset that is available in R. First thing we need to do is to load ggplot2 library and then the movies dataset


The default ggplot axis labels

Traditionally the labels are set in the axis directly by ggplot from the aesthetics selected e.g.:

p0<-ggplot(data=movies, aes(x=year))

plot of define x and y axis ggplot

To make ggplot axes’ labels different we can use xlab and ylab. This defines x and y axis in ggplot easily.

p0+xlab('The glorious years of the movies')+ylab('The public ratings')

Setting axes labels in ggplot with scales

  scale_x_continuous('The glorious years of the movies (with scales)')+
  scale_y_continuous('The public ratings (with scales)')

Also worth investigating is the labs function that allow the change of the axes and the title e.g.:

  x='The glorious years of the movies (with labs)',
  y='The public ratings (with labs)'

Formatting labels text for size and rotation?

Ggplot can change axis label orientation, size and colour. To rotate the axes in ggplot you just add the angle property. To change size ou use size and for colour you uses color (Notice that a ggplot uses US-english spelling). Finally, note that you can use the face property to define if the font is bold or italic.

p0 + xlab('The Years of Cinema')+
  ylab('Public Ratings')+
    axis.text.x=element_text(angle=90, size=8),
    axis.title.x=element_text(angle=10, color='red'),
    axis.title.y=element_text(angle=80, color='blue', face='bold', size=14)

The formatting of the text in the labels is a bit counter intuitive because it uses a slightly different nomenclature. The formatting is done with the theme function and by defining element_text’s with the wanted format. In the example above the axis.text.x defines the ticks format and the axis.title.? define the labels format.

A good way to learn all the elements that a ggplot theme can format can be obtained from the help menu by entering ?theme. These examples are just scrapping the surface of what you can do but hope they can get you started in formatting text size and orientation inside ggplot plots.

Side Note: Did you noticed how crappy the movies from the 70s, 80s and 90s were?

How to plot multiple data series in ggplot for quality graphs?

plot multiple data series with ggplot

I've already shown how to plot multiple data series in R with a traditional plot by using the par(new=T), par(new=F) trick. Now I'll show how to do it within ggplot2.

First let's generate two data series y1 and y2 and plot them with the traditional points methods

x <- seq(0, 4 * pi, 0.1)
n <- length(x)
y1 <- 0.5 * runif(n) + sin(x)
y2 <- 0.5 * runif(n) + cos(x) - sin(x)
plot(x, y1, col = "blue", pch = 20)
points(x, y2, col = "red", pch = 20)

This is exactly the R code that produced the above plot. It is just a simple plot and points functions to plot multiple data series. It is not really the greatest, smart looking R code you want to use. Better plots can be done in R with ggplot.

Plotting with Ggplot2

Now, let's try this with ggplot2.

First we need to create a data.frame with our series.

If we have very few series we can just plot adding geom_point as needed.

df <- data.frame(x, y1, y2)
ggplot(df, aes(x, y = value, color = variable)) + 
    geom_point(aes(y = y1, col = "y1")) + 
    geom_point(aes(y = y2, col = "y2"))

But if we have many series to plot an alternative is using melt to reshape the data.frame and with this plot an arbitrary number of rows. For example:

# This creates a new data frame with columns x, variable and value
# x is the id, variable holds each of our timeseries designation
df.melted <- melt(df, id = "x")
ggplot(data = df.melted, aes(x = x, y = value, color = variable)) +

And thats how to plot multiple data series using ggplot. The basic trick is that you need to melt your data into a new data.frame. Remember, in data.frames each row represents an observation.


Another option, pointed to me in the comments by Cosmin Saveanu (Thanks!), it to plot the multiple data series with facets (good for B&W):

ggplot(data = df.melted, aes(x = x, y = value)) +
geom_point() + facet_grid(variable ~ .)