There are two ways of doing statistical analysis. One that I call the Excel approach and the other that I call the Experts approach.

You use the former if you are a mouse user, and like to point and click at things without understanding what is happening. You can also forego control too.

And you use the latter for finer control over outputs. They require some understanding of the theoretical aspects of the analysis, and a lot more punching of the keyboard.

Introducing JASP

I’ve been experimenting a statistics software called JASP that aims to be a small statistics software — ‘an alternative to SPSS’. It falls in the realm of Excel users but has the power of Experts underneath. What? Is THAT POSSIBLE?

JASP has only a few, well… FOUR, categories of analysis — t-tests, anova, regression, and frequencies — and this is fine for most of the use cases. I find it particularly interesting for EDA (Exploratory Data Analysis) where you just want to look at the data your working with.

Everything is visual and done with the mouse, from loading data, to clicking through the options of the analysis you want to perform.

The output is beautiful but limited to HTML with the images embedded as png data:uris. It would be great to export the results for use in academic papers — eps or pdfs would be great for LaTeX. This makes JASP a little limiting.

In any case, JASP is still in its infancy and will evolve rapidly. As for exploratory data analysis software there are alternatives. One of my favourites is Mondrian. Mondrian is old as it started in 1997, latest stable is from 2011 and latest beta is from 2013. Mondrian doesn’t have the beautiful graphics of JASP either. Sometimes simple things are the best and Mondrian just works.

But isn’t JASP just a gui for R?

The cool thing about JASP is that all the clicks and clanks are just a way to pass parameters to R functions. The backend is R and therefore you can do everything the hard way if you really need to improve on the existing functions. For example, sometimes you need to make your plot scales logarithmic. There is no way to do that in JASP but if you use the R function underlying it… you solve your problem — and you can export PDFs for paper production.

The R scripts that power JASP are available at Github. If you want to use them in your own advanced work in R just go ahead and read them.

5 essential tricks for R users

R is very powerful and is becoming the language of data scientists. But some things require a bit of learning and are not obvious to the R newcomer. Here are five useful tips if you are just starting out:

1. Sometimes data is not in the correct format and you need to reshape data to use it in R. Instead of using external software you can do it directly in R.
2. Plotting multiple series in the same figure. This can be accomplished using R ggplot2 library producing better looking graphics.
3. If you do Network Analysis, you’ll need to partition the graph into communities. Finding communities in R is easy with the igraph package.
4. Still playing with graphs, you can colour different nodes according to some data property. Check how to colour graph nodes in R.
5. R ggplot2 allows you to accept most of the defaults and have great plots, but sometimes you might want to customise them further. Check how to customise ggplot2 axes labels.

Extra: If you use Sweave to automate your reports with live data in R, you might sometimes want to extract the R snippets to a new R file. Instead of copying and pasting, try this:

Stangle("big_sweave_doc.Rnw", output="big_sweave_code.R")

Lorenz Attractor in R.

The Lorenz System is one of the most famous system of equations in the realm of chaotic systems first studied by Edward Lorenz. The double lob remembering a butterfly wing is on the imagination of any complex systems enthusiast.

Lorenz Attractor Equations

The three equations that govern its behavior are:

where usually

So let's define the parameters, the initial state of the system and the equations in the form of a function called Lorenz

parameters <- c(s = 10, r = 28, b = 8/3) state <- c(X = 0, Y = 1, Z = 1)   Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- s * (Y - X) dY <- X * (r - Z) - Y dZ <- X * Y - b * Z list(c(dX, dY, dZ)) }) }

We can now start processing this and plotting it.

times <- seq(0, 50, by = 0.01) library(deSolve) out <- ode(y = state, times = times, func = Lorenz, parms = parameters)   par(oma = c(0, 0, 3, 0)) plot(out, xlab = "time", ylab = "-") plot(out[, "Y"], out[, "Z"], pch = ".", type = "l") mtext(outer = TRUE, side = 3, "Lorenz model", cex = 1.5)

The above example is mainly copied from the the deSolve package documentation that uses the same example. What I'd like to point out in this is how simple it is to solve and plot a system of differential equations in R. The language is simple and clear and it is much more practical than implementing everything from scratch.

If you want to play with a couple of Lorenz attractors synchronizing please visit this Java implementation.

Up next: Animating the Lorenz Attractor in R.

How to plot multiple data series in ggplot for quality graphs?

I've already shown how to plot multiple data series in R with a traditional plot by using the par(new=T), par(new=F) trick. Now I'll show how to do it within ggplot2.

First let's generate two data series y1 and y2 and plot them with the traditional points methods

x <- seq(0, 4 * pi, 0.1) n <- length(x) y1 <- 0.5 * runif(n) + sin(x) y2 <- 0.5 * runif(n) + cos(x) - sin(x) plot(x, y1, col = "blue", pch = 20) points(x, y2, col = "red", pch = 20)

This is exactly the R code that produced the above plot. It is just a simple plot and points functions to plot multiple data series. It is not really the greatest, smart looking R code you want to use. Better plots can be done in R with ggplot.

Plotting with Ggplot2

Now, let's try this with ggplot2.

First we need to create a data.frame with our series.

If we have very few series we can just plot adding geom_point as needed.

library(ggplot2) df <- data.frame(x, y1, y2) ggplot(df, aes(x, y = value, color = variable)) + geom_point(aes(y = y1, col = "y1")) + geom_point(aes(y = y2, col = "y2"))

But if we have many series to plot an alternative is using melt to reshape the data.frame and with this plot an arbitrary number of rows. For example:

library(reshape) # This creates a new data frame with columns x, variable and value # x is the id, variable holds each of our timeseries designation df.melted <- melt(df, id = "x")   ggplot(data = df.melted, aes(x = x, y = value, color = variable)) + geom_point()

And thats how to plot multiple data series using ggplot. The basic trick is that you need to melt your data into a new data.frame. Remember, in data.frames each row represents an observation.

update:

Another option, pointed to me in the comments by Cosmin Saveanu (Thanks!), it to plot the multiple data series with facets (good for B&W):

library(reshape) ggplot(data = df.melted, aes(x = x, y = value)) + geom_point() + facet_grid(variable ~ .)

R: Extrair código de um documento Sweave

A produção de documentos reprodutíveis é muito fácil com Sweave em R. Mas por vezes quero extrair somente o código R para um ficheiro separado, sem todo o boilerplate do Latex (apesar de ser muito o melhor ambiente de edição de texto do mundo). Para tal pode-se utilizar o seguinte comando para produzir um ficheiro big_sweave_code.R com todas a vinhetas de código existentes no ficheiro Sweave.

Stangle("big_sweave_doc.Rnw", output="big_sweave_code.R")

Finding communities in networks with R and igraph

Finding communities in networks is a common task under the paradigm of complex systems. Doing it in R is easy. There are several ways to do community partitioning of graphs using very different packages. I’m going to use igraph to illustrate how communities can be extracted from given networks.

igraph is a lovely library to work with graphs. 95% of what you’ll ever need is available in igraph. It has the advantage that the libraries are written in C and are fast as hell.

algorithms for community detection in networks

walktrap.community

This algorithm finds densely connected subgraphs by performing random walks. The idea is that random walks will tend to stay inside communities instead of jumping to other communities.

Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106

edge.betweenness.community

This algorithm is the Girvan-Newman algorithm. It is a divisive algorithm where at each step the edge with the highest betweenness is removed from the graph. For each division you can compute the modularity of the graph. At the end, choose to cut the dendrogram where the process gives you the highest value of modularity.

M Newman and M Girvan: Finding and evaluating community structure in networks, Physical Review E 69, 026113 (2004)

fastgreedy.community

This algorithm is the Clauset-Newman-Moore algorithm. In this case the algorithm is agglomerative. At each step two groups merge. The merging is decided by optimising modularity. This is a fast algorithm, but has the disadvantage of being a greedy algorithm. Thus, is might not produce the best overall community partitioning, although I find it useful and accurate.

A Clauset, MEJ Newman, C Moore: Finding community structure in very large networks, http://www.arxiv.org/abs/cond-mat/0408187

spinglass.community

This algorithm uses as spin-glass model and simulated annealing to find the communities inside a network.

J. Reichardt and S. Bornholdt: Statistical Mechanics of Community Detection, Phys. Rev. E, 74, 016110 (2006), http://arxiv.org/abs/cond-mat/0603718

M. E. J. Newman and M. Girvan: Finding and evaluating community structure in networks, Phys. Rev. E 69, 026113 (2004)

An example:

# First we load the ipgrah package library(igraph)   # let's generate two networks and merge them into one graph. g2 <- barabasi.game(50, p=2, directed=F) g1 <- watts.strogatz.game(1, size=100, nei=5, p=0.05) g <- graph.union(g1,g2)   # let's remove multi-edges and loops g <- simplify(g)   # let's see if we have communities here using the # Grivan-Newman algorithm # 1st we calculate the edge betweenness, merges, etc... ebc <- edge.betweenness.community(g, directed=F)   # Now we have the merges/splits and we need to calculate the modularity # for each merge for this we'll use a function that for each edge # removed will create a second graph, check for its membership and use # that membership to calculate the modularity mods <- sapply(0:ecount(g), function(i){ g2 <- delete.edges(g, ebc$removed.edges[seq(length=i)]) cl <- clusters(g2)$membership # March 13, 2014 - compute modularity on the original graph g # (Thank you to Augustin Luna for detecting this typo) and not on the induced one g2. modularity(g,cl) })   # we can now plot all modularities plot(mods, pch=20)   # Now, let's color the nodes according to their membership g2<-delete.edges(g, ebc$removed.edges[seq(length=which.max(mods)-1)]) V(g)$color=clusters(g2)$membership # Let's choose a layout for the graph g$layout <- layout.fruchterman.reingold   # plot it plot(g, vertex.label=NA)   # if we wanted to use the fastgreedy.community agorithm we would do fc <- fastgreedy.community(g) com<-community.to.membership(g, fc$merges, steps= which.max(fc$modularity)-1) V(g)$color <- com$membership+1 g\$layout <- layout.fruchterman.reingold plot(g, vertex.label=NA)

try it!