programming – sixhat.net

JASP or not to JASP? Bayesian statistical methods for free.

There are two ways of doing statistical analysis. One that I call the Excel approach and the other that I call the Experts approach.

You use the former if you are a mouse user, and like to point and click at things without understanding what is happening. You can also forego control too.

And you use the latter for finer control over outputs. They require some understanding of the theoretical aspects of the analysis, and a lot more punching of the keyboard.

Introducing JASP

I’ve been experimenting a statistics software called JASP that aims to be a small statistics software — ‘an alternative to SPSS’. It falls in the realm of Excel users but has the power of Experts underneath. What? Is THAT POSSIBLE?

JASP has only a few, well… FOUR, categories of analysis — t-tests, anova, regression, and frequencies — and this is fine for most of the use cases. I find it particularly interesting for EDA (Exploratory Data Analysis) where you just want to look at the data your working with.

Everything is visual and done with the mouse, from loading data, to clicking through the options of the analysis you want to perform.

The output is beautiful but limited to HTML with the images embedded as png data:uris. It would be great to export the results for use in academic papers — eps or pdfs would be great for LaTeX. This makes JASP a little limiting.

What about Mondrian?

In any case, JASP is still in its infancy and will evolve rapidly. As for exploratory data analysis software there are alternatives. One of my favourites is Mondrian. Mondrian is old as it started in 1997, latest stable is from 2011 and latest beta is from 2013. Mondrian doesn’t have the beautiful graphics of JASP either. Sometimes simple things are the best and Mondrian just works.

But isn’t JASP just a gui for R?

The cool thing about JASP is that all the clicks and clanks are just a way to pass parameters to R functions. The backend is R and therefore you can do everything the hard way if you really need to improve on the existing functions. For example, sometimes you need to make your plot scales logarithmic. There is no way to do that in JASP but if you use the R function underlying it… you solve your problem — and you can export PDFs for paper production.

The R scripts that power JASP are available at Github. If you want to use them in your own advanced work in R just go ahead and read them.

5 essential tricks for R users

R is very powerful and is becoming the language of data scientists. But some things require a bit of learning and are not obvious to the R newcomer. Here are five useful tips if you are just starting out:

Sometimes data is not in the correct format and you need to reshape data to use it in R. Instead of using external software you can do it directly in R.
Plotting multiple series in the same figure. This can be accomplished using R ggplot2 library producing better looking graphics.
If you do Network Analysis, you’ll need to partition the graph into communities. Finding communities in R is easy with the igraph package.
Still playing with graphs, you can colour different nodes according to some data property. Check how to colour graph nodes in R.
R ggplot2 allows you to accept most of the defaults and have great plots, but sometimes you might want to customise them further. Check how to customise ggplot2 axes labels.

Extra: If you use Sweave to automate your reports with live data in R, you might sometimes want to extract the R snippets to a new R file. Instead of copying and pasting, try this:

Stangle("big_sweave_doc.Rnw", output="big_sweave_code.R")

awk length of each line sorted

This is just an awk one-liner for helping me find long lines, usually in latex documents, that makes merging different versions of my paper easier. It just identifies the long lines and you can then easily break them into smaller ones. No more 3000 character lines.

awk '{print length($0), ++nn}' file.txt | sort -n

Systems that are Sensitive to Initial Conditions

Systems that are sensitive to initial conditions are those that whose trajectories diverge in ways that are not predictable. One of those systems is the chaotic Lorenz attractor, but even simpler systems can show dependence on the initial conditions.

In this example the diverging behaviour is obtained by the totally deterministic algorithm 2 times modulo 1 or (2x%1) that you can even try out in your calculator.

take a random number between 0 and 1 (e.g. 0.823)
multiply that number by 2 (e.g. 2*0.823=1.646)
calculate the modulo 1 (%) of this value (e.g. 1.646 % 1=0.646). modulo 1 corresponds in this case to make the integer part of the number 0.
use the result obtained in 3. and repeat from 2.

Now start with a second number very similar to the first (e.g. 0.825) and repeat the process (in the animation the initial difference is just 0.01% between the two values).

You’ll see that for the first iterations the calculations (the trajectory) are similar but suddenly they jump all around. After some iterations you can’t predict the behavior of the second trajectory, even if you know the first trajectory. This clearly shows that the system is sensitive to initial conditions. A very simple and strange system indeed.

R Tip: define ggplot axis labels

Formatting text and labels in ggplot or ggplot2 axis is easy. A common task when producing plots for publication is to replace default labels. Default labels in axes tend to reflect the name of variables used and sometimes these are not the most descriptive labels. At least not when you are publishing the plots in a scientific journal. So let’s try to break down some ways to personalise ggplot plot axes.

Quick Navigation:

The default ggplot axis labels
Setting axes labels in ggplot with scales
Formatting labels text for size and rotation?

For this formatting example I’ll use the movies dataset that is available in R. First thing we need to do is to load ggplot2 library and then the movies dataset

library(ggplot2)
data(movies)

The default ggplot axis labels

Traditionally the labels are set in the axis directly by ggplot from the aesthetics selected e.g.:

p0<-ggplot(data=movies, aes(x=year))
p0<-p0+geom_point(aes(y=rating))+geom_smooth(aes(y=rating))
p0

plot of define x and y axis ggplot

To make ggplot axes’ labels different we can use xlab and ylab. This defines x and y axis in ggplot easily.

p0+xlab('The glorious years of the movies')+ylab('The public ratings')

Setting axes labels in ggplot with scales

p0+
  scale_x_continuous('The glorious years of the movies (with scales)')+
  scale_y_continuous('The public ratings (with scales)')

Also worth investigating is the labs function that allow the change of the axes and the title e.g.:

p0+labs(
  x='The glorious years of the movies (with labs)',
  y='The public ratings (with labs)'
  )

Formatting labels text for size and rotation?

Ggplot can change axis label orientation, size and colour. To rotate the axes in ggplot you just add the angle property. To change size ou use size and for colour you uses color (Notice that a ggplot uses US-english spelling). Finally, note that you can use the face property to define if the font is bold or italic.

p0 + xlab('The Years of Cinema')+
  ylab('Public Ratings')+
  theme(
    axis.text.x=element_text(angle=90, size=8),
    axis.title.x=element_text(angle=10, color='red'),
    axis.title.y=element_text(angle=80, color='blue', face='bold', size=14)
    )

The formatting of the text in the labels is a bit counter intuitive because it uses a slightly different nomenclature. The formatting is done with the theme function and by defining element_text’s with the wanted format. In the example above the axis.text.x defines the ticks format and the axis.title.? define the labels format.

A good way to learn all the elements that a ggplot theme can format can be obtained from the help menu by entering ?theme. These examples are just scrapping the surface of what you can do but hope they can get you started in formatting text size and orientation inside ggplot plots.

Side Note: Did you noticed how crappy the movies from the 70s, 80s and 90s were?

Lorenz Attractor in R.

Lorenz Attractor

The Lorenz System is one of the most famous system of equations in the realm of chaotic systems first studied by Edward Lorenz. The double lob remembering a butterfly wing is on the imagination of any complex systems enthusiast.

Lorenz Attractor Equations

The three equations that govern its behavior are:

$\frac{dx}{dt}=\sigma(y-x)$

$\frac{dy}{dt}=x(\rho-z)-y$

$\frac{dz}{dt}=xy-\beta z$

where usually $\sigma=10,\rho=28,\beta=8/3$

So let's define the parameters, the initial state of the system and the equations in the form of a function called Lorenz

parameters <- c(s = 10, r = 28, b = 8/3)
state <- c(X = 0, Y = 1, Z = 1)
 
Lorenz <- function(t, state, parameters) {
    with(as.list(c(state, parameters)), {
        dX <- s * (Y - X)
        dY <- X * (r - Z) - Y
        dZ <- X * Y - b * Z
        list(c(dX, dY, dZ))
    })
}

We can now start processing this and plotting it.

times <- seq(0, 50, by = 0.01)
library(deSolve)
out <- ode(y = state, times = times, func = Lorenz, parms = parameters)
 
par(oma = c(0, 0, 3, 0))
plot(out, xlab = "time", ylab = "-")
plot(out[, "Y"], out[, "Z"], pch = ".", type = "l")
mtext(outer = TRUE, side = 3, "Lorenz model", cex = 1.5)

The above example is mainly copied from the the deSolve package documentation that uses the same example. What I'd like to point out in this is how simple it is to solve and plot a system of differential equations in R. The language is simple and clear and it is much more practical than implementing everything from scratch.

If you want to play with a couple of Lorenz attractors synchronizing please visit this Java implementation.

Up next: Animating the Lorenz Attractor in R.

How to plot multiple data series in ggplot for quality graphs?

plot multiple data series with ggplot

I've already shown how to plot multiple data series in R with a traditional plot by using the par(new=T), par(new=F) trick. Now I'll show how to do it within ggplot2.

First let's generate two data series y1 and y2 and plot them with the traditional points methods

x <- seq(0, 4 * pi, 0.1)
n <- length(x)
y1 <- 0.5 * runif(n) + sin(x)
y2 <- 0.5 * runif(n) + cos(x) - sin(x)
plot(x, y1, col = "blue", pch = 20)
points(x, y2, col = "red", pch = 20)

This is exactly the R code that produced the above plot. It is just a simple plot and points functions to plot multiple data series. It is not really the greatest, smart looking R code you want to use. Better plots can be done in R with ggplot.

Plotting with Ggplot2

Now, let's try this with ggplot2.

First we need to create a data.frame with our series.

If we have very few series we can just plot adding geom_point as needed.

library(ggplot2)
df <- data.frame(x, y1, y2)
ggplot(df, aes(x, y = value, color = variable)) + 
    geom_point(aes(y = y1, col = "y1")) + 
    geom_point(aes(y = y2, col = "y2"))

But if we have many series to plot an alternative is using melt to reshape the data.frame and with this plot an arbitrary number of rows. For example:

library(reshape)
# This creates a new data frame with columns x, variable and value
# x is the id, variable holds each of our timeseries designation
df.melted <- melt(df, id = "x")
 
ggplot(data = df.melted, aes(x = x, y = value, color = variable)) +
  geom_point()

And thats how to plot multiple data series using ggplot. The basic trick is that you need to melt your data into a new data.frame. Remember, in data.frames each row represents an observation.

update:

Another option, pointed to me in the comments by Cosmin Saveanu (Thanks!), it to plot the multiple data series with facets (good for B&W):

library(reshape)
ggplot(data = df.melted, aes(x = x, y = value)) +
geom_point() + facet_grid(variable ~ .)