New York Times Writes in Praise of <R>, the Open Source Statistics Package

mkrishnan · Jan 7, 2009

http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html

To some people R is just the 18th letter of the alphabet. To others, it’s the rating on racy movies, a measure of an attic’s insulation or what pirates in movies say.

R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca partly because data mining has entered a golden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell use it.

But R has also quickly found a following because statisticians, engineers and scientists without computer programming skills find it easy to use.

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

Excellent publicity for one of the best, and yet, most underappreciated, applications in the OSS world.

The article glosses over one thing that makes <R> really well suited to the statistics world... A SAS representative makes this gibe:

“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

But one of the hidden but major problems with programs like SAS, SPSS, LISREL, MPlus, etc, is that, for complicated statistics, for which there are no completely well-established numerical methodologies, these programs, being closed source, do not publish their source code and are therefore not open to analysis by the community to determine whether their assumptions in conducting these analyses are actually properly warranted. Even in relatively simple analyses like basic SEM, it's fairly common that results from different packages are slightly different, with little clarity as to why.

R has a huge advantage, in principle, that even for its most complex statistics, how it arrives at its results is completely available for analysis.

Blue Velvet · Jan 7, 2009

Is it mere coincidence that it's right next to Q in the alphabet? Enquiring minds need to know.

mkrishnan · Jan 7, 2009

Blue Velvet said:
Is it mere coincidence that it's right next to Q in the alphabet? Enquiring minds need to know.

Let me run some numbers and get back to you.

Incidentally, the interface for R in OS X is particularly nice.

gotzero · Jan 17, 2009

Blue Velvet said:
Is it mere coincidence that it's right next to Q in the alphabet? Enquiring minds need to know.

There is much more of a coincidence that it is right next to "S".

I love R, and am always surprised that it does not get more attention from businesses and people learning statistics.

mkrishnan said:
Incidentally, the interface for R in OS X is particularly nice.

Agreed! For once the OS X version of a truly multi-platform program is not way behind...

Sesshi · Jan 17, 2009

mkrishnan said:
R has a huge advantage, in principle, that even for its most complex statistics, how it arrives at its results is completely available for analysis.

In principle. It remains to be seen if flaws in the results get picked up on and fixed, even if they are readily visible. Personally, I think that element of FOSS being touted as an advantage is at best a debatable one in practice.

twoodcc · Jan 17, 2009

interesting. thanks for posting! i have never heard of R, and i plan to check it out!

gotzero · Jan 17, 2009

Sesshi said:
In principle. It remains to be seen if flaws in the results get picked up on and fixed, even if they are readily visible. Personally, I think that element of FOSS being touted as an advantage is at best a debatable one in practice.

Taking that bait, I have found plenty of errors in the closed platforms too, and in my opinion, at least with R they are well discussed, and you can often see why there is a problem. It is a little rough around the edges, and most people now are terrified of anything with a terminal, but if money is an object, it sure fills a gap.

One of the advantages of FOSS here is that you can move around companies and computers and know that you can get an additional unit of the software without breaking the bank.

R is not going to overtake SAS, but I think it is nice to have a platform where everyone can share data/output.

Sesshi · Jan 17, 2009

That depends on whether your focus in terms of how you acquire and use software.

To me, unless we're talking about a custom development the cost of the acquisition of the software is actually the cheapest part of the software lifecycle - and for most packaged applications, a completely negligible part of the operating cost of the application as a whole.

If you're a very large organisation and custom-develop many of your application, then OSS makes a lot of sense as a means of using or adapting existing code as a time-compression method. We're also currently doing that. Similarly if your rollout is measured in millions of units, then the cost of packages starts to become a sizeable sum which has to be considered.

However, for many situations I'm not sure if nerds who can't see the wood for the trees have progressed higher up the management chain or if there's just been a general dumbing-down in terms of decision-making sysadmins, but increasingly I'm starting to hear the same arguments for FOSS, not all of which actually makes sense over the lifecycle of the product in a professional scenario in the applications being considered.

gotzero · Jan 18, 2009

From a top-of-the-enterprise argument, I am sure you are right, and it highlights to me that you play in much larger pools than I do right now.

My biggest problem was when managing a department of less than 20 people in a large organization, and we needed better software FAST. Even though money was not necessarily an object, time was, and it can take weeks or months to go from the "I need this" stage to having it in hand. R saved me a couple of times simply because I could sweet talk IT into getting proven FOSS platforms on my team's systems often same day.

I guess to me part of the benefit of FOSS is the speed. You can be up and running in a few minutes if need be. It also allows you to prove the benefit of the software before you have to justify purchasing it. A specific benefit for me was that after leaving my bank, I was able to take R with me, and I now am able to use it for school and when I do consulting. Mastering one program helped me with several jobs, an advantage over non-monopolistic closed-source programs or custom applications.

Sesshi · Jan 19, 2009

I see the budgetary reason. However that's quite a band-aid approach, although I don't see it as uncommon, and not necessarily a guarantee of the best solution, as often in that case you haven't actually used the competing monetised applications 'in anger'. It's 'what you can get' as opposed to 'what you might actually need'.

mkrishnan · Jan 19, 2009

I think in statistics and other "technical" computing, there are some additional issues. First, by tradition, the support from the companies that make technical software is not always great. As I understand it, SAS is less of an issue, but the people who make SPSS have no customer management skills whatsoever. They routinely find their software broken by operating system updates and then take months to fix it. They mysteriously post warning messages six or eight months after an operating system version comes out indicating that statistics might be miscomputed on that version even though the software appears to run. All because they programmed their software (in both Windows and OS X) in some god forsaken development environment that flaunts every convention of both operating systems (although recently they have "possibly" fixed this with their transition to using Java).

As for the value of open code, in basic statistics, it's not a big deal, but in primarily academic advanced statistics, it can be crucial. But I do agree with Sesshi in that, although R is very advanced, there hasn't always been a lightning-speed advancement of its features based on the fact that its algorithms are freely available and that many leading statisticians are working on it. I'd say, in comparison to the Mozilla foundation, for instance, the management of R is a bit haphazard.

gotzero · Jan 19, 2009

mkrishnan said:
I'd say, in comparison to the Mozilla foundation, for instance, the management of R is a bit haphazard.

Agreed, but the market is also a lot smaller.

Sesshi said:
I see the budgetary reason. However that's quite a band-aid approach, although I don't see it as uncommon, and not necessarily a guarantee of the best solution, as often in that case you haven't actually used the competing monetised applications 'in anger'. It's 'what you can get' as opposed to 'what you might actually need'.

A lot of times for me it is not the budget but the speed. There is a big difference between "what I need" and "what I need right now", especially if I am billing, or getting billed, by the hour.

You are both right that it is not the most mature, but I have a lot of high hopes for it. Teaching statistics, it is quite refreshing to be able to point people to something they can D/L and use at home, and through my rose colored glasses, it will help the next generation of the statistically inclined get some exposure at a younger age, not only to stat, but also to a command line

.

haiggy · Jan 19, 2009

We use this in my statistics class... never thought I'd see anything mentioned about it. I don't find it particularly easy to use unless you've gone through quite a few tutorials... then I guess you'd get the hang of it.

mkrishnan · Jan 19, 2009

haiggy said:
We use this in my statistics class... never thought I'd see anything mentioned about it. I don't find it particularly easy to use unless you've gone through quite a few tutorials... then I guess you'd get the hang of it.

I think R is much harder to use than SPSS and even SAS, but the big difference is that once you understand R's way of doing things, I think it's much more consistent than SPSS in particular. I found the statement in the Times article kind of ironic, actually, when they said that R found a following among scientists who didn't know how to program. R is so much easier to use if you understand the basics of object oriented programming.

pooky · Jan 19, 2009

mkrishnan said:
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
But one of the hidden but major problems with programs like SAS, SPSS, LISREL, MPlus, etc, is that, for complicated statistics, for which there are no completely well-established numerical methodologies, these programs, being closed source, do not publish their source code and are therefore not open to analysis by the community to determine whether their assumptions in conducting these analyses are actually properly warranted. Even in relatively simple analyses like basic SEM, it's fairly common that results from different packages are slightly different, with little clarity as to why.

R is wonderful, and what you've written above is a perfect example of what pisses me off about the commercial scientific software community, particularly SAS. Not only do users have no access to their algorithms, users are FORCED to do things in a particular way. Want to do a regression? On SAS, you have virtually no options as far as controlling how it is computed, unless it is an option that you have been granted permission to use. In R, if you don't like the way it is computed, you can write your own damn code.

I've noticed in my field in particular (Ecology/Evolutionary biology), the past few decades have been SAS-dominated, with older workers entering a comfort zone with the software that isn't always entirely logical. I've had reviewers ask for computations (e.g. least squares means, type 3 sums of squares) that were completely invented by the SAS institute without much support for their actual utility. It's one thing when bad, overpriced software is all that is available, but when that software can completely change the scientific culture, you're in trouble.

In other words, R kicks ass, and while it may not be the end-all-be-all of statistics, it will certainly force the commercial developers to innovate to remain competitive. This is a Good Thing.

gotzero · Jan 21, 2009

For what it is worth, Ashlee provided an update on an NYT blog (link) after what must have been deluge of geekmail.

Seeing the ~800% difference in the estimate of the R user base (goes up with commercial dependence, how random), does anyone know where to find an intelligent estimate of the number of SAS users? It would be interesting to even roughly compare the number of eyeballs starting at each.

mkrishnan · Jan 21, 2009

pooky said:
I've noticed in my field in particular (Ecology/Evolutionary biology), the past few decades have been SAS-dominated, with older workers entering a comfort zone with the software that isn't always entirely logical.

Try psychology...

SPSS is already considered "advanced" software and most people will not go anywhere near SAS because of its perceived difficulty.

I would be curious to know the SAS userbase size, too (and SPSS while we're at it). I think it's quite large, though.

Search

Search

New York Times Writes in Praise of <R>, the Open Source Statistics Package

mkrishnan

Moderator emeritus

Blue Velvet

Moderator emeritus

mkrishnan

Moderator emeritus

gotzero

macrumors 68040

Sesshi

macrumors G3

twoodcc

macrumors P6

gotzero

macrumors 68040

Sesshi

macrumors G3

gotzero

macrumors 68040

Sesshi

macrumors G3

mkrishnan

Moderator emeritus

gotzero

macrumors 68040

haiggy

macrumors 65816

mkrishnan

Moderator emeritus

pooky

macrumors 6502

gotzero

macrumors 68040

mkrishnan

Moderator emeritus

Our Staff