Archive for the ‘statistical analysis’ Category

When to use the R language

Saturday, April 17th, 2010

When to use the R language

When you have to explore data. At the start of an analytic project, it’s a good idea to create a bunch of graphical visualizations of your data to get a sense of what’s inside it. In terms of its graphical capabilities, R exists in a whole separate dimension from Excel. This was perhaps the most shocking part to me about using R for the first time: I really thought I had a handle on data analysis even though I’d restricted my software to Excel, but boy was I wrong. The visualizations you can create in R are much more sophisticated and much more nuanced. And, philosophically, you can tell that the visualization tools in R were created by people more interested in good thinking about data than about beautiful presentation. (The result, ironically, is a much more beautiful presentation, IMHO.)

Here’s how I’d put the difference to someone who’s familiar with Excel but not yet with R. The graphics creation options that Excel gives you are all based in the graphical user interface. This is what makes Excel relatively easy to use—all your options are laid out before you with nice buttons and fill-in-the-blank boxes. But in order to create a graphical interface that’s easy to use, the creators of Excel had to make a bunch of decisions about what sorts of graphics you are and are not likely to want. With too many choices, the graphical interface becomes cumbersome and frustrating, so to achieve simplicity they had to eliminate options.

And this isn’t a gripe or anything. I can’t say I’d have done a better job designing Excel’s charting graphical interface. I cut my teeth on it.

These limitations become a problem when you want to inspect data visually in a bunch of different ways in order to explore it. R, through a combination of its well-designed base graphics package, the exceptionally well-designed lattice graphics package, and the jaw-droppingly well-designed ggplot2 graphics package, offers a breathtaking array of visualization options that you access through the command line or scripts. It has power that you just can’t get using a graphical interface to generate your charts.

We are living through the age of statistics

Wednesday, March 17th, 2010

Boltzmann did a great deal to introduce statistics to science but most scientists of his generation fought against the idea and insisted that Newton’s mechanical view of the universe was correct. The true era of statistics began on December 14th of 1900, when Max Plank first presented the Planck postulate. Since that time, statistics have conquered science. If you were to list things that were dominant motifs of the 20th century, you might include skyscrapers, airplanes, rockets, penicillin and statistics.

Nevertheless, statistics are not the truth about our universe. They are a useful tool, but as a world view, they take us no closer to the truth than Newton’s mechanical universe. As I write these words, I’m thinking of what Kuhn said, that science adopts a new paradigm, not because it is a better representation of reality, but because it is useful to the work that scientists are now doing.

For these reasons, it is important to remember the many problems that science faces when relying on statistical methods:

Statistical problems also afflict the “gold standard” for medical research, the randomized, controlled clinical trials that test drugs for their ability to cure or their power to harm. Such trials assign patients at random to receive either the substance being tested or a placebo, typically a sugar pill; random selection supposedly guarantees that patients’ personal characteristics won’t bias the choice of who gets the actual treatment. But in practice, selection biases may still occur, Vance Berger and Sherri Weinstein noted in 2004 in ControlledClinical Trials. “Some of the benefits ascribed to randomization, for example that it eliminates all selection bias, can better be described as fantasy than reality,” they wrote.

Randomization also should ensure that unknown differences among individuals are mixed in roughly the same proportions in the groups being tested. But statistics do not guarantee an equal distribution any more than they prohibit 10 heads in a row when flipping a penny. With thousands of clinical trials in progress, some will not be well randomized. And DNA differs at more than a million spots in the human genetic catalog, so even in a single trial differences may not be evenly mixed. In a sufficiently large trial, unrandomized factors may balance out, if some have positive effects and some are negative. (See Box 3) Still, trial results are reported as averages that may obscure individual differences, masking beneficial or harm ful effects and possibly leading to approval of drugs that are deadly for some and denial of effective treatment to others.

“Determining the best treatment for a particular patient is fundamentally different from determining which treatment is best on average,” physicians David Kent and Rodney Hayward wrote in American Scientist in 2007. “Reporting a single number gives the misleading impression that the treatment-effect is a property of the drug rather than of the interaction between the drug and the complex risk-benefit profile of a particular group of patients.”

Another concern is the common strategy of combining results from many trials into a single “meta-analysis,” a study of studies. In a single trial with relatively few participants, statistical tests may not detect small but real and possibly important effects. In principle, combining smaller studies to create a larger sample would allow the tests to detect such small effects. But statistical techniques for doing so are valid only if certain criteria are met. For one thing, all the studies conducted on the drug must be included — published and unpublished. And all the studies should have been performed in a similar way, using the same protocols, definitions, types of patients and doses. When combining studies with differences, it is necessary first to show that those differences would not affect the analysis, Goodman notes, but that seldom happens. “That’s not a formal part of most meta-analyses,” he says.

Meta-analyses have produced many controversial conclusions. Common claims that antidepressants work no better than placebos, for example, are based on meta-analyses that do not conform to the criteria that would confer validity. Similar problems afflicted a 2007 meta-analysis, published in the New England Journal of Medicine, that attributed increased heart attack risk to the diabetes drug Avandia. Raw data from the combined trials showed that only 55 people in 10,000 had heart attacks when using Avandia, compared with 59 people per 10,000 in comparison groups. But after a series of statistical manipulations, Avandia appeared to confer an increased risk.

Crime continues to decline but the public disagrees

Monday, February 15th, 2010

Why do people think crime is getting worse?

The year 2009 was a grim one for many Americans, but there was one pleasant surprise amid all the drear: Citizens, though ground down and nerve-racked by the recession, still somehow resisted the urge to rob and kill one another, and they resisted in impressive numbers. Across the country, FBI data show that crime last year fell to lows unseen since the 1960s – part of a long trend that has seen crime fall steeply in the United States since the mid-1990s.

At the same time, however, another change has taken place: a steady rise in the percentage of Americans who believe crime is getting worse. The vast majority of Americans – nearly three-quarters of the population – thought crime got worse in the United States in 2009, according to Gallup’s annual crime attitudes poll. That, too, is part of a running trend. As crime rates have dropped for the past decade, the public belief in worsening crime has steadily grown. The more lawful the country gets, the more lawless we imagine it to be.

I’ve written before about how safe New York City is.

I was just recently in Atlanta. I ran into a woman who seemed to think there was more war in the world now than ever before. I told her what I’ve read, which is that there is less war now than at any other time known to historians. She looked at me like I was crazy.

I am not sure what is going on, but it seems like a lot of the public wants to believe the world is in worse shape than it is. There is, after all, a segment of the population who believes that the threat of Islamic terrorism is the worst threat America has ever faced – as if the Soviet Union, with enough nuclear weapons to end all life on earth, was somehow a joke, something to laugh at.

Are these misperceptions due to Americans dislike of reading history? I get that the economy is bad, but violence everywhere seems contained. Why do people believe otherwise? What reference points do they use?

New York has come of age as a start-up hub

Saturday, January 23rd, 2010

Obviously I’m biased, since I’m trying to do a start-up in New York, but everything about this rings true:

Tumblr and Posterous are the two most prominent “tumblogging” sites, i.e. sites that make blogging more straightforward by making it easier to post media. Both were launched within six months. (Actually, Posterous was started later than Tumblr.)

But now Tumblr has been an Alexa Top 100 site for a while and is still growing strong. Meanwhile Posterous has about 4 times less uniques. Yet Posterous has everything to win: it’s a Y Combinator company with top-tier investors like Chris Sacca and Mitch Kapor. Its founders are experienced software engineers with computer science degrees from Stanford. How come it’s eating dust from a small startup started by a high school dropout?

The answer is as easy as it is counter-intuitive: Tumblr is a New York company and Posterous is a Silicon Valley company.

Or, to put it another way: Posterous is an engineered product, while Tumblr is a designed product.

Posterous is extremely well engineered. There’s nothing wrong with it. Every single thing about it is well thought out. But it’s not just that it’s less pretty (though it is). It’s just not designed as well as Tumblr is.

…In fact, everything about Posterous is nice. It’s very nice. I’m not here to bash Posterous, I think it’s a tremendous product and I wish them the best of luck.

But everything about Tumblr is better designed. I used the landing page as one example, but there are tons of features where Tumblr shines by its gorgeous design.

Meanwhile Posterous is typical of the Silicon Valley engineering mindset where everything is measured, ranked, weighted. It’s like Google. And having terrible design like Google is great if you have a technology edge. But if you’re in a market where what matters is design edge, that’s not enough. There needs to be great design, by which I don’t mean looks (though they’re important), but how it works for the end user.

…The first is that New York has truly come of age as a startup hub, with its own “style”, its own way of doing things, its own mindset, which can sometimes — not always, but sometimes — kick Silicon Valley’s ass.

Incanter: an R-like statistical package for the JVM

Monday, January 4th, 2010

Incanter is a statistical package written in Clojure. It brings some bits of the R-language to the JVM.