The relational database is up against its limits on big sites
10 years ago, when I was getting serious about programming, I read Phillip Greenspun’s book about websites and databases. Greenspun does a great job of explaining some of the thinking, and even traditions, that shaped relational databases. I read that book and formed certain opinions about databases, for instance, that a database should not store the same info in multiple places (that is, the database should be normalized). You would be a fool to store a user’s phone number in 3 different places, because when they change their phone number, there is a risk that one database table with update but another won’t, and then you have inconsistent data about the user.
I’ve been fascinated with the articles, more and more over the last 3 years, arguing for a new style of storing data:
Distributed vs. Relational Databases
Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them. For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.
Cassandra is taking the approach that, given that you’re going to have to give up some parts of the relational model to scale, let’s start over and rethink things. Let’s add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.