Archive for the ‘datastores’ Category

Gay marriage: how to design the database

Thursday, March 11th, 2010

Gay marriage: the database engineering perspective. This is great stuff. The essay is ostensibly about how to handle the recording of gay marriage in a database. But it uses the issue of gay marriage to go through every classic issue of database design, from foreign keys to normalization to the degree of abstraction needed to handle polyamorous marriages. From now on, when I’ve got a friend trying to learn how to design databases, I will send them to this essay.

There are various objections to expanding the conventional, up-tight, as-God-intended “one man, one woman” notion of marriage but by far the least plainly bigoted ones I am aware of are the bureaucratic ones.

To be blunt, the systems aren’t set up to handle it. The paper forms have a space for the husband’s name and a space for the wife’s name. Married people carefully enter their details in block capitals and post the forms off to depressed paper-pushers who then type that information into software front-ends whose forms are laid out and named in precisely the same fashion. And then they hit “submit” and the information is filed away electronically in databases which simply keel over or belch integrity errors when presented with something so profound as a man and another man who love each other enough to want to file joint tax returns.

Speaking as a computery-type person, altering the paper forms is not my department. It’s probably expensive and there are probably millions of existing incorrect forms which would need returning or recycling or burning instead of using. Or maybe it’s simple. I don’t know. The real question from my perspective is how you store a marriage in a computer.

Altering your database schema to accommodate gay marriage can be easy or difficult depending on how smart you were when you originally set up your system to accommodate heterosexuality only. Let’s begin.

A denormalisation dictionary

Sunday, December 20th, 2009

A fascinating bit of database optimization:

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big “start reviewing” buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL “ORDER BY RAND()” statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn’t matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn’t nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page—and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation—the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some—the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the “next page” button would occasionally turn up a page that didn’t exist. I had some very poorly considered fallback logic for that—if the random page didn’t exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page—which subsequently attracted well over a thousand individual reviews.

Next time, I’m going to try and make the “next” button completely bullet proof! I’m also going to maintain a “denormalisation dictionary” documenting every denormalisation in the system in detail—such a thing would have saved me several hours of confused debugging.

The relational database is up against its limits on big sites

Thursday, September 24th, 2009

10 years ago, when I was getting serious about programming, I read Phillip Greenspun’s book about websites and databases. Greenspun does a great job of explaining some of the thinking, and even traditions, that shaped relational databases. I read that book and formed certain opinions about databases, for instance, that a database should not store the same info in multiple places (that is, the database should be normalized). You would be a fool to store a user’s phone number in 3 different places, because when they change their phone number, there is a risk that one database table with update but another won’t, and then you have inconsistent data about the user.

I’ve been fascinated with the articles, more and more over the last 3 years, arguing for a new style of storing data:

Distributed vs. Relational Databases

Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them. For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.

Cassandra is taking the approach that, given that you’re going to have to give up some parts of the relational model to scale, let’s start over and rethink things. Let’s add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.

A site converts SQL into Propel code (for Symfony, for instance)

Monday, September 7th, 2009

I think I stumbled upon this site last year, when I was first learning Symfony. At that time, the site was very rough. It is still in alpha, even now, but it looks like it has reached a point where it is useful. It converts SQL into the code needed for Propel.

So it converts this:

person.gender = ‘M’ AND (person.location IN (’Birmingham’, ‘Coventry’) OR person.location = ‘Manchester’) AND (person.enabled <> 0) AND person.age > 16

Into this:

$c = new Criteria();
$crit0 = $c->getNewCriterion(PersonPeer::GENDER, ‘M’);
$crit1 = $c->getNewCriterion(PersonPeer::LOCATION, array(’Birmingham’, ‘Coventry’), Criteria::IN);
$crit2 = $c->getNewCriterion(PersonPeer::LOCATION, ‘Manchester’);

// Perform OR at level 1 ($crit1 $crit2 )
$crit1->addOr($crit2);
$crit3 = $c->getNewCriterion(PersonPeer::ENABLED, 0, Criteria::NOT_EQUAL);
$crit4 = $c->getNewCriterion(PersonPeer::AGE, 16, Criteria::GREATER_THAN);

// Perform AND at level 0 ($crit0 $crit1 $crit3 $crit4 )
$crit0->addAnd($crit1);
$crit0->addAnd($crit3);
$crit0->addAnd($crit4);

// Remember to change the peer class here for the correct one in your model
$c->add($crit0);
$result = TablePeer::doSelect($c);

// This loop will of course need to be edited to work
foreach ($result as $obj)
{
//$val = $obj->getValue();
}

Very useful, since the worst thing about Symfony is how under-documented Propel is.

On future Symfony projects, I may convert to Doctrine.

JOINs are evil, part II

Saturday, August 8th, 2009

It’s odd that I’m suddenly seeing this advice everywhere. Ten years ago, when I was learning about databases, I never heard anyone suggest that JOINs are evil:

#4: Avoid joins

As your tables get bigger and bigger joins cannot be done in memory and spill over to disk. Once that happens your performance is gone. For smaller tables it doesn’t matter but as tables get bigger avoid joining when you can. Try using unions, subselects, or denormalization instead. If you have two big tables that are 1-to-1 that you are joining a lot combine them into one wide table.

JOINs are evil?

Saturday, August 1st, 2009

Interesting conversation in the comments after Don MacAskill talks about compression and CPU usage for MySql tables on a server running OpenSolaris:

Don MacAskill:
We don’t do JOINs. As in, really, no JOINs. JOINs are evil. :) Still, our read workload is likely very random, so I see where you’re coming from. But I’m fairly certain that some of the same data applies.

I have no idea, though, what granularity the compression is at. That’s a good question. I’ll see if I can get an answer. I’ll bet it’s each “record” (aka block), but I could be wrong.

A follow up comment:

Justin Swanhart:
I don’t subscribe to the JOINS are evil camp, but I understand your frustration. Personally I am a fan of materializing views for fast read performance on aggregated tables, but still the advantages of normalization for OLTP applications.

But even without JOINs, InnoDB reads may be fairly random if you are using non PRIMARY indexes. As you probably know InnoDB tables are organized arround the primary key of the table, or around an automatically provided 64bit surrogate key. Secondary indexes have to do an additional lookup on the primary key.

Are your tables read-only? Insertion and update may have really big penalties on compressed filesystems too. Also, have you considered the compression options of the new pluggable version of InnoDB?

JournalSpace loses all data in its database and has no backup

Saturday, January 10th, 2009

This is a horrifying failure of risk management and system administration good practice:  JournalSpace loses all data in its database and has no backup:

Blogging platform JournalSpace (which I’d never heard of to date) has ceased to be, following a wipe-out of the main database for which there was no back-up in place. According to the JournalSpace blog, the database was overwritten as a result of a malicious act from a disgruntled ex-employee.

The lack of backups is the fault of management, for they had the authority to make better decisions, and they had the ethical responsibility to protect the data of their users. Nevertheless, they try to shift the blame to one of their employees:

It was the guy handling the IT (and, yes, the same guy who I caught stealing from the company, and who did a slash-and-burn on some servers on his way out) who made the choice to rely on RAID as the only backup mechanism for the SQL server. He had set up automated backups for the HTTP server which contains the PHP code, but, inscrutibly, had no backup system in place for the SQL data. The ironic thing here is that one of his hobbies was telling everybody how smart he was.

The employee might be guilty of criminal actions here, but that doesn’t let management off the hook for having been so unprepared. If it hadn’t been an employee it might have been a tornado or earthquake or some other disaster – and the blame still would have belonged to management. Multiple backups, in different locations, is the precaution that a responsible company must make.

cb sums it up well in the comments:

lol, gotta love an internet company that has ‘a guy handling IT’. As if the IT side of things is an afterthought-which apparently it was in this case.

There is a second part to this story that I find very sad. One of the users of JournalSpace, a woman calling herself tinythoughts, shows up in the comments at TechCrunch and expresses her sadness, whereupon she is immediately attacked for her having ever used JournalSpace. I am puzzled and worried by the attitude that would defend the company and blame the customer.

This is tinythoughts:

i had one of the oldest journals on journalspace. i am really upset about losing about 6 years of writing, and my layouts which i made. it was bad enough when they lost years of comments. this is far worse. i am pretty sure i archived most of everything up til about a year ago on my external hd. i’m actually a lot sadder about this than i thought i would be.

This is the criticism that is then thrown at her:

If you value your work so much, you shouldn’t be using something that’s free and expect not to lose it.

At the end of the day, piss all you want. It’s your damn fault for leeching off a free service and expect it to continue to provide for you.

The day of FREE is over!

And then this was her response:

it wasn’t free. i was a paying customer for most of the time i was on there, until the first big data loss. after that, i did not get a pro account anymore and also began to write less on there.
as for losing my stuff, which i did value, as i said, i did back it up myself after that first data loss. however, i liked it where and how it was, accessible online to me and anyone else. we are talking about almost 6 years of content. that is not some small thing. even if you’re dumb, you should be able to understand that.
btw, i work in this industry myself, and i am pretty sure the days of free are not over. but all the best to you on being rude anonymously to others online.

She also adds:

I don’t believe their story. I think there is more to it. They’ve had problems before and they always lay it out like their users are technically stupid and willing to accept any dumb answer given to them. What happened really was a great loss for many users, who had been there for years, a community of really great people. The greatest loss is all of the time, life, love, and community each of those users put into journalspace, where it was all documented and washed away like sandcastles on the beach. I might be upset for my own loss, but not nearly as sad as I am for many of my friends there. I think journalspace owes them more than a lame excuse and an empty sorry.

The fact that someone is willing to attack the customers in this case, rather than the grossly irresponsible company, actually saddens me more than the already sad fact that a lot of people lost years worth of work. (Though if I lost that much work, I’d cry for days.)

Database war stories

Sunday, July 29th, 2007

This article regarding data storage and its use is a goldmine of information. The comments that follow are also excellent. The article is focused on Bloglines and Memeorandum, but the comments evolved into a “flat files versus databases” debate. Some choice excerpts from both the article and the comments follow:

[For Bloglines] As evidenced by our design, traditional database systems were not appropriate (or at least the best fit) for large parts of our system. There’s no trace of SQL anywhere (by definition we never do an ad hoc query, so why take the performance hit of a SQL front-end?), we resort to using external (to the databases at least) caches, and a majority of our data is stored in flat files. Sure, we could have just gone with Oracle running on a big SAN, but that would have been very expensive overkill, both on the hardware and on the software licenses (and features, for that matter). And relational databases oftentimes are not the most efficient mechanism to store data, so we’d still most likely have to resort to using memcacheds.

In the comments, a fellow named Phil Swenson said that using flat files, instead of a database, was always a terrible idea. Many comments then followed, offering opposing views.

John Hart said:

Databases are useful for ad-hoc queries, schema flexibility, transactional semantics, and flexible indexing.

If you don’t need these things, you can get a _lot_ of wins by not using a database. If you’ve been developing within a database for a long time, seeing the performance you can get from flat files will blow you away. You realize, “oh, wow, computers are a lot faster than I realized because the database has been in my way for so long.”

As a funny example, what’s the quickest way to get a sorted 1TB dataset from Oracle? No matter how you define your indices, it’s still quicker to export unsorted data via SQL*Loader and then sort it yourself. Crazy, but (in my experience) true.

Scott Lewis:

Part of Phil’s point may be that databases provide a consistent, highly-referenced system for storing and retrieving data, whereas flat-file systems tend to be idiosyncratic. This affects not only ongoing support issues, but the actual value of a product, potentially. New purchasers of an online site, for example, would want surety that the data can be easily and reliably managed in the future. Databases tend to be better at providing that than flat-file systems, no matter how robust or fast.

Patrick Tufts disagreed with Phil’s assertion that “if you ever wants to extract/analyze data it will be painful.” Patrick wrote:

In my experience, doing very large-scale data analysis at Amazon and IBM, this is not the case. I prefer to work with flat files (either logfiles or full database extracts) because I’m going to do multiple passes over nearly every record, and most databases, while great for random access to records, are terrible when you just want everything.

Andy writes:

I have a huge directed graph stored in MySQL as a single table with (parent, child) relationships. As an optimization for other purposes, this gets flattened into a complete mapping of all ancestors to all descendents. The graph is maintained in an SQL table for maintenance reasons (we already have the tools that can manipulate the graph as stored in MySQL). The graph, and thus the table, is relatively small, 4000 nodes and 2000 edges.

Traversing the entire graph using SQL queries in sequence (query each node when it is reached) took 25 minutes. Selecting everything from the table and generating the same data structure (parent->child relationships, not nested structures) in perl cut the run time down to 45 seconds. That’s 45 seconds to traverse the entire graph. Of course, once it’s all in memory, that kind of speed is to be expected.

Chris M writes:

one advantage of a database is it makes it easy to change your schema, as phil pointed out. If your project is at all visionary or cutting-edge, that freedom is indispensible, since you never know what direction you’ll be going in next. What kind of relationships will be the most important ones, etc.

If you’re lucky though, and then smart, you’ll be able to cache essentially all of your application in “flat” html files in front of your database. You regenerate only the ones you need to when the database is modified, which should be infrequently. This is the approach Rails takes. Then also your pages can be comprised of up to dozens of small bits and pieces, so that when a master page is invalidated, the database is only hit to regenerate just the part that was changed and the regeneration only takes a half second or something.

Gavin writes:

The time to learn a relational db and then special cases to back up, etc., are big time inputs needed for rdbms. Do not forget that in the ‘against’ category..

There is so much data on the web that all the rules of databases need to be rewritten

Tuesday, July 24th, 2007

The modern relational database came together around 1970. SQL (Structured Query Language), the language for getting data into and out of databases, was created shortly thereafter. The early relational databases were great for when a company needed to manage a few million records. Banks could keep track of their customers transactions, and airlines could keep track of their customers tickets. But now, it seems, we’ve reached a point where relational databases no longer work. At least, not for the giant stores of data that exist on the web, stores that might have billions, or tens of billions, of records. Apparently the increased speed of CPUs and the much greater availability of RAM has not been enough to give databases the extra speed they need to handle the larger data sets. The implications of that are worrisome.

Joe Gregorio has seen a new pattern emerge for handling very large data sets:

Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.

He goes on to outline the main features of the new pattern that he sees:

Distributed
The data has to be distributed across multiple machines.
Joinless
No joins, and no referential integrity, at least at the data store level.
De-Normalized
No one said this explicily, but I presume there is a lot of de-normalization going on if you are avoiding joins.
Transcationless
No transactions