Database war stories
This article regarding data storage and its use is a goldmine of information. The comments that follow are also excellent. The article is focused on Bloglines and Memeorandum, but the comments evolved into a “flat files versus databases” debate. Some choice excerpts from both the article and the comments follow:
[For Bloglines] As evidenced by our design, traditional database systems were not appropriate (or at least the best fit) for large parts of our system. There’s no trace of SQL anywhere (by definition we never do an ad hoc query, so why take the performance hit of a SQL front-end?), we resort to using external (to the databases at least) caches, and a majority of our data is stored in flat files. Sure, we could have just gone with Oracle running on a big SAN, but that would have been very expensive overkill, both on the hardware and on the software licenses (and features, for that matter). And relational databases oftentimes are not the most efficient mechanism to store data, so we’d still most likely have to resort to using memcacheds.
In the comments, a fellow named Phil Swenson said that using flat files, instead of a database, was always a terrible idea. Many comments then followed, offering opposing views.
John Hart said:
Databases are useful for ad-hoc queries, schema flexibility, transactional semantics, and flexible indexing.
If you don’t need these things, you can get a _lot_ of wins by not using a database. If you’ve been developing within a database for a long time, seeing the performance you can get from flat files will blow you away. You realize, “oh, wow, computers are a lot faster than I realized because the database has been in my way for so long.”
As a funny example, what’s the quickest way to get a sorted 1TB dataset from Oracle? No matter how you define your indices, it’s still quicker to export unsorted data via SQL*Loader and then sort it yourself. Crazy, but (in my experience) true.
Scott Lewis:
Part of Phil’s point may be that databases provide a consistent, highly-referenced system for storing and retrieving data, whereas flat-file systems tend to be idiosyncratic. This affects not only ongoing support issues, but the actual value of a product, potentially. New purchasers of an online site, for example, would want surety that the data can be easily and reliably managed in the future. Databases tend to be better at providing that than flat-file systems, no matter how robust or fast.
Patrick Tufts disagreed with Phil’s assertion that “if you ever wants to extract/analyze data it will be painful.” Patrick wrote:
In my experience, doing very large-scale data analysis at Amazon and IBM, this is not the case. I prefer to work with flat files (either logfiles or full database extracts) because I’m going to do multiple passes over nearly every record, and most databases, while great for random access to records, are terrible when you just want everything.
Andy writes:
I have a huge directed graph stored in MySQL as a single table with (parent, child) relationships. As an optimization for other purposes, this gets flattened into a complete mapping of all ancestors to all descendents. The graph is maintained in an SQL table for maintenance reasons (we already have the tools that can manipulate the graph as stored in MySQL). The graph, and thus the table, is relatively small, 4000 nodes and 2000 edges.
Traversing the entire graph using SQL queries in sequence (query each node when it is reached) took 25 minutes. Selecting everything from the table and generating the same data structure (parent->child relationships, not nested structures) in perl cut the run time down to 45 seconds. That’s 45 seconds to traverse the entire graph. Of course, once it’s all in memory, that kind of speed is to be expected.
Chris M writes:
one advantage of a database is it makes it easy to change your schema, as phil pointed out. If your project is at all visionary or cutting-edge, that freedom is indispensible, since you never know what direction you’ll be going in next. What kind of relationships will be the most important ones, etc.
If you’re lucky though, and then smart, you’ll be able to cache essentially all of your application in “flat” html files in front of your database. You regenerate only the ones you need to when the database is modified, which should be infrequently. This is the approach Rails takes. Then also your pages can be comprised of up to dozens of small bits and pieces, so that when a master page is invalidated, the database is only hit to regenerate just the part that was changed and the regeneration only takes a half second or something.
Gavin writes:
The time to learn a relational db and then special cases to back up, etc., are big time inputs needed for rdbms. Do not forget that in the ‘against’ category..
August 3rd, 2007 at 9:22 am
[...] Database War Stories from new weblog, Closer to the Ideal. [...]