Archive for the ‘datastores’ Category

Chris Renner hates database abstraction layers

Sunday, May 9th, 2010

Chris Renner hates database abstraction layers, but I think he doesn’t know how to use them. He writes:

Propel actually does manage to provide some support for joins. I do not like Propel’s reliance on the “*” wildcard in the linked example, but perhaps that is just for simplicity in the documentation. Good SQL explicity names the columns to retrieve, and except for in-development testing, you should rarely if ever use “*” …its lazy coding and wastes resources (see Rudy Limeback’s Simply SQL for a more detailed argument against “*”). I’d like to see Propel, or any other ORM/DAL class deal, however, with a query such as this:

SELECT SQL_CALC_FOUND_ROWS
projects.id
, faculty
, author
, author2
, author3
, date_created
, users.last_name as faculty_last
, users.first_name as faculty_first
, users2.last_name as author_last
, users2.first_name as author_first
, dept.dept_name
, project_title
, project_type
, staff_no
, project_status
, organization.org_name
, project_date
FROM projects
LEFT OUTER JOIN users as faculty ON projects.faculty = users.id
LEFT OUTER JOIN users as users2 ON projects.author = users2.id
LEFT OUTER JOIN dept ON projects.dept = dept.id
LEFT OUTER JOIN organization ON project.org = organization.id
LEFT OUTER JOIN staff ON project.staff_no = staff.id
WHERE (users.last_name LIKE ‘%smith%’ OR users.first_name LIKE ‘%smith%’)
AND project.title LIKE ‘%Material Composition of Unobtanium%’
ORDER BY Items.Entry_Date DESC
LIMIT 0, 15

The above example is a sanitized (names anonymized to protect the innocent) version of a real query I use to return paginated results in one of our web apps. Some of the joins are included dynamically based on search criteria, others are always included. The WHERE clause is likewise generated dynamically, as well as the ORDER BY, which is used for column-based sorting.

Show me a DAL/ORM class that can deal with this example and I’ll be your best friend forever.

I do almost all my development work nowadays with Symfony, using Propel as the ORM, so I work with Propel a great deal. One trick for handling multiple joins is to use a database view. This is a trick that halfer mentioned to me on the Symfony forums.

Recently I’ve been working for a client who is taking an old system and porting it to Symfony. The old system had a lot of complicated JOINs. So, for instance, I take this:

SELECT
c.id, c.textid, c.title, c.title_link, c.content_date, c.issue_date,
c.protected AS protect, c.allow_comments, c.show_comments, c.content_hint,
c.author1, c.author2, c.author3, c.author4, c.author_str,
c.subtitle, c.abstract, c.direct_url,
c.imgpath1, c.imgpath2, c.imgcaption1, c.imgcaption2,
c.content_type, c.live_date,
c.reference_link_1, c.alt_info_1,
b.blog_name, b.blog_id,
cao_author_xref.author_id, cao_author_xref.state,
cao_author_xref.content_type AS content_t,
it.issue_text_date,
d.literal, d.alias, d.cdaobject, d.cdaaction, d.channel, d.adsite,
d.adzone, d.indexable
FROM cao_author_xref
LEFT OUTER JOIN cao_content c on (cao_author_xref.content_id = c.id
and cao_author_xref.state=c.state)
LEFT OUTER JOIN cao_issue it on c.issue_date = it.issue_date
LEFT OUTER JOIN cao_blogs b on c.reference_link_1 = b.blog_id
LEFT OUTER JOIN cao_content_type d on c.content_type=d.id

and I turn it into a view:

CREATE VIEW cao_paginated_content_by_author AS
SELECT
c.id, c.textid, c.title, c.title_link, c.content_date, c.issue_date,
c.protected AS protect, c.allow_comments, c.show_comments, c.content_hint,
c.author1, c.author2, c.author3, c.author4, c.author_str,
c.subtitle, c.abstract, c.direct_url,
c.imgpath1, c.imgpath2, c.imgcaption1, c.imgcaption2,
c.content_type, c.live_date,
c.reference_link_1, c.alt_info_1,
b.blog_name, b.blog_id,
cao_author_xref.author_id, cao_author_xref.state,
cao_author_xref.content_type AS content_t,
it.issue_text_date,
d.literal, d.alias, d.cdaobject, d.cdaaction, d.channel, d.adsite,
d.adzone, d.indexable
FROM cao_author_xref
LEFT OUTER JOIN cao_content c on (cao_author_xref.content_id = c.id
and cao_author_xref.state=c.state)
LEFT OUTER JOIN cao_issue it on c.issue_date = it.issue_date
LEFT OUTER JOIN cao_blogs b on c.reference_link_1 = b.blog_id
LEFT OUTER JOIN cao_content_type d on c.content_type=d.id

Now I add it to the config/schema.yml file as if it was a real database table:

cao_paginated_content_by_author:
id: ~
textid: { type: varchar(200) }
title: { type: varchar(200) }
title_link: { type: varchar(150) }
content_date: { type: timestamp }
issue_date: { type: date }
protect: { type: varchar(1) }
allow_comments: { type: varchar(1) }
show_comments: { type: varchar(1) }
content_hint: { type: varchar(1) }
author1: { type: integer }
author2: { type: integer }
author3: { type: integer }
author4: { type: integer }
author_str: { type: longvarchar }
subtitle: { type: longvarchar }
abstract:
peerName: cao_abstract
phpName: cao_abstract
type: LONGVARCHAR
required: true
direct_url: { type: varchar(250) }
imgpath1: { type: varchar(150) }
imgpath2: { type: varchar(150) }
imgcaption1: { type: longvarchar }
imgcaption2: { type: longvarchar }
content_type: { type: integer }
reference_link_1: { type: integer }
alt_info_1: { type: varchar(200) }
blog_name: { type: varchar(200) }
blog_id: { type: integer }
issue_text_date: { type: varchar(100) }
author_id: { type: integer }
state: { type: varchar(1) }
content_t: { type: integer }
live_date: { type: timestamp }
literal: { type: varchar(50) }
alias: { type: varchar(50) }
cdaobject: { type: varchar(50) }
cdaaction: { type: varchar(50) }
channel: { type: varchar(50) }
adsite: { type: varchar(50) }
adzone: { type: varchar(50) }
indexable: { type: varchar(1) }

Now I run the Symfony command that autogenerates my Propel classes based on the above schema.yml.

Now to run the above JOIN statement, using Propel, I only need a few lines of code. If I want all the rows from the JOIN where STATE equals “P” I write:

$c = new Criteria();
$c->add(CaoPaginatedContentByAuthorPeer::STATE, “P”);
$arrayOfItems = CaoPaginatedContentByAuthorPeer:doSelect($c);

That is 3 lines of code. And it keeps the SQL out of my code, which I think is a big plus.

Is MySQL Doomed to Extinction?

Sunday, May 9th, 2010

Interesting conversation on Hacker News about the article, Is MySQL Doomed to Extinction?

I know that it’s become fashionable to write off MySQL or SQL in general and many large sites are migrating away from it to non-relational data stores. However, for many sites, MySQL does a wonderful job and relational features cut down development time. The founder of Flickr spoke to this at a Django conference saying something to the effect of, “ever since there was more than 200 sites on the internet, it has been better to target a framework at the rest of the internet than at the top 100 sites”. Specifically, he was speaking to sharding and other highly scalable techniques that Django doesn’t deal with natively and saying that the vast, vast majority of sites will never have to deal with those questions.

Relational technology definitely has its place and does offer developers advantages. And relational technology does scale for some high-traffic operations. Look at Wikipedia. It’s the 6th most trafficked site in the world and backed by MySQL. Granted, they do use caching and other techniques as well, but it’s a highly available, highly trafficked site. In fact, I can’t remember the last time I heard about Wikipedia going down.

New technology directions are great. I’m really excited about Cassandra. Do we really have to declare something dead and something else the winner?

Document based data storage: going without a schema using MongoDb

Saturday, May 8th, 2010

Interesting remarks by Tim Hawkins, over on a LinkedIn discussion, regarding schemaless data storage design, and using MongoDb.

There are loads of projects on the go to create Zend database adaptors for mongo, however in my opinion, these are all missing the point, they all try to make the mongo system look like Zend_Db system or an ORM interface, this goes back to trying to impose a schema on things. If you let go of schema and embrace the document orientated approach fully, you find you dont need any of the complexity. The native mongo php interface is rich enough and simple enough to use directly without abstraction layers. All the abstraction just gets in the way. It took me a while to just give that all up and stop crying about the lack of joins etc, and start looking at the datasets in a different way.

My final observation is that Mongo is fast, so fast it took my breath away when we first got this system up and running, so fast that we don’t need memcache anymore. And it produces results in our high load application, much faster than an RDBMS, and with far less cpu and memory consumption. I also believe mongo will scale far further than any mysql setup. I know people are pushing SSD, multicore, but eventually you hit a load point where one data store just cant cut it anymore, and you need to shard, and once you do that you can kiss goodbye to most of the advantages of SQL as you RDBMS systems fall back to being little more than keyed record stores.
Posted 4 days ago | Reply Privately

When to use the R language

Saturday, April 17th, 2010

When to use the R language

When you have to explore data. At the start of an analytic project, it’s a good idea to create a bunch of graphical visualizations of your data to get a sense of what’s inside it. In terms of its graphical capabilities, R exists in a whole separate dimension from Excel. This was perhaps the most shocking part to me about using R for the first time: I really thought I had a handle on data analysis even though I’d restricted my software to Excel, but boy was I wrong. The visualizations you can create in R are much more sophisticated and much more nuanced. And, philosophically, you can tell that the visualization tools in R were created by people more interested in good thinking about data than about beautiful presentation. (The result, ironically, is a much more beautiful presentation, IMHO.)

Here’s how I’d put the difference to someone who’s familiar with Excel but not yet with R. The graphics creation options that Excel gives you are all based in the graphical user interface. This is what makes Excel relatively easy to use—all your options are laid out before you with nice buttons and fill-in-the-blank boxes. But in order to create a graphical interface that’s easy to use, the creators of Excel had to make a bunch of decisions about what sorts of graphics you are and are not likely to want. With too many choices, the graphical interface becomes cumbersome and frustrating, so to achieve simplicity they had to eliminate options.

And this isn’t a gripe or anything. I can’t say I’d have done a better job designing Excel’s charting graphical interface. I cut my teeth on it.

These limitations become a problem when you want to inspect data visually in a bunch of different ways in order to explore it. R, through a combination of its well-designed base graphics package, the exceptionally well-designed lattice graphics package, and the jaw-droppingly well-designed ggplot2 graphics package, offers a breathtaking array of visualization options that you access through the command line or scripts. It has power that you just can’t get using a graphical interface to generate your charts.

Document databases

Sunday, April 11th, 2010

Interesting look at document databases:

A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.

Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.

Is MySql a really terrible database system?

Thursday, March 25th, 2010

I was surprised by this headline:

DIGG: 4000% PERFORMANCE INCREASE BY SORTING IN PHP RATHER THAN MYSQL

I thought, “How can that be?” Sorting in a database should always be faster than sorting in PHP. Still the article made a lot of good sounding points:

1.) Precompute on writes, make reads fast. This is an oldie as a scaling strategy, but it’s valuable to see how SimpleGeo is applying it to their problem of finding entities within a certain geographical region. Using Cassandra they’ve built two clusters: one for indexes and one for records. The records cluster, as you might imagine, is a simple data lookup. The index cluster has a carefully constructed key for every lookup scenario. The indexes are computed on the write, so reads are very fast. As reads dominate, this makes a lot of sense. Queries based on time are also precomputed. Joe mentions some special algorithms for spreading out data, which tends to cluster around geographical regions, but does not mention what these are.

3.) The relation tool chain has failed for real-time. The relational database tool chain is not evolving. It has failed for large scale, real-time environments. Building scalable systems on a relational database requires building sharding, load balancing, resharding, cluster management, worrying about consistency, implementing distributed queries, and other layers yourself, so why bother? Cassandra does all that for you out of the box. Shot off a server and Cassandra will handle all the remapping and rerouting automatically.

4.) Scaling practices turn a relational database into a non-relational database. To scale at Digg they followed a set of practices very similar to those used at eBay. No joins, no foreign key constraints (to scale writes), primary key look-ups only, limited range queries, and joins were done in memory. When implementing the comment feature a 4,000 percent increase in performance was created by sorting in PHP instead of MySQL. All this effort required to make a relational database scale basically meant you were using a non-relational database anyway. So why not just use a non-relational database from the start?

6.) Scaling equals specialization. To scale often requires building highly custom, problem specific solutions.

But then Denis Forbes offered this counter-point:

Shocked by the incredibly poor database performance described on the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem.

While they posted that entry six months ago, they recently followed up with more statements on the NoSQL / RDBMS divide, and are now being heavily used as a citation of sorts.

For instance Dare Obasanjo held Digg’s moves as a rebuttal of my prior entry on SQL scaling (though my entry actually explicitly excluded incredibly rare edge cases like Digg’s, and my core point was that the majority of database uses don’t have the needs of a site like Digg, I’m always one to take on a challenge), which then got picked up in other blogs.

I would say Digg’s case is an example of a bottom-feeder RDBMS product (apologies for being incendiary, but why does the problem always come down to MySQL? These examples always end up being “we moved from MySQL to NoSQL” rather than “We moved from Sybase ASE to NoSQL”), used arguably suboptimally on unpowered hardware, and it proves nothing of substance about either database technology. Yet it’s held as demonstrative of something, which is why I focus on it. They are different tools in the toolbox, arguably for different purposes, and that isn’t the focus of this entry.

He offers some interesting test results. He also suggests that more companies should take advantage of big servers and cheap RAM:

Database servers really like having a lot of RAM. Ideally you should have more RAM than you have data, allowing it to cache the entirety of your DB (or at least the working-set quantity of DB on that partition) making incredible read performance achievable.

Joining rows is not a hard activity for database servers. It can do it at unfathomable rates if the data can be fed to it at the appropriate pace and in the right form. Even heavily normalized databases can be high performance.

What normally makes joins a performance issue is data locality: if you have to load two rows from different places on the disk, that’s two seeks instead of just one (or three, four, five or more instead of one). When seeks are as costly as they are on a magnetic disk, you avoid it (either by striving for a database that fits in memory, which paradoxical often calls for heavy normalization, or by de-normalizing).

…If I had 48GB of RAM in the test machine (which is fairly pedestrian outside of gerbil-sized cloud instances. Note that you can now add 128GB of RAM to servers for around $4000 in some cases), outside of the initial caching period the select rates would be stratospheric regardless of storage medium, though SSDs would still come in a very, very strong lead when it came to write performance.

For the same $4000 you could chain five Intel X25-E drives for 320GB of intensely high performance – and persistent – storage. Just keep going up until you have more throughput, I/O and storage than you could dream of.

Some high-end enterprise solutions now tier storage and automatically place data as appropriate, choosing between magnetic, SSD, and memory caching systems. The pages of the table that are never touched end up on the magnetic storage while the hot area – say Diggs within the last 6 months – are moved to SSDs or to huge banks of memory caching.

Gay marriage: how to design the database

Thursday, March 11th, 2010

Gay marriage: the database engineering perspective. This is great stuff. The essay is ostensibly about how to handle the recording of gay marriage in a database. But it uses the issue of gay marriage to go through every classic issue of database design, from foreign keys to normalization to the degree of abstraction needed to handle polyamorous marriages. From now on, when I’ve got a friend trying to learn how to design databases, I will send them to this essay.

There are various objections to expanding the conventional, up-tight, as-God-intended “one man, one woman” notion of marriage but by far the least plainly bigoted ones I am aware of are the bureaucratic ones.

To be blunt, the systems aren’t set up to handle it. The paper forms have a space for the husband’s name and a space for the wife’s name. Married people carefully enter their details in block capitals and post the forms off to depressed paper-pushers who then type that information into software front-ends whose forms are laid out and named in precisely the same fashion. And then they hit “submit” and the information is filed away electronically in databases which simply keel over or belch integrity errors when presented with something so profound as a man and another man who love each other enough to want to file joint tax returns.

Speaking as a computery-type person, altering the paper forms is not my department. It’s probably expensive and there are probably millions of existing incorrect forms which would need returning or recycling or burning instead of using. Or maybe it’s simple. I don’t know. The real question from my perspective is how you store a marriage in a computer.

Altering your database schema to accommodate gay marriage can be easy or difficult depending on how smart you were when you originally set up your system to accommodate heterosexuality only. Let’s begin.

A denormalisation dictionary

Sunday, December 20th, 2009

A fascinating bit of database optimization:

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big “start reviewing” buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL “ORDER BY RAND()” statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn’t matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn’t nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page—and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation—the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some—the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the “next page” button would occasionally turn up a page that didn’t exist. I had some very poorly considered fallback logic for that—if the random page didn’t exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page—which subsequently attracted well over a thousand individual reviews.

Next time, I’m going to try and make the “next” button completely bullet proof! I’m also going to maintain a “denormalisation dictionary” documenting every denormalisation in the system in detail—such a thing would have saved me several hours of confused debugging.

The relational database is up against its limits on big sites

Thursday, September 24th, 2009

10 years ago, when I was getting serious about programming, I read Phillip Greenspun’s book about websites and databases. Greenspun does a great job of explaining some of the thinking, and even traditions, that shaped relational databases. I read that book and formed certain opinions about databases, for instance, that a database should not store the same info in multiple places (that is, the database should be normalized). You would be a fool to store a user’s phone number in 3 different places, because when they change their phone number, there is a risk that one database table with update but another won’t, and then you have inconsistent data about the user.

I’ve been fascinated with the articles, more and more over the last 3 years, arguing for a new style of storing data:

Distributed vs. Relational Databases

Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them. For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.

Cassandra is taking the approach that, given that you’re going to have to give up some parts of the relational model to scale, let’s start over and rethink things. Let’s add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.

A site converts SQL into Propel code (for Symfony, for instance)

Monday, September 7th, 2009

I think I stumbled upon this site last year, when I was first learning Symfony. At that time, the site was very rough. It is still in alpha, even now, but it looks like it has reached a point where it is useful. It converts SQL into the code needed for Propel.

So it converts this:

person.gender = ‘M’ AND (person.location IN (’Birmingham’, ‘Coventry’) OR person.location = ‘Manchester’) AND (person.enabled <> 0) AND person.age > 16

Into this:

$c = new Criteria();
$crit0 = $c->getNewCriterion(PersonPeer::GENDER, ‘M’);
$crit1 = $c->getNewCriterion(PersonPeer::LOCATION, array(’Birmingham’, ‘Coventry’), Criteria::IN);
$crit2 = $c->getNewCriterion(PersonPeer::LOCATION, ‘Manchester’);

// Perform OR at level 1 ($crit1 $crit2 )
$crit1->addOr($crit2);
$crit3 = $c->getNewCriterion(PersonPeer::ENABLED, 0, Criteria::NOT_EQUAL);
$crit4 = $c->getNewCriterion(PersonPeer::AGE, 16, Criteria::GREATER_THAN);

// Perform AND at level 0 ($crit0 $crit1 $crit3 $crit4 )
$crit0->addAnd($crit1);
$crit0->addAnd($crit3);
$crit0->addAnd($crit4);

// Remember to change the peer class here for the correct one in your model
$c->add($crit0);
$result = TablePeer::doSelect($c);

// This loop will of course need to be edited to work
foreach ($result as $obj)
{
//$val = $obj->getValue();
}

Very useful, since the worst thing about Symfony is how under-documented Propel is.

On future Symfony projects, I may convert to Doctrine.

JOINs are evil, part II

Saturday, August 8th, 2009

It’s odd that I’m suddenly seeing this advice everywhere. Ten years ago, when I was learning about databases, I never heard anyone suggest that JOINs are evil:

#4: Avoid joins

As your tables get bigger and bigger joins cannot be done in memory and spill over to disk. Once that happens your performance is gone. For smaller tables it doesn’t matter but as tables get bigger avoid joining when you can. Try using unions, subselects, or denormalization instead. If you have two big tables that are 1-to-1 that you are joining a lot combine them into one wide table.

JOINs are evil?

Saturday, August 1st, 2009

Interesting conversation in the comments after Don MacAskill talks about compression and CPU usage for MySql tables on a server running OpenSolaris:

Don MacAskill:
We don’t do JOINs. As in, really, no JOINs. JOINs are evil. :) Still, our read workload is likely very random, so I see where you’re coming from. But I’m fairly certain that some of the same data applies.

I have no idea, though, what granularity the compression is at. That’s a good question. I’ll see if I can get an answer. I’ll bet it’s each “record” (aka block), but I could be wrong.

A follow up comment:

Justin Swanhart:
I don’t subscribe to the JOINS are evil camp, but I understand your frustration. Personally I am a fan of materializing views for fast read performance on aggregated tables, but still the advantages of normalization for OLTP applications.

But even without JOINs, InnoDB reads may be fairly random if you are using non PRIMARY indexes. As you probably know InnoDB tables are organized arround the primary key of the table, or around an automatically provided 64bit surrogate key. Secondary indexes have to do an additional lookup on the primary key.

Are your tables read-only? Insertion and update may have really big penalties on compressed filesystems too. Also, have you considered the compression options of the new pluggable version of InnoDB?

JournalSpace loses all data in its database and has no backup

Saturday, January 10th, 2009

This is a horrifying failure of risk management and system administration good practice:  JournalSpace loses all data in its database and has no backup:

Blogging platform JournalSpace (which I’d never heard of to date) has ceased to be, following a wipe-out of the main database for which there was no back-up in place. According to the JournalSpace blog, the database was overwritten as a result of a malicious act from a disgruntled ex-employee.

The lack of backups is the fault of management, for they had the authority to make better decisions, and they had the ethical responsibility to protect the data of their users. Nevertheless, they try to shift the blame to one of their employees:

It was the guy handling the IT (and, yes, the same guy who I caught stealing from the company, and who did a slash-and-burn on some servers on his way out) who made the choice to rely on RAID as the only backup mechanism for the SQL server. He had set up automated backups for the HTTP server which contains the PHP code, but, inscrutibly, had no backup system in place for the SQL data. The ironic thing here is that one of his hobbies was telling everybody how smart he was.

The employee might be guilty of criminal actions here, but that doesn’t let management off the hook for having been so unprepared. If it hadn’t been an employee it might have been a tornado or earthquake or some other disaster – and the blame still would have belonged to management. Multiple backups, in different locations, is the precaution that a responsible company must make.

cb sums it up well in the comments:

lol, gotta love an internet company that has ‘a guy handling IT’. As if the IT side of things is an afterthought-which apparently it was in this case.

There is a second part to this story that I find very sad. One of the users of JournalSpace, a woman calling herself tinythoughts, shows up in the comments at TechCrunch and expresses her sadness, whereupon she is immediately attacked for her having ever used JournalSpace. I am puzzled and worried by the attitude that would defend the company and blame the customer.

This is tinythoughts:

i had one of the oldest journals on journalspace. i am really upset about losing about 6 years of writing, and my layouts which i made. it was bad enough when they lost years of comments. this is far worse. i am pretty sure i archived most of everything up til about a year ago on my external hd. i’m actually a lot sadder about this than i thought i would be.

This is the criticism that is then thrown at her:

If you value your work so much, you shouldn’t be using something that’s free and expect not to lose it.

At the end of the day, piss all you want. It’s your damn fault for leeching off a free service and expect it to continue to provide for you.

The day of FREE is over!

And then this was her response:

it wasn’t free. i was a paying customer for most of the time i was on there, until the first big data loss. after that, i did not get a pro account anymore and also began to write less on there.
as for losing my stuff, which i did value, as i said, i did back it up myself after that first data loss. however, i liked it where and how it was, accessible online to me and anyone else. we are talking about almost 6 years of content. that is not some small thing. even if you’re dumb, you should be able to understand that.
btw, i work in this industry myself, and i am pretty sure the days of free are not over. but all the best to you on being rude anonymously to others online.

She also adds:

I don’t believe their story. I think there is more to it. They’ve had problems before and they always lay it out like their users are technically stupid and willing to accept any dumb answer given to them. What happened really was a great loss for many users, who had been there for years, a community of really great people. The greatest loss is all of the time, life, love, and community each of those users put into journalspace, where it was all documented and washed away like sandcastles on the beach. I might be upset for my own loss, but not nearly as sad as I am for many of my friends there. I think journalspace owes them more than a lame excuse and an empty sorry.

The fact that someone is willing to attack the customers in this case, rather than the grossly irresponsible company, actually saddens me more than the already sad fact that a lot of people lost years worth of work. (Though if I lost that much work, I’d cry for days.)

Database war stories

Sunday, July 29th, 2007

This article regarding data storage and its use is a goldmine of information. The comments that follow are also excellent. The article is focused on Bloglines and Memeorandum, but the comments evolved into a “flat files versus databases” debate. Some choice excerpts from both the article and the comments follow:

[For Bloglines] As evidenced by our design, traditional database systems were not appropriate (or at least the best fit) for large parts of our system. There’s no trace of SQL anywhere (by definition we never do an ad hoc query, so why take the performance hit of a SQL front-end?), we resort to using external (to the databases at least) caches, and a majority of our data is stored in flat files. Sure, we could have just gone with Oracle running on a big SAN, but that would have been very expensive overkill, both on the hardware and on the software licenses (and features, for that matter). And relational databases oftentimes are not the most efficient mechanism to store data, so we’d still most likely have to resort to using memcacheds.

In the comments, a fellow named Phil Swenson said that using flat files, instead of a database, was always a terrible idea. Many comments then followed, offering opposing views.

John Hart said:

Databases are useful for ad-hoc queries, schema flexibility, transactional semantics, and flexible indexing.

If you don’t need these things, you can get a _lot_ of wins by not using a database. If you’ve been developing within a database for a long time, seeing the performance you can get from flat files will blow you away. You realize, “oh, wow, computers are a lot faster than I realized because the database has been in my way for so long.”

As a funny example, what’s the quickest way to get a sorted 1TB dataset from Oracle? No matter how you define your indices, it’s still quicker to export unsorted data via SQL*Loader and then sort it yourself. Crazy, but (in my experience) true.

Scott Lewis:

Part of Phil’s point may be that databases provide a consistent, highly-referenced system for storing and retrieving data, whereas flat-file systems tend to be idiosyncratic. This affects not only ongoing support issues, but the actual value of a product, potentially. New purchasers of an online site, for example, would want surety that the data can be easily and reliably managed in the future. Databases tend to be better at providing that than flat-file systems, no matter how robust or fast.

Patrick Tufts disagreed with Phil’s assertion that “if you ever wants to extract/analyze data it will be painful.” Patrick wrote:

In my experience, doing very large-scale data analysis at Amazon and IBM, this is not the case. I prefer to work with flat files (either logfiles or full database extracts) because I’m going to do multiple passes over nearly every record, and most databases, while great for random access to records, are terrible when you just want everything.

Andy writes:

I have a huge directed graph stored in MySQL as a single table with (parent, child) relationships. As an optimization for other purposes, this gets flattened into a complete mapping of all ancestors to all descendents. The graph is maintained in an SQL table for maintenance reasons (we already have the tools that can manipulate the graph as stored in MySQL). The graph, and thus the table, is relatively small, 4000 nodes and 2000 edges.

Traversing the entire graph using SQL queries in sequence (query each node when it is reached) took 25 minutes. Selecting everything from the table and generating the same data structure (parent->child relationships, not nested structures) in perl cut the run time down to 45 seconds. That’s 45 seconds to traverse the entire graph. Of course, once it’s all in memory, that kind of speed is to be expected.

Chris M writes:

one advantage of a database is it makes it easy to change your schema, as phil pointed out. If your project is at all visionary or cutting-edge, that freedom is indispensible, since you never know what direction you’ll be going in next. What kind of relationships will be the most important ones, etc.

If you’re lucky though, and then smart, you’ll be able to cache essentially all of your application in “flat” html files in front of your database. You regenerate only the ones you need to when the database is modified, which should be infrequently. This is the approach Rails takes. Then also your pages can be comprised of up to dozens of small bits and pieces, so that when a master page is invalidated, the database is only hit to regenerate just the part that was changed and the regeneration only takes a half second or something.

Gavin writes:

The time to learn a relational db and then special cases to back up, etc., are big time inputs needed for rdbms. Do not forget that in the ‘against’ category..

There is so much data on the web that all the rules of databases need to be rewritten

Tuesday, July 24th, 2007

The modern relational database came together around 1970. SQL (Structured Query Language), the language for getting data into and out of databases, was created shortly thereafter. The early relational databases were great for when a company needed to manage a few million records. Banks could keep track of their customers transactions, and airlines could keep track of their customers tickets. But now, it seems, we’ve reached a point where relational databases no longer work. At least, not for the giant stores of data that exist on the web, stores that might have billions, or tens of billions, of records. Apparently the increased speed of CPUs and the much greater availability of RAM has not been enough to give databases the extra speed they need to handle the larger data sets. The implications of that are worrisome.

Joe Gregorio has seen a new pattern emerge for handling very large data sets:

Sure you can store a lot of data in a relational database, but when I say large, I mean really large; a billion or more records. I know we need this because I keep seeing people build it.

He goes on to outline the main features of the new pattern that he sees:

Distributed
The data has to be distributed across multiple machines.
Joinless
No joins, and no referential integrity, at least at the data store level.
De-Normalized
No one said this explicily, but I presume there is a lot of de-normalization going on if you are avoiding joins.
Transcationless
No transactions