Archive for the ‘catastrophic failure’ Category

BitBucket gets hacked and, since they rely on Amazon Web Services, they were dependent on Amazon to fix the problem

Monday, October 5th, 2009

Stuff like this makes me afraid to ever rely on anyone’s cloud services.

BitBucket got hit with a denial of service attack. But they host everything on Amazon Web services (EC2, EBS, etc). So they were dependent on Amazon to figure out what the problem was. And Amazon took 16 long hours to figure out what the problem was. And in the mean time, Amazon kept telling BitBucket that everything was fine:

What came from that, was 5 or 6 hours of advice, some of which were obvious timesinks, while others were somewhat credible. What they kept coming back to was that EBS is a “shared network resource” and performance would vary. We were also told to use RAID0 to distribute our load over several EBS instances to increase the throughput.

At this point, we were getting less throughput than you can pull off of a 1.44MB floppy, so we didn’t accept this for an answer. We did some more tests, trying to measure the bandwidth of the machine by fetching their “100mb.bin” files, which we couldn’t do. We again emphasized that this was in fact, in all likelihood, a network problem.

At this point, our outage was well known, especially in the Twittosphere. We have some rather large customers relying on service with us, and some of these customers have some hefty support contracts with Amazon. Emails were sent.

Shortly after this, I requested an additional phone-call from Amazon, this time to our system administrator. He had been compiling some rather worrying numbers over the past hours, since up until now, the support had refused to acknowledge a problem with the service. They claimed that everything was working fine, when clearly, it was not.

This time, a different support rep. called, and this time, they were ready to acknowledge our problem as “very serious.” We sent them our aggregated logs, and shortly thereafter, they reported that “we have found something out of the ordinary with your volume.”

We had been extremely frustrated up until this point, because 1) we couldn’t actually *do* anything about it, and 2) we were being told that everything should be fine. It felt like there was an elephant right in front of us, and a person next to us was insisting that there wasn’t.

Software project failures

Tuesday, July 28th, 2009

Coding Horror has a nice set of quotes and links related to catastrophic software project failures. Of interest to anyone who does project management on software projects:

If Las Vegas sounds too tame for you, software might just be the right gamble. Software projects include a glut of risks that would give Vegas oddsmakers nightmares. The odds of a large project finishing on time are close to zero. The odds of a large project being canceled are an even-money bet (Jones 1991).

In 1998, Peat Marwick found that about 35 percent of 600 firms surveyed had at least one runaway software project (Rothfeder 1988). The damage done by runaway software projects makes the Las Vegas prize fights look as tame as having high tea with the queen. Allstate set out in 1982 to automate all of its office operations. They set a 5-year timetable and an $8 million budget. Six years and $15 million later, Allstate set a new deadline and readjusted its sights on a new budget of $100 million. In 1988, Westpac Banking Corporation decided to redefine its information systems. It set out on a 5-year, $85 million project. Three years later, after spending $150 million with little to show for it, Westpac cut its losses, canceled the project, and eliminated 500 development jobs (Glass 1992). Even Vegas prize fights don’t get this bloody.

How does diversity help a project?

Monday, July 27th, 2009

Whether we are talking about the evolution of finches  in the Galápagos Islands  or the evolution of the software projects that we work on, my sense is that diversity offers its greatest benefit during  a crisis. The worst thing about monoculture is the powerful reward it offers to pathogens. Or, as Wikipedia says:

The dependence on monoculture crops can lead to large scale failures when the single genetic variant or cultivar becomes susceptible to a pathogen or when a change in weather patterns occur.

Extending that as a metaphor for business, groupthink (a monoculture of thought) can lead to catastrophic failure when some foundational assumption of the group is proven wrong. A monoculture of thought offers a powerful reward to pathogenic behavior. Consider the meltdown at Enron, where top executives all agreed on the profitability of reckless energy trades, and they continued to agree with each other almost till the very moment company declared bankruptcy. Likewise, the top executives at AIG were certain that they had distributed risk in a such a way that the downside of that risk would never catch up with them. – people with dissident viewpoints were squeezed out of their jobs. Or consider the 40 year decline of the United States auto industry, an industry that has suffered more than most from groupthink and inaccurate assumptions. The executives of the 1970s and 1980s felt, despite the gathering evidence, that price was more important to Americans than quality, and that quality automatically meant expensive, and so they lost a generation of car buyers.

A corporate culture that values homogeneity is at grave risk of punishing non-conformists. A good manager is always on guard against the kinds of social bullying, however subtle, that can cause people to censor their opinions. This is a basic task of risk management: reduce risk by challenging core assumptions. Make sure divergent view points are heard.

I should add, if you are working at a new start-up, struggling to find its place in the world, you should treat every day as a crisis.

Genetic diversity allows a population multiple avenues to move forward when a radical change in the external environment dooms the existing species, in their current forms. Genetic diversity helps facilitate the transformation of sub-sections of those populations to evolve into new forms. Likewise, when a corporation faces a crisis, having a diverse range of opinions is healthy, and the more those differences of opinions reach down to core assumptions, the healthier. In boom times, such diversity of opinion could potentially be viewed as annoyingly disruptive of the good times, but in a crisis, what’s needed is the maximum of diversity: in viewpoint, in history, in current circumstances, in goals, in future expectations, etc.

In theory, a genius of a manager could possibly assemble a team made up solely of white males, which still had enough diversity of opinion to perform well in a crisis, but as a practical matter, the most reliable way to put together a diverse team is to recruit people from different backgrounds, different genders, different races and, where possible, different countries.

I regard the cultivation of diversity on a project as a fundamental survival technique, so I devote a lot of time to recruiting newcomers to the field of programming. And so, I read with interest Kirrily Robert’s discussion of recruiting women to work on an open source project (what follows is from Robert’s blog post):

———————————————–

I surveyed women on the Dreamwidth and AO3 projects and asked them about their experiences. You can read a fuller report of their responses on my earlier blog post, Dispatches from the revolution.

One of the first things I asked them was whether they had previously been involved in open source projects. They gave answers like:

I’d never contributed to an open source project before, or even considered that I could.

I didn’t feel like I was wanted.

I never got the impression that outsiders were welcome.

I considered getting involved in Debian, but the barriers to entry seemed high.

Those who got a little further along still found it hard to become productive on those projects:

It’s kind of like being handed a box full of random bicycle parts: it doesn’t help when you don’t know how they go together and just want to learn how to ride a bike.

People without a ton of experience get shunted off to side areas like docs and support, and those areas end up as the ladies’ auxiliary.

But on Dreamwidth and AO3…

What I like most is that there isn’t any attitude of “stand aside and leave the code to the grown-ups”. If there’s something that I’m able to contribute, however small, then the contribution is welcome.

And this one, which is my favourite:

Deep down, I had always assumed coding required this kind of special aptitude, something that I just didn’t have and never would. It lost its forbidding mystique when I learned that people I had assumed to be super-coders (surely born with keyboard attached!) had only started training a year ago. People without any prior experience! Women! Like me! Jesus! It’s like a barrier broke down in my mind.

So, what can we learn from this? Well, one thing I’ve learnt is that if anyone says, “Women just aren’t interested in technology” or “Women aren’t interested in open source,” it’s just not true. Women are interested, willing, able, and competent. They’re just not contributing to existing, dare I say “mainstream”, open source projects.

And this is great news! It’s great news for new projects. If you are starting up a new open source project, you have the opportunity to recruit these women.

JournalSpace loses all data in its database and has no backup

Saturday, January 10th, 2009

This is a horrifying failure of risk management and system administration good practice:  JournalSpace loses all data in its database and has no backup:

Blogging platform JournalSpace (which I’d never heard of to date) has ceased to be, following a wipe-out of the main database for which there was no back-up in place. According to the JournalSpace blog, the database was overwritten as a result of a malicious act from a disgruntled ex-employee.

The lack of backups is the fault of management, for they had the authority to make better decisions, and they had the ethical responsibility to protect the data of their users. Nevertheless, they try to shift the blame to one of their employees:

It was the guy handling the IT (and, yes, the same guy who I caught stealing from the company, and who did a slash-and-burn on some servers on his way out) who made the choice to rely on RAID as the only backup mechanism for the SQL server. He had set up automated backups for the HTTP server which contains the PHP code, but, inscrutibly, had no backup system in place for the SQL data. The ironic thing here is that one of his hobbies was telling everybody how smart he was.

The employee might be guilty of criminal actions here, but that doesn’t let management off the hook for having been so unprepared. If it hadn’t been an employee it might have been a tornado or earthquake or some other disaster – and the blame still would have belonged to management. Multiple backups, in different locations, is the precaution that a responsible company must make.

cb sums it up well in the comments:

lol, gotta love an internet company that has ‘a guy handling IT’. As if the IT side of things is an afterthought-which apparently it was in this case.

There is a second part to this story that I find very sad. One of the users of JournalSpace, a woman calling herself tinythoughts, shows up in the comments at TechCrunch and expresses her sadness, whereupon she is immediately attacked for her having ever used JournalSpace. I am puzzled and worried by the attitude that would defend the company and blame the customer.

This is tinythoughts:

i had one of the oldest journals on journalspace. i am really upset about losing about 6 years of writing, and my layouts which i made. it was bad enough when they lost years of comments. this is far worse. i am pretty sure i archived most of everything up til about a year ago on my external hd. i’m actually a lot sadder about this than i thought i would be.

This is the criticism that is then thrown at her:

If you value your work so much, you shouldn’t be using something that’s free and expect not to lose it.

At the end of the day, piss all you want. It’s your damn fault for leeching off a free service and expect it to continue to provide for you.

The day of FREE is over!

And then this was her response:

it wasn’t free. i was a paying customer for most of the time i was on there, until the first big data loss. after that, i did not get a pro account anymore and also began to write less on there.
as for losing my stuff, which i did value, as i said, i did back it up myself after that first data loss. however, i liked it where and how it was, accessible online to me and anyone else. we are talking about almost 6 years of content. that is not some small thing. even if you’re dumb, you should be able to understand that.
btw, i work in this industry myself, and i am pretty sure the days of free are not over. but all the best to you on being rude anonymously to others online.

She also adds:

I don’t believe their story. I think there is more to it. They’ve had problems before and they always lay it out like their users are technically stupid and willing to accept any dumb answer given to them. What happened really was a great loss for many users, who had been there for years, a community of really great people. The greatest loss is all of the time, life, love, and community each of those users put into journalspace, where it was all documented and washed away like sandcastles on the beach. I might be upset for my own loss, but not nearly as sad as I am for many of my friends there. I think journalspace owes them more than a lame excuse and an empty sorry.

The fact that someone is willing to attack the customers in this case, rather than the grossly irresponsible company, actually saddens me more than the already sad fact that a lot of people lost years worth of work. (Though if I lost that much work, I’d cry for days.)

Newspapers are doomed

Saturday, December 27th, 2008

Sam Zell made a terrible mistake when he bought the Tribune Company:

When the Tribune Company announced that it was filing for bankruptcy, last Monday, Sam Zell, the man who bought the company a year ago, for $8.2 billion, said that its problems were the result of a “perfect storm.” You take readers and advertisers who were already migrating away from print, and add a steep recession, and you’ve got serious trouble. What Zell failed to mention was that his acquisition of the company had buried it beneath such a heavy pile of debt that any storm at all would likely have sunk it. But although Zell was making excuses for his own mismanagement, the perfect storm is real enough, and it is threatening to destroy newspapers as we know them. Layoffs and buyouts have become routine. The Miami Herald and the San Diego Union-Tribune are reportedly on the selling block, while lawmakers in Connecticut are trying to keep two newspapers there afloat. Even the New York Times Company has slashed its dividend and announced that it would borrow against its headquarters to avoid cash-flow problems.There’s no mystery as to the source of all the trouble: advertising revenue has dried up. In the third quarter alone, it dropped eighteen per cent, or almost two billion dollars, from last year.

Newspapers are simply a method of delivering ads:

It turns out that subscribers are more expensive, not less expensive, than online readers. Yes, they pay more — but they’re not paying for intensive reporting, experienced editors, and the like. They’re paying for printing presses, mobbed-up newspaper delivery operations, and the whole enormous physical infrastructure involved in getting thousands of tonnes of newsprint delivered to millions of front doors every morning. It’s a hugely expensive operation, and its costs are nowhere near covered by subscription revenues.

There’s an old saying that you’ll never understand newspaper economics until you understand why newspaper vending machines are designed so that you can take as many papers as you like for your quarter. Newspapers are, first and last, devices for delivering ads to readers. It’s the ads which account for all the profits, not the cash coming from subscribers or people who buy their paper at the newsstand. Yes, news itself is free, nowadays. But it always has been. What we’ve been paying for all these years was never news, it was papers.

Lately there have been a lot of articles about the newspaper industry. We are told that this recession is killing off newspapers, everything is moving to the web. There is some mourning, people wonder how journalism will survive once the newspapers are gone. But consider the other side – where will the ad dollars go? When the next boom hits, America will have a lot less newspapers than it has had in the past. And much more of the public will think it natural to get their news online. Doesn’t it seem that at some point quite a bit of ad revenue must become available to online ventures?

Smart, rational managers tend to manage their companies to bankruptcy

Friday, December 12th, 2008

Almost every complaint that Frank Sommers has with JavaFX is something that I am pleased about. Apparently he wants Sun to focus on its current customers, rather than on its future customers. To me, that attitude is what often leads to bankruptcy. Smart, well manage companies often manage themselves rationally to bankruptcy. To my mind, JavaFX is Sun’s attempt to break out of that death spiral, to do something new. I am hoping this will prove as much of a new and positive direction for Sun as the iPod proved for Apple.

Frank Sommers writes:

In spite of a thriving Swing community, and despite Swing’s large user base, Sun has re-focused its efforts around JavaFX over the past year-and-a-half, at the expense of Swing development. The most visible aspects of that change in focus is that many of the most experienced Swing developers left the company, such as Chet Haase (see Artima’s interview with Chet Haase), Hans Muller, or Scott Violet. The important Swing-related JSRs have also been stale for a long time now: the latest JSR 295 and 296 updates occurred in June, 2006, according to the JCP’s Web site.

Most recently, SwingX contributor Jeanette Winzenburg wrote on the project’s online forum that Sun all but abandoned its support for SwingX, because its engineers are busy working on JavaFX:

… the official terminus is “frozen” – but as that happened already in July and everybody in the core team is well over their ears into FX it looks rather permanent to me…

I think it quite funny that [Sun engineer Richard Bair argues] in favour of that support/evolution mainly by stating that nobody (definitely not the experienced engineers/architects) at Sun has any time for it – because you all are wasting it it on FX (biased me again ) Fancy demos – especially in a language unrelated to the project at the center of this forum’s topic – are just that: fancy demos. They don’t solve any real world problems. Chet’s termed the effort to produce them so cutely as CDD – Conference Driven Design.

Winzenburg’s note seems to echo the sentiment of many experienced Swing developers. JGoodies’ Karsten Lentzsch noted, for example, that:

I’m worried that Scott Violet, Chet Haase, and now Hans Muller left Sun. AFAIK Jeff Dinkins isn’t working on Swing anymore, and Amy Fowler has changed her focus too.

None of my Java customers is interested in JavaFX. They want to get their Swing UIs running. They are looking for Java desktop blueprints, for a cook book that explains how to address the everyday Swing task. My customers were excited about the JSR 296 (Swing app framework) and 295 (beans binding). But now it’s unclear what’ll happen to these projects. I don’t see Sun’s Swing strategy.

If you look at the JavaOne 2006, 2007 and now 2008 what have we got for Swing, or in other words for the Java deskop *now*? A cool demo (Aerith) in 2006, more cool demos in 2007, and JavaFX in 2008. All my customers do the “boring” stuff: editors, forms, navigation, buffering, data binding, layout. That’s how they make money. Who cares about them? They need a framework, better components, not animated 3D flipping images.

And Kirill Grouchnikov recently wrote that:

I don’t know what the future holds for JavaFX. Sun is heavily betting on it… All I know is that JavaFX has effectively halted all core Swing development. Over the last 18 months, we have seen significant architectural initiatives (JSR 295 and JSR 296) changing leads and frozen. All client-facing improvements in Java2D, AWT and Swing in Java 6 Update 10 are completely driven by the requirements of JavaFX.

…Do you agree that Sun’s recent focus on JavaFX has hurt the cause of client-side Java?

The worst software project failure ever

Saturday, June 9th, 2007

The modern, affluent standard of living depends on society’s infrastructure, both physical and intellectual: airports, highways, bridges, ports, telecommunication networks, databases, power plants, and the software that makes it all run. 150 years ago it was still common for massive bridge projects to meet with catastrophic failure. The engineers of the 1800s slowly mastered the art of working with iron and then steel. The projects grew in scale, and the skills of the project managers needed to keep apace.

Nowadays we rarely hear of catastrophic bridge failures. This is a particular type of infrastructure that has come to be well understood, both in its engineering and in the project management that oversees its construction. However, we, as a society, are still struggling to find the right way to develop large software projects. This is the newest type of infrastructure, and the correct way to engineer it and manage its development is still a source of controversy.

Robert L. Glass has compiled a book of entertaining and instructive stories regarding the massive software project failures of our time: Software Runaways: Monumental Software Disasters. Anyone with an interest in the management of software projects should read it.

The worst of these project failures, the most expensive and the most ambitious, was the attempt made by the Federal Aviation Authority to modernize the computer system it uses to keep track of what planes are in the air. The effort began in 1981 and ended in complete failure in 1994. The government hired IBM to do the actual work, and over the course of 14 years, IBM burned through $3.7 billion dollars. Nothing was accomplished. The project was finally shut down by Congress. Nothing came out of the project, not a single piece of software, nor even a line of code, was ever used for anything.

Robert N. Britcher, who was involved with the project, offers a full write up of the project. I will try to entice you to buy the book with some excerpts:

The Advanced Automation System began, in concept, in 1981 and ended in 1994, “terminated for convenience” by the government. Billions of dollars were spent on it. It is hard to describe. You can’t learn anything from the name. You know it’s about air traffic control because I told you, or because you read about it in the papers. Maybe part of the problem was the name. It sounds like the system to end all systems.

One engineer I know described the AAS this way. You’re living in a modest house and you see the refrgerator going. The ice sometimes melts, and the door isn’t flush, and the repairman comes out, it seems, once a month. And now you notice it’s bulky and doesn’t save energy, and you’ve seen the new ones at Sears. So it’s time. The first thing you do is look into some land a couple of states over and think about a new house. Then you get I.M. Pei and some of the other great architects and hold a design run-off. This takes awhile, so you have to put up with the fridge, which is now making a buzzing noise that keeps you awake at night. You look at several plans and even build a prototype or two. Time goes on and you finally choose a design. There is a big bash before building starts. Then you build. And build. The celebrating continues; each brick thrills. Then you change your mind. You really wanted a Japanese house with redwood floors and a formal garden. So you start to re-engineer what you have. Move a few bricks and some sod. Finally, you have something that looks pretty good. Then, one night, you go to bed and notice the buzzing in the refrigerator is gone. Something’s wrong. The silence keeps you awake. You’ve spent too much money! You don’t really want to move! And now you find out the kids don’t like the new house. In fact, your daughter says “I hate it”. So you cut your losses. Fifteen years and few billion dollars later, the old refridgerator is still running. Somehow.

At $3.7 billion, the Advanced Automation System was one of the largest civilian computer contracts ever; maybe the largest. It was the largest single contract in IBM’s history. From the moment it was awarded, until near the project’s demise, IBM patted itself on the back. There was something for everyone, beginning with a great ball in Union Station, featuring Chubby Checker and “The Twist”.

At its peak the project employed over 2,000 people. About a million dollars a day. If you thought like the IBM project manager, this was a good deal. Many people were working and money was being made. It was going to last forever. No one considered that it wouldn’t. And everyone was getting ahead. (One of the ironies of conceptual work is that it is easy to believe you are farther along than you are.)

The AAS must have been the most supervised project in history: this atop its enormous size and complexity, and the extreme and constantly changing requirements. One programmer described it this way: “Working on the project was like working on a car inside the garage with the motor running. Eventually, even the crickets hopping around the tires suffocate.”

What I saw on the FAA’s Advanced Automation System would have made Sisyphus weep… Whatever commitment and discipline there was… was worn down by a battery of watchfulness that I can only ascribe to fear of failure. In spite of tens of millions of dollars spent on new computers for AAS, the most important piece of equipment on the project was the overhead projector. There were endless meetings attended by dozens of people – as if we were never quite sure about the whole thing. The people in charge simply lacked the confidence and the finesse of the space team: NASA, contractors, and astronaughts.

It has been noted by everyone from the New York Times to the Vice-President of the United States that the main problem on the Advanced Automation System was “changing requirements”. For those involved in large-scale computer systems, that is nothing new. No one can perfectly surmise the shape and feel of a system years in advance. Even replacing some aspect of a system you know by heart is not immune from thinking twice about it. … [But] the requirements churn (it was called) on the Advanced Automation System was not normal. It was the result of our enchantment with the computer-human interface, the CHI. The new controller workstation, fronted by a 20″ by 20″ color display, because it was capable of seemingly endles variety of presentations, mesmerized the population of AAS like the O.J. Simpson trial mesmerized the nation…

The project was handed over to human factor pundits, who then drove the design. Requirements became synonymous with preferences. Thousands of labor-months were spent designing, discussing, and demonstrating the possibilities: colors, fonts, overlays, reversals, serpentine lists, toggling, zooming, opaque windows, the list is huge. It was something to see. (Virtually all of the marketing brochures – produced prematurely and in large numbers – sparkled with some rendition or other of the new controller console.) It just wasn’t usable…

The cost of what turned out to be a 14-year human factors study did not pay off. Shortly before the project was terminated a controller on the CBS evening news said: “It takes me 12 commands to do what I used to do with one.” I believe he spoke for everyone with common sense.

Rummaging through one of the closets at the far end of the hall on the fifth floor one day, looking for some standards document, I found an envelope left by someone who left the company – as many did after so many years advancing against stone, while the wheels of commerce were accelerating on what everyone referred to as “the outside”. It contained “A Brief History Of The Advanced Automation System”. It was printed by hand and left, perhaps inadvertantly, or perhaps with the hope that some anthropologist might some day discover it and make a pronouncement. In every important way, it is the truth:

“A young man, recently hired, devotes years to a specification written to the bit level for programs that will never be coded. Another, to a specification that will be replaced. Programmers marry one another, then divorce and marry someone in another subsystem. Program designs are written to severe formats, then forgotten. The formats endure. A man decides to become a woman and succeeds before system testing starts. As testing approaches, she begins a second career on local television, hosting a show on witchcraft. An architect chases a new technology, then another, then changes his mind and goes into management. A veteran programmer writes the same program a dozen times, then transfers. The price of money increases eight times. Programmers sleep in the halls. Committees convene for years to discuss keystroking. An ambitious training manager builds an encyclopedia of manuals no one will ever use. Decisions are scheduled weeks in advance. Workers sit in the hallways. Notions of computing begin in the epoch of A, edge toward B, then come down hard on A + B. Human factors experts achieve Olympian status. The Berlin Wall collapses. The map of Europe is redrawn. Everything is counted. Quality becomes mixed with quantity. Morale is reduced to a quotient, then counted. Dozens of men and women argue for thousands of hours: What is a requirement? A generation of workers retire. The very mission changes and only a few notice. Programming theories come and go. Managers cling to expectations, like a child to a blanket. Presentations are polished to create an impression, then curbed to cut costs. Then they are studied. The work spikes and spikes again. Offices are changed a dozen times. Management retires and returns. The contractor is sold. Software is blamed. Executives are promoted. The years rip by with no end in sight. A company president gets an idea: make large small. Turn methods over to each programmer. Dress down. Count on the inscrutability of programming. Promote good news. Turn a leaf away from the sun. Maybe start over.”