Improve Linux performance

    May 9, 2003

Performance breakthroughs seem to come in two varieties: easy and hard. That’s no platitude; the boundary between the two is surprisingly clear.

When you hear about some — the easy ones — you clap your hands and say, “wow” or “of course” or “slick.” Although in some cases it has taken considerable genius to realize their first application, they’re easy to understand.

The other kind involve careful measurement, specific knowledge, and a fair amount of tuning. These are often both frustratingly and rewardingly contingent on “local conditions” such as hardware specifics.

Good programmers can operate in either the “hard” or “easy” mode. As performance is an important topic for Linux programmers, I offer a paired collection of four hard and easy tales from real (programming) life.

It always starts with requirements

Death and taxes have reputations for inevitability. Of nearly equal importance for developers is careful consideration of requirements; that’s one subject that comes up in every kind of programming.

While performance is certainly important, the best way to handle this requirement is not always obvious. Time and again, I’ve experienced a software challenge that followed roughly this pattern: a program is in use. Its functionality is correct. A user stops in, though, to report that it’s “too slow” and needs acceleration. Someone on the team quickly hacks in a “monitor” that slows performance a bit but keeps the user informed about how much time remains for a long-running computation. Satisfaction settles in.

I first observed this phenomenon late in the 1960s and have seen it every decade since. My guess is that in some sense it’s been going on since the beginning of computing. The point is not that end users are gullible or can be distracted with shiny decorations. They simply have larger goals than our narrow, specialists’ notions of “performance.” In many cases, when they say a calculation needs to be faster, what they really mean is that they need to know more quickly and reliably how long the calculation will take. With accurate information on that score, end users can be happy scheduling other tasks around the computer’s delay.

This suggests a “warmup exercise” for everyone involved with performance. First, read or reread your favorite references on requirements management; the Resourcessection below provides a couple of useful starting points.

Second, keep in your toolkit a few little “progress monitors,” “progress bars,” or “stopwatches” that you can quickly drop into applications when you find you’ve left your users too much in the dark. These don’t have to be expensive at all, as the two following examples illustrate.

Two countdown examples

Performing a long computation while simultaneously keeping a user informed of its progress models an interesting problem in application design. It’s an instance of concurrency, or “multitasking.” Many developers believe concurrency demands elaborate mechanisms, and difficult threading code in particular, for a proper solution.

It’s not true. The Resources section below lists an earlier developerWorks column in which I sketch the range of technologies involved in concurrency. What’s more, simple concurrent programming has been available on essentially all UNIX hosts since at least 1988, which is when ksh standardized its “co-processes.” If you save the ksh source in Listing 1, below, as ex1.ksh and then run it, you’ll see a “countdown” — “ten, nine, eight, etc.” — display, then the result of the ksh subprocess: “All done.” In practice, you might use something like this for a long-running chemical calculation or database retrieval: launch the operation as a subprocess, but keep users informed about what’s going on or how much time is left before completion. It only takes a couple of lines of source code to update the display.

Listing 1. ksh source code for example countdown display


“Real-time” information for your users requires no more than a modest amount of coding; even character-based applications can do it. If a graphical user interface (GUI) fits your situation better, that’s also easy. Many toolkits build in “busy” or “waiting” or “progress” displays. If you don’t already have one at hand, a few lines of Tk or another modern GUI toolkit will suffice to make a little “analogue” countdown clock face or progress bar. Here’s an example “from scratch” that fits on a couple of pages:

Listing 2. Tk source code for example countdown display


This countdown shows a minute and second hand, as in Figure 1, with the same information content as the previous program.

Figure 1. Display of analogue countdown clock coded in Tk

Remember: information can substitute for functionality. The more your end users know about the internal state of your application, the less they’ll require of you. You can solve many apparent performance problems just by teaching your programs to show what they’re doing.

Sorting is hard

That was an “easy” lesson; “civilians” with no particular programming background can understand the previous paragraph. Sorting, though — that’s in the “hard” category. And it must be: Donald Knuth devotes an entire volume of his classic series on computing to sorting and searching (see Resourcesfor a link to a review of Knuth’s The Art of Computer Programming).

Sorting turns out to be at the heart of many performance challenges for a deep reason. Performance is an issue for large-scale computations alone. Humans can only understand a few items at a time, so the way we grasp problems large enough to take a long time is to organize or structure them in some way. Sorting is the most common of these ways; a sorted list can be binary-searched rather than linearly searched. That simplifies manual lookup in the New York City telephone directory, for example, from a problem on the scale of a million to a problem on the order of the logarithm of a million (about fourteen, say).

That’s typical of sorting work; superior algorithms often outperform inferior ones by a factor of a thousand or more. It pays to sort intelligently.

Most intelligent of all, however, is to avoid sorting altogether, or at least limit it to occasions when it won’t be noticed. That’s common with database management systems; among other benefits, they allow for the construction of “insert-time” indexes. When a sorted result is needed, it can be read directly from the index. The cost is a slightly longer time to create or insert elements — but users won’t notice this if the application’s workflow is typical.

Knuth’s reference describes plenty of other tactics, such as “Boyer-Moore” or “Rabin-Karp” for accelerating sorting operations in specific situations. One common principle that unifies several of these is that it can be easier to solve a general problem than a more specific one. Mathematicians are familiar with this; common problems in algebra are more tractable when considered over the complex numbers than over the reals, even though complex arithmetic is more difficult.

That’s how I think about “decorate-sort-undecorate” (DSU). Suppose you have this dataset:

“Jane Jones” -> 15,000
“Dan Smith” -> 6,000
“Kim Black” -> 40,000

Say you need the names sorted by their associated revenues. A naive approach might be to use a library entry point for sorting, supplied with a comparison function to say that “Kim Black” precedes “Dan Smith” because 40,000 is greater than 6,000, and so on. That’s straightforward and effective for small problems.

Far more efficient, though — especially when sorting a million items at a time, as is common in bioscience, finance, and other domains — is to go to the trouble of creating a special-purpose “reverse table.” While this operation takes time, it allows the sorting to be very fast. Python’s list comprehensions make this particularly economical to express; the same principle applies in any language you might have at hand, though:

Listing 3. Minimal decorate-sort-undecorate illustration, in Python


Sorting the derived dataset can easily accelerate the overall operation by an order of magnitude or more. Calculations of computational complexity seem to be one of the harder subjects for students; there’s no doubt of their value, though, when they lead to such speedups.

500 times as quick

Independent consultant Alex Martelli experienced an even more dramatic breakthrough just last year. As he related in the Python Business Forum mailing list, he had implemented an XML processor that required eight hours to complete its task. That was unacceptable; end-users simply couldn’t wait that long. He combined several fixes to bring the time down to one minute — a factor of 500 over where he started!

First, he switched from the built-in Python XML libraries to the specialized pyRXP parser. This is the same pyRXP that made an appearance in a previous developerWorks article on the performance of XML processing; see Resources for a link to the article. Along with its thriftier use of processing cycles, pyRXP also dramatically slashes memory footprint, when compared to other Python-bound XML engines. This makes an enormous difference, and is typical of many performance bottlenecks: applications are starved for memory. Even formally favorable algorithms can slow catastrophically if they use so much memory as to require swapping. That was Martelli’s biggest problem.

So, when evaluating alternative products or algorithms, benchmark not just their apparent throughput but also their memory impact. The latter often dominates the effective scalability of production applications. Moreover, if you decide to try solving a performance problem by throwing hardware at it, as is often wise, start by considering more memory. An increase in main memory is an inexpensive experiment that often yields dramatic results.

That, then is the second easy principle: get more memory, and make better use of what you have. Lots of people, including managers, seem to appreciate that memory matters.

Disk drives deserve doubt

A final tip for the performance-hungry is to be suspicious of disk drives. Mass storage has made amazing strides in density and reliability over the last few decades, but what I’ve seen lately is that at least some manufacturers appear to be stretching the quality of their operations too thin. Results include some units that fail outright and far more that simply give variable results.

Unless you measure carefully, you don’t know the true average access time or throughput of your disk subsystems. This can be particularly frustrating when mass storage is configured as RAID, SAN, or other modern technologies. A particular unit might fail frequently, but the only consequence apparent at the application level is spooky variations in over-all performance. It can easily happen, for example, that most of the elapsed time of a particular program goes to error-correction within a RAID unit that has effectively lost one entire spindle.

What can you do about this? Plenty of remedies are possible, but they’re poorly documented; I know them largely as the folklore that system administrators pass on in person. Here are a few highlights:

  • Buy equipment you trust. Mass storage is a domain where you can pay for dependable products rather than accepting lowest bids.
  • Don’t push the envelope. Let others shake down SANs, the latest-generation SCSI, gigaether storage, and other offerings.
  • Look for measurability in the products you buy. Find disk drives that monitor their own performance and circumstances, including temperature. Keep records so you know what to expect from your equipment. If your projects are big enough, you willrun into disk mysteries. You’ll be far better off with a documented baseline than trying to track down a failure after it has begun to happen.

Some performance fixes take only minutes to conceive or even implement; others require investments of days of research before they pay off. To track down the true limits on your applications’ speed, you need to be prepared to tackle both the hard and the easy problems of performance.


  • This page of readings on requirements acquisition descends from the Usenet newsgroup.
  • Cameron admires the intent more than the realization of Requirements Management Place, but he thinks that might largely be a matter of personal taste. In any case, everyone involved in requirements acquisition — and certainly all programmers and administrators — should at least have an acquaintance with the range of resources and ideas that bear on analysis of requirements.
  • Concurrency for grown-ups” explains that there’s a lot more to multi-tasking than just threading.
  • Cameron’s personal page on ksh provides pointers for more information on this interesting and capable language. Make sure you install a ksh93; many Linux distributions install ksh88 by default.
  • Danny Yee’s review of Donald Knuth’s The Art of Computer Programminggives a bit of background on the latter’s authoritative treatment of Sorting and Searching.
  • Brian Goetz discusses some of the most common performance mistakes he’s seen in projects using the Java language.
  • “Adventures in high-performance XML persistence, Part 2” mentions pyRXP, which is, for some applications, the fastest supported XML parser.
  • The old Server/Workstation Expert had a great column on storage concepts and practice that remains valuable reading. I don’t know of any comparable current introduction to the hard problems of managing mass storage.
  • For the complete story on IBM’s storage offerings, visit the IBM TotalStorage page.
  • The Tivoli Storage Management page is a good starting point for information on how Tivoli can help manage drives, backups, SANs, and more.
  • Find more resources for Linux developers in the developerWorksLinux zone.

First published by IBM developerWorks at

Reprinted with permission from the author.

Cameron Laird is a full-time developer for independent consultancy Phaseit, Inc.He writes and speaks frequently on open source and other technical topics. Most of his own development work has been in systems programming and on the server side. You can contact him at