The following article originally appeared in the September 6, 1996 issue of HPCwire. It is reproduced here with their permission.
PAPERS DEVELOPS NOVEL SUPERCOMPUTING CAPABILITY FOR NOWS 09.06.96 by Alan Beck, editor in chief HPCwire ============================================================================= West Lafayette, Indiana -- PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) is custom hardware that allows a cluster of unmodified PCs and/or workstations to function as a fine-grain parallel computer capable of MIMD, SIMD, and VLIW execution. Its developers assert that the total time taken to perform a typical barrier synchronization using PAPERS is about 3 microseconds, and a wide range of aggregate communication operations are also supported with very low latency. In order to learn more about the functionality and pragmatic potential of PAPERS, HPCwire interviewed Hank Dietz, associate professor of electrical and computer engineering at Purdue University and a principal PAPERS developer. Following are selected excerpts from that discussion. ---------------------- HPCwire: Please tell us about the background and fundamental concepts of PAPERS. DIETZ: "I've worked in compilers for a long time. In doing very fine- grained compiler code scheduling and timing analysis, we found that the most useful hardware construct was a barrier synchronization mechanism. About three years ago, we found a way to build a very efficient one that could plug into standard PCs or workstations. That's how the PAPERS project was born. "This barrier mechanism doesn't just do barriers, however. In order to implement the full barrier mechanism that we wanted, we also had to build in some communication originally intended for collecting and transmitting barrier group masks -- basically bit masks saying which processors were involved in each barrier. The hardware that does that turns out to be a special case of a more general notion called aggregate function communication, where as a side effect of every barrier synchronization each processor can put out a piece of data and specify exactly what function of the data collected from all processors in the barrier group it would like to receive in return. "So basically PAPERS is a very simple piece of custom hardware that plugs into a group of workstations, PCs or other machines and gives not just a very low-latency communication mechanism but also one capable of sampling global state very cheaply. With PAPERS, one operation is sufficient to sample everybody's state." HPCwire: Is PAPERS a unique technology? DIETZ: "So far. Normally people talk about shared memory and message passing. Even though the PAPERS logic -- the aggregate function model -- can fake either of those, it's fundamentally a different computation model in terms of the interaction between parallel-processing elements." HPCwire: Your literature claims that PAPERS can turn a NOW (network of workstations) into something virtually indistinguishable from a supercomputer. Isn't this an overstatement? DIETZ: "It's an overstatement in the sense that obviously if you take a bunch of 386s and tie them together, you still have a bunch of 386s. But it's not an overstatement in this sense: Most of the differences between a traditional NOW and a traditional supercomputer revolve around the fact that the latter has very low-latency communication and a way of sampling the global state, so you have a cheap way of effecting global control -- for example, SIMD and VLIW execution models. It turns out that PAPERS actually does provide that. "True, we're not quite as fast as some supercomputers have been in providing those functions. But we're much closer to the fastest supercomputers than to the traditional NOWs. And, in fact, PAPERS provides much faster global control than some supercomputers, such as the Intel Paragon or IBM SP2." HPCwire: Can you document these claims with benchmark figures? DIETZ: "We have performed benchmarks against the Paragon and other machines. The figures are posted on-line and have also been published in several papers. For the Paragon, minimum communication time between a couple of processors is on the order of a couple of hundred microseconds. Whereas for the PAPERS unit to do a barrier sync across everybody is on the order of three microseconds. And to do other kinds of aggregate communication operations ends up being no more than tens of microseconds." HPCwire: How many workstations can be linked before there's a slowdown? DIETZ: "Normally people think about networks as switched. PAPERS has no switching. Consequently, when you scale it up, all you're adding is wire and gate delays: there are no extra logic delays, buffering stages or switching stages. We've already built prototypes that can literally scale up to thousands of processors, and the slowdown on the basic operations is on the order of one or two hundred nanoseconds." HPCwire: Aren't there peculiar programming challenges involved? DIETZ: "Absolutely. Most people use PVM or MPI as the programming environment for workstation clusters. There, typically one PE initiates, and one PE receives. That's not the way PAPERS works. Our programming environment, AFAPI (Aggregate Function Application Program Interface), has a full set of operations and can even do things that look like ordinary network communications -- for example, a permutation communication across all the processors. The catch is that it's not one operation per processor but one operation involving all the processors. "Let's say I want to do an arbitrary multibroadcast. Each processor outputs its piece of data and says who it wants to read the data from. Because of the way the hardware is structured, this becomes a single operation for the library. This is quite different from each processor asynchronously deciding to talk to another processor. It's not a whole bunch of point-to-point links. It's literally an N-to-N communication." HPCwire: Exactly what does this mean for the programmer? DIETZ: "AFAPI is different. If you're writing C, you can't just take your PVM or MPI codes and run them unchanged. You have to sit down and think a little bit. But if you do this, you get vastly improved performance and virtually no operating system overhead. "I believe it's actually easier to use than PVM or MPI for two reasons. One is that there's no concept of buffer management. And since all our operations are cheap, you don't have to worry about restructuring your code, hiding latency, vectorizing messages, etc. "We also have a major compiler effort going on. Don't forget: PAPERS started out as a compiler project. Jointly with Will Cohen at the University of Alabama at Huntsville we have developed a port of the Maspar C dialect called MPL. We've taken the full compiler for that and retargeted it to generate code for PAPERS clusters. "Rather than thinking of a PAPERS cluster as a traditional NOW, it's better to conceive of it as a dedicated parallel supercomputer that just happens to be made out of commodity boxes with a custom connection." HPCwire: PAPERS requires hardware. How costly is it? DIETZ: "The software and hardware designs are not proprietary; they're all fully public-domain. And we like keeping things that way. For a two- or four- processor system, a custom board is not even required. You can buy the parts from Radio Shack and assemble them on your kitchen table; it would cost about $50 to $60. Scalable versions are a bit more expensive to build, because the PAPERS modules have additional hardware for hooking them together. "Also, although the AFAPI is normally used with a cluster of UNIX systems connected by both PAPERS and a conventional network, the same programming interface works with other hardware configurations. For example, SHMAPERS AFAPI uses UNIX System V shared memory on SMP hardware; CAPERS AFAPI works with just two machines connected by a standard 'LapLink' cable." HPCwire: What kind of applications are PAPERS systems currently supporting? DIETZ: "Seven universities are currently playing with it -- using the technology in everything from scientific applications to a chess-playing program. At least one company is looking into an embedded PAPERS-based system for medical equipment. We also are developing VGA video-wall applications." --------------------- For more information, see the PAPERS Web site http://garage.ecn.purdue.edu/~papers