Prof. Hank Dietz,
[email protected]
Randy Fisher,
[email protected]
Originally Posted July 30, 1998 at http://dynamo.ecn.purdue.edu/~hankd/SWAR/Scc.html
SWARC, pronounced "swh-are-see," is a C-like language designed to simplify writing portable code using SWAR parallelism. SWAR is our generic name for a variation on SIMD, the Single Instruction Multiple Data parallelism that has been used in supercomputers for decades. Unlike traditional SIMD hardware, SWAR (Simd Within A Register) hardware parallelism width is not a constant, but is a function of the precision of the values being operated upon -- lower precision yields greater parallelism. The primary motivation behind the SWARC language is that SWAR hardware mechanisms in the latest generation of microprocessors offer excellent speedups, but vary widely and have highly machine-dependent programming models that require hand coding of individual instructions and offer virtually no compile-time optimization.
Although having a very smart C compiler automatically generate optimized native SWAR code is a good long-term goal, quite a lot of new compiler technology must be developed for this, and building such a C compiler is not currently practical. To develop and test the new compiler technology, and also to provide an immediately usable SWAR programming abstraction, SWARC is a relatively small and simple language that can be used to generate code modules that can call and be called by ordinary C code. Thus, for example, SWARC has no I/O operations, etc.; these are all provided by the interface to C.
The following sections of this HTML document briefly overview the SWARC language, the SWARC compiler (Scc), a simple example program, and an "evil" example program.
The key concept in SWAR is the use of operations on word-sized values to perform parallel operations on fields within a word. Thus, the most distinctive feature of SWAR programming is the need to specify the field sizes and their alignment within a word. The field sizes and alignments are highly machine dependent, so, for portability, the programmer's specification should not require particular layouts, but should provide minimal constraints on how a particular layout can be selected by the compiler. To interface to C as a module compiler, SWARC also needs to be able to specify data types that have layouts equivalent to ordinary C data.
SWARC's solution is very simple. A SWARC data type essentially specifies a first-class single-dimension array type whose elements are either ordinary C-typed objects or are fields with a specified minimum precision. For example, the C declarations:
char c[100]; short s[100]; int i[100]; unsigned int u[100];
are equivalent to the SWARC C-layout declarations:
char[100] c; short[100] s; int[100] i; unsigned int[100] u;
The slight notational difference between C and SWARC is due to the fact that C arrays are not fully first-class objects. For example, given the above declarations, i=s; is a perfectly legal SWARC array assignment that does element-by-element conversion from short to int copying the array value of s into i. In C, i=s; would be illegal; in fact, the closest that we could come to making this work for C would be to wrap each array in a C struct (since they are first-class C objects), but that still would not work, because the two struct types would be incompatible.
However, specifying first-class arrays of C objects is only the C interface portion of SWARC's data typing mechanism. Borrowing the bit-field specification from C struct declaration syntax, SWARC allows first-class arrays of appropriately-aligned SWAR data to be declared as:
char:7[100] ca; short:[100] ia;
This notation uses any of the basic C types char, short, int, or float, but specifies the minimum number of bits for each field and allows the SWARC compiler to select the most appropriate internal layout (i.e., field alignment and ordering). The declaration of ca specifies that it is an array of at least 100 elements, each of which is an integer field with at least 7 bits. Similarly, ia is an array of at least 100 elements, each of which is an integer field with at least as many bits as a C short (i.e., 16 bits). Thus, the programmer does not know very much about how ca and ia will be represented, but ia=ca; works as expected -- and so would ia=c; or i=ca;.
About now is when you might be thinking that C++ classes and templates could solve these problems without requiring a whole new language. The answer is not quite. It would be very difficult for the C++ compiler to optimize across multiple first-class array operations, or even to do the basic static analysis involving SWAR optimizations.
In addition to this major difference between C types and SWARC types, there are a variety of minor differences. To simplify the compiler analysis, SWARC does not support C struct or union types, pointers, nor even multi-dimensional arrays; of course, there is nothing preventing programmers from coding using these types in C functions that interact with SWARC code. To make SWARC more usable for programmers, there are also a variety of extensions. In addition to the C-like sizeof built-in operation, SWARC provides widthof and precisionof; widthof and precisionof can be applied to an expression to return constant values that are, respectively, the number of elements or the number of bits of precision. For declaring variables or type casting, typeof can be used to generate a type specification from an expression. For type casting, degenerate null-subscripted types can be used to change the precision of elements without changing the number of elements. For example:
int x; int:8[3] y; int[3] z; typeof(y) y2; /* equivalent to int:8[3] y2; */ x = sizeof(z); /* x = # bytes in z's memory image */ x = widthof(z); /* x = 3 */ x = precisionof(z); /* x = 8 */ y = ((int:8[]) z); /* y = (z's value as an int:8[3]) */
It is important to note that, unlike C's sizeof, the SWARC construct returns a potentially non-constant value -- the SWARC compiler is even free to use multiple different-sized layouts for values of the same declared type. This operator is used only to pass C code the current size of the value's memory image.
There are also several type specifiers closely related to C storage classes. As in C, data and functions that are defined either later in this file or in another, separately compiled, file can be declared as extern. Also like C, data and functions whose names are to be visible only within the current file are declared static, and data can be declared register as a hint to the compiler that the value need not be placed in memory or as const to suggest that the data is temporarily read-only. SWARC integer values have attributes similar to C integers by default, signed and using modular arithmetic, but can also be explicitly declared as such; the alternatives are, respectively, unsigned and saturation.
The language specification is long and not yet complete, but understanding SWARC data typing is really the key. Operating on these first-class arrays is done using ordinary C operators and SIMD semantics. For example, the if statement, if given an array-valued conditional, will perform the usual SIMD enable-masking when executing the clauses. There are also a few new statements, operators, and some extended semantics. For example, there is a where construct that behaves like if, but always executes the clauses (i.e., unlike if, it implements enable masking only, not control flow). There are new binary operators for maximum, minimum, and average. Further, as was done in Thinking Machines' C* language, assignment operators can be used to perform unary reductions, e.g., +=y would return the single value that is the sum of the elements of y. Most of these extensions should sound familiar if you have used SIMD supercomputers like the old Thinking Machines, MasPar, and DAP systems.
This is not a serious summary, but you can get a reasonable feel for the language by scanning through the PCCTS grammar that was used to build the SWARC compiler's front-end:
swarc : ( CCODE | ( { "extern" | "static" } ( data | func ) ) )+ ; data : type Name ( "," Name )* ";" ; func : "void" Name "(" ( "void" | ( type Name ( "," type Name )* ) ) ")" ( ( block ) | ";" ) ; type : ( "register" | "const" | "modular" | "saturation" | "signed" | "unsigned" )* ( "char" | "short" | "int" | "float" ) { ":" { num } } { "[" ( expr | ) "]" } | "typeof" "(" expr ")" ; block : "{" ( data )* { stat ( stat )* } "}" ; stat : block | CCODE | Name ":" stat | "goto" Name ";" | "if" "(" expr ")" stat { "else" stat } | "while" "(" expr ")" stat | "do" stat "while" "(" expr ")" ";" | "for" "(" expr ";" expr ";" expr ")" stat | "continue" { expr } ";" | "break" { expr } ";" | "return" ";" | ";" | "where" "(" expr ")" stat { "elsewhere" stat } | "everywhere" stat | ident "(" { expr ( "," expr )* } ")" ";" | expr ";" ; expr : expr0 ; expr0 : expr1 { "=" expr | "&&=" expr | "||=" expr | "?>=" expr | "?<=" expr | "+/=" expr | "+=" expr | "-=" expr | "*=" expr | "/=" expr | "%=" expr | ">>=" expr | "<<=" expr | "&=" expr | "^=" expr | "|=" expr } ; expr1 : expr2 { "?" expr2 ":" expr2 } ; expr2 : expr3 ( "||" expr3 )* ; expr3 : expr4 ( "&&" expr4 )* ; expr4 : expr5 ( "|" expr5 )* ; expr5 : expr6 ( "^" expr6 )* ; expr6 : expr7 ( "&" expr7 )* ; expr7 : expr8 ( "==" expr8 | "!=" expr8 )* ; expr8 : expr9 ( "<" expr9 | ">" expr9 | "<=" expr9 | ">=" expr9 | "?>" expr9 | "?<" expr9 | "+/" expr9 )* ; expr9 : expr10 ( ">>" expr10 | "<<" expr10 )* ; expr10 : expr11 ( "+" expr11 | "-" expr11 )* ; expr11 : expr12 ( "*" expr12 | "/" expr12 | "%" expr12 )* ; expr12 : "-" expr12 | "!" expr12 | "~" expr12 | "++" expr12 | "--" expr12 | "(" type ")" expr12 | "sizeof" "(" ( type | ident ) ")" | expr13 { ( "[" expr "]" ) | ( "[<<" expr "]" ) | ( "[<<%" expr "]" ) | ( "[>>" expr "]" ) | ( "[>>%" expr "]" ) | "++" | "--" } | "&&=" expr12 | "||=" expr12 | "?>=" expr12 | "?<=" expr12 | "+/=" expr12 | "+=" expr12 | "&=" expr12 | "*=" expr12 | "|=" expr12 | "^=" expr12 ; expr13 : num | vnum | ident | "(" expr ")" ; ident : Name ; vnum : "{" expr { ".." expr } ( "," expr { ".." expr } )* "}" | String ( String )* ; num : Octal_Number | Decimal_Number | Hex_Number | Binary_Number | Char | "widthof" "(" expr ")" | "precisionof" "(" expr ")" ;
The SWARC compiler is called Scc (for Swar C compiler). It is a very sophisticated, roughly 50K source line, research compiler, using a variety of compiler optimization technologies that are truly "bleeding edge." It was created from scratch using PCCTS and other compiler tools developed by our group at Purdue, and the intent is to make it a Public Domain release, stable enough to be usable for production work.
Right now, as an alpha-test version, a lot of stuff works and a lot (e.g., 3DNow! floating point) doesn't. Currently, it only generates code for MMX variants... although it does understand the variations quite well. We are currently pursuing funding to continue our work in implementing new targets and developing new optimizations.
Despite its "researchy" status, Scc provides a command line interface very similar to that of the Gnu C compiler, gcc; in fact, Scc automatically invokes portions of gcc to handle SWARC source preprocessing and the relatively mundane compilation of Scc-generated C code. Like gcc, Scc associates file types with the suffix of the file name. The following file types are understood:
The command line options implemented thus far for Scc include:
Thus, the basic output of Scc is C code (including inline native assembly code for SWAR operations) and a C interface header file, but Scc is capable of directly producing executable files. In use, Scc handles SWARC code nearly identically to how gcc handles C code.
The following is a simple SWARC program, test.Sc:
void addfour(int[2]x) { x += 4; } ${ main() { int a[2]; a[0] = 1; a[1] = 2; addfour(a); printf("a={%d,%d}\n", a[0], a[1]); } $}
This program doesn't really show-off many of the language or compiler features, but is short enough that we can reasonably include all phases of the code for this example. The addfour function is SWARC code that takes a C-layout first-class array of 2 integers (passed by address), and simply adds 4 to each of its elements. The main is ordinary C code, which can be used in-line within SWARC programs as shown above.
Compiled by Scc -Sc -P, the header file Scout.h is generated as:
/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */ extern void addfour(int *x);
And the C code file Scout.c:
/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */ #include "Sc.h" void addfour(int *x) { extern mmx_t mmx_cpool[]; register mmx_t *_cpool = &(mmx_cpool[0]); { movq_m2r(*(((mmx_t *) x) + 0), mm0); paddd_m2r(*(_cpool + 2), mm0); movq_r2m(mm0, *(((mmx_t *) x) + 0)); } _return: emms(); } main() { int a[2]; a[0] = 1; a[1] = 2; addfour(a); printf("a={%d,%d}\n", a[0], a[1]); } /* MMX constant pool */ mmx_t mmx_cpool[] = { /* 0 */ 0x0000000000000000LL, /* 1 */ 0xffffffffffffffffLL, /* 2 */ 0x0000000400000004LL };
The actual assembly language translation of the program, as generated by Scc test.Sc -S -O2 is:
.file "Scout.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl addfour .type addfour,@function addfour: pushl %ebp movl %esp,%ebp movl 8(%ebp),%edx movl $mmx_cpool,%eax #APP movq (%edx), %mm0 paddd 16(%eax), %mm0 movq %mm0, (%edx) emms #NO_APP leave ret .Lfe1: .size addfour,.Lfe1-addfour .section .rodata .LC0: .string "a={%d,%d}\n" .text .align 4 .globl main .type main,@function main: pushl %ebp movl %esp,%ebp subl $8,%esp movl $1,-8(%ebp) movl $2,-4(%ebp) leal -8(%ebp),%eax pushl %eax call addfour pushl -4(%ebp) pushl -8(%ebp) pushl $.LC0 call printf leave ret .Lfe2: .size main,.Lfe2-main .globl mmx_cpool .data .align 4 .type mmx_cpool,@object mmx_cpool: .long 0 .long 0 .long -1 .long -1 .long 4 .long 4 .size mmx_cpool,24 .ident "GCC: (GNU) 2.7.2.3"
Notice that the MMX instructions appear without additional overhead, simply using gcc's inline assembly facility with our libmmx wrappers. However, the gcc code does have a significant flaw: the last .align 4 should really have been an .align 8 to ensure optimum performance, but this minor bug will be fixed in future versions of Scc (if it is not fixed in gcc first).
Compiling test.Sc directly into an executable file called test is done by Scc test.Sc -o test. When the file test is executed on an MMX-capable machine, the output is simply:
a={5,6}
Ok, so by now you're starting to wonder how clever this compiler is... after all, you could have written those three MMX instructions yourself. Well, you asked for it. Here's a trivial little example that doesn't generate trivial little MMX code:
char[14] c; void nasty(unsigned int:2[32]two, typeof(two) twotoo, unsigned int:4[32]four) { /* Two bit divide... */ two = two / twotoo; /* Four bit field average... but first must promote 2->4 bit */ four +/= two; /* Eight bit add... of missaligned c to itself rotated */ c += c[<<% 1]; }
This little gem is called torture.Sc, and was compiled by Scc torture.Sc -Sc -P -O2000 to generate the following code. But wait a second... -O2000? Doesn't that mean the Scc optimizer can take up to 2000 seconds? Yup... but it actually took 0.86 seconds for the full compile, meaning that the block was scheduled optimally (at least according to our timing model for a generic MMX target). The output C code is:
/* compiled by Scc alpha-test v980728 for generic IA32 MMX target */ #include "Sc.h" char c[14]; void nasty(mmx_t *two, mmx_t *twotoo, mmx_t *four) { extern mmx_t mmx_cpool[]; register mmx_t *_cpool = &(mmx_cpool[0]); { movq_m2r(*(((mmx_t *) two) + 0), mm0); movq_m2r(*(((mmx_t *) twotoo) + 0), mm1); movq_r2r(mm1, mm2); psrlq_i2r(1, mm1); movq_r2r(mm0, mm3); psrlq_i2r(1, mm0); pandn_r2r(mm3, mm1); movq_r2r(mm3, mm4); pand_r2r(mm0, mm3); movq_r2r(mm2, mm5); pandn_r2r(mm0, mm2); por_r2r(mm3, mm1); por_r2r(mm2, mm1); pandn_r2r(mm4, mm5); pand_m2r(*(_cpool + 2), mm1); pand_m2r(*(_cpool + 3), mm5); por_r2r(mm5, mm1); movq_m2r(*(((mmx_t *) four) + 0), mm6); movq_r2m(mm1, *(((mmx_t *) two) + 0)); movq_m2r(*(((mmx_t *) four) + 1), mm7); movq_r2r(mm1, mm5); pand_m2r(*(_cpool + 4), mm1); movq_r2r(mm5, mm4); pand_m2r(*(_cpool + 5), mm5); psllq_i2r(16, mm5); por_r2r(mm5, mm1); movq_r2r(mm1, mm5); pand_m2r(*(_cpool + 6), mm1); pand_m2r(*(_cpool + 7), mm5); psllq_i2r(8, mm1); por_r2r(mm1, mm5); movq_r2r(mm5, mm1); pand_m2r(*(_cpool + 8), mm5); pand_m2r(*(_cpool + 9), mm1); psllq_i2r(4, mm5); por_r2r(mm5, mm1); movq_r2r(mm1, mm5); pand_m2r(*(_cpool + 10), mm1); pand_m2r(*(_cpool + 11), mm5); psllq_i2r(2, mm1); psrlq_i2r(32, mm4); por_r2r(mm1, mm5); movq_r2r(mm4, mm1); pand_m2r(*(_cpool + 4), mm4); pand_m2r(*(_cpool + 5), mm1); psllq_i2r(16, mm1); por_r2r(mm1, mm4); movq_r2r(mm4, mm1); pand_m2r(*(_cpool + 6), mm4); pand_m2r(*(_cpool + 7), mm1); psllq_i2r(8, mm4); por_r2r(mm4, mm1); movq_r2r(mm1, mm4); pand_m2r(*(_cpool + 8), mm1); pand_m2r(*(_cpool + 9), mm4); psllq_i2r(4, mm1); por_r2r(mm1, mm4); movq_r2r(mm4, mm1); pand_m2r(*(_cpool + 10), mm4); pand_m2r(*(_cpool + 11), mm1); psllq_i2r(2, mm4); movq_r2r(mm6, mm0); psrlq_i2r(1, mm6); por_r2r(mm4, mm1); pand_m2r(*(_cpool + 12), mm6); movq_r2r(mm5, mm4); psrlq_i2r(1, mm5); pand_m2r(*(_cpool + 12), mm5); por_r2r(mm4, mm0); paddd_r2r(mm5, mm6); pand_m2r(*(_cpool + 13), mm0); movq_r2r(mm7, mm5); psrlq_i2r(1, mm7); paddd_r2r(mm0, mm6); pand_m2r(*(_cpool + 12), mm7); movq_r2r(mm1, mm4); psrlq_i2r(1, mm1); pand_m2r(*(_cpool + 12), mm1); por_r2r(mm4, mm5); paddd_r2r(mm1, mm7); pand_m2r(*(_cpool + 13), mm5); movq_r2m(mm6, *(((mmx_t *) four) + 0)); paddd_r2r(mm5, mm7); movq_m2r(*(((mmx_t *) &c) + 0), mm5); movq_r2m(mm7, *(((mmx_t *) four) + 1)); movq_m2r(*(((mmx_t *) &c) + 1), mm4); movq_r2r(mm4, mm1); pand_m2r(*(_cpool + 14), mm4); movq_r2r(mm4, mm0); psllq_i2r(8, mm4); movq_r2r(mm5, mm3); psrlq_i2r(56, mm5); pand_m2r(*(_cpool + 15), mm4); pand_m2r(*(_cpool + 16), mm5); movq_r2r(mm3, mm2); psllq_i2r(8, mm3); por_r2r(mm5, mm4); pand_m2r(*(_cpool + 15), mm3); psrlq_i2r(40, mm0); pand_m2r(*(_cpool + 17), mm0); por_r2r(mm0, mm3); paddb_r2r(mm1, mm4); paddb_r2r(mm3, mm2); pand_m2r(*(_cpool + 14), mm4); movq_r2m(mm2, *(((mmx_t *) &c) + 0)); pand_m2r(*(_cpool + 18), mm1); por_r2r(mm1, mm4); movq_r2m(mm4, *(((mmx_t *) &c) + 1)); } _return: emms(); } /* MMX constant pool */ mmx_t mmx_cpool[] = { /* 0 */ 0x0000000000000000LL, /* 1 */ 0xffffffffffffffffLL, /* 2 */ 0x5555555555555555LL, /* 3 */ 0xaaaaaaaaaaaaaaaaLL, /* 4 */ 0x000000000000ffffLL, /* 5 */ 0x00000000ffff0000LL, /* 6 */ 0x0000ff000000ff00LL, /* 7 */ 0x000000ff000000ffLL, /* 8 */ 0x00f000f000f000f0LL, /* 9 */ 0x000f000f000f000fLL, /* 10 */ 0x0c0c0c0c0c0c0c0cLL, /* 11 */ 0x0303030303030303LL, /* 12 */ 0x7777777777777777LL, /* 13 */ 0x1111111111111111LL, /* 14 */ 0x0000ffffffffffffLL, /* 15 */ 0xffffffffffffff00LL, /* 16 */ 0x00000000000000ffLL, /* 17 */ 0x0000000000ffffffLL, /* 18 */ 0xffff000000000000LL };
Ok. So our compiler really does do something non-trivial.
Still, the real question is how fast does the output code run? We don't really know... we don't yet have one of each of the different types of MMX targets. However, the compiler timing estimates on this code were surprisingly consistent. The AMD K6 and Cyrix 6x86MX and MII processors all yielded code that Scc estimated was about 4.7x as fast as word-oriented serial C code; for the AMD K6-2 and Intel PentiumII and Pentium with MMX, Scc predicts a 6.0x speedup (as it also does for the generic MMX model). Why? In this case, all the targets are using the same MMX instructions, so it is just a matter of pipeline scheduling, and having two pipes beats having only one.