Rather than asking the programmer to recode the program for every target architecture or data characteristic, some languages offer higher level data structures and operations which do not delve into the lower-level operations, and may provide a mechanism for the user to hint or suggest how to optimize them. For example, Fortran90 provides array operations, so the above programs could be expressed directly in those languages, and HPF additionally provides "structured comments" that allow the programmer to say how the rows and columns of those arrays should be distributed among processors, which in turn implies which processor will be used to perform operations. Even so, for any particular language, that approach only solves the issue for those particular abstractions which the language designers have chosen to include as primitives, and even then, only in those cases and for those platforms which the compiler (and runtime) implementers have spent the time and effort to optimize. Hinting and mapping mechanisms, like those used in HPF, are also restricted in their expressiveness (i.e. using the "owner computes" rule).
Software Cabling provides a much more general approach to solving such problems which does not depend on language designers and implementers to be steps in front of any particular programmer's wishes. The general approach is for the programmer to first state the algorithm in the most general and least restrictive terms possible, by specifying the operations that must be performed to each input to produce outputs while excluding mention of the order in which operations must execute relative to one another except for cases where one operation directly depends upon the other. The result of this step is an executable algorithm, or program, which can run on a number of platforms, but it may not be obvious to a runtime system how to execute it optimally on some or all. After the initial programming task, Software Cabling then allows the programmer to offer much more help in suggesting how it might be executed on a particular platform using mapping and specialization.

This module is a board (as indicated by its graphical, schematic-like representation) as opposed to a chip (which would be represented in a common textual programming language). The module has 6 pins (roughly corresponding to arguments), one for each of the interface memories (aka i-memories), which are indicated by the rectangles separated into three zones by solid lines. The pins (arguments) have the same names as those i-memories: inmat1 and inmat2 are the input matrices, outmat is the result matrix, size1 is the number of rows in inmat1 and outmat, size2 is the number of columns in inmat2 and outmat, and sizemid is the other dimension of both input arrays. (SC arrays don't really have rows and columns: The terms are used here only for convenience to mean the first and second dimensions of the arrays, respectively.) inmat1, inmat2, and outmat are shown to be two-dimensional arrays by the two hashmarks along their left edges. The size of inmat1 in its second dimension, and of inmat2 in its first dimension, is given by the range in sizemid, as indicated by the wires (lines) from sizemid to the corresponding dimension hash marks on the others. The board also consists of a socket (circle) with six wires (lines). Each wire corresponds to a receptacle on the socket, and those with names (in this case, v1, v2, vlen, and o) can accept a pin of the same name on whatever module is plugged into the socket--in this case, the dot product module, called fdotprod. The other two wires, going to size1 and size2, are described below, but the reason they don't accept pins is that their names (after any special prefixes like "*" , "+", "@", or "?") begin with "_", and this effectively hides these wires from the module in that socket.
The fmatmult module does its job by replicating the middle socket once for every element in the output array--i.e. once for every value in the range present on the size1 memory and for every value in the range present on the size2 memory. This replication is performed by the "dupall" operation, represented by the asterisk prefixing the label on these wires (lines). Each resultant socket (i.e. resulting from the replication), instead of seeing the original value ranges on these wires, sees a unique combination of integers from within those ranges. The socket's binding modifiers (shown by the arrows within the circle together with the legend to its upper right) show that the value from the size1 range for each socket instance will be used to index the first dimension of both the inmat1 array and the outmat array, while the value from the size2 range for each socket instance will be used to index the second dimension of both the inmat2 array and the outmat array.
The result, then, is that each socket instance will contain the fdotprod module (as per the label), and each will be bound to a single row of inmat1 on its v1 receptacle, a single column of inmat2 on its v2 receptacle, and one element of outmat on its o receptacle, with the vlen receptacle being bound to the sizemid memory. The fdotprod module, described below, will know only that it is getting two input vectors on its v1 and v2 pins and a length on its vlen pin, and that it is producing a scalar on its o pin.
Recall that, during the execution of a board in SC, the color, or control state, of each (element of each) memory can change, rather like the data state (the "value") on the memory (element). As its name implies, control state controls when modules can access the memory (element). Specifically, the pin of a module can only access a memory (element) through a ready receptacle--i.e. one represented by a wire which is the same color as the memory it connects to. In this case, then, each replicated socket containing fdotprod can only access the memories when they are green, which is the control state that all memories start with. The colored signal dot(s) immediately adjoining where the wire connects to the memory rectangle shows the color(s) which the memory might change to after the access, so in this case, the inmat1 and inmat2 elements will remain green, as will vlen, but the outmat elements will turn red as each is produced. size1 and size2 will turn red after all of the replicated sockets have been created (as dictated by the semantics of dupall).
Recall also that an interface memory (like all of the memories in this board) will only be green if the pin it represents is in turn in a ready receptacle in its socket (i.e. in the "parent" board), and when an interface memory turns the color of one of its posting dots (i.e. the dots in the rightmost zone), the board is "done" with the memory (at least for the time being) and control (i.e. the control state, or color) returns to the memory in the parent board bound to the corresponding receptacle in the parent socket. In this case, each i-memory is flagged as "atomic" (with an "A" in the upper right corner), meaning that the fmatmult board won't even start executing until all of its pins are in ready receptacles in its parent socket, and that the i-memories will remain green until changed to some other color on the fmatmult board itself (rather than by something on the parent board). The atomicity of inmat1, inmat2, and sizemid i-memories also means that even though their control states never change to the color of their posting dots (red), the parent board can still regard them as "predictable" (i.e. not written to, and with only one posting dot), so the parent can predictively reposess the control state.

This next diagram represents the board named fdotprod, which is used above to perform a floating point dot product. The board has four pins in its interface: v1 and v2, which are the input vectors, vlen, which is the input length (actually the range of indices) of those vectors, and o, which is the resulting scalar. The board contains two other memories, prods and an unnamed memory in the lower right, which are not i-memories (and therefore do not correspond to pins), as indicated by the fact that they are separated into only two zones (by one solid line). [A newer revised notation will make this difference more obviously apparent.] v1, v2, and prods, are all one-dimensional arrays (vectors), as denoted by the single hashmark on their left side.
The socket (circle) on the left is replicated once for every number in the range found on vlen, as represented by the asterisk (dupall) prefix on the corresponding wire's label. Each socket instance performs the pairwise floating point multiplication ($fmult) of corresponding elements from v1 and v2 into the corresponding element of prods (all as shown by the three binding modifier arrows within the socket), and the socket on the right adds the resulting floating point products from prods together, in no particular order, producing the output memory, o. The dollar sign in front of fmult signifies that it is an SC-defined constant (in this case, module). Note that, as a result of the dupall and binding modifiers on the left socket, each $fmult instance operates in a very simple environment where it gets one floating point number on its i1 receptacle, another on its i2 receptacle, and produces a result on its o receptacle. The $fmult instances are oblivious to the fact that array bindings are used.
The module in the right socket, ac_red, is defined below, and can perform any associative and commutative scalar reduction which it finds on its red_op receptacle. In this case, that memory is bound to a memory initialized (using the "=" in the label) to a floating point add ($fadd) operation. ac_red finds the input elements for the reduction on its i receptacle (in this case bound to the prods vector), in the range found on its vlen receptacle (here bound to the vlen memory), producing a result on its o receptacle (here bound to the o memory).

The socket on the left will necessarily be the first one to have all of its receptacles ready. Instead of being labeled with a module name, it has a special receptacle called $mod, from which the socket itself will read the module to be placed into the socket (in this case, the reduction operator provided by the parent), after which that memory is turned red as per the associated signal dot. Only one receptacle for the reduction operator module, called identity, is bound by that socket, and it will write the identity for the provided reduction operator into the o memory and turn it red.
Once the control state of red_op has changed to red, it allows the socket on the top right to replicate (as per the dupall wire) and the replicants to execute, in no particular order, performing the reduction operator for every input element (on i) to o, while also changing the corresponding elements of todo to red. Note that different receptacles (and therefore pins) are bound in this socket than in the left socket. This approach can be used to effect overloading.
When all of the elements of todo (within the range on vlen) have turned red, the socket at the bottom, holding the predefined do-nothing module $null, can execute, having the primary function of turning the o memory to the color of its posting dot, thus returning the answer to the parent board and signifying completion. Note that the legend for the bottom socket shows that the binding modifier is a dimension number ("1") in braces ("[..]"), meaning that the range on memory vlen is used to restrict the elements of todo to which the socket is bound to those in the first dimension within that range.

Like the previous implementation, the top part of the board takes care of the reduction, the todo memory keeps track of when the reductions are done, and the bottom part handles the finishing up, but now, instead of reducing all of the input into a single scalar memory (o), a vector of partial results is kept (partials). The three sockets along the top each perform a single scalar reduction, but each in different ways. The first reduces any two input values together (from the vector on the i pin) to produce a partial result (into any element of the partials vector). The second reduces any one partial result into any other. The third reduces any one input element into any partial result. Elements of the partials vector are green when unused (or emptied), and red when they hold a valid partial value. These three sockets can be performed in any order or in parallel. (In fact, assuming that the reduction op is atomic, which is will usually be, any one of these may execute concurrently with another instance of itself.) All three of these sockets obtain the reduction to be performed (i.e. their $mod receptacle) from the board's red_op pin, and each again binds only the receptacles it needs.
This implementation performs one less reduction operation than the original board, so one element of the todo vector must be changed to red without a reduction. This is performed by the middle socket on the bottom (holding the $null module). The other two sockets at the bottom (both containing a $copy chip) can only run when todo is all red, and one or the other is used to copy the answer to the output memory, o. There are two of them because the answer could be in two different places: It will usually be in partials, as the only red element there, but in the special case where there is only one input element to the board, no reductions need to occur, so the answer will still be in the only (still green) element in i. Internally, the $copy chip itself often doesn't copy anything: For example, in this case, the SC runtime can just internally rename the memory holding the result to "o".
The "any" prefix, represented by a question mark (?), is used on many of the wires here, to mean that that wire can be bound to any of the elements of the associated array memory, so it can only be used for arrays with known size--i.e. with a wire connected to each hashmark on their left. (For arrays of unknown size, the same goal can be achieved using the longhand dupany binding modifier, similar to the dupall discussed previously but using a "+" prefix instead of "*".) If there are two "any" wires from the same socket to the same memory, they will not both be bound to the same element.
For applications where those sorts of cases are likely, we can consider an alternate matrix multiply in a more streaming form, which is capable of performing any dot product as soon as its own unique input and output elements are ready, instead of waiting for all of them first. In fact, since the dot product has already been written in a streaming form, any single multiplication or addition is capable of being performed as soon as possible:

The main difference between this and the original implementation is the loss of the "A"s (to represent "atomic") on some of the i-memories, and the addition of two memory arrays (t1 and t2) and four sockets containing $null modules. These are needed to figure out when the board is finished with each input element, so that they can be returned to the parent at the appropriate time. Simply put, a row of inmat1 can be returned when the associated row of o has been computed, and a column of inmat2 can be returned when the associated column of o has been computed. However, since the elements of o (and their control states) are returned to the parent just as soon as they are produced, and since the parent may even return some of those control states for the next iteration/matrix before the last iteration/matrix has finished, the control states of o are not a reliable source for that information. To make up for that, the control states of o are echoed into local arrays t1 and t2, and the $null modules between t1 and inmat1, and between t2 and inmat2 fire when those conditions are satisfied, returning the appropriate row or column to the parent and resetting the control state of that row or column in t1 or t2 for the next iteration/matrix.
The other two $null sockets (at the top and bottom) are solely to change the control state of the input elements from green to red as they become available. This is necessary because without it, both the wire and the accompanying signal dots between the fdotprod socket and the inmat1 and inmat2 arrays would need to be green--and signal dots in non-atomic i-memories are not allowed to be green. (Reasons for this constraint are sound, but beyond the scope of this document.)
Although the actual techniques used by Software Cabling tools to effect optimization are beyond the scope of this document, the general question is: How does one take these sorts of factors into account while still keeping the program portable--in fact, without changing the program at all, in some sense? SC uses two approaches: Mapping and specialization. Mapping means providing instructions to the scheduler about how the operations should be assigned to processors. Specialization means providing instructions about which operations should not execute: it is used to help a scheduler make decisions when it has many different choices (such as in the top half of the revised ac_red board). Mapping cannot change the possible behavior of the program, only the performance. For specialization, care must be taken to ensure it does not cause some portion of the computation to stop before the original program says it is done (thereby violating either the liveness semantics of the model, or the semantics of the program).
A Software Cabling implementation also serves as a high-level design and documentation for the program, which necessarily remains valid throughout the program lifetime, because it is the program. And while some of the examples shown here have dealt with implementation issues beyond what one would usually encounter in other languages, this has primarily resulted from the fact that many of the complexities of those other languages have been dealt with quickly, and the additional expressiveness and flexibility of Software Cabling has made it possible to then address more advanced issues. Specifically, the resulting programs have latency-hiding and formal correctness properties that are lacking in most other approaches, even while avoiding specific architecture-dependent properties, and this will become ever more important as high-performance architectures become larger, more complex, more varied, and more distributed.