Many-to-Many

It is common for processes that work together on a given problem to need to share data in other than a one-to-one association. That is, many processes may require the data held in one process, one process may require data held in many processes, or information may need to be shuffled or transposed among several processes. Although CDS allows processes to just move sharable data into comm cells to make it available to other processes using the standard communication mechanisms, it is often possible to minimize overall latency required by such one-to-many, many-to-one, and many-to-many operations by taking advantage of architecture-specific support and/or topology. CDS therefore provides a few high-level operations for these purposes, with the same basic philosophy as other CDS operations--i.e. that of allowing the programmer to express policy rather than mechanism. (The MPI term for similar operations is "collective", partially to describe the involvement of many processes to achieve the outcome, but CDS only requires the direct involvement of one process to perform these operations.)

Logical Model

To achieve the goals of efficient multi-process data exchange, CDS provides "multicast" versions of its enq and write operations, called cds_enqm and cds_writem, as well as two novel operations, called cds_rb and cds_tpose. The two multicast operations are identical to their unitary counterparts, except that they take a set or processes as targets instead of a single process, and the region is delivered into the specified context and cell within each of them. For the two novel operations, unlike every other CDS communication operation, the comm heap is not explicitly used--they effectively move data directly from cells in one set of processes (called the from-set) to cells in another set of processes (the to-set). It is quite common for the two sets to be identical, or for one (or both) of the sets to contain just one element. (In fact, if both sets contain only one element, then either operation will effectively just move a single region from one cell to another.)

cds_rb (which stands for "reduce-broadcast") effectively deqs a region from a specified cell in each process in the from-set, reduces those regions into a single region (by applying a commutative operation of the programmer's choice to the first item in each region to produce the first item in the result region, the second item in each region to produce the second, etc.), and then enqs that resulting region into a specified cell in each process in the to-set. Note that if the from-set contains only a single process, then cds_rb is effectively a broadcast (i.e. it just moves a single region from one process into a cell in one or more other processes), and if the to-set contains a single process, then cds_rb is effectively a reduction (i.e. it takes a region from one or more processes, reduces them into one region, and places that into a cell in the single process of the to-set).

cds_tpose (which stands for "transpose") serves to gather and/or scatter data from a specified cell in the from-set to a specified cell in the to-set. Specifically, if the from-set contains f process IDs, and the to-set contains t process IDs, then cds_tpose effectively deqs a region containing t data items from a specified cell in each process in the from-set, and enqs a region containing f data items into a specified cell in each process in the to-set. The ith data item in the region enq'd into the jth process is equal to the jth data item in the region deq'd from the ith process. Note that, if the from-set contains only a single process, then this scatters a single region over the to-set processes, and if the to-set contains only a single process, this gathers data from the from-set processes into a single region.

Both cds_rb and cds_tpose may be expensive to start up, since each may spend a significant amount of time analyzing the from-set and to-set to determine the most efficient way to perform the operation. Partially for this reason, each of these operations takes a replicator argument--i.e. an integer count, which can be infinity--and the operation is performed that number of times before completing. One iteration does not necessarily complete before the next begins, but regions will always be deq'd or enq'd from any particular cell in the proper order. The iterator has the effect of converting some cells to full-time use by the many-to-many operations for some period, thereby making a broadcast or reduction during that period as simple as enqing the inputs and deqing the results.

Syntax

cds_enqm (rgid,nprocs,procs,cntxt,cell,perm)
cds_writem(rgid,nprocs,procs,cntxt,cell,perm,waitflg)
cds_rb(iter,nitems,type,reduct,
       fmsetn,fmset,fmcntxt,fmcell,
       tosetn,toset,tocntxt,tocell)
cds_tpose(iter,nbytes,
       fmsetn,fmset,fmcntxt,fmcell,
       tosetn,toset,tocntxt,tocell)
cds_irb(&ithrd,iter,nitems,type,reduct,
       fmsetn,fmset,fmcntxt,fmcell,
       tosetn,toset,tocntxt,tocell)
cds_itpose(&ithrd,iter,nbytes,
       fmsetn,fmset,fmcntxt,fmcell,
       tosetn,toset,tocntxt,tocell)

nprocs and procs are the number of processes and a vector of process IDS. iter is the number of times to perform the operation (and may be CDS_INFIN). insetn, fmset, fmcntxt, and fmcell are, respectively, the number of elements in the from-set, the from-set itself (as a vector of process IDs), and the context and cell to obtain the input regions. tosetn, toset, tocntxt, and tocell are the corresponding values for the to-set. nitems is the number of items to be read and written in each region, type is the data type of each of those elements, and reduct is the reduction to perform from the list (very similar to MPI): maximum (CDS_MAX), minimum (CDS_MIN), sum (CDS_SUM), product (CDS_PROD), bit-wise and (CDS_AND), bit-wise or (CDS_OR), bit-wise xor (CDS_XOR), max value with location (CDS_MAXLOC), and min value with location (CDS_MINLOC). ithrd is the thread ID for the non-blocking versions.

In cases where the from-set is identical to the to-set, it is possible (and recommended) that tosetn be set to CDS_SAME, and toset be specified as zero (i.e. null pointer). In other cases, where the from-set and to-set do not intersect at all, the caller can make the call slightly more efficient by negating the value of tosetn.

[Index] On to Process Control