basic

Basic Communication

All communication in CDS is based upon just a few basic primitives, to be introduced here. The user can use these primitives directly, or can use higher-level interfaces, described later, which are probably more familiar, and which will be formally defined in terms of these primitives.

The semantics (i.e. behavior) of the primitives will first be explained in the simplest way (under "Logical Model"), with no reference to implementation details. This description may make them seem rather inefficient. Then some methods of implementing those semantics efficiently on real machines is described (under "Physical Model"). While anybody can program CDS knowing only the logical model, knowledge of the most common physical implementations can sometimes help to tune programs for maximum efficiency and portability.

Logical Model

To perform communication, a CDS process is created by augmenting each traditional user process (i.e. the standard program code, stack, and heap, shown in yellow below) with two other areas (shown in blue):

A Comm Heap (which is logically private to the process) and
A set of Comm Cells (which are logically public to all processes), contained in one or more Cell Contexts

The comm heap is much like the standard program heap: regions are allocated using rgalloc (like Unix malloc), and logically deallocated using rgfree (like Unix free), and the user can (in most circumstances) directly access the data in these regions (e.g. using pointers).

Each comm cell is a globally-accessible container which can hold zero or more regions in a queue-like structure. The user cannot directly access the data in the regions held by a cell. Communication is performed by transferring regions between the local comm heap and the comm cells in any process. To access a particular cell, the accessing process must name the process holding the cell, the context within that process holding the cell, and the cell number.

Specifically, the communication primitives in CDS are:

rgalloc: Create a region in the Comm Heap
rgfree: Delete a region in the Comm Heap
enq: Copy a region from the Comm Heap and add it to the tail of a Cell
deq: Remove a region from the head of a Cell and put it into the Comm Heap.
write: Copy a region from the Comm Heap to a Cell, destroying any other regions which might be in the cell.
read: Copy the head region from a cell into the Comm Heap, without disturbing the Cell.
zap: Delete all regions in a cell.

To help CDS perform these actions efficiently (described later under Physical Model), the user process must agree to "cooperate" by informing CDS before modifying the contents of a region in the comm heap. This can be performed with yet anotherprimitive, rgmod, or by passing appropriate arguments to the last other primitive to handle the region before the modification.

deq and read allow the user to specify a time limit (which can be infinite) specifying the maximum time to block waiting for the cell to become non-empty. enq and write never block indefinitely, but there is another primitive, benq, which is identical to enq except that it will block indefinitely (or until it fails at the expiration of a user-specified time-out period) unless (1) the target cell is empty and (2) there is already a deq operation waiting on that cell. This provides functionality similar to a "synchronous send" in MPI, or (if a zero time-out is specified) a "ready send" in MPI. A time limit can also be specified for enq, meaning only that the caller advises (though does not require) that the primitive behave like benq for that long, after which it must proceed and unblock normally. This is most useful to try to avoid intermediate buffering when dealing with very large regions where the deq on the receiving end is expected to be part of a recv function (discussed in the next section).

deq, read, and benq each have a split transaction mode (similar to "non-blocking" operations in MPI), where the operation initiation and completion are separated, most useful when accessing distant cells to compensate for latency. (There is no significant benefit, or cost, to using the split form of these operations for local cell access.) benq's split transaction mode is useful for both local and remote cells. In each case, the first operation of the split transaction (ideq, iread, and ibenq) is semantically identical to starting a new thread containing only the blocking operation (i.e. minus the i prefix) and returning a logical thread id, and the second operation (wait) is semantically identical to waiting for the specified thread to complete. In each case, a new thread, visible to the user using traditional thread-management tools or utilities, may or may not actually be created. wait also allows a time limit to be specified, and whether or not the thread should be canceled if that limit is exceeded. This permits wait to double as an MPI-like cancel operation (e.g. by supplying a zero time limit).

[A time limit cannot be used on both the wait primitive and the primitive being waited on. One or the other must be specified as "indefinite" ("infinite")]

Syntax

The syntax of the above operations is as follows:

cds_rgalloc(&rgid,size)
cds_rgfree(rgid)
cds_rgmod(rgid,waitflg)
cds_write(rgid,proc,cntxt,cell,perm)
cds_enq(rgid,proc,cntxt,cell,perm,waitflg)
cds_benq(rgid,proc,cntxt,cell,perm,timeout)
cds_read(&rgid,proc,cntxt,cell,perm,timeout)
cds_deq(&rgid,proc,cntxt, cell,perm,timeout)
cds_ideq(&ithrd,&rgid,proc,cntxt,cell,perm)
cds_zap(proc,cntxt,cell)
cds_iread(&ithrd,&rgid,proc,cntxt,cell,perm)
cds_ibenq(&ithrd,rgid,proc,cntxt,cell,perm)
cds_wait(ithrd,timeout,cancel)

rgid is a region ID (returned by the operation if preceded by &). size is the size of the region to create, in bytes. In all cases, proc, cntxt, and cell together denote a specific cell by specifying the process in which the cell lives, the context within that process, and the cell number within that context, respectively. perm specifies what sort of access the process wants to the region after the operation:

CDS_PWRIT: modify permission, equivalent to executing cds_rgmod after the operation

CDS_PREAD: only read permission (no effect)

CDS_PNONE: no permission, equivalent to executing rgfree after the operation

CDS_NODTA: the region is retained (i.e. not rgfreed), but the data within the region cannot be accessed directly. (Rarely used, for efficiency only, see below.)

timeout specifies how long to block until failing, and may be CDS_BLOCK to wait forever. cancel (on cds_wait) specifies to cancel the thread on timeout if true (non-zero), else the thread continues to be waited on again later. >

Transfer interrupted!

/tt> on enq advises that the operation may be more efficient if it acts like a benq for a short period of time before reverting to standard enq semantics. A non-zero value of waitflg on rgmod advises that the operation may be more efficient if it blocks for a short period of time. (See Physical Model for more discussion of waitflg.) In terms of the logical model, the value of waitflg on these operations has no effect whatsoever.

*rgid serves as a pointer to the data on the region (i.e. **rgid addresses the first byte in the region), unless CDS_NODTA was specified as a permission. The value of *rgid may change when rgid is passed to rgmod or another primitive with CDS_PWRIT permission, so it should not be held in a temporary variable across such a call.

Physical Model

Cells will typically hold only pointers to regions, rather than regions themselves, so if the process performing a primitive and the process holding the queue share memory, the communication primitives (enq, deq, write, and read) will not actually copy any data: they just copy pointers. All of the Comm Heaps owned by processes which physically share memory are typically treated as a single shared entity. The region pointers within cells are virtualized to be independent of any one address space. This allows non-SPMD programs, because the comm heaps can be mapped into each process at a different address.

As a result, multiple processes may end up holding pointers to the same physical region, so a reference count is kept with each region indicating the number of pointers pointing at the region. This is incremented whenever a logical copy is made (i.e. with enq, write, and read), and decremented whenever an rgfree is performed. The rgmod primitive ensures that the reference count is zero, and if not, makes a new copy of the region for the calling process, to modify without affecting other processes reading the original region.

The waitflg argument on rgmod is specifically for the caller to suggest that all references to the region are expected to disappear relatively soon, and that therefore if references are indeed found by the operation, it might make sense for it to wait a bit (e.g. for one-half the time it would take to copy the region) before performing a copy, so that the other references to the region might vanish, making the copy unnecessary. This is only advice to the CDS runtime system, however, and in all cases, CDS is free to determine for itself if and when to perform a copy within rgmod.

The waitflg argument on enq addresses a similar issue, and is also advisory. This value allows the caller to advise CDS that a deq operation is expected to be posted within a short time (if not already), and so if it is not yet present, it might take longer to perform the necessary queuing, etc., in some cases, than to simply wait a short time for the deq operation. Again, CDS is free to make this determination on its own, without regard to the waitflg, based on a more intimate knowledge of the inner implementation factors at play.

In standard practice, waitflg in both of the situations above will be specified as either 0 (for no waiting advised), 1 (for nominal wait advised, to minimize average overall time by minimizing overhead), or 2 (for maximum wait advised, to minimize the amount of consumed in the comm heap even if it slows the program significantly). Even when a value of 2 is provided, the operation is not allowed to wait indefinitely (or it would violate the non-blocking semantics of these operations), but the meaning of "indefinitely" is left for the CDS implementation/implementor to interpret.

When a deq or read is issued to an empty cell, information regarding the calling process (or continuation) can be entered into the cell (in lieu of a pointer), to be processed when the next write or enq operation is issued for the cell. Similarly, when a benq is issued to an empty cell, the information regarding the calling process can be recorded there for the next deq operation to process it.

[Index] On to Copying, (Un)Packing, Converting