The semantics (i.e. behavior) of the primitives will first be explained in the simplest way (under "Logical Model"), with no reference to implementation details. This description may make them seem rather inefficient. Then some methods of implementing those semantics efficiently on real machines is described (under "Physical Model"). While anybody can program CDS knowing only the logical model, knowledge of the most common physical implementations can sometimes help to tune programs for maximum efficiency and portability.
The comm heap is much like the standard program heap: regions are allocated using rgalloc (like Unix malloc), and logically deallocated using rgfree (like Unix free), and the user can (in most circumstances) directly access the data in these regions (e.g. using pointers).
Each comm cell is a globally-accessible container which can hold zero or more regions in a queue-like structure. The user cannot directly access the data in the regions held by a cell. Communication is performed by transferring regions between the local comm heap and the comm cells in any process. To access a particular cell, the accessing process must name the process holding the cell, the context within that process holding the cell, and the cell number.
Specifically, the communication primitives in CDS are:
To help CDS perform these actions efficiently (described later under Physical Model), the user process must agree to "cooperate" by informing CDS before modifying the contents of a region in the comm heap. This can be performed with yet anotherprimitive, rgmod, or by passing appropriate arguments to the last other primitive to handle the region before the modification.
deq and read allow the user to specify a time limit (which can be infinite) specifying the maximum time to block waiting for the cell to become non-empty. enq and write never block indefinitely, but there is another primitive, benq, which is identical to enq except that it will block indefinitely (or until it fails at the expiration of a user-specified time-out period) unless (1) the target cell is empty and (2) there is already a deq operation waiting on that cell. This provides functionality similar to a "synchronous send" in MPI, or (if a zero time-out is specified) a "ready send" in MPI. A time limit can also be specified for enq, meaning only that the caller advises (though does not require) that the primitive behave like benq for that long, after which it must proceed and unblock normally. This is most useful to try to avoid intermediate buffering when dealing with very large regions where the deq on the receiving end is expected to be part of a recv function (discussed in the next section).
deq, read, and benq each have a split transaction mode (similar to "non-blocking" operations in MPI), where the operation initiation and completion are separated, most useful when accessing distant cells to compensate for latency. (There is no significant benefit, or cost, to using the split form of these operations for local cell access.) benq's split transaction mode is useful for both local and remote cells. In each case, the first operation of the split transaction (ideq, iread, and ibenq) is semantically identical to starting a new thread containing only the blocking operation (i.e. minus the i prefix) and returning a logical thread id, and the second operation (wait) is semantically identical to waiting for the specified thread to complete. In each case, a new thread, visible to the user using traditional thread-management tools or utilities, may or may not actually be created. wait also allows a time limit to be specified, and whether or not the thread should be canceled if that limit is exceeded. This permits wait to double as an MPI-like cancel operation (e.g. by supplying a zero time limit).
[A time limit cannot be used on both the wait primitive and the primitive being waited on. One or the other must be specified as "indefinite" ("infinite")]
cds_rgalloc(&rgid,size)rgid is a region ID (returned by the operation if preceded by &). size is the size of the region to create, in bytes. In all cases, proc, cntxt, and cell together denote a specific cell by specifying the process in which the cell lives, the context within that process, and the cell number within that context, respectively. perm specifies what sort of access the process wants to the region after the operation:
cds_rgfree(rgid)
cds_rgmod(rgid,waitflg)
cds_write(rgid,proc,cntxt,cell,perm)
cds_enq(rgid,proc,cntxt,cell,perm,waitflg)
cds_benq(rgid,proc,cntxt,cell,perm,timeout)
cds_read(&rgid,proc,cntxt,cell,perm,timeout)
cds_deq(&rgid,proc,cntxt, cell,perm,timeout)
cds_ideq(&ithrd,&rgid,proc,cntxt,cell,perm)
cds_zap(proc,cntxt,cell)
cds_iread(&ithrd,&rgid,proc,cntxt,cell,perm)
cds_ibenq(&ithrd,rgid,proc,cntxt,cell,perm)
cds_wait(ithrd,timeout,cancel)
timeout specifies how long to block until failing, and may be CDS_BLOCK to wait forever. cancel (on cds_wait) specifies to cancel the thread on timeout if true (non-zero), else the thread continues to be waited on again later. >CDS_PWRIT: modify permission, equivalent to executing cds_rgmod after the operation CDS_PREAD: only read permission (no effect) CDS_PNONE: no permission, equivalent to executing rgfree after the operation CDS_NODTA: the region is retained (i.e. not rgfreed), but the data within the region cannot be accessed directly. (Rarely used, for efficiency only, see below.)
*rgid serves as a pointer to the data on the region (i.e. **rgid addresses the first byte in the region), unless CDS_NODTA was specified as a permission. The value of *rgid may change when rgid is passed to rgmod or another primitive with CDS_PWRIT permission, so it should not be held in a temporary variable across such a call.
As a result, multiple processes may end up holding pointers to the same physical region, so a reference count is kept with each region indicating the number of pointers pointing at the region. This is incremented whenever a logical copy is made (i.e. with enq, write, and read), and decremented whenever an rgfree is performed. The rgmod primitive ensures that the reference count is zero, and if not, makes a new copy of the region for the calling process, to modify without affecting other processes reading the original region.
The waitflg argument on rgmod is specifically for the caller to suggest that all references to the region are expected to disappear relatively soon, and that therefore if references are indeed found by the operation, it might make sense for it to wait a bit (e.g. for one-half the time it would take to copy the region) before performing a copy, so that the other references to the region might vanish, making the copy unnecessary. This is only advice to the CDS runtime system, however, and in all cases, CDS is free to determine for itself if and when to perform a copy within rgmod.
The waitflg argument on enq addresses a similar issue, and is also advisory. This value allows the caller to advise CDS that a deq operation is expected to be posted within a short time (if not already), and so if it is not yet present, it might take longer to perform the necessary queuing, etc., in some cases, than to simply wait a short time for the deq operation. Again, CDS is free to make this determination on its own, without regard to the waitflg, based on a more intimate knowledge of the inner implementation factors at play.
In standard practice, waitflg in both of the situations above will be specified as either 0 (for no waiting advised), 1 (for nominal wait advised, to minimize average overall time by minimizing overhead), or 2 (for maximum wait advised, to minimize the amount of consumed in the comm heap even if it slows the program significantly). Even when a value of 2 is provided, the operation is not allowed to wait indefinitely (or it would violate the non-blocking semantics of these operations), but the meaning of "indefinitely" is left for the CDS implementation/implementor to interpret.
When a deq or read is issued to an empty cell, information regarding the calling process (or continuation) can be entered into the cell (in lieu of a pointer), to be processed when the next write or enq operation is issued for the cell. Similarly, when a benq is issued to an empty cell, the information regarding the calling process can be recorded there for the next deq operation to process it.
[Index] On to Copying, (Un)Packing, Converting
Copyright 2000 © elepar All rights reserved