Cell Messaging Layer FAQ

The Cell Messaging Layer is a communication library for the Cell Broadband Engine (Cell), which many people recognize as the microprocessor in the Playstation 3. Each Cell contains eight high-speed vector processors called synergistic processing elements (SPEs). The Cell Messaging Layer makes it easy for the SPEs within a Cell and across any number of networked Cells to transfer data to each other and coordinate their operations. Specifically, the Cell Messaging Layer implements a small subset of the de facto standard Message Passing Interface (MPI) library, which makes it easy for an MPI programmer to adapt to Cell programming.

To build the Cell Messaging Layer, you'll need version 3 of the Cell Software Development Kit, including the Developer package. These are available from the Downloads section of IBM's Cell Broadband Engine resource center. If you have a cluster of Cell processors, you'll need an MPI implementation such as Open MPI to perform the internode communication. If you don't have any Cell hardware, you may be able to use IBM's Full System Simulator and Sysroot Image for development and testing. These are provided by the Extras package at the Cell Broadband Engine resource center.

You can download the Cell Messaging Layer source code from SourceForge.net.

Yes! The Cell Messaging Layer is open-source software released under the GNU General Public License, version 2 with an additional modifications must be clearly marked clause.

The Installation button in the menu bar links to complete installation instructions. The basic procedure, though, is to edit the Makefile as needed then run make config, make, and make install.

The Documentation button in the menu bar links to a list of MPI functions supported by the Cell Messaging Layer.

First, remember that the Cell Messaging Layer implements only a small subset of MPI. Even so, the Cell Messaging Layer tries to keep its footprint small by dividing MPI functions across a number of object files. The linker is smart enough to include only those object files that are actually referenced in the final executable. For example, if a program never calls MPI_Reduce() (either explicitly or implcitly by calling MPI_Allreduce()), the coll.o object file will not be used and will therefore take up no space. Table 1 lists the number of bytes used by each section of each Cell Messaging Layer object file at the time of this writing.

Table 1: Bytes per object file
Filename	text	data	bss	common	Total
barrier.o	1356	0	332	0	1688
bcast.o	1096	4	152	0	1252
coll.o	552	0	0	0	552
globals.o	0	0	0	256	256
info.o	256	0	0	0	256
init.o	1484	0	64	0	1548
pt2pt.o	2512	0	1360	0	3872
reduce.o	2812	8	648	0	3468
rpc.o	296	0	0	0	296
time.o	128	0	8	0	136
Max	2812	8	1360	256	4436
Total	10492	12	2564	256	13324

As indicated by Table 1, the maximum amount of memory that the Cell Messaging Layer will ever use—if at least one function in every object files is invoked—is 13,324 bytes. In practice, this number can be made much less by using overlays. For example, if all Cell Messaging Layer code (text segment) is made to overlay application code, the Cell Messaging Layer will require only 2,832 bytes of resident data (data, bss, and common segments) in the worst case. As an alternative, if data overlays are also used, all of the Cell Messaging Layer's code and private data can fit in a shared 4,180-byte segment (worst case) with the 256 bytes of global data kept resident.

The point is that there are a number of space-vs.-performance tradeoffs that a programmer can make when using the Cell Messaging Layer. Plus, a program that uses few MPI functions will need less space than a program that uses many MPI functions.

The current release of the Cell Messaging Layer running the pingpong example program on the PowerXCell 8i version of the Cell measures 0.281μs of zero-byte latency and 22.3 GiB/s of bandwidth for an 128 KiB message. Within a two-Cell blade, pingpong measures 0.841μs of zero-byte latency across the BIF link.

The SPE program looks like any other MPI program. The PPE and—if running in hybrid mode—host programs contain a small amount of boilerplate code plus any RPC functions they want to make available to the SPEs. An online Hello, world code example shows a simple SPE program and the corresponding PPE code.

The Cell Messaging Layer doesn't change the way the SPEs are used. If your data do not fit in local store, you can use libspe's DMA functions to transfer data to and from main memory, just like in any SPE program.

Ranks fill one Cell at a time. For example, in a cluster of two 6-SPE Cells the first Cell's SPEs will be assigned MPI ranks 0–5, and the second Cell's SPEs will be assigned MPI ranks 6–11.

On the PPE, the cellmsg_spes_per_ppe() function returns the number of SPEs managed by each PPE.

On the SPE, there are two alternatives. Either use the Cell Messaging Layer-specific MPI_CML_LOCAL_NEIGHBORS key with MPI_Comm_get_attr() or use the Cell Messaging Layer-specific MPI_COMM_MEM_DOMAIN communicator with MPI_Comm_size().

By default, the number of SPEs per PPE is calculated automatically. However, the CMLMAXLOCALSPES environment variable can be used to control the number of SPEs per PPE. For example, when running Open MPI on the PPEs (or on the host in hybrid mode) one might run mpirun -x CMLMAXLOCALSPES=1 -np 4 ./myprogram to run a four-PPE program with one SPE per PPE.

Some early performance data appear on a Cell Messaging Layer poster that was displayed in Los Alamos National Laboratory's booth at the SC08 conference (November 2008).

The following conference paper details the Cell Messaging Layer's implementation and presents a wealth of performance data:

Scott Pakin. Receiver-initiated Message Passing over RDMA Networks. In Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, Florida, April 2008.