Cell Messaging Layer
Frequently Asked Questions
What is the Cell Messaging Layer?
The Cell Messaging Layer is a communication library for the Cell Broadband Engine (Cell), which many people recognize as the microprocessor in the Playstation 3. Each Cell contains eight high-speed vector processors called synergistic processing elements (SPEs). The Cell Messaging Layer makes it easy for the SPEs within a Cell and across any number of networked Cells to transfer data to each other and coordinate their operations. Specifically, the Cell Messaging Layer implements a small subset of the de facto standard Message Passing Interface (MPI) library, which makes it easy for an MPI programmer to adapt to Cell programming.
What do I need to run the Cell Messaging Layer?
To build the Cell Messaging Layer, you'll need version 3 of the Cell Software Development Kit, including the Developer package. These are available from the Downloads section of IBM's Cell Broadband Engine resource center. If you have a cluster of Cell processors, you'll need an MPI implementation such as Open MPI to perform the internode communication. If you don't have any Cell hardware, you may be able to use IBM's Full System Simulator and Sysroot Image for development and testing. These are provided by the Extras package at the Cell Broadband Engine resource center.
Where can I get the Cell Messaging Layer?
You can download the Cell Messaging Layer source code from SourceForge.net.
Is the Cell Messaging Layer free? Is it open-source software?
Yes! The Cell Messaging Layer is open-source software released under the GNU General Public License, version 2 with an additional modifications must be clearly marked clause.
How do I install the Cell Messaging Layer?
The Installation button in the menu bar links to complete installation instructions. The basic procedure, though, is to edit the Makefile as needed then run make config, make, and make install.
Which MPI functions are currently supported?
The Documentation button in the menu bar links to a list of MPI functions supported by the Cell Messaging Layer.
Isn't MPI huge? Won't the Cell Messaging Layer use up all of my local store?

First, remember that the Cell Messaging Layer implements only a small subset of MPI. Even so, the Cell Messaging Layer tries to keep its footprint small by dividing MPI functions across a number of object files. The linker is smart enough to include only those object files that are actually referenced in the final executable. For example, if a program never calls MPI_Reduce() (either explicitly or implcitly by calling MPI_Allreduce()), the coll.o object file will not be used and will therefore take up no space. Table 1 lists the number of bytes used by each section of each Cell Messaging Layer object file at the time of this writing.

Table 1: Bytes per object file
FilenametextdatabsscommonTotal
barrier.o1356033201688
bcast.o1096415201252
coll.o552000552
globals.o000256256
info.o256000256
init.o148406401548
pt2pt.o25120136003872
reduce.o2812864803468
rpc.o296000296
time.o128080136
Max2812813602564436
Total1049212256425613324

As indicated by Table 1, the maximum amount of memory that the Cell Messaging Layer will ever use—if at least one function in every object files is invoked—is 13,324 bytes. In practice, this number can be made much less by using overlays. For example, if all Cell Messaging Layer code (text segment) is made to overlay application code, the Cell Messaging Layer will require only 2,832 bytes of resident data (data, bss, and common segments) in the worst case. As an alternative, if data overlays are also used, all of the Cell Messaging Layer's code and private data can fit in a shared 4,180-byte segment (worst case) with the 256 bytes of global data kept resident.

The point is that there are a number of space-vs.-performance tradeoffs that a programmer can make when using the Cell Messaging Layer. Plus, a program that uses few MPI functions will need less space than a program that uses many MPI functions.

How fast is SPE-to-SPE communication?
The current release of the Cell Messaging Layer running the pingpong example program on the PowerXCell 8i version of the Cell measures 0.281μs of zero-byte latency and 22.3 GiB/s of bandwidth for an 128 KiB message. Within a two-Cell blade, pingpong measures 0.841μs of zero-byte latency across the BIF link.
What does a typical Cell Messaging Layer program look like?

The SPE program looks like any other MPI program. The PPE and—if running in hybrid mode—host programs contain a small amount of boilerplate code plus any RPC functions they want to make available to the SPEs. An online Hello, world code example shows a simple SPE program and the corresponding PPE code.

Does my entire program have to fit in local store? Can I use main memory?
The Cell Messaging Layer doesn't change the way the SPEs are used. If your data do not fit in local store, you can use libspe's DMA functions to transfer data to and from main memory, just like in any SPE program.
How are MPI ranks assigned?
Ranks fill one Cell at a time. For example, in a cluster of two 6-SPE Cells the first Cell's SPEs will be assigned MPI ranks 0–5, and the second Cell's SPEs will be assigned MPI ranks 6–11.
How can I determine the number of SPEs per PPE?

On the PPE, the cellmsg_spes_per_ppe() function returns the number of SPEs managed by each PPE.

On the SPE, there are two alternatives. Either use the Cell Messaging Layer-specific MPI_CML_LOCAL_NEIGHBORS key with MPI_Comm_get_attr() or use the Cell Messaging Layer-specific MPI_COMM_MEM_DOMAIN communicator with MPI_Comm_size().

How can I limit the number of SPEs per PPE (e.g., to use only one core per socket)?
By default, the number of SPEs per PPE is calculated automatically. However, the CMLMAXLOCALSPES environment variable can be used to control the number of SPEs per PPE. For example, when running Open MPI on the PPEs (or on the host in hybrid mode) one might run mpirun -x CMLMAXLOCALSPES=1 -np 4 ./myprogram to run a four-PPE program with one SPE per PPE.
Where can I learn more about the Cell Messaging Layer?

Some early performance data appear on a Cell Messaging Layer poster that was displayed in Los Alamos National Laboratory's booth at the SC08 conference (November 2008).

The following conference paper details the Cell Messaging Layer's implementation and presents a wealth of performance data:

Scott Pakin. Receiver-initiated Message Passing over RDMA Networks. In Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, Florida, April 2008.

Scott Pakin, pakin@lanl.gov