Background

SCI is getting more and more popular and acceptance increases due to lower prices and higher performance. Although the initial intention of SCI was to offer a global Coherent Shared Memory for pure shared memory applications, SCI is often used as communication medium for other communication methods such as Message Passing. These applications can profit from the extremely low latency offered by the underlying SCI network. Data transfers can be initiated by simply accessing remote memory. One large advantage when using SCI for message passing applications only is that there's no need for global cache coherency in most cases. Hence, SCI adapters can be simplified. But shared memory used as working tool for message passing hasn't advantages only. When messages to be sent getting larger, the shared memory method becomes inefficient. This results from mainly three points:

The CPU(s) participates actively on communication process. That is, the CPU is running a small loop copying memory from local to remote memory (in case of a send operation). This hinders the CPU from other work it could do otherwise.
The copy-process is inefficient by nature. Data has to be fetched from memory into CPU registers and then written to remote memory. So by definition there is no real zero-copy data transmission possible. In contrast, real zero-copy means that data is transferred directly from memory to memory without any indirections.
Data transmission from CPU to other regions than local memory is often not very efficient in view of throughput. Measurments for pure CPU register-to-PCI bandwith showed that the maximum achievable PCI performance isn't reached by far. The following figure (upper six values) describes measured troughput for different machines (430FX with P-133, 440FX with PPro-200, and 440LX with PII-266, while both the 440 chipsets were dual processor machines with only one CPU used). The test was a simple copy loop from 32 bit register to PCI space and reverse.

In contrast, the lower six values show throughput between PCI space and memory, where the PCI device acts as master and is able to access local memory without CPU intervention (DMA transfer).

Today the bandwidth is looking better when writing onto the PCI bus using new boards (440BX ChipSet). Practically, the same values as in case of the Pentium/Triton combination are reached.
However, there's still a gap of about 30MB/s between active writing into PCI Memory and reading data out of main memory using DMA engines.

We made some practical measurements on Dolphins PCI-SCI hardware (D310 with LC-2 chip) and with a Dual 350MHz PII board with BX chipset. The results are as following:

The picture is showing one graph for a ping-pong bandwidth using SCI shared memory and two graphs using DMA. In case of DMA one test was done by direct access to PCI-SCI hardware registers of the user process and the other by going through the Linux kernel.
All three tests are based on zero-copy (no additional buffers were involved).

As you can see, this picture explains the reverse case as it's suggested by the PCI measurements above: DMA on PCI is not faster then accesses initiated by the CPU.
However, the problem here is not the PCI bus or the PCI chipset, but the DMA engine inside the PCI-SCI card. DMA implementations for Myrinet (BIP) have shown a very high bandwidth close to the PCI limit.

When looking at the curves for SCI shared memory and DMA bandwidth the following question may come up: When DMA is always slower than shared memory, why do we need DMA?

Indeed, this is a serious question and in view of peak communication bandwidth there is a simple answer: We don't need DMA!
But things are different when looking at general system performance where everything from calculation to communication is included. Remember that in case of SCI shared memory communication the CPU copies data from local to remote locations. Hence, the CPU is completely blocked for the duration of the copy operation. But in case of DMA once descriptors for specifying source and destination are prepared and notified to the hardware, the data transfer happens in background and the CPU can continue working.
Of course, the memory traffic caused by the DMA engine slows down the CPU slightly. On a Dual 350MHz PII with BX chipset we pointed out a CPU performance loss of about 15% in worst case (no cache use for data). The better the cache is used, the lower the loss.

Based on this information it's possible to compare SCI shared memory and DMA not only in view of absolute bandwidth, but also in view of CPU utilization. The following picture illustrates the compare method.

The faster SCI shared memory is compared with the slower DMA (in case of Dolphins card). The time used in case of the shared memory copy operation (tSHM) is shorter than the time used for the DMA case (tDMA). But during DMA operation CPU can work in parallel, but not so fast (we assume our worst case CPU performance loss of 15%). In case of SCI shared memory, after processing the copy operation there's still some time available until tDMA is reached. So it can be calculated for both cases, how much time there's available for other things than data transmission (as a function of transmission size). The picture below shows the two resulting graphs.

Important Notices: These graphs give no absolutely precise values. For this, precise timing values are required which are difficult to measure for us. This is true especially for the time which is really required for the pure DMA transfer (no ping-pong). However, we used the DMA ping-pong values although there is some more time included as the DMA engine actually produces traffic on the local PCI and memory bus (ping-pong implies a round-trip). This is bad especially for very short transmission sizes. In case of SCI shared memory we assumed a constant bandwidth of about 82MB/s for packet sizes of 64Bytes and larger (no ping-pong values). This assumption is correct and also reported by Dolphin (see paper "High-performance cluster-computing with Dolphin's CluStar PCI adapter card" presented at SCI Europe'98).

The picture shows that there's a switching point at about 128Bytes when DMA begins to be more CPU-friendly than shared memory. This point is at a surprisingly low message size. And even if this point must be doubled due to the above described inaccuracies, it is still low. Therefore this demonstrates the importance of DMA, even if the total bandwidth is looking not so good.

Anyways, the picture above with graphs of message size versus bandwidth shows also the advance of a user-level mechanism over a kernel-controlled one. Although the difference here seems to be not very large, it is significant over relatively wide and especially in our applications often used message sizes (128Bytes-4kBytes). In contrast to direct hardware access we measured an additional latency of 4.5us for zero-data packets (11.5us vs. 7us).
Besides the increased latency this is a lot of time for a CPU and a lot of other computations could be done instead. It's also important to notice that the 4.5us for kernel entry are not enough for calculation the lost calculation time. It takes also time to get back from kernel into the user process. We measured about 3us here.

Back to the comparison of SCI shared memory and DMA, we also see that the shared memory approach is still the best for short messages. Therefore it would be very useful to have both mechanisms available in future.

In order to achieve a user-level access to the DMA hardware we must take a look besides SCI, because this was never a concern of SCI. Realizung user level DMA is one thing. But in an open system where multiple processes (and possibly multiple users) are involved, we need Protected User Level DMA. This is required, because the check which process may access which memory locations is no longer performed by the kernel, but direct inside the hardware.

The most interesting projects on the area of protected user level communication are the SHRIMP and the U-NET projects.
Both ones try to eliminate OS calls while adding appropriate protection mechanisms to the communication hardware (although SHRIMP uses the conventional memory protection mechanism offered by the CPU to a great extend).

The latest initiative on this area is the Virtual Interface Architecture (VIA).
The VIA doesn't only specify a mechanism for Protected User Level DMA (which is close to the one used on U-Net), but it also defines basic mechanisms for data and control flow.

We decided to take the new VIA approach and bring it together with SCI shared memory. Initially, the VIA in its V1.0 specification offers pure message passing only. That is, you can't share memory among different nodes so that nodes may communicate via simple memory references. Although the VIA decreases latency dramatically due to the protected user level hardware access, there's still some overhead generated by fetching descriptors etc.
So the idea is to offer shared memory capabilities besides the VIA functionality. For short messages the overhead caused by shared memory communication will be even lower than in case of VIA communication.

The shared memory will be very useful especially for internal communication library transfers which are not seen directly by the application (no explicite send/receives). These transfers are typically very small ones used for syncronization stuff like barriers. Additionally, the shared memory interface may not only offer shared memory "as it is". For example, special operations like implicite fetch&add may be offered directly by hardware. This speeds up special operations even more.

Another approach is to use the additional low-latency shared memory for VIA internal communication only. That is, the shared memory will be hidden inside the (very thin) VIA layer.

Last Updated: March 10th 1999
By Digital Force / Mario Trams