Sorry, but currently there's not much to say here. Although the hardware supports at least a so-called Manual Packet Mode, we did not make a performance test with this until now. Of course, finally the Manual Packet Mode is only for some control funtions where a high speed is not really required. Nevertheless we can give some estimations what we can expect from the raw hardware. We will present concrete measurments and curves here when they are available (first values are expected end of March 2000 / beginning of April 2000).
Note: When there is written something about Megabytes (or MBytes, or MB) in the following text, this refers to 'real' Megabytes. That is, 1MByte = 1MB = 1048576Byte.
Shared Memory Latency
The one-way latency for transparent transactions (accesses to remote memory) will be probably somewhat in the range from 4-5 microseconds. This value shall be equal for both write and read transactions. That is, a read transaction will take about 8-10 microseconds to complete.
Shared Memory Write Bandwidth
Tests on an Samsung Alpha UX board (533MHz 21164 Alpha, 66MHz Memory Bus,
21174 ChipSet) have shown a write bandwidth to raw Dual Ported Memory
(see also Hardware page) of
about 145MByte/s. This will be an upper limit for transparent write
bandwidth.
Since transparent writes need slightly more operations
than DPM writes the bandwidth shall be around 130MByte/s.
But maybe newer Alpha ChipSets or SPARC Chipsets or even maybe ChipSets for
Intel Architecture that implement a 64Bit PCI Bus can reach a higher
general PCI write bandwidth here.
21164 Alpha processor systems collect only up to 32Bytes and write them out
in a single PCI burst (only 4 PCI data phases). Larger bursts would
result in a significant higher bandwidth.
21264 systems have twice the amount of buffer capability (4 outstanding
64Byte write buffers). In hope that this causes just a longer PCI burst
(8 data phases) and no other overhead these systems will reach a raw PCI write
performance of 185MB/s.
Shared Memory Read Bandwidth
Reading is a completely different story than writing. Because of
cache coherency problems with PCI memory there is generally no
prefetching of data made by all ChipSets. That is, only this data
is read that the CPU has requested. The 21174 may be a small
exception because it always reads 64bits even if only 32bits are
requested. But the additional data that is read is discarded and
therfore this is no real prefetching.
Because no PCI bursts are made during read transactions, this
results in a very bad bandwidth that is about 8.8MByte/s for the
above mentioned Alpha-Board.
Besides the problem with the generally bad PCI performance there's
another problem that read transactions cannot be pipelined through
the communication system. That is, they require a complete round-trip
time before they complete from a point of view of the processor.
This results not only in a bad performance but the processor will
be blocked in most cases and can't do anything else than waiting
until the outstanding transaction completes.
To disarm this problem a little bit, Dolphin has implemented in their
PCI-SCI bridges a so-called 'Aggressive Prefetching Mode'. In this
mode the hardware can speculatively read data from remote nodes
although not really required at the moment. Later when the CPU really
requests this data it is faster available. But nevertheless still slow
because there's still the PCI read bottleneck.
Because our hardware is intended for message passing where sending
of data implies more write transactions than read transactions we
won't put a lot effort in speeding up transparent read operations.
To conclude, the bandwidth of transparent reads will be probably around 3-4Mbyte/s.
Remote DMA Latency
Here it is too vague to specify any value at the moment.
Remote DMA Bandwidth
Since the PCI-SCI bridge or rather the PCI Master inside will be able to merge subsequent local memory reads and writes we are looking forward to reach a very high bandwidth here that is maybe somewhere around 200MByte/s.
This high value should be at least reachable between two nodes A and B where B is the remote node and there is no other node C that communicates with B. Otherwise it is difficult to merge PCI transactions on node B since incoming packets are processed in strict order normally.