8 August 2001
Hi,

Here's a semi-formal report of what I've found so far.  Doug asked 
that the NFS 'getattr' call be measured; I answer this in the broader
context from Chris: 'what does it take to make NFS go fast'.

The summary of the findings are: (1) getattr is fast, very fast. 
(2) There is a lot of overhead in the XDR/RPC code: >85%-90% of
the total time is spent in this 'network' code, (3) there is a lot
of latency in the network: viz. a single (non-threaded) client
issuing getattr as fast as it can, can only use up 25% of the 
server cpu and 8% of the network bandwidth.

There are several possible directions for future work: 

(A) look to understand the XDR/RPC/UDP/device-driver overhead better 
(and start looking for ways to tune it).

(B) determine if other network cards can significantly alter latency 
and/or overhead, (either due to better device drivers, better hardware, 
or different (e.g. 1 gigabit) speeds).

(C) measure read performance, see just how big the XDR/RPC/UDP overhead 
is in that case.  This 'overhead' is probably small, at least on a 100 Mbit 
network, but there's a hint that it may not look that way on a 1 gigabit
network.

(D) do nothing until a better understanding of the spec NFS benchmark 
is gained.  If the dominant components are large file reads/writes, then
the 'network' overhead may pale in comparison.


--linas


=====================================================================
Below follows the a detailed writeup of results.  Warning, its long: probably
longer than you what your were expecting.


Client NFS Driver
-----------------
To be able to exercise an NFS server while having explicit control over
the protocol being sent, a small, simple user-space NFS client was developed.
The client consists of the Mount and NFS v2 RPC IDL files (easily obtained 
on the net), compiled with the c library 'rpcgen' command to create stubs
for the mount and nfs protocols.  The stubs were filled in with enough code
to work correctly, and instrumented with some simple elapsed-time clocks.

Currently, the client takes a hostname, a directory path, and a count N.
It mounts the directory from the host, invokes 'getattr' N times, and unmounts.
It reports the elapsed time for each of these operations.  N=100K or 1M is
needed to get something that runs more than a few seconds.


CPU Cycle Counts
----------------
To be able to determine the number of cpu clock cycles it takes to get
from point A to point B in the Linux kernel, a simple infrastructure 
around the rtdsc instruction was developed.

Clock-cycle counts can be measured with the rdtsc instruction.  It counts
at the rate of the CPU clock: It will count to 1700M on a 1700MHz Intel 
machine,  and to 700M on a 700MHz Athalon.  The gcc compiler/inline assembler 
code for this is:

static inline unsigned long long 
rdtsc_insn (void)
{
   unsigned long long xtime;
   asm volatile ("rdtsc" : "=A" (xtime));
   return xtime;
}

See also the linux kernel header file /usr/include/asm/msr.h 


CPU Cycles Spent in the Kernel NFS getattr call
-----------------------------------------------
If we wrap over the interior of the kernel function call
fs/nfsd/nfsproc.c:nfsd_proc_getattr() and then run the NFS client, we find
that on average, we spend 911 cycles in getattr(), with a standard-deviation
of 195 cycles.  (There is a standard deviation because we average over 100K
calls; its fairly large because various network events, interrupts, etc. as 
well as virtual memory details will cause timing variations).

911 cycles is *fast* by almost any yardstick: 900-odd instructions is not 
a lot in the grand scheme of things. 


Elapsed Time From Client's Point of View
----------------------------------------
We can measure the amount of time spent in 'getattr' in a different way: by
looking at the elapsed time, from the clients point of view.  This elapsed
time includes the overhead of the client creating and sending out the RPC
call, the network delay, the time on the server to receive & decode the 
network traffic, perform the getattr, and respond.

This measurement is made simple by the fact that the NFS protocol specifies 
a 'null call': a call that goes all the way from client to server, but does
nothing on the server. The null call *does* generate an RPC response which
does go back on the net: the null call behaves like an 'empty' subroutine.

With both client and server on the same machine, I invoked 'null call' 
1 million times via the custom nfs client.  We find (for a 1700 MHz cpu):

null call:     17.5 microseconds
getattr call:  20.3 microseconds

If we think of 'null call' as 'network overhead', then the *actual* getattr()
takes 20.3 - 17.5 = 2.75 microseconds. 

(or 2.75 * 1700MHz = 4675 clock cycles ... I also measured on a 700 MHz athalon;
the microsec-times are exactly 1700/700 slower: i.e.. its still 4675 clock cycles
on the athalon).

Why is 4675 clocks so much larger than 911 clocks?   I'm not sure.  Presumably,
some of this additional overhead is due to the slightly larger UDP packets involved
with getattr, as well as differences in XDR encode/decode effort, etc.  I tried
hunting around deeper in the kernel to make up the difference, but found only half
the 'missing' cycles in the XDR/RPC code.  The other half presumably is in the 
UDP and ethernet device driver.

The important thing to note here is: 'protocol overhead' accounts for 85% of the
elapsed time! 

(N.B. running the knfsd as a loadable module as opposed to compiled into the 
kernel resulted in a 10-15% performance improvement, although maybe this was 
a fluke, as there was other activity on the machine ...)

Elapsed Time Over the Network.
-----------------------------
The above tests were run over the 'loopback' network interface, i.e. lo at 127.0.0.1
For comparison, we run the same tests over ethernet.  The performance is much, much 
worse: 

null call:     114 microseconds
getattr call:  134 microseconds

What's going on? Why?  I also monitored the cpu usage on the server during this test
and also the network traffic, the interrupts, and the context switches.  These loads
differ between 'null call' and 'getattr', but not a lot.

Network is 100Mbit ethernet ...
Ethernet controller: Intel Corporation 82820 820 (Camino 2) Chipset Ethernet (rev 1).

network traffic during 'null call'
recv bytes/sec =  725KB/sec = 5.7 Mbit/sec = 6% of network max
xmit bytes/sec =  800KB/sec = 6.4 Mbit/sec = 6% of network max
(server cpu usage during null call = 27% sys 0% user)
(server 18.5K ints/sec, 27.5K ctxt switch/sec)

network traffic during 'getattr call'
recv bytes/sec =  848KB/sec = 6.8 Mbit/sec =  7% of network max
xmit bytes/sec = 1230KB/sec = 9.8 Mbit/sec = 10% of network max
(server cpu usage during getattr = 24% sys 0% user)
(server 15.5K ints/sec, 23.1K ctxt switch/sec)

Note that if we multiply the interrupt rate by the elapsed time, and 
divide by the number of calls, we find that there are *exactly* 
two interrupts per RPC call, and three context switches.

Soooo ... what's the bottleneck? Neither is the CPU maxed out, nor is the network
maxed out.  Let rule out  obvious culprits:

Device Driver Problems? No ..
-----------------------------
We can rule out most device driver problems by verifying the transfer speed of
a large file (1 GB) by ftp.  We find:

recv bytes/sec =  267KB/sec =  2.1 Mbit/sec =  2% of network max
xmit bytes/sec = 12.1MB/sec = 96.8 Mbit/sec = 97% of network max
(server cpu usage during getattr = 16% sys 0% user)
(server 5.28K ints/sec, 0.67K ctxt switch/sec)
(note this includes disk i/o interrupts)

UDP or XDR problems? No ...
---------------------------
However, ftp uses TCP/IP not UDP/IP, and doesn't use RPC/XDR.  We can eliminate
problems there with an NFS write test.  Mount a volume via NFS, in the 'normal'
way, and copy a large file (actually, not copy but dd bs=100k count=10000) which
moves 1GB through NFS.

recv bytes/sec =  233KB/sec =  2.0 Mbit/sec =  2% of network max
xmit bytes/sec = 12.0MB/sec = 95.8 Mbit/sec = 96% of network max
(server cpu usage during nfs read = 12% sys 0% user)
(server 2.67K ints/sec, 4.25K ctxt switch/sec)
(note this includes disk i/o interrupts)

Note this is comparable to the ftp performance, and slightly less cpu load!
So moving large gobs of data through NFS/XDR/RPC/UDP is not a problem.

Just Plain Old Latency? Yes...
------------------------------
Although there may be more subtle device driver and/or UDP issues associated
with small packets, the above two tests tested large packets only.   Can we 
test small packets only?  Sure. By launching e.g. 12 clients, we can overlap
the network latency & try to make sure that both the network and the cpu are 
fully loaded.  Unfortunately, 12 clients max out the client cpu to 100%, so
we weren't able to stress the nfs server to its limits:

network traffic during 'null call'
recv bytes/sec =  2.38MB/sec = 19 Mbit/sec = 19% of network max
xmit bytes/sec =  1.95MB/sec = 16 Mbit/sec = 16% of network max
(server cpu usage during null call = 69% sys 0% user)
(server 44.2K ints/sec, 75.2K ctxt switch/sec)
(client cpu usage: 97 %)

network traffic during 'getattr call'
recv bytes/sec =  3.10MB/sec = 25 Mbit/sec = 25% of network max
xmit bytes/sec =  3.77MB/sec = 30 Mbit/sec = 30% of network max
(server cpu usage during null call = 71% sys 0% user)
(server 41.7K ints/sec, 72.3K ctxt switch/sec)
(client cpu usage: 99 %)

But we came close. Lesson of the day: as long as there is just *one*
active client, issuing small work requests, then the major component 
of response time is determined by the system & network latency.  
Having lots of clients running in parallel does a much better job of 
loading the server.  

Conclusion
----------

Summary from Pei Cao:

> Based on the numbers in the report, I would think that the breakdown of the
> cycles consumed by each stage of the implementation of getattr to be the
> following:
>
> 2.75 microsecond      the actual getattr kernel routine itself
>
> 17.5 microsecond      UDP/RPC/XDR + loopback interface + 2 ctxt sw.
code
>
> ~12-13 microsecond    ethernet driver + 2 ethernet interrupts + 1 more ctxt sw>
> I got the third row of numbers by calculating how many CPU cycles the ethernet-
> based testing of getattr would consume from the system CPU utilization numbers.