Kernel Path Lengths for NFS Operations
--------------------------------------
24 August 2001  -- Linas Vepstas

Summary:
--------
This report focuses on the path lengths of the read operation in the
linux kernel NFS-server implementation.  We present a detailed
accounting of some of the paths, but fail to account for most of the CPU
time spent on the server.  We can conclude that the NFS-server itself,
along with the kernel file-system buffers, is quite fast, and accounts
for only 1/3, or less, of the total CPU time spent on the server.  
We guess that most of the missing time is lost in the ethernet device 
driver.


Intro 
-----
Lets assume a single-threaded NFS client driving a (multi-threaded) NFS
server as fast as it can; i.e. the client issues an NFS request as soon
as it has received a reply to its last request.   We can break down the
sequence of events that contribute to the total elapsed time for one
iteration of this loop:


    client creates request packet
    client sends request on wire
    wire flight time
    server ethernet hardware i/o handling   
    server udp/ip stack packet read   
    server rpc/nfs processing   
    server udp/ip stack packet send       
    server ethernet hardware i/o handling   
    wire flight time
    client read of response packet

By instrumenting the kernel with the rdtsc instruction, I've measured in 
detail the following components:

    server rpc/nfs processing   
    server udp/ip stack packet send       

for the nfs 'nullcall', nfs 'getattr', and the nfs read of a 900 byte file.

Measurement Tool Kernel Module 
------------------------------
The tool is a loadable kernel module, 'ks.o'.  To instrument the kernel,
one inserts the macros START_STAMP(timer_name) and STOP_STAMP(timer_name)
around components of interest.  Each time these macros are executed, the
elapsed time is measured.  The results can be viewed through the /proc
file system: The kernel module creates the entry /proc/kstamp/stats.
'cat'ing this file shows:

-- number of invocations
-- average elapsed time
-- min elapsed time (over all measurements)
-- max elapsed time (over all measurements)
-- square deviation (I need to implement a kernel-safe integer sqrt function)

sample output:
0 inner_calib        count=4         avg=154    lo=120  hi=220    dev=1668
1 outer_calib        count=4         avg=468    lo=408  hi=524    dev=1736
2 svc_process        count=1000000   avg=8920   lo=8048 hi=144884 dev=821885
3 nfsd_cache_lookup  count=0         avg=0      lo=0    hi=0      dev=0
4 nfsd_dispatch      count=1000000   avg=2100   lo=1764 hi=116368 dev=106241
5 pc_decode          count=1000000   avg=279    lo=264  hi=27676  dev=9356
6 pc_func            count=1000000   avg=1024   lo=796  hi=29672  dev=49900
7 pc_encode          count=1000000   avg=177    lo=160  hi=28384  dev=6096
8 nfsd_proc_null     count=0         avg=0      lo=0    hi=0      dev=0
9 nfsd_proc_getattr  count=1000000   avg=757    lo=544  hi=29388  dev=38947
10 nfsd_proc_read    count=0         avg=0      lo=0    hi=0      dev=0
11 nfsd_cache_update count=0         avg=0      lo=0    hi=0      dev=0

The counts can be reset by writing to the stats file: e.g.
'echo "anything" > /proc/kstamp/stats'

(This is an enhanced version of the tool used previously)

Results
-------
The server component has three major blocks: receive packet,  fulfill
request, send packet.  The last two are easy to measure, the first is
harder, because its timing depends on ethernet hardware interrupts.
(and thus wasn't measured here.  Maybe later ...)

The 'fulfill request' path starts in fs/nfsd/nfssvc.c in the
infinite loop circa line 209 where svc_process() is called.  The
svc_process() is a generic rpc routine that does some rpc packet
decoding, and then calls back into the nfsd code for specific
handling.  (See net/sunrpc/svc.c) The detailed break-out of calls is
shown below.  Note that microsecs * 1700 == cycle counts, since the
server is a 1700 MHz machine.   When two numbers are quoted, then this
is the range observed across nullcall, getattr, and nfs read calls.
The nullcall is always at the low end of the range, and the read is
always at the high end.


     subroutine           cycle    micro
       name               count    secs                 comments
    ---------            -------   ------    ----------------------------------
enter svc_process   
   svc_authenticate         200     0.12     // rpc svc - checks client ip addr
   svc_putlong               96     0.06     // rpc svc - write rpc reply packet header
   memset                  170-325  0.1-0.19 // rpc svc - zero out reply packets
   enter nfsd_dispatch                       // rpc call to nfsd svc dispatch handler
      nfsd_cache_lookup      0               // zero cause nfsd read,getattr not cached here
      pc_decode             200     0.12     // nfsd xdr decode request packet contents
      enter pc_func                          // call the actual nfsd routine
         nfsd_proc_null      0               // nfsd null call -- takes no time.
          -- or --
         nfsd_proc_getattr  657     0.39     // nfsd getattr asks kernel for file data
           -- or --
         nfsd_proc_read     8590    5.05     // nfsd call to kernel to read 900 byte file
      exit pc_func           0               // approx zero overhead on top of children
      pc_encode           100-200  0.06-0.12 // nfsd xdr encode of results to wire protocol
      nfsd_cache_update      0               // cache inactive for getattr, read
   exit nfsd_dispatch        0               // approx zero overhead
   enter svc_send                            // rpc svc - send rpc packet out
      svc_release_skb     530-690  0.31-0.41 // rpc svc - release prior request skb
      sock_sendmsg       5080-7450 3.00-4.38 // kernel socket code - send packet
      svc_sock_release       0               // whatever, noop
   exit svc_send             0               // approx zero overhead
exit svc_process            200      0.12    // approx overhead not measured above


Measurement Notes:
------------------
1) There is a *lot* of jitter in the measurements.  Over a million 
   iterations (typically), the *fastest* time for any given routine is 
   about 10%-15% faster than the average.   The *slowest* time is usually 
   50-100 times slower than the average.  Note the above path lengths are 
   the averages, not the fastest times.

2) There are strong cache/TLB contention effects everywhere.  Inserting 
   'dead code' of varying length will improve or degrade performance by 
   10-15% (and sometimes more).  This can make the measurement of 
   timings difficult, and essentially leads to inaccuracies of about 
   10% in all measurements.

   (FYI: TLB==translation look-aside buffer, and is a kind of a cache 
   for the otherwise time-consuming virtual-to-real address translation 
   computations)

   (FYI: Cache contention occurs when multiple distinct addresses
   alias to the same cache line. If, for example, the cache is 2-way
   set associative, and three addresses map to the same cache line, then
   the cache will *always* be flushed *three* times inside a loop. By
   contrast, in the same case, a 4-way associative cache would never be
   flushed in the same case.  Note that cache flushes can take hundreds
   of CPU cycles, and are on the same order as the finest-grained
   measurements.)


Sanity Check
------------
We can add up the numbers in the call graph above to create a 'sanity
check' on the numbers.  We can compare this to actual measurements:

          synthetic      measured    (measured)
          cycle cnt      cycle cnt   (microsec)
nullcall    6576           5877        3.46
getattr     7233           7727        4.55
read       17951          19646       11.56

So the call-graph overestimates the null call by 10% and underestimates
the read by 10%


Not the End of the Story
------------------------
The above figures are interesting, and reasonably accurate, but in fact
fail to account for most of the elapsed time spent in the kernel.  From
earlier measurements, we know that the total elapsed time, from client to
server and back, is on the order of 100 to 300 microseconds.  Above,
we've accounted in detail for only about 3% of that time!!

From old data, we have:
  
             round-trip time     server         client
                 usecs          % busy          %busy
nullcall         114             21.5%           22%
getattr          135             22.5%           27%
read             300             12%             24%

So, we can compute the total CPU time consumed on the server by
processing one request:

null     21.5% cpu busy * 114 usec = 24.5 usec
getattr  22.5%          * 135 usec = 30.4 usec
read     12%   cpu busy * 300 usec = 36 usec

This is not a happy state of affairs: e.g. for the read of the 900 byte
file, of the total 36 usecs, we have accounted for only 11.6 usecs in 
the read-request handling and the packet sending.  Where'd the other 
24 usecs go ???  presumably, packet-receive, ethernet device driver,
context switching, etc.

If we do the same calculation on the client, then we can estimate
the amount of elapsed time spent on the 'wire' itself:

Total client cpu time:
null     22%  cpu bus * 114 usec = 25 usec
getattr  27%          * 135      = 36 usec
read     24%          * 300      = 72 usec

Wire-Flight Time (total elapsed time minus client cpu time minus server
cpu time).
null       114 - 25 - 25 = 64 usec
getattr    135 - 30 - 36 = 68 usec
read       300 - 36 - 72 = 192 usec

Note that the 'wire time' accounts for 1/2 or more of the total 
elapsed time.  I'm not sure if this is due to the ethernet cmsad
algorithms, or because the ethernet cards have big buffers, or what. 

Conclusion
----------
The NFS subsystem still looks 'fast', as before.  A lot of time
seems to be lost in either packet receive, context switching, 
interrupt handling, or, what seems most likely, in the ethernet 
device driver.