Kernel Path Lengths for NFS Operations -------------------------------------- 24 August 2001 -- Linas Vepstas Summary: -------- This report focuses on the path lengths of the read operation in the linux kernel NFS-server implementation. We present a detailed accounting of some of the paths, but fail to account for most of the CPU time spent on the server. We can conclude that the NFS-server itself, along with the kernel file-system buffers, is quite fast, and accounts for only 1/3, or less, of the total CPU time spent on the server. We guess that most of the missing time is lost in the ethernet device driver. Intro ----- Lets assume a single-threaded NFS client driving a (multi-threaded) NFS server as fast as it can; i.e. the client issues an NFS request as soon as it has received a reply to its last request. We can break down the sequence of events that contribute to the total elapsed time for one iteration of this loop: client creates request packet client sends request on wire wire flight time server ethernet hardware i/o handling server udp/ip stack packet read server rpc/nfs processing server udp/ip stack packet send server ethernet hardware i/o handling wire flight time client read of response packet By instrumenting the kernel with the rdtsc instruction, I've measured in detail the following components: server rpc/nfs processing server udp/ip stack packet send for the nfs 'nullcall', nfs 'getattr', and the nfs read of a 900 byte file. Measurement Tool Kernel Module ------------------------------ The tool is a loadable kernel module, 'ks.o'. To instrument the kernel, one inserts the macros START_STAMP(timer_name) and STOP_STAMP(timer_name) around components of interest. Each time these macros are executed, the elapsed time is measured. The results can be viewed through the /proc file system: The kernel module creates the entry /proc/kstamp/stats. 'cat'ing this file shows: -- number of invocations -- average elapsed time -- min elapsed time (over all measurements) -- max elapsed time (over all measurements) -- square deviation (I need to implement a kernel-safe integer sqrt function) sample output: 0 inner_calib count=4 avg=154 lo=120 hi=220 dev=1668 1 outer_calib count=4 avg=468 lo=408 hi=524 dev=1736 2 svc_process count=1000000 avg=8920 lo=8048 hi=144884 dev=821885 3 nfsd_cache_lookup count=0 avg=0 lo=0 hi=0 dev=0 4 nfsd_dispatch count=1000000 avg=2100 lo=1764 hi=116368 dev=106241 5 pc_decode count=1000000 avg=279 lo=264 hi=27676 dev=9356 6 pc_func count=1000000 avg=1024 lo=796 hi=29672 dev=49900 7 pc_encode count=1000000 avg=177 lo=160 hi=28384 dev=6096 8 nfsd_proc_null count=0 avg=0 lo=0 hi=0 dev=0 9 nfsd_proc_getattr count=1000000 avg=757 lo=544 hi=29388 dev=38947 10 nfsd_proc_read count=0 avg=0 lo=0 hi=0 dev=0 11 nfsd_cache_update count=0 avg=0 lo=0 hi=0 dev=0 The counts can be reset by writing to the stats file: e.g. 'echo "anything" > /proc/kstamp/stats' (This is an enhanced version of the tool used previously) Results ------- The server component has three major blocks: receive packet, fulfill request, send packet. The last two are easy to measure, the first is harder, because its timing depends on ethernet hardware interrupts. (and thus wasn't measured here. Maybe later ...) The 'fulfill request' path starts in fs/nfsd/nfssvc.c in the infinite loop circa line 209 where svc_process() is called. The svc_process() is a generic rpc routine that does some rpc packet decoding, and then calls back into the nfsd code for specific handling. (See net/sunrpc/svc.c) The detailed break-out of calls is shown below. Note that microsecs * 1700 == cycle counts, since the server is a 1700 MHz machine. When two numbers are quoted, then this is the range observed across nullcall, getattr, and nfs read calls. The nullcall is always at the low end of the range, and the read is always at the high end. subroutine cycle micro name count secs comments --------- ------- ------ ---------------------------------- enter svc_process svc_authenticate 200 0.12 // rpc svc - checks client ip addr svc_putlong 96 0.06 // rpc svc - write rpc reply packet header memset 170-325 0.1-0.19 // rpc svc - zero out reply packets enter nfsd_dispatch // rpc call to nfsd svc dispatch handler nfsd_cache_lookup 0 // zero cause nfsd read,getattr not cached here pc_decode 200 0.12 // nfsd xdr decode request packet contents enter pc_func // call the actual nfsd routine nfsd_proc_null 0 // nfsd null call -- takes no time. -- or -- nfsd_proc_getattr 657 0.39 // nfsd getattr asks kernel for file data -- or -- nfsd_proc_read 8590 5.05 // nfsd call to kernel to read 900 byte file exit pc_func 0 // approx zero overhead on top of children pc_encode 100-200 0.06-0.12 // nfsd xdr encode of results to wire protocol nfsd_cache_update 0 // cache inactive for getattr, read exit nfsd_dispatch 0 // approx zero overhead enter svc_send // rpc svc - send rpc packet out svc_release_skb 530-690 0.31-0.41 // rpc svc - release prior request skb sock_sendmsg 5080-7450 3.00-4.38 // kernel socket code - send packet svc_sock_release 0 // whatever, noop exit svc_send 0 // approx zero overhead exit svc_process 200 0.12 // approx overhead not measured above Measurement Notes: ------------------ 1) There is a *lot* of jitter in the measurements. Over a million iterations (typically), the *fastest* time for any given routine is about 10%-15% faster than the average. The *slowest* time is usually 50-100 times slower than the average. Note the above path lengths are the averages, not the fastest times. 2) There are strong cache/TLB contention effects everywhere. Inserting 'dead code' of varying length will improve or degrade performance by 10-15% (and sometimes more). This can make the measurement of timings difficult, and essentially leads to inaccuracies of about 10% in all measurements. (FYI: TLB==translation look-aside buffer, and is a kind of a cache for the otherwise time-consuming virtual-to-real address translation computations) (FYI: Cache contention occurs when multiple distinct addresses alias to the same cache line. If, for example, the cache is 2-way set associative, and three addresses map to the same cache line, then the cache will *always* be flushed *three* times inside a loop. By contrast, in the same case, a 4-way associative cache would never be flushed in the same case. Note that cache flushes can take hundreds of CPU cycles, and are on the same order as the finest-grained measurements.) Sanity Check ------------ We can add up the numbers in the call graph above to create a 'sanity check' on the numbers. We can compare this to actual measurements: synthetic measured (measured) cycle cnt cycle cnt (microsec) nullcall 6576 5877 3.46 getattr 7233 7727 4.55 read 17951 19646 11.56 So the call-graph overestimates the null call by 10% and underestimates the read by 10% Not the End of the Story ------------------------ The above figures are interesting, and reasonably accurate, but in fact fail to account for most of the elapsed time spent in the kernel. From earlier measurements, we know that the total elapsed time, from client to server and back, is on the order of 100 to 300 microseconds. Above, we've accounted in detail for only about 3% of that time!! From old data, we have: round-trip time server client usecs % busy %busy nullcall 114 21.5% 22% getattr 135 22.5% 27% read 300 12% 24% So, we can compute the total CPU time consumed on the server by processing one request: null 21.5% cpu busy * 114 usec = 24.5 usec getattr 22.5% * 135 usec = 30.4 usec read 12% cpu busy * 300 usec = 36 usec This is not a happy state of affairs: e.g. for the read of the 900 byte file, of the total 36 usecs, we have accounted for only 11.6 usecs in the read-request handling and the packet sending. Where'd the other 24 usecs go ??? presumably, packet-receive, ethernet device driver, context switching, etc. If we do the same calculation on the client, then we can estimate the amount of elapsed time spent on the 'wire' itself: Total client cpu time: null 22% cpu bus * 114 usec = 25 usec getattr 27% * 135 = 36 usec read 24% * 300 = 72 usec Wire-Flight Time (total elapsed time minus client cpu time minus server cpu time). null 114 - 25 - 25 = 64 usec getattr 135 - 30 - 36 = 68 usec read 300 - 36 - 72 = 192 usec Note that the 'wire time' accounts for 1/2 or more of the total elapsed time. I'm not sure if this is due to the ethernet cmsad algorithms, or because the ethernet cards have big buffers, or what. Conclusion ---------- The NFS subsystem still looks 'fast', as before. A lot of time seems to be lost in either packet receive, context switching, interrupt handling, or, what seems most likely, in the ethernet device driver.