8 August 2001 Hi, Here's a semi-formal report of what I've found so far. Doug asked that the NFS 'getattr' call be measured; I answer this in the broader context from Chris: 'what does it take to make NFS go fast'. The summary of the findings are: (1) getattr is fast, very fast. (2) There is a lot of overhead in the XDR/RPC code: >85%-90% of the total time is spent in this 'network' code, (3) there is a lot of latency in the network: viz. a single (non-threaded) client issuing getattr as fast as it can, can only use up 25% of the server cpu and 8% of the network bandwidth. There are several possible directions for future work: (A) look to understand the XDR/RPC/UDP/device-driver overhead better (and start looking for ways to tune it). (B) determine if other network cards can significantly alter latency and/or overhead, (either due to better device drivers, better hardware, or different (e.g. 1 gigabit) speeds). (C) measure read performance, see just how big the XDR/RPC/UDP overhead is in that case. This 'overhead' is probably small, at least on a 100 Mbit network, but there's a hint that it may not look that way on a 1 gigabit network. (D) do nothing until a better understanding of the spec NFS benchmark is gained. If the dominant components are large file reads/writes, then the 'network' overhead may pale in comparison. --linas ===================================================================== Below follows the a detailed writeup of results. Warning, its long: probably longer than you what your were expecting. Client NFS Driver ----------------- To be able to exercise an NFS server while having explicit control over the protocol being sent, a small, simple user-space NFS client was developed. The client consists of the Mount and NFS v2 RPC IDL files (easily obtained on the net), compiled with the c library 'rpcgen' command to create stubs for the mount and nfs protocols. The stubs were filled in with enough code to work correctly, and instrumented with some simple elapsed-time clocks. Currently, the client takes a hostname, a directory path, and a count N. It mounts the directory from the host, invokes 'getattr' N times, and unmounts. It reports the elapsed time for each of these operations. N=100K or 1M is needed to get something that runs more than a few seconds. CPU Cycle Counts ---------------- To be able to determine the number of cpu clock cycles it takes to get from point A to point B in the Linux kernel, a simple infrastructure around the rtdsc instruction was developed. Clock-cycle counts can be measured with the rdtsc instruction. It counts at the rate of the CPU clock: It will count to 1700M on a 1700MHz Intel machine, and to 700M on a 700MHz Athalon. The gcc compiler/inline assembler code for this is: static inline unsigned long long rdtsc_insn (void) { unsigned long long xtime; asm volatile ("rdtsc" : "=A" (xtime)); return xtime; } See also the linux kernel header file /usr/include/asm/msr.h CPU Cycles Spent in the Kernel NFS getattr call ----------------------------------------------- If we wrap over the interior of the kernel function call fs/nfsd/nfsproc.c:nfsd_proc_getattr() and then run the NFS client, we find that on average, we spend 911 cycles in getattr(), with a standard-deviation of 195 cycles. (There is a standard deviation because we average over 100K calls; its fairly large because various network events, interrupts, etc. as well as virtual memory details will cause timing variations). 911 cycles is *fast* by almost any yardstick: 900-odd instructions is not a lot in the grand scheme of things. Elapsed Time From Client's Point of View ---------------------------------------- We can measure the amount of time spent in 'getattr' in a different way: by looking at the elapsed time, from the clients point of view. This elapsed time includes the overhead of the client creating and sending out the RPC call, the network delay, the time on the server to receive & decode the network traffic, perform the getattr, and respond. This measurement is made simple by the fact that the NFS protocol specifies a 'null call': a call that goes all the way from client to server, but does nothing on the server. The null call *does* generate an RPC response which does go back on the net: the null call behaves like an 'empty' subroutine. With both client and server on the same machine, I invoked 'null call' 1 million times via the custom nfs client. We find (for a 1700 MHz cpu): null call: 17.5 microseconds getattr call: 20.3 microseconds If we think of 'null call' as 'network overhead', then the *actual* getattr() takes 20.3 - 17.5 = 2.75 microseconds. (or 2.75 * 1700MHz = 4675 clock cycles ... I also measured on a 700 MHz athalon; the microsec-times are exactly 1700/700 slower: i.e.. its still 4675 clock cycles on the athalon). Why is 4675 clocks so much larger than 911 clocks? I'm not sure. Presumably, some of this additional overhead is due to the slightly larger UDP packets involved with getattr, as well as differences in XDR encode/decode effort, etc. I tried hunting around deeper in the kernel to make up the difference, but found only half the 'missing' cycles in the XDR/RPC code. The other half presumably is in the UDP and ethernet device driver. The important thing to note here is: 'protocol overhead' accounts for 85% of the elapsed time! (N.B. running the knfsd as a loadable module as opposed to compiled into the kernel resulted in a 10-15% performance improvement, although maybe this was a fluke, as there was other activity on the machine ...) Elapsed Time Over the Network. ----------------------------- The above tests were run over the 'loopback' network interface, i.e. lo at 127.0.0.1 For comparison, we run the same tests over ethernet. The performance is much, much worse: null call: 114 microseconds getattr call: 134 microseconds What's going on? Why? I also monitored the cpu usage on the server during this test and also the network traffic, the interrupts, and the context switches. These loads differ between 'null call' and 'getattr', but not a lot. Network is 100Mbit ethernet ... Ethernet controller: Intel Corporation 82820 820 (Camino 2) Chipset Ethernet (rev 1). network traffic during 'null call' recv bytes/sec = 725KB/sec = 5.7 Mbit/sec = 6% of network max xmit bytes/sec = 800KB/sec = 6.4 Mbit/sec = 6% of network max (server cpu usage during null call = 27% sys 0% user) (server 18.5K ints/sec, 27.5K ctxt switch/sec) network traffic during 'getattr call' recv bytes/sec = 848KB/sec = 6.8 Mbit/sec = 7% of network max xmit bytes/sec = 1230KB/sec = 9.8 Mbit/sec = 10% of network max (server cpu usage during getattr = 24% sys 0% user) (server 15.5K ints/sec, 23.1K ctxt switch/sec) Note that if we multiply the interrupt rate by the elapsed time, and divide by the number of calls, we find that there are *exactly* two interrupts per RPC call, and three context switches. Soooo ... what's the bottleneck? Neither is the CPU maxed out, nor is the network maxed out. Let rule out obvious culprits: Device Driver Problems? No .. ----------------------------- We can rule out most device driver problems by verifying the transfer speed of a large file (1 GB) by ftp. We find: recv bytes/sec = 267KB/sec = 2.1 Mbit/sec = 2% of network max xmit bytes/sec = 12.1MB/sec = 96.8 Mbit/sec = 97% of network max (server cpu usage during getattr = 16% sys 0% user) (server 5.28K ints/sec, 0.67K ctxt switch/sec) (note this includes disk i/o interrupts) UDP or XDR problems? No ... --------------------------- However, ftp uses TCP/IP not UDP/IP, and doesn't use RPC/XDR. We can eliminate problems there with an NFS write test. Mount a volume via NFS, in the 'normal' way, and copy a large file (actually, not copy but dd bs=100k count=10000) which moves 1GB through NFS. recv bytes/sec = 233KB/sec = 2.0 Mbit/sec = 2% of network max xmit bytes/sec = 12.0MB/sec = 95.8 Mbit/sec = 96% of network max (server cpu usage during nfs read = 12% sys 0% user) (server 2.67K ints/sec, 4.25K ctxt switch/sec) (note this includes disk i/o interrupts) Note this is comparable to the ftp performance, and slightly less cpu load! So moving large gobs of data through NFS/XDR/RPC/UDP is not a problem. Just Plain Old Latency? Yes... ------------------------------ Although there may be more subtle device driver and/or UDP issues associated with small packets, the above two tests tested large packets only. Can we test small packets only? Sure. By launching e.g. 12 clients, we can overlap the network latency & try to make sure that both the network and the cpu are fully loaded. Unfortunately, 12 clients max out the client cpu to 100%, so we weren't able to stress the nfs server to its limits: network traffic during 'null call' recv bytes/sec = 2.38MB/sec = 19 Mbit/sec = 19% of network max xmit bytes/sec = 1.95MB/sec = 16 Mbit/sec = 16% of network max (server cpu usage during null call = 69% sys 0% user) (server 44.2K ints/sec, 75.2K ctxt switch/sec) (client cpu usage: 97 %) network traffic during 'getattr call' recv bytes/sec = 3.10MB/sec = 25 Mbit/sec = 25% of network max xmit bytes/sec = 3.77MB/sec = 30 Mbit/sec = 30% of network max (server cpu usage during null call = 71% sys 0% user) (server 41.7K ints/sec, 72.3K ctxt switch/sec) (client cpu usage: 99 %) But we came close. Lesson of the day: as long as there is just *one* active client, issuing small work requests, then the major component of response time is determined by the system & network latency. Having lots of clients running in parallel does a much better job of loading the server. Conclusion ---------- Summary from Pei Cao: > Based on the numbers in the report, I would think that the breakdown of the > cycles consumed by each stage of the implementation of getattr to be the > following: > > 2.75 microsecond the actual getattr kernel routine itself > > 17.5 microsecond UDP/RPC/XDR + loopback interface + 2 ctxt sw. code > > ~12-13 microsecond ethernet driver + 2 ethernet interrupts + 1 more ctxt sw> > I got the third row of numbers by calculating how many CPU cycles the ethernet- > based testing of getattr would consume from the system CPU utilization numbers.