Tuning Linux NFS for VMWare ESXi Datastore Performance


Configuring the Linux Kernel for Hardware RAID

On my system, a 2.8TB ESXi datastore was served over NFS. The disks were hardware RAID 5 via an HP P800 SAS controller. Before tuning, the average I/O speed was very poor and did not exceed 15MB/s.

The first point to note here is that the RAID is hardware based. There is absolutely no need for the Linux kernel scheduler to perform any kind of disk buffering. Let the RAID controller handle all of the disk I/O.

Configuring the Linux scheduler by changing the default from [cfq] to [noop] gave major I/O improvements. On my system, this change alone increased average I/O speeds to 54MB/s with a maximum read speed of 95MB/s.

echo "noop" > /sys/block/cciss\!c0d0/queue/scheduler
echo "noop" > /sys/block/cciss\!c0d1/queue/scheduler

The RAID controller and Network cards must not be on shared interrupts.


RELATED: Free Hyper-V Web Management Tool


To check for shared interrupts:
cat /proc/interrupts

Stop all unnecessary services as shown in the example below.

chkconfig --list
chkconfig --level 3 xfs off
chkconfig --level 3 sendmail off
chkconfig --level 3 gpm off

Increasing Linux NFSD Threads for ESXi Performance

For heavy workloads you will need to increase the number of NFS server threads. Use 128 threads as a starting point. 

/etc/sysconfig/nfs
RPCNFSCOUNTD=128

The thread value can be configured in real time
echo 128 > /proc/fs/nfsd/threads
ps ax | grep nfs

Monitoring Linux NFS Server Thread Performance

Now you will need to monitor the performance of those 128 threads.

cat /proc/net/rpc/nfsd

rc 0 58708094 377082036
fh 0 0 0 0 0
io 4191757345 576031420
th 128 931051 15634.270 7895.289 27271.733 3240.960 229.407 127.945 68.418 85.028 55.789 1935.526
ra 32 376818531 0 0 0 0 0 0 0 0 0 809
net 435790183 0 435790174 3325
rpc 435788524 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 0 142652 256 106273 8674 0 376819530 58706883 312 118 0 0 303 124 76 0 0 1241 2892 0 0 0
proc4ops 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In the above, we are interested with the line that starts with "th"
128 = number of nfs server threads
931051 = number of times that all 128 threads had work to do. In this case, the system was running for a period of 3 weeks.

The remaining numbers show which percentage of threads were active and for how many seconds.

For example:
15634.270 (up to 10% of threads were active)
27271.733 (up to 30% of the threads were active. )
The last 3 numbers which indicate usage of 80%, 90%, 100% must be low. If they are high, means your system needs additional NFS threads to cope with the load.

Configuring HP RAID Controllers For NFS Performance

Download the hpacucli utility from HP website. Search for HP Array Configuration Utility CLI for Linux

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?swItem=MTX-66b08e49c28f4bd49f4641ed80

# df -h

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
128G 11G 111G 9% /
/dev/cciss/c0d0p1 99M 15M 79M 16% /boot
tmpfs 2.0G 0 2.0G 0% /dev/shm
/dev/cciss/c0d1 2.8T 2.5T 158G 95% /mnt/nfs

Using the nfsstat command, we saw 80% reads and 20% writes on /mnt/nfs. Set RAID Controller cache ratio to 75% read and 25% write.

hpacucli ctrl all show config detail
hpacucli ctrl slot=2 modify cacheratio=75/25


Configure Linux Disk Read-Ahead for NFS Performance

The read-ahead disk cache was set 512KB. However, this can vary depending on what the system is being used for. For a video streaming server, disk access is mostly sequential, and would benefit from a larger read-ahead. Database servers seek randomly, and would not benefit from read-ahead.

For SSD, disable read ahead. Since there is no disk rotational delay, no point in reading extra data just because you might need it. Otherwise to set read-ahead use the following:

/sbin/blockdev --setra 1024 /dev/cciss/c0d0
/sbin/blockdev --setra 1024 /dev/cciss/c0d1

Configuring Linux Filesystems for NFS Performance

For faster disk writes, use the data=journal option and prevent updates to file access times which in itself results in additional data written to the disk.

/dev/cciss/c0d1 /mnt/nfs ext4 defaults,noatime,data=journal

data=journal mode provides full data and metadata journaling.  All new data is
written to the journal first, and then to its final location. In the event of a crash, the journal can be replayed, bringing both data and metadata into a consistent state.  This mode is the slowest except when data needs to be read from and written to disk at the same time where it outperforms all others modes. 

Configuring Linux NFS Exports for ESXi Performance

Configure the nfs threads to write without delay to the disk using the no_delay option. Although using "async" might produce better results, for data integrity I prefer using the "sync" option.

/etc/exports
/mnt/nfs *(rw,insecure,all_squash,sync,no_wdelay)

Configure Linux TCP/IP for NFS Performance

/etc/sysctl.conf
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65535 16777216
net.core.netdev_max_backlog = 30000

Configuring Linux Network for NFS Performance

/sbin/ifconfig eth0 txqueuelen 10000

Configuring Linux CPU I/O Wait Times

May not be a bad thing, due to the fact that CPU's have become much faster than disk systems. We will tell Linux to flush as quickly as possible so that writes are kept as small as possible.


Linux Kernel Settings for NFS Write Performance

In the Linux kernel dirty pages writeback frequency can be controlled by two parameters: vm.dirty_ratio and vm.dirty_background_ratio. Both are expressed as a percentage of memory, that is the free memory + reclaimable memory. The first parameter controls when a process will itself start writing out dirty data, the second controls when the kernel thread [pdflush] must be woken up to start writing global dirty data.

dirty_background_ratio is always less than dirty_ratio. If dirty_background_ratio >= dirty_ratio the kernel automatically set it to dirty_ratio / 2

Kernels also have dirty_background_bytes and dirty_bytes, which can be used to define a limit in bytes, instead of percentage. In this scenario I have used vm.dirty_ratio and vm.dirty_background_ratio

iostat -x /dev/cciss/c0d1
vmstat 1 10
/etc/sysclt.conf
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 1000
vm.dirty_ratio = 10
vm.dirty_writeback_centisecs = 100
vm.vfs_cache_pressure = 40


Additional Posts

Mounting an SAS LTO Tape Drive on a VM

http://veerapen.blogspot.com/2012/04/adding-sas-lto-tape-drive-to-vm-in-esxi.html

Linux Disk Performance Tuning

http://veerapen.blogspot.com/2015/03/linux-disk-performance-tuning.html