interrupt coalescing for high bandwidth packet capture? – Managing your servers can streamline the performance of your team by allowing them to complete complex tasks faster. Plus, it can enable them to detect problems early on before they get out of hand and compromise your business. As a result, the risk of experiencing operational setbacks is drastically lower.
But the only way to make the most of your server management is to perform it correctly. And to help you do so, this article will share nine tips on improving your server management and fix some problem about linux, redhat, ethernet, nic, packet-capture.
I have an application which does packet capture from an ethernet card. Once in a while we see packets dropped (we suspect due to the buffer in the network card or kernel being overrun). I am trying to figure out if turning on interrupt coalescing will help or worsen the situation. On the one hand, there should be less work on the CPU since there should be less interrupts to process, on the other hand, it seems that if the IRQs are not processed as frequently, there is a higher probability of a buffer being overrun. Does that mean that maybe I should turn it on and increase the size of rmem_max settings?
UPDATED TO INCLUDE OS/HW Details:
Dell PowerEdge 1950, Dual Quad-Core Xeon X5460 @ 3.16GHz
Broadcom NetXtreme II BCM5708
proc/sys/net/core dev_weight 64 netdev_budget 300 rmem_default 110592 somaxconn 128 wmem_max 16777216 xfrm_aevent_rseqth 2 message_burst 10 netdev_max_backlog 65536 rmem_max 16777216 warnings 1 xfrm_acq_expires 30 xfrm_larval_drop 1 message_cost 5 optmem_max 20480 rps_sock_overflow_entries 0 wmem_default 110592 xfrm_aevent_etime 10
Without knowing why you’re dropping packets, it’s impossible to know whether it’ll help or not. Your analysis is fundamentally correct — if interrupts arrive (are serviced) less often, there’s a greater chance of buffers filling up, all things being equal. If you don’t know why you’re losing packets, though, you can’t tell if making that change will improve the situation or not.
Personally, I find throwing good-quality NICs with good drivers into a good quality server makes all my problems go away. Much cheaper than spending days grovelling through debug data.
Okay, you’ve not given some of the basic information (like particular OS distribution or kernel version). That matters because the sysctl/kernel setting defaults differ across distros and certain tunables aren’t exposed in some Linux systems. You’re working with a server from 2008, so how do we know that your OS and kernel aren’t from the same era?
Looking at your network parameters, though, I’d increase the default buffer sizes. A recent system setup for high-frequency trading I deployed had much higher
rmem_default settings. Try “8388608” to start and see if that helps. It’s a basic change, but usually the first step…
I would also look at changing the realtime priorities of your (presumably custom) application. Are you using any form of CPU affinity (taskset, cgroups) in your app or wrapper script? How about the realtime priority of your app? Look into the chrt command and its options to see what would be appropriate for your situation. Is your application multithreaded?
Luckily, the 5400-series CPU doesn’t have hyperthreading to deal with, but how are your other BIOS settings? Did you disable power management and C-states? Are there any unnecessary daemons running on the system? Is
Now, as to the hardware you’re using, if this if for HFT use, you’re behind; literally THREE jumps in CPU and architectural changes… The Nehalem (5500-series) brought a big jump in tech over the 5400-series you’re using. Westmere (5600) was even better. Sandy Bridge was a big enough change over the 5500/5600 to spur another hardware refresh in my environments.
It also sounds like you’re using the onboard NICs. There were some hoops we needed to jump through when dealing with Broadcom… But you’re not at that point yet. How does CPU load look when you encounter dropped-packets? What type of data flow rate are you experiencing during your captures? This may just be a case of your system not keeping up.
There are a lot of knobs to tweak/tune here. A better understanding of what you’re working with will help us narrow things down, though.
Edit: you mentioned Red Hat. The options for EL5 and EL6 differ, but the suggestions above do apply in theory.
Edit: It’s good that you’re on RHEL 6. There’s a lot you can do. Try setting the priority of your app and test. Another useful guide is the RHEL MRG tuning guide. Not all of the features will be available to your kernel, but this will give you some ideas and explanations for some of the things you can modify for more deterministic performance.