Slow command execution time in Linux

Posted on

Slow command execution time in Linux – Managing your servers can streamline the performance of your team by allowing them to complete complex tasks faster. Plus, it can enable them to detect problems early on before they get out of hand and compromise your business. As a result, the risk of experiencing operational setbacks is drastically lower.

But the only way to make the most of your server management is to perform it correctly. And to help you do so, this article will share nine tips on improving your server management and fix some problem about linux, docker, kernel, , .

On superuser guys suggested me to go here.

We experience some slow simple command execution time in Linux, possibly related to procfs slowness.

Like simple uptime command can take few seconds to execute.

Here are the inputs:

  • Platform: AWS
  • Instance: x1.32xl (128 cores with 2T RAM)
  • OS: Ubuntu 16.04
  • Kernel: 4.4.0-1043-aws

We run docker with around 250 containers.

  • Docker version: 17.09-ce

Utilization:

  • CPU utilization: <50%
  • Mem utilization: <50%

Some OS stats:

# cat /proc/loadavg
100.45 108.30 109.41 35/254951 544357

# vmstat 1 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
110  2 791584 485552640 50354496 687472448    0    0   426   183    1    1 10  8 82  1  0
13  0 791584 485495104 50353940 687477568    0    0 22820 47984 196555 326943 12 12 75  1  0
33  1 791584 485385632 50352428 687473536    0    0 38932 52892 166486 389428 13 14 72  1  0

# ps axu| wc -l
3792

What exactly happens?

Launching simple commands takes time, when the commands use procfs in any way. Like doing ls in a dir with couple of files gets stuck on open syscall of procfs

# strace -r ls
...
0.000084 open("/proc/filesystems", O_RDONLY) = 3
3.652504 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
...

Or uptime:

# strace -r uptime
...
0.000035 open("/proc/filesystems", O_RDONLY) = 3
11.014154 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
...
0.000044 open("/proc/uptime", O_RDONLY) = 3
1.554646 lseek(3, 0, SEEK_SET)
...

Quick Q/A and what we already tried:

  • This slowness only present on host level. In containers we don’t see such issues. We played around a bit and see this issue when we run container with both --ipc=host and --pid=host flags.
  • We have traced this slowness in procfs to mutex_lock here https://github.com/torvalds/linux/blob/v4.4/fs/namei.c#L3082
  • Too many containers
    • No, we have 600 on other host and all good
  • Too many processes
    • No, we have 10K on other host and all good
  • Too many threads
    • That might be. We don’t have such amount of threads on any other host, but we tried to reproduce it on clean x1.32xlarge instance and failed. So it might be treads + smth else.

Any ideas and suggestions are welcome.

Solution :

We found the root cause.

We ran

/usr/share/bcc/tools/funcslower -m 250 -T proc_pid_readdir

to get processes which were causing long calls to proc_pid_readdir

When we did it, we got few processes:

zabbix_agent
atop
byobu

All of these were making really long calls.

TIME       COMM           PID    LAT(ms)             RVAL FUNC
20:01:35   zabbix_agentd  921144 1258.01                0 proc_pid_readdir 
20:01:38   zabbix_agentd  921144 2692.71                0 proc_pid_readdir 
20:01:39   zabbix_agentd  921145 1276.88                0 proc_pid_readdir

When we stopped ALL of these – we saw great improvement.

So looks like mixing a lot of processes and threads + calling proc_pid_readdir too frequently causes great procfs slowness.

We will try 4.14 kernel today and see if it gets better.

UPDATE:

Kernel 4.14 solved our issue as well. Simple mutex for each inode is replaced now with rwsem.

Leave a Reply

Your email address will not be published.