3
# cat /etc/sysctl.conf 
fs.aio-max-nr=99999999
fs.file-max=99999999
kernel.pid_max=4194304
kernel.threads-max=99999999
kernel.sem=32768 1073741824 2000 32768
kernel.shmmni=32768
kernel.msgmni=32768
kernel.msgmax=65536
kernel.msgmnb=65536
vm.max_map_count=1048576

# cat /etc/security/limits.conf
 * soft core unlimited
 * hard core unlimited
 * soft data unlimited
 * hard data unlimited
 * soft fsize unlimited
 * hard fsize unlimited
 * soft memlock unlimited
 * hard memlock unlimited
 * soft nofile 1048576
 * hard nofile 1048576
 * soft rss unlimited
 * hard rss unlimited
 * soft stack unlimited
 * hard stack unlimited
 * soft cpu unlimited
 * hard cpu unlimited
 * soft nproc unlimited
 * hard nproc unlimited
 * soft as unlimited
 * hard as unlimited
 * soft maxlogins unlimited
 * hard maxlogins unlimited
 * soft maxsyslogins unlimited
 * hard maxsyslogins unlimited
 * soft locks unlimited
 * hard locks unlimited
 * soft sigpending unlimited
 * hard sigpending unlimited
 * soft msgqueue unlimited
 * hard msgqueue unlimited

# cat /etc/systemd/logind.conf
[Login]
UserTasksMax=infinity

# free -g 
              total        used        free      shared  buff/cache   available
Mem:            117           5          44          62          67          48
Swap:            15           8           7

# df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       194G  121G   74G  63% /

# cat /proc/meminfo
MemTotal:       123665416 kB
MemFree:        90979152 kB
MemAvailable:   95376636 kB
Buffers:           72260 kB
Cached:         25964076 kB
SwapCached:            0 kB
Active:          8706568 kB
Inactive:       22983044 kB
Active(anon):    7568968 kB
Inactive(anon): 18871224 kB
Active(file):    1137600 kB
Inactive(file):  4111820 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16777212 kB
SwapFree:       16777212 kB
Dirty:                20 kB
Writeback:             0 kB
AnonPages:       5653128 kB
Mapped:           185100 kB
Shmem:          20786924 kB
KReclaimable:     281732 kB
Slab:             541000 kB
SReclaimable:     281732 kB
SUnreclaim:       259268 kB
KernelStack:       34384 kB
PageTables:        93216 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    78609920 kB
Committed_AS:   63750908 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       46584 kB
VmallocChunk:          0 kB
Percpu:            18944 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      183484 kB
DirectMap2M:     5058560 kB
DirectMap1G:    122683392 kB
And for the user account used to run the scripts:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Yet

./somescript.sh: fork: retry: Resource temporarily unavailable

The server has medium load (~ 20 load average atm), and uses many scripts which do extensive forking (i.e. $(comecode) inside many scripts). The server (Google cloud instance) has 16 cores and 128GB ram with a 100GB tmpfs drive and 16GB swap. Even when the CPU, the memory and the swap are all under 50% use the message shows.

It is hard to believe it would be hitting any of these already high upper limits. I suspect there is some other setting that affect this.

What else can be tuned to avoid this fork: retry: Resource temporarily unavailable issue?

1 Answers1

1

After more debugging I finally found the answer. The answer seems very valuable in that others may run into this. It may also be a bug in Ubuntu (TBD)

My scripts made the following change (in-script) in various places;

ulimit -u 20000 2>/dev/null

The 20000 number would vary from 2000 to 40000 depending on the script/situation.

What thus seems to happen is that as soon as a number of processes somehow "maxed out" the maximum total of open files (1048576) - which would seem easy to do with for example only a limited number of scripts - multiplied each time by their respective ulimit settings. The result was that at max about 2000-2200 threads would be started.

I removed all the ulimit -u settings, and now do not get any fork: retry: resource temporarily unavailable anymore, nor any other related fork errors.

htop now also shows much more then 2000-2200 threads;

Tasks: 2349, 22334 thr, 318 kthr; 32 running

Now my machine becomes overloaded/unresponsibe, but that is another problem (server is likely swapping), and at that a much more enjoyable one then the fork issue :)

(As an interesting sidenote and reference, https://stackoverflow.com/questions/30757919/the-limit-of-ulimit-hn describes how to increase the max number of open files to an amount greater then 1048576.)

It should be easy to setup a test for this (bash nested fork script with a ulimit -n ${some_large_value} set inside each forked thread)