10

One of the services on my system is spamming the log and generating a lot of data. I don't know which service it is.

With regular log files, I can use ls -lha to print the size of all the files in the /var/log directory, which will point me to the culprit. But with systemd, all the logs are in one 'journal' file.

How do I generate a list of units with their corresponding log size? I'd like something like so;

+----------------------+-------+
|         UNIT         | SIZE  |
+----------------------+-------+
| system.slice         | 35.2M |
| cron.service         | 17.0M |
| zabbix-agent.service |  8.9M |
|          ...         |  ...  |

By adapting the answer here; https://unix.stackexchange.com/questions/727066/how-to-get-only-the-unit-names-using-systemctl I created this rather slow method, which by the way also has an RCE vuln on the names of the units;

systemctl show '*' --state=loaded --property=Id --value --no-pager | grep . | sort | uniq | xargs -i sh -c ' journalctl -u {} | wc -c | tr -d "\n" && echo -n " " && echo {}' | sort -n

Breaking this down;

systemctl show '*'## show all units
 --state=loaded   ## Only active units
--property=Id     ## Show the name.
--value           ## Display the value
--no-pager        ## Don't invoke 'more'
| grep .          ## Get all results
| sort            ## Sort by name 
| uniq            ## If a name occurs more than once, remove it.
| xargs -i sh -c '## Invoke, for each result, the following, and open (1); 
journalctl -u {}  ## Run journalctl to print the journal. 
| wc -c           ## Count the number of bytes in the output.
| tr -d "\n"      ## Strip the trailing newline from the output of 'wc'.
 && echo -n " "   ## Add a space for 'sort' later.
&& echo {}'       ## Add the name of the unit. Close (1)
| sort -n         ## Sort the whole output from the commands above by size, asc. 

It 'works', but is obviously very, very slow, having to crawl through the entire journal. If the journal is big, this could take a long time. I wonder if there is a far better way of doing this?

muru
  • 207,228
npr_se
  • 101

3 Answers3

7

Rather than polling systemctl for the loaded units then running journalctl separately on each one, you could get journalctl to output just the unit name for each message, then sort and count those. For example

$ journalctl --output=cat --output-fields=UNIT | sort | uniq -c | sort -nr | head
  14195 anacron.service
   4104 NetworkManager-dispatcher.service
   2627 apt-daily.service
   2580 motd-news.service
   2499 fwupd.service
   2415 cups.service
   2356 fwupd-refresh.service
   1884 cups.path
   1876 cups.socket

or if you prefer awk

journalctl --output=cat --output-fields=UNIT | gawk '{count[$0]++} 
  END {
    PROCINFO["sorted_in"] = "@val_num_desc";
    for(unit in count) printf "%8d %s\n", count[unit], unit
  }' | head

Note that there's a slight difference between counting journal entries for unit foo and counting lines in the journalctl -u foo output:

steeldriver@steeldriver-virtualbox:~$ journalctl -u motd-news | wc -l
2668
steeldriver@steeldriver-virtualbox:~$ journalctl -u motd-news | grep -c -v '^-- Boot'
2580

If you want to accumulate the size of the messages rather than the number of log entries then you could get at that via the JSON output, e.g.

journalctl --output=json --output-fields=UNIT,SYSLOG_IDENTIFIER,MESSAGE | jq -r '
  [.UNIT // .SYSLOG_IDENTIFIER, (.MESSAGE|length)] | @tsv' | gawk -F'\t' '{messlen[$1] += $2} 
    END {
      PROCINFO["sorted_in"] = "@val_num_desc"; 
      for (unit in messlen) printf "%8d\t%s\n", messlen[unit], unit
    }' | head

or all in jq

journalctl --output=json --output-fields=UNIT,SYSLOG_IDENTIFIER,MESSAGE | jq -r --slurp '
  map({name: (.UNIT // .SYSLOG_IDENTIFIER), size: (.MESSAGE|length)}) | group_by(.name) | 
  map({name: .[0].name, size: map(.size) | add}) | sort_by(.size) | reverse[] | [.size,.name] | @tsv
' | head

The pure jq version ran comparably fast to the jq + awk version when I restricted it to a single boot (i.e. journalctl -b --output=json ...) but apparently got killed when I tried it with the full journal - that may just be due to memory limits on the VM I'm testing it on though.

steeldriver
  • 142,475
3

Oh well, I made the mistake of running the test on a relatively new installation, which didn't quite show how faster the modified version of the command runs. I updated the answer replacing the results of the test with the results of the test ran on an older "log-dense" machine.


The culprit of the slowness is the journalctl command.

I believe that for your purpose you don't need a list of log sizes, but rather an indication of what is logging the most.

So it should be appropriate to reduce the inspected time window's size in order to have journalctl work on a reduced input size1.

E.g., reducing the inspected time window's size to the last 4 hours can be done by adding -S "$(date -d "- 4 hours" +"%Y-%m-%d %T")" to the journalctl command:

journalctl -S "$(date -d "- 4 hours" +"%Y-%m-%d %T")" -u {}

I also took the liberty of slightly optimizing your script further (no need for --no-pager in a pipe, and no need for grep . | sort | uniq; systemctl show '*' --state=loaded --property=Id --value will only output unique lines and empty lines, and empty lines will be discarded by xargs; I also changed tr -d "\n" && echo -n " " && echo {} to a simpler and more understandable awk "{print \$1,\"{}\"}"):

systemctl show '*' --state=loaded --property=Id --value |
xargs -i sh -c 'journalctl -S "$(date -d "- 4 hours" +"%Y-%m-%d %T")" -u {} | wc -c | awk "{print \$1,\"{}\"}"' |
sort -n

This outputs the exact same your command outputs (namely, for each active unit, the size in bytes of journalctl -u <unit> followed by the unit's name, sorted by size), which I assume you like as is since you didn't complain about it (it could be optimized to output just relevant lines, however I understand this is meant to be just a tool to quickly find out who's misbehaving, so perhaps it's fine to leave it like this); these are the last 10 lines outputted by the command on my system:

17 uuidd.service
17 uuidd.socket
17 veritysetup.target
17 vgauth.service
69 session-250.scope
266 smartmontools.service
402 user-1000.slice
550 systemd-logind.service
560 motd-news.service
811 ssh.service
1808 phpsessionclean.service
2466 init.scope
2466 -.slice
2524 cron.service
4151 system.slice

On my system, reducing the inspected time window's size to the last 4 hours reduces the execution time by about a factor of 8 (from ~2 minutes to ~15 seconds); I compiled this ugly script which drops anything cached before running the two commands, and I ran it in order to make the comparison:

#!/usr/bin/env bash

sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

time { systemctl show '*' --state=loaded --property=Id --value | grep . | sort | uniq | xargs -i sh -c ' journalctl -u {} | wc -c | tr -d "\n" && echo -n " " && echo {}' | sort -n } &>/dev/null

sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'

time { systemctl show '*' --state=loaded --property=Id --value | xargs -i sh -c 'journalctl -S "$(date -d "- 4 hours" +"%Y-%m-%d %T")" -u {} | wc -c | awk "{print $1,&quot;{}&quot;}"' | sort -n } &>/dev/null

exit 0

These are the results:

% ./script.sh

real 1m58.843s user 1m46.212s sys 0m5.423s

real 0m13.133s user 0m2.175s sys 0m2.922s ./script.sh 108.40s user 8.36s system 88% cpu 2:12.38 total

The inspected time window may be fine-tuned to your needs (even reducing the window's size further if the issue is very prominent).

1. I'm still looking for a better way to tackle this though.

kos
  • 41,268
-1

Indeed your own script works fine. If it takes too long to execute, you could always reduce the journal size before running it.

Check the size:

journalctl --disk-usage

Reduce the size to say 100M

journalctl --vacuum-size=100M

Run your oneliner again which will now be fast. This way you should easily be able to find your "spammer".

guntbert
  • 13,475