How to count occurrences of text in a file?

Question

I have a log file sorted by IP addresses, I want to find the number of occurrences of each unique IP address. How can I do this with bash? Possibly listing the number of occurrences next to an ip, such as:

5.135.134.16 count: 5
13.57.220.172: count 30
18.206.226 count:2

and so on.

Here’s a sample of the log:

5.135.134.16 - - [23/Mar/2019:08:42:54 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "POST /wp-login.php HTTP/1.1" 200 3836 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:55 -0400] "POST /wp-login.php HTTP/1.1" 200 3988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
5.135.134.16 - - [23/Mar/2019:08:42:56 -0400] "POST /xmlrpc.php HTTP/1.1" 200 413 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:05 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:06 -0400] "POST /wp-login.php HTTP/1.1" 200 3985 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:07 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:08 -0400] "POST /wp-login.php HTTP/1.1" 200 3833 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:09 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:11 -0400] "POST /wp-login.php HTTP/1.1" 200 3836 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:12 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:15 -0400] "POST /wp-login.php HTTP/1.1" 200 3837 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.220.172 - - [23/Mar/2019:11:01:17 -0400] "POST /xmlrpc.php HTTP/1.1" 200 413 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
13.57.233.99 - - [23/Mar/2019:04:17:45 -0400] "GET / HTTP/1.1" 200 25160 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
18.206.226.75 - - [23/Mar/2019:21:58:07 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "https://www.google.com/url?3a622303df89920683e4421b2cf28977" "Mozilla/5.0 (Windows NT 6.2; rv:33.0) Gecko/20100101 Firefox/33.0"
18.206.226.75 - - [23/Mar/2019:21:58:07 -0400] "POST /wp-login.php HTTP/1.1" 200 3988 "https://www.google.com/url?3a622303df89920683e4421b2cf28977" "Mozilla/5.0 (Windows NT 6.2; rv:33.0) Gecko/20100101 Firefox/33.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] "GET /wp-login.php HTTP/1.1" 200 2988 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"

Mikael Flora · Answer 1 · 2019-03-28T22:34:59.150

40

You can use cut and uniq tools:

cut -d ' ' -f1 test.txt  | uniq -c
      5 5.135.134.16
      9 13.57.220.172
      1 13.57.233.99
      2 18.206.226.75
      3 18.213.10.181

Explanation :

cut -d ' ' -f1 : extract first field (ip address)
uniq -c : report repeated lines and display the number of occurences

edited Mar 28 '19 at 22:34

answered Mar 28 '19 at 22:04

Mikael Flora

481

steeldriver · Answer 2 · 2019-03-28T22:36:02.350

If you don't specifically require the given output format, then I would recommend the already posted cut + uniq based answer

If you really need the given output format, a single-pass way to do it in Awk would be

awk '{c[$1]++} END{for(i in c) print i, "count: " c[i]}' log

This is somewhat non-ideal when the input is already sorted since it unnecessarily stores all the IPs into memory - a better, though more complicated, way to do it in the pre-sorted case (more directly equivalent to uniq -c) would be:

awk '
  NR==1 {last=$1} 
  $1 != last {print last, "count: " c[last]; last = $1} 
  {c[$1]++} 
  END {print last, "count: " c[last]}
'

Ex.

$ awk 'NR==1 {last=$1} $1 != last {print last, "count: " c[last]; last = $1} {c[$1]++} END{print last, "count: " c[last]}' log
5.135.134.16 count: 5
13.57.220.172 count: 9
13.57.233.99 count: 1
18.206.226.75 count: 2
18.213.10.181 count: 3

dessert · Accepted Answer · 2019-03-28T23:11:42.653

You can use grep and uniq for the list of addresses, loop over them and grep again for the count:

for i in $(<log grep -o '^[^ ]*' | uniq); do
  printf '%s count %d\n' "$i" $(<log grep -c "$i")
done

grep -o '^[^ ]*' outputs every character from the beginning (^) until the first space of each line, uniq removes repeated lines, thus leaving you with a list of IP addresses. Thanks to command substitution, the for loop loops over this list printing the currently processed IP followed by “ count ” and the count. The latter is computed by grep -c, which counts the number of lines with at least one match.

Example run

$ for i in $(<log grep -o '^[^ ]*'|uniq);do printf '%s count %d\n' "$i" $(<log grep -c "$i");done
5.135.134.16 count 5
13.57.220.172 count 9
13.57.233.99 count 1
18.206.226.75 count 2
18.213.10.181 count 3

pa4080 · Answer 4 · 2019-03-28T22:20:21.263

Here is one possible solution:

IN_FILE="file.log"
for IP in $(awk '{print $1}' "$IN_FILE" | sort -u)
do
    echo -en "${IP}\tcount: "
    grep -c "$IP" "$IN_FILE"
done

replace file.log with the actual file name.
the command substitution expression $(awk '{print $1}' "$IN_FILE" | sort -u) will provide a list of the unique values of the first column.
then grep -c will count each of these values within the file.

$ IN_FILE="file.log"; for IP in $(awk '{print $1}' "$IN_FILE" | sort -u); do echo -en "${IP}\tcount: "; grep -c "$IP" "$IN_FILE"; done
13.57.220.172   count: 9
13.57.233.99    count: 1
18.206.226.75   count: 2
18.213.10.181   count: 3
5.135.134.16    count: 5

score 5 · Answer 5 · answered Mar 29 '19 at 16:14

Some Perl:

$ perl -lae '$k{$F[0]}++; }{ print "$_ count: $k{$_}" for keys(%k)' log 
13.57.233.99 count: 1
18.206.226.75 count: 2
13.57.220.172 count: 9
5.135.134.16 count: 5
18.213.10.181 count: 3

This is the same idea as Steeldriver's awk approach, but in Perl. The -a causes perl to automatically split each input line into the array @F, whose first element (the IP) is $F[0]. So, $k{$F[0]}++ will create the hash %k, whose keys are the IPs and whose values are the number of times each IP was seen. The }{ is funky perlspeak for "do the rest at the very end, after processing all input". So, at the end, the script will iterate over the keys of the hash and print the current key ($_) along with its value ($k{$_}).

And, just so people don't think that perl forces you to write script that look like cryptic scribblings, this is the same thing in a less condensed form:

perl -e '
  while (my $line=<STDIN>){
    @fields = split(/ /, $line);
    $ip = $fields[0];
    $counts{$ip}++;
  }
  foreach $ip (keys(%counts)){
    print "$ip count: $counts{$ip}\n"
  }' < log

Y. Pradhan · Answer 6 · 2019-03-31T12:13:17.490

Maybe this is not what the OP want; however, if we know that the IP address length will be limited to 15 characters, a quicker way to display the counts with unique IPs from a huge log file can be achieved using uniq command alone:

$ uniq -w 15 -c log

5 5.135.134.16 - - [23/Mar/2019:08:42:54 -0400] ...
9 13.57.220.172 - - [23/Mar/2019:11:01:05 -0400] ...
1 13.57.233.99 - - [23/Mar/2019:04:17:45 -0400] ...
2 18.206.226.75 - - [23/Mar/2019:21:58:07 -0400] ...
3 18.213.10.181 - - [23/Mar/2019:14:45:42 -0400] ...

Options:

-w N compares no more than N characters in lines

-c will prefix lines by the number of occurrences

Alternatively, For exact formatted output I prefer awk (should also work for IPV6 addresses), ymmv.

$ awk 'NF { print $1 }' log | sort -h | uniq -c | awk '{printf "%s count: %d\n", $2,$1 }'

5.135.134.16 count: 5
13.57.220.172 count: 9
13.57.233.99 count: 1
18.206.226.75 count: 2
18.213.10.181 count: 3

Note that uniq won't detect repeated lines in the input file if they are not adjacent, so it may be necessary to sort the file.

wjandrea · Answer 7 · 2019-03-31T17:34:03.300

1

FWIW, Python 3:

from collections import Counter

with open('sample.log') as file:
    counts = Counter(line.split()[0] for line in file)

for ip_address, count in counts.items():
    print('%-15s  count: %d' % (ip_address, count))

Output:

13.57.233.99     count: 1
18.213.10.181    count: 3
5.135.134.16     count: 5
18.206.226.75    count: 2
13.57.220.172    count: 9

edited Mar 31 '19 at 17:34

answered Mar 31 '19 at 17:25

wjandrea

14,504

score 0 · Answer 8 · edited Mar 31 '19 at 17:04

0

cut -f1 -d- my.log | sort | uniq -c

Explanation: Take the first field of my.log splitting on dashes - and sort it. uniq needs sorted input. -c tells it to count occurrences.

edited Mar 31 '19 at 17:04

wjandrea

14,504

answered Mar 30 '19 at 18:01

PhD

101

How to count occurrences of text in a file?

8 Answers8

Example run