3

Do not confuse this question to be a duplicate of "what is difference b/w sort -u and sort | uniq"

This is in essence a word count program

Confusion raised from the following command is reason for asking this question:

    root@sanctum:~/datascience# cat data 
    this is a file that is supposed to be a file

this gives incorrect output:

root@sanctum:~/datascience# cat data | sed 's/ /\n/g' | uniq -c
      1 this
      1 is
      1 a
      1 file
      1 that
      1 is
      1 supposed
      1 to
      1 be
      1 a
      1 file

Piping the output to sort and then to uniq gives the perfect answer-

root@sanctum:~/datascience# cat data | sed 's/ /\n/g' | sort |uniq -c
      2 a
      1 be
      2 file
      2 is
      1 supposed
      1 that
      1 this
      1 to

output of when piped just to sort:

root@sanctum:~/datascience# cat data | sed 's/ /\n/g' | sort 
a
a
be
file
file
is
is
supposed
that
this
to

how does the line number of appearance of a line have an effect on the count of the occurrences in the file? i dont know how to phrase it but u get the point

Basically why cant cat data | sed 's/ /\n/g' | uniq -c give the required result?

1 Answers1

3

This is not random behavior. From man uniq:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.

Essentially, uniq by default only works on sorted input. It is so by design, in other words.

Your main question however is:

how does the line number of appearance of a line have an effect on the count of the occurrences in the file

To answer this question, you'd really have to look at uniq's source code:

while (!feof (stdin))
  {
    char *thisfield;
    size_t thislen;
    if (readlinebuffer_delim (thisline, stdin, delimiter) == 0)
      break;
    thisfield = find_field (thisline);
    thislen = thisline->length - 1 - (thisfield - thisline->buffer);
    if (prevline->length == 0
        || different (thisfield, prevfield, thislen, prevlen))
      {
        fwrite (thisline->buffer, sizeof (char),
                thisline->length, stdout);
    SWAP_LINES (prevline, thisline);
    prevfield = thisfield;
    prevlen = thislen;
  }

}

The key here is that the file is read line-by-line and comparison can be done only with current and previous line in the function different() which returns True if lines are not the same, False if they are the same. The reason for that really is that if you were to compare against all lines, you'd probably need a large amount of memory if there's a large number of lines. This isn't practical, and would slow down uniq considerably

Pablo Bianchi
  • 17,371