1

After asking a question on ubuntuforums.org and not getting a satisfying answer, I've decided to ask the question again here on Ask Ubuntu. I need the answer to be very detailed. Specifically, I need to know which lines get compared every time a line is printed using uniq for the following two examples:

file1.txt:

$ cat -A file1.txt
aaa^Iupc$
b$
c$
aaa^Iztp$
b$
c$
C$
A$
B$
B$
b$

$ sort file1.txt | uniq -f 1
A
aaa    upc
aaa    ztp
b

and file2.txt:

$ cat -A file2.txt
aaa^Iupc$
b$
c$
aaa^Iztp$
b$
c$
C$
A$
B$
B$
bbb^Ixpz$

$ sort file2.txt | uniq -f 1
A
aaa    upc
aaa    ztp
b
bbb    xpz
c

I'm confused about the second example. I don't understand how come uppercase B doesn't end up in the final output. Shouldn't the line with uppercase B be printed given that lines B and bbb xpz are both adjacent to each other? If:

B ---> (empty)

and

bbb ---> xpz

an empty value and xpz are both unique so both lines should be printed. Or am I missing something?

muru
  • 207,228

2 Answers2

2

The answer lies in the sorting order and what does uniq uses for a field value when less than the given field number (N) exists while using -f N.

As seen you have ASCII charsets, so the sorting order is much predictable:

% sort file.txt            
A
aaa upc
aaa ztp
b
b
B
B
bbb xpz
c
c
C

Now, let's use uniq -f 1 to get unique lines with skipping the (whitespace separated) first field of each line while checking:

% sort file.txt | uniq -f 1
A
aaa upc
aaa ztp
b
bbb xpz
c

Now, the important thing to note that, uniq uses null string for lines that has has less than the fields mentioned, 1 in this case; so, all the lines that has only one field would be treated as having null strings for other fields while comparing with other lines having >=2 fields.

So, from the sort file2.txt output:

b
b
B
B

would all be treated as same and only the first line containing b would be preserved, hence you have a b in the output.

Similarly, from:

c
c
C

only the first c would end up in the uniq's output.

heemayl
  • 93,925
0

Here's a table that may help you to work through the process:

----------------+---------------+----------+----------------+
    sort        |     Remove    | Adjacent |                |
 (C locale)     |    field #1   |  match?  |    Output      |
----------------+---------------+----------+----------------+
A               |               |    N*    |A               |
B               |               |    Y     |                |
B               |               |    Y     |                |
C               |               |    Y     |                |
aaa     upc     |   upc         |    N     |aaa     upc     |
aaa     ztp     |   ztp         |    N     |aaa     ztp     |
b               |               |    N     |b               |
b               |               |    Y     |                |
bbb     xpz     |   xpz         |    N     |bbb     xpz     |
c               |               |    N     |c               |
c               |               |    Y     |                |
----------------+---------------+----------+----------------+
* the first line has no adjacent above, so is always output
steeldriver
  • 142,475