I have this code in a shell script:
sort input | uniq -c | sort -nr > output
The input file had no preceding white spaces, but the output does. How do I fix this? This is in bash
I have this code in a shell script:
sort input | uniq -c | sort -nr > output
The input file had no preceding white spaces, but the output does. How do I fix this? This is in bash
The default behavior of uniq is to right-justify the frequency in a line 7 spaces wide, then separate the frequency from the item with a single space.
Source : https://www.thelinuxrain.com/articles/tweaking-uniq-c (Wayback Machine)
Remove the leading spaces with sed :
$ sort input | uniq -c | sort -nr | sed 's/^\s*//' > output
uniq -c adds leading whitespace. E.g.
$ echo test
test
$ echo test | uniq -c
1 test
You could add a command at the end of the pipeline to remove it. E.g.
$ echo test | uniq -c | sed 's/^\s*//'
1 test
FWIW you can use a different sorting tool for more flexibility. Python is one such tool.
#!/usr/bin/python3
import sys, operator, collections
counter = collections.Counter(map(operator.methodcaller('rstrip', '\n'), sys.stdin))
for item, count in counter.most_common():
print(count, item)
In theory this would even be faster than the sort tool for large inputs since the above program uses a hash table to identify duplicate lines instead of a sorted list. (Alas it places lines of identical count in an arbitrary instead of a natural order; this can be amended and still be faster than two sort invocations.)
If you want more flexibility on the output format you can look into the print() and format() built-in functions.
For instance, if you want to print the count number in octal with up to 7 leading zeros and followed by a tab instead of a space character with a NUL line terminator, replace the last line with:
print(format(count, '08o'), item, sep='\t', end='\0')
Store the script in a file, say sort_count.py, and invoke it with Python:
python3 sort_count.py < input
uniq -c -i | tr -s ' ' | cut -c 2-
Translate leading whitespaces into single whitespace with tr -s and then print output from 2nd character with cut -c.