4

I have a file that looks like this:

    7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
    8  C00000002 score:  -39.520 nathvy =  49 nconfs =         3129
    9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
   10  C00000002 score:  -38.454 nathvy =  49 nconfs =         9473
   11  C00000004 score:  -37.704 nathvy =  24 nconfs =          156
   12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
    2  C00000002 score:  -48.649 nathvy =  49 nconfs =         3878
    3  C00000001 score:  -44.988 nathvy =  41 nconfs =         1988
    4  C00000002 score:  -42.674 nathvy =  49 nconfs =         6740
    5  C00000002 score:  -42.453 nathvy =  49 nconfs =         4553
    6  C00000002 score:  -41.829 nathvy =  49 nconfs =         7559

My second column are some IDs that are not sorted here, some of them are repeating, such as (C00000001) for example. All of them have a different number assigned followed by score: (number most often starts with -).

What I would like to do is:

1) read second column (non sorted IDs) and to always pick the first one that appears. So in case of C00000001 it would pick the on with score : -37.558.

2) now when I have unique values presented, I would like to sort them based on the number after score:, meaning the most negative number to be on the first position while the most positive one to be on the last position.

I would like to have output printed out the same way as my input file (same structure).

Ravexina
  • 57,256
djordje
  • 311

3 Answers3

8
$ sort -k2,2 -u < filename | sort -k4,4n

7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
12 C00000001 score:  -37.558 nathvy =  41 nconfs =           51

Explanation:

  1. sort -k2,2 -u: sorts the lines based on second column and does not change the order of them (cause they're basically the same value) and keep the first one.
  2. sort -k4,4n: sort numerically based on the scores (there is no need for -r to reverse it).
Ravexina
  • 57,256
1

With GNU awk > 4.0:

$ gawk '
    !seen[$2] {seen[$2] = $0} 
    END {PROCINFO["sorted_in"] = "@val_num_asc"; for (i in seen) print seen[i]}
  ' file
    7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
    9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
   12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
steeldriver
  • 142,475
0

Contributing with an additional single-line command that can easily be configured

for row in $(cat tmp |  awk '{print $2}' | sort | uniq); do cat tmp | grep $row | head -n 1; done | sort -r --key=4

7  C00000002 score:  -41.156 nathvy =  49 nconfs =         2251
9  C00000004 score:  -38.928 nathvy =  24 nconfs =          150
12  C00000001 score:  -37.558 nathvy =  41 nconfs =           51
toerq
  • 11
  • 1