1

Yesterday I asked this question, and got awesome answers, it's really a joy to ask questions on this site.

Today I got a slightly different question

say I have csv1

1,2,3,4
5,6,7,8 --
9,10,11,12
13,14,15 --

and csv2 has

1,2,3,4,5 --
20,21,22,23,24
24,25,26,27,28
9,10,11,12,30 --
45,46,47,48,60

How can I print only those rows whose 1st 4 fields are only present in one of the two files? In other words, discard all lines from each file whose 1st four fields are also present in a line in the other file.

1,2,3,4
9,10,11,12
20,21,22,23,24
24,25,26,27,28
45,46,47,48,60

Note that -- doesn't exist in the actual files, i added them to help you notice the difference.

So far, I'm loading everything in numpy arrays and comparing each element,

if a[i] == b[i] and ...

But I want to know if there's a better way to do it using Linux tools.

Edit

Every line in csv2 has a corresponding line in csv1 and there are no duplicate lines in the same file. basically i'm trying to remove csv2 from csv1 and output the rest of csv1.

Lynob
  • 6,775

1 Answers1

3

Here's one way:

$ awk -F, 'NR==FNR{a[$1$2$3$4]++; next}!a[$1$2$3$4]' csv2 csv1
110,12,31,345
1,12,14,55 
12,53,22,10
1,12,32,44 

Explanation

  • -F, : set the field separator to ,. Now, the first comma-separated field of each line will be $1, the second $2 and so on.
  • NR==FNR : these are two awk special variables. NR is the current input line and FNR is the line number of the current file. The two will be equal only while the 1st file is being read.
  • NR==FNR{a[$1$2$3$4]++; next} : while reading the 1st file, save the 1st 4 fields as a key in the array a and set their value to 1. This basically saves all 1st 4 fields of csv1. The next ensures that we immediately skip to the next line and don't process the rest of the script.
  • !a[$1$2$3$4] : the default action of awk is to print the current line. So, if you use something that evaluates to true, awk understands that it should print this line. !a[ $1$2$3$4] is true when a[$1$2$3$4] is not define which will happen for lines in csv1 whose 1st 4 fields were not present in any lines of csv2. Therefore, this directive will cause all lines whose 1st 4 fields have never been seen (so their value in the a array is not 1) to be printed.
terdon
  • 104,119