Difference of two big files

Question

I have "test1.csv" and it contains

200,400,600,800
100,300,500,700
50,25,125,310

and test2.csv and it contains

100,4,2,1,7
200,400,600,800
21,22,23,24,25
50,25,125,310
50,25,700,5

now

diff test2.csv test1.csv > result.csv

is different than

diff test1.csv test2.csv > result.csv

I don't know which is the correct order but I want something else, both of the commands above will output something like

2 > 100,4,2,1,7
   3 2,3c3,5
   4 < 100,300,500,700
   5 < 50,25,125,310
   6 \ No newline at end of file
   7 ---
   8 > 21,22,23,24,25
   9 > 50,25,125,310

I want to output only the difference, thus results.csv should look like this

100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

I tried diff -q and diff -s but they didn't do the trick. Order doesn't matter, what matters is that I want to see only the difference, no > nor < nor blank space.

grep -FvF did the trick on smaller files not on big ones

first file contains more than 5 million lines, second file contains 1300.

so results.csv should result in ~4,998,700 lines

I also tried grep -F -x -v -f which didn't work.

score 21 · Accepted Answer · answered Jun 23 '15 at 22:23

Sounds like a job for comm:

$ comm -3 <(sort test1.csv) <(sort test2.csv)
100,300,500,700
    100,4,2,1,7
    21,22,23,24,25
    50,25,700,5

As explained in man comm:

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

So, the -3 means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:

$ comm -3 <(sort test1.csv) <(sort test2.csv) | tr -d '\t'
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

In this case, you don't really even need to sort the files and you can simplify the above to:

comm -3 test1.csv test2.csv | tr -d '\t' > difference.csv

score 6 · Answer 2 · edited Apr 13 '17 at 12:24

Using grep with bash process substitution:

$ cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

To save the output as results.csv:

cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv) >results.csv

<() is the bash process substitution pattern
grep -vFf test2.csv test1.csv will find the lines unique to only test1.csv
grep -vFf test1.csv test2.csv will find the lines unique to only test2.csv
Finally we are summing up the results by cat

Or as Oli suggested, you can use command grouping also:

$ { grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv; }
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

Or just run one after another, as they are both writing to STDOUT they will ultimately get added:

$ grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5

muru · Answer 3 · 2015-06-23T22:14:53.837

4

If the order of rows is not relevant, use awk or perl:

awk '{seen[$0]++} END {for (i in seen) {if (seen[i] == 1) {print i}}}' 1.csv 2.csv

Use grep to get the common lines and filter those out:

grep -hxvFf <(grep -Fxf 1.csv 2.csv) 1.csv 2.csv

The internal grep gets the common lines, then the external grep finds lines which don't match these common lines.

edited Jun 23 '15 at 22:14

answered Jun 23 '15 at 21:53

muru

207,228

Volker Siegel · Answer 4 · 2015-06-24T07:17:16.213

Use the `--*-line-format=...` options of `diff`

You can tell diff exactly what you need - explained below:

diff --old-line-format='%L' --new-line-format='%L' --unchanged-line-format='' f1.txt f2.txt

It is possible to specify the output of diff in a very detailed way, similar to a printf number format.

The lines from the first file, test1.csv are called "old" lines, and the lines from the second, test2.csv, are "new" lines. That makes sense when diff is used to see what changed in a file.

The options we need are the ones to set the format for "old" lines, "new" lines, and "unchanged" lines.
The formats we need are very simple:
For the changed lines, new and old, we want to output only the text of the lines. %L is the format symbol for the line text.
For the unchanged lines, we want to show nothing.

With this, we can write options like --old-line-format='%L', and put it all together, using your example data:

$ diff --old-line-format='%L' --new-line-format='%L' --unchanged-line-format='' test1.csv test2.csv
100,4,2,1,7
100,300,500,700
21,22,23,24,25
50,25,700,5

Notes on performance

Because the files have different size, try to exchange the input files if it does not matter, it could be that the inner workings of diff can handle one way better than the other. Better is either needing less memory, or less computation.

There is an optimisation option for using diff with large files: --speed-large-files. It uses assumptions about the file structure, so it's not clear whether it helps in your case, but worth trying it.

The format options are described in the man diff under --LTYPE-line-format=LFMT.

kos · Answer 5 · 2015-06-23T22:17:26.053

3

Since the order doesn't need to be preserved, simply:

sort test1.csv test2.csv | uniq -u

sort test1.csv test2.csv: merges and sorts test1.csv and test2.csv
uniq -u: prints only the lines which have no duplicate

edited Jun 23 '15 at 22:17

answered Jun 23 '15 at 22:14

kos

41,268

Difference of two big files

5 Answers5

Use the `--*-line-format=...` options of `diff`

Notes on performance

Linked

Difference of two big files

5 Answers5

Use the --*-line-format=... options of diff

Notes on performance

Linked

Use the `--*-line-format=...` options of `diff`