How to find duplicate lines in very large (65GB) text files?

Question

I have 10 text files, that contain 65 characters of hex value in each line. Each text files is 6.5GB in size (having 99999999 lines)

i.e.file1-6.5GB, file2-6.5GB, file3-6.5GB, file4-6.5GB, ... file10-6.5GB

I need to find duplicate lines in all of these 10 text files combined and I need to be able to know which duplicate line came from which file and how many files have duplicate lines?

How can I do that?

I am currently using this command:

sort *.txt | uniq -d > dup

But it is hanging, and sometimes crashing. If I want to check a 65GB file, do I need double that size in the computer memory i.e do I need to install more memory?

Is there any other way to do this?

Raffa · Accepted Answer · 2022-04-25T18:10:59.150

Assuming GNU sort

sort doesn't need an amount of RAM more or even equal to the size of processed file/s it uses available memory and temporary files during processing to sort big files in batches. It is very efficient and does this with no need for user intervention when reading directly from file/s. However, when reading from a pipe or STDIN, setting a value for the buffer size with the option --buffer-size=SIZE might be needed for efficiency.

So what you most likely need is enough disk space that can be freely utilized under /tmp ... if space on disk is not enough, you can try the --compress-program=PROG option (PROG is the compression program to be used like gzip. You need to specify that and it needs to be installed on your system) to compress and decompress temporary files during the sorting process like so:

sort --compress-program=gzip *.txt | uniq -d > dupfile

The crashes are most likely due to using more processing threads/processes in parallel more than your system can handle at once. You can limit that to reduce system load using the --parallel=N option (N can be a number from 1 to 8. The lower the number the slower the processing but system load will be lower as well and crashes will stop) like so:

sort --parallel=2 *.txt | uniq -d > dupfile

These two options can also be used together like so:

sort --compress-program=gzip --parallel=2 *.txt | uniq -d > dupfile

Alternatively, you can do it in two steps first, pre-sort the files one by one and then, use the --merge option on the already sorted files to merge the files without sorting like so:

sort --merge *.txt | uniq -d > dupfile

And of course you can use all three options on pre-sorted files to reduce the load on your system like so:

sort --compress-program=gzip --parallel=2 --merge  *.txt | uniq -d > dupfile

To know which duplicate lines came from which file/s, you can use grep with the -F option that will treat whole lines as fixed strings and should give you more performance and the option -x which will exactly match the whole line like so:

grep -Fx -f dupfile *.txt > resultfile

How to find duplicate lines in very large (65GB) text files?

1 Answers1

Linked