How do I combine two files while excluding lines that exist in both files?

Question

I have 2 files that can not be sorted. Both of them have a list of words per lines. I am trying to compare both files and create a new one without any duplicate lines that get matched between both files. This means, if a line on file A is found on file B, it should not show as an output result.

There is a huge issue with many questions and sites that say in their titles "Deleting Duplicates" when in fact it is "Merging Duplicates & Showing A Unique One". These 2 points are very different. One is not actually deleting duplicate lines, only merging them.

For this particular case I do need to DELETE THEM for real. So if they are found in both files, they do not show as a result.

I have tested comm already and this fails. I have also tested several other cases like awk, grep that I have seen. The rules for both files is the following:

They have different size (Do not have the same amount of lines)
To be a duplicate it compares the whole line against each and all other lines in the other file
Files can not be sorted

Here is some information about the files, they carry list of emails, one email per line. Of course because they are not the same size, it does not mean they will have all emails the same, but they do have inside of each other all unique emails. It is just that some emails might be on both files. For the cases where the emails are on both files, the output results should not show those emails.

score 1 · Answer 1 · answered Feb 16 '22 at 07:34

There are more efficient ways, but here is a solution. I was unsure how you would want the files merged. So, in this solution distinct lines from file1 are written to the new file, then distinct lines from file2 are written to the new file.

# remove_dupes.py
from sys import argv
infile1 = open( str(argv[1]), "r" )
infile2 = open( str(argv[2]), "r" )
try:
    outfile = open( str(argv[3]), "w" )
except (IndexError):
    outfile = open( 'out', "w" )
if1_arr = infile1.readlines()
if2_arr = infile2.readlines()
tmp_arr = if2_arr
exclude = []
for line in if1_arr:
    if line in if2_arr:
        exclude.append(line)
    else:
        outfile.write(line)
for line in if2_arr:
    if line not in exclude:
        outfile.write(line)
infile1.close()
infile2.close()
outfile.close()

To run:

python3 remove_dupes.py <file1> <file2> <output_file>

If you'd like to turn this into a quicker command-line tool, move the script to a long-term place and add the following line to your .bashrc, .bash_aliases, .zshrc, or equivalent file.

alias mydiff='python3 <path_to_script> '

You can replace 'mydiff' with whatever you'd like to call it. After that you can run the script with:

mydiff <file1> <file2> <output_file>

score 0 · Answer 2 · edited Apr 08 '22 at 02:32

Simple solution

diff --suppress-common-lines fileB  fileC

I tested this with filenames in a directory
```
$ ls *c* > fileC
$ ls *b* > fileB
```

I used the sdiff tool to show side by side differences between the two files. Here are the first few lines

ACM Queue - Databases Only.recipe                         <
ACM Queue Magazine Database Only.recipe                   <
acm_queue.txt                                             <
blighted.csv                                                blighted.csv
Brave Passwords_4.csv                                       Brave Passwords_4.csv
conda_history.txt                                         <
conda_install_altair.log2~                                <
copied.url                                                <
copy_in.sh                                                <

I found that there were two lines in common blighted.csv and Brave Passwords_4.csv

diff --suppress-common-lines fileB fileC

shows the files minus the common lines

To test

  $ grep "blighted.csv" fileB
  blighted.csv
  $ grep "blighted.csv" fileC
  blighted.csv
  $ diff --suppress-common-lines fileC fileB | grep "blighted.csv"
  $ (no output)

One last help -- remove the editing marks that diff applies to the output

diff --suppress-common-lines fileB fileC | grep "^<\|^>" | sed "s/^. //g"

How do I combine two files while excluding lines that exist in both files?

2 Answers2

Simple solution

One last help -- remove the editing marks that diff applies to the output