Find duplicates by filemane

Question

I have about 167k files in single folder(for now) and renamed by this script in here: Renaming bunch of files, but only part of the title .
How can I find duplicate files by their names (only digits in that specific spot) and delete oldest file:
Aaaaaaa.bbb - 0000125 tag tag_tag 9tag Aaaaaaa.bbb - 0000002 tag 9tag Aaaaaaa.bbb - 0000002 tag tag_tag 9tag

All tools that I used didn't provide such functionality so only script can help.

score 0 · Answer 1 · edited Feb 08 '16 at 08:08

Below here's a find, sort and awk one-liner.

Basic idea is to list files, sort them numerically (which works, unless Aaaaaaa.bbb and tags are themselves are numbers), and then let awk store each 3rd field of filenames into prev variable, and compare it with current value of field 3. If they match, print a message.

find . -type f -print | sort --numeric | awk '{if(prev == $3) print $0" is duplicate of "$prevEntry}{ prev=$3; prevEntry=$0}'

Below is a small demo:

    $ seq 6 10 | xargs printf "%07d\n" | xargs -I {} touch "Aaaaaaa.bbb - {} tag 9tag" 

    $ seq 00001 00020 | xargs printf "%07d\n" | xargs -I {} echo "Aaaaaaa.bbb - {} tag tag_tag 9tag"

$ find . -type f -print | sort --numeric | awk '{if(prev == $3) print $0" is duplicate of "$prevEntry}{ prev=$3; prevEntry=$0}'

    ./Aaaaaaa.bbb - 0000006 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000006 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000007 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000007 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000008 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000008 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000009 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000009 tag tag_tag 9tag
    ./Aaaaaaa.bbb - 0000010 tag tag_tag 9tag is duplicate of ./Aaaaaaa.bbb - 0000010 tag tag_tag 9tag

Find duplicates by filemane

1 Answers1