How to find (and delete) duplicate files

Question

I have a largish music collection and there are some duplicates in there. Is there any way to find duplicate files. At a minimum by doing a hash and seeing if two files have the same hash.

Bonus points for also finding files with the same name apart from the extension - I think I have some songs with both mp3 and ogg format versions.

I'm happy using the command line if that is the easiest way.

score 189 · Accepted Answer · edited Jul 03 '23 at 21:38

189

fdupes

I use fdupes for this. It is a commandline program which can be installed from the repositories with sudo apt install fdupes. You can call it like fdupes -r /dir/ect/ory and it will print out a list of dupes. fdupes has also a README on GitHub and a Wikipedia article, which lists some more programs.

edited Jul 03 '23 at 21:38

Pablo Bianchi

17,371

answered Sep 08 '10 at 19:20

qbi

19,515

score 71 · Answer 2 · edited May 24 '20 at 19:22

List of programs/scripts/bash-solutions, that can find duplicates and run under nix:

dupedit: Compares many files at once without checksumming. Avoids comparing files against themselves when multiple paths point to the same file.
dupmerge: runs on various platforms (Win32/64 with Cygwin, *nix, Linux etc.)
dupseek: Perl with algorithm optimized to reduce reads.
fdf: Perl/c based and runs across most platforms (Win32, *nix and probably others). Uses MD5, SHA1 and other checksum algorithms
freedups: shell script, that searches through the directories you specify. When it finds two identical files, it hard links them together. Now the two or more files still exist in their respective directories, but only one copy of the data is stored on disk; both directory entries point to the same data blocks.
fslint: has command line interface and GUI.
liten: Pure Python deduplication command line tool, and library, using md5 checksums and a novel byte comparison algorithm. (Linux, Mac OS X, *nix, Windows)
liten2: A rewrite of the original Liten, still a command line tool but with a faster interactive mode using SHA-1 checksums (Linux, Mac OS X, *nix)
rdfind: One of the few which rank duplicates based on the order of input parameters (directories to scan) in order not to delete in "original/well known" sources (if multiple directories are given). Uses MD5 or SHA1.
rmlint: Fast finder with command line interface and many options to find other lint too (uses MD5), since 18.04 LTS has a rmlint-gui package with GUI (may be launched by rmlint --gui or from desktop launcher named Shredder Duplicate Finder)
ua: Unix/Linux command line tool, designed to work with find (and the like).
findrepe: free Java-based command-line tool designed for an efficient search of duplicate files, it can search within zips and jars.(GNU/Linux, Mac OS X, *nix, Windows)
fdupe: a small script written in Perl. Doing its job fast and efficiently.1
ssdeep: identify almost identical files using Context Triggered Piecewise Hashing

score 65 · Answer 3 · edited Oct 16 '21 at 23:57

FSlint has a GUI and some other features. The explanation of the duplicate checking algorithm from their FAQ:

1. exclude files with unique lengths
2. handle files that are hardlinked to each other
3. exclude files with unique md5(first_4k(file))
4. exclude files with unique md5(whole file)
5. exclude files with unique sha1(whole file) (in case of md5 collisions).

fslint installation instructions

score 7 · Answer 4 · answered Sep 08 '10 at 21:46

If your deduplication task is music related, first run the picard application to correctly identify and tag your music (so that you find duplicate .mp3/.ogg files even if their names are incorrect). Note that picard is also available as an Ubuntu package.

That done, based on the musicip_puid tag you can easily find all your duplicate songs.

score 6 · Answer 5 · answered Apr 22 '14 at 07:34

Another script that does this job is rmdupe. From the author's page:

rmdupe uses standard linux commands to search within specified folders for duplicate files, regardless of filename or extension. Before duplicate candidates are removed they are compared byte-for-byte. rmdupe can also check duplicates against one or more reference folders, can trash files instead of removing them, allows for a custom removal command, and can limit its search to files of specified size. rmdupe includes a simulation mode which reports what will be done for a given command without actually removing any files.

score 4 · Answer 6 · answered Jul 03 '23 at 21:48

jdupes

I found jdupes very easy and extremely fast.

jdupes is a program for identifying and taking actions upon duplicate files such as deleting, hard linking, symlinking, and block-level deduplication (also known as "dedupe" or "reflink"). It is faster than most other duplicate scanners. It prioritizes data safety over performance while also giving expert users access to advanced (and sometimes dangerous) features.

# Search a single directory:
jdupes path/to/directory
Search multiple directories:
jdupes directory1 directory2
Search all directories recursively:
jdupes --recurse path/to/directory
Search directory recursively and let user choose files to preserve:
jdupes --delete --recurse path/to/directory
Search multiple directories and follow subdirectores under directory2, not directory1:
jdupes directory1 --recurse: directory2
Search multiple directories and keep the directory order in result:
jdupes -O directory1 directory2 directory3
EXclude files over 1M, sumarize info, recursive
jdupes -X size+=:1000k --summarize --recurse ~

N0rbert · Answer 7 · 2015-02-19T22:00:41.270

4

I use komparator - sudo apt-get install komparator (Ubuntu 10.04+ ) - as GUI-tool for finding duplicates in manual mode.

edited Feb 19 '15 at 22:00

answered Dec 29 '13 at 12:15

N0rbert

103,263

score 3 · Answer 8 · edited Feb 19 '15 at 15:55

3

Have you tried

finddup

or

finddup -l

I guess it works fine.

edited Feb 19 '15 at 15:55

blade19899

26,994

answered Jul 05 '14 at 04:34

xerostomus

1,060

score 2 · Answer 9 · edited Jul 03 '23 at 21:43

For Music related duplicate identification and deletion, Picard (open source) by http://musicbrainz.org/ and Jaikoz (privative) are the best solutions. Jaikoz I believe automatically tags your music based on the data of the song file. You don't even need the name of the song for it to identify the song and assign all metadata to it. Although the free version can tag only a limited number of songs in one run, but you can run it as many times as you want.

score 1 · Answer 10 · answered Feb 23 '21 at 15:33

1

dupeGuru has a dedicated mode for music. It is a cross-platform GUI program and, as of today (February 2021), it is in active development, although it is unclear which releases work on which systems. Check its documentation.

answered Feb 23 '21 at 15:33

Diego V

430
3
14

score 0 · Answer 11 · answered Jul 03 '23 at 21:21

Now that fslint is no longer supported, I've switched to fclones. As requested, it matches by hash, and can output a list, or replace files with hard or soft links.

I've been using it like this to replace duplicate files with hard links:

fclones group <dir/to/recursively/search> | fclones link

score 0 · Answer 12 · answered Oct 24 '23 at 16:17

If you're fine with a GUI tool instead, I can highly recommend Czkawka. You can very easily find duplicate files, filter out the files you want (they are grouped by default) and delete the files you do not need. Also search is very fast and cached, so the next time you run it, it will be even faster.

demo video