17.5 Duplicate Files
20191229 A common challenge is to find duplicate files, such as photos or music or documents. When available disk space becomes tight then it’s also a good time for a clean up.
A simple trick to find duplicates is to calculate a MD5 signature for a file, and to the use that signature to find duplicates of the file, knowing that in general a mapping of the contents of a file to a signature is a unique mapping - the signature is unique for different files.
The fdupes package provides the fdupes command that incorporates the use of the MD5 signature within a more thorough pipeline to guarantee the files are duplicates. The pipeline for checking for duplicate files begins with a file size comparison, a partial MD5 signature comparison, a full MD5 signature comparison, and then a byte-to-byte comparison.
A summary as obtained using the
-m option is often useful to begin with:
$ fdupes --summarize . 13567 duplicate files (in 6407 sets), occupying 16996.0 megabytes
fdupes requires at least one command line argument (a path to a directory). In the above a period (.) is used to indicate the current directory.
With no options fdupes lists groups of duplicated files in the specified directory:
$ fdupes . ./20180323_thesis_02.pdf ./20180323_thesis_01.pdf ./20180323_thesis.pdf ./20030102_pakdd01_03.pdf ./20031012_pakdd01.pdf ./20200531_siunits_01.pdf ./20200531_siunits.pdf
-r option to recurse into
fdupes can delete duplicates, retaining the first listed file. A general heuristic is to keep the original rather than files with versioned file names, noting they contain exactly the same content. Ordering the list by name and then reversing the order can be useful:
$ fdupes --order='name' --reverse . ./20180323_thesis.pdf ./20180323_thesis_01.pdf ./20180323_thesis_02.pdf ./20031012_pakdd01.pdf ./20030102_pakdd01_03.pdf ./20200531_siunits.pdf ./20200531_siunits_01.pdf
The following command will delete duplicates, keeping the first file in the list, the list being ordered in reverse by the filename:
$ fdupes --delete --noprompt --order='name' --reverse .
-f option will generate a
list of duplicate files excluding the first of the duplicates. This is
then a list that can be saved to file to generate a script to manually
delete the duplicate files if desired.
Your donation will support ongoing availability and give you access to the PDF version of the book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0.