calibre plugins: duplicate finder

The next few posts will deal with some of the more useful plugins in calibre. Functionality can be added to calibre via little code snippets called “plugins”. The Calibre application itself comes with and makes extensive use of plugins developed by the Calibre development team. The Calibre plugins programming interface (API) makes it possible for users to write their own plugins, that offer additional features they find useful or override the default Calibre behaviour. For a full list of available calibre plugins click here. There is a help forum on mobile read dedicated to plugins. Please post here if you need help with developing or using a plugin.

Today we will discuss one popular plugin; “Find Duplicates”. The goal of this plugin is to find duplicate entries in your collection so you can delete or merge them. This plugin was developed by Grant Drake. The help forum dedicated to this particular plugin can be found here.

Getting the Find Duplicates plugin: In the main calibre window click on Preferences and at the bottom left corner of the new window that opens click “Plugins”. Now a window opens up with the list of available plugins. The “Find Duplicates” plugin is not on the default list so you will have to first get it. To do this, click on the “Get new plugins” button at the bottom left corner. A window with a list of plugins shows up. Choose the Find Duplicates plugin and click the install button on the bottom right corner. Restart calibre to see the “Find Duplicates” button in the main calibre tool bar.

Now click on “Find Duplicates” and you will see the following window:

Duplicates by author or title: The first choice in the duplicate search type as shown in the figure above is Title/Author. This allows you to locate duplicates by either author name or by title or both. You can set one column to ignore if you want only to use the other. The figure below shows the duplicates obtained in my collection by setting the “Title Matching” column to identical and the “Author Matching” column to ignore. I have also selected “Show all groups at once with highlighting”.

The above figure shows all sets of books of the same title irrespective of the author names. Books of the same title are highlighted by the same color and grouped together. The two duplicate entries for say “The Columbus Dispatch” (highlighted in green) can be combined by selecting them both, clicking the little arrow next to “Edit Metadata” and selecting “Merge book records”. Here you can choose to delete the extra files or keep them. Having done this, in the above figure, we are left with only one set of duplicates. 

Similar: The figure below shows the effect of setting the “Title Matching” column to identical and the “Author Matching” column to similar. Similar authors differ only in the punctuation or order of their names.

Soundex: The figure below shows the effect of setting the “Title Matching” column to identical and the “Author Matching” column to soundex. Soundex authors also differ only in the punctuation or order of their names but can also include books where the author names have minor spelling errors so they sound similar like “Austen” and “Austin” in the example below.

Fuzzy: The figure below shows the effect of etting the “Title Matching” column to identical and the “Author Matching” column to fuzzy. Fuzzy author matches compare using their surnames and only the first initial so they detect “Jane Austen” as well as “J. Austen” as shown below. This is the most general setting for finding duplicates.


Duplicates by ISBN or binary:
 You can look for duplicates by comparing ISBN. This will allow you to search for books with identical ISBN and the author and title matching becomes disabled. Similarly Binary searching allows you to look for duplicate occurances of a file even if the title and author are different provided the actual files are identical. As shown in figure below “P&P” by “Unknown” contains the same file as “Pride and Prejudice” by “Austen Jane”, “Jane Austin” and “J. Austen”. They all contain the same identical text (.txt) file. While “Pride and Prejudice” by “Jane Austen” contains a different epub file.

Hope you found this post useful. See you in about a week with details on some more plugins.

Leave a Reply

%d bloggers like this: