Friday, November 11, 2011

calibre plugins: duplicate finder

The next few posts will deal with some of the more useful plugins in calibre. Functionality can be added to calibre via little code snippets called "plugins". The Calibre application itself comes with and makes extensive use of plugins developed by the Calibre development team. The Calibre plugins programming interface (API) makes it possible for users to write their own plugins, that offer additional features they find useful or override the default Calibre behaviour. For a full list of available calibre plugins click here. There is a help forum on mobile read dedicated to plugins. Please post here if you need help with developing or using a plugin.

Today we will discuss one popular plugin; "Find Duplicates". The goal of this plugin is to find duplicate entries in your collection so you can delete or merge them. This plugin was developed by Grant Drake. The help forum dedicated to this particular plugin can be found here.

Getting the Find Duplicates plugin: In the main calibre window click on Preferences and at the bottom left corner of the new window that opens click "Plugins". Now a window opens up with the list of available plugins. The "Find Duplicates" plugin is not on the default list so you will have to first get it. To do this, click on the "Get new plugins" button at the bottom left corner. A window with a list of plugins shows up. Choose the Find Duplicates plugin and click the install button on the bottom right corner. Restart calibre to see the "Find Duplicates" button in the main calibre tool bar.

Now click on "Find Duplicates" and you will see the following window:




Duplicates by author or title: The first choice in the duplicate search type as shown in the figure above is Title/Author. This allows you to locate duplicates by either author name or by title or both. You can set one column to ignore if you want only to use the other. The figure below shows the duplicates obtained in my collection by setting the "Title Matching" column to identical and the "Author Matching" column to ignore. I have also selected "Show all groups at once with highlighting".

The above figure shows all sets of books of the same title irrespective of the author names. Books of the same title are highlighted by the same color and grouped together. The two duplicate entries for say "The Columbus Dispatch" (highlighted in green) can be combined by selecting them both, clicking the little arrow next to "Edit Metadata" and selecting "Merge book records". Here you can choose to delete the extra files or keep them. Having done this, in the above figure, we are left with only one set of duplicates. 

Similar: The figure below shows the effect of setting the "Title Matching" column to identical and the "Author Matching" column to similar. Similar authors differ only in the punctuation or order of their names.
Soundex: The figure below shows the effect of setting the "Title Matching" column to identical and the "Author Matching" column to soundex. Soundex authors also differ only in the punctuation or order of their names but can also include books where the author names have minor spelling errors so they sound similar like "Austen" and "Austin" in the example below.



Fuzzy: The figure below shows the effect of etting the "Title Matching" column to identical and the "Author Matching" column to fuzzy. Fuzzy author matches compare using their surnames and only the first initial so they detect "Jane Austen" as well as "J. Austen" as shown below. This is the most general setting for finding duplicates.





Duplicates by ISBN or binary:
You can look for duplicates by comparing ISBN. This will allow you to search for books with identical ISBN and the author and title matching becomes disabled. Similarly Binary searching allows you to look for duplicate occurances of a file even if the title and author are different provided the actual files are identical. As shown in figure below "P&P" by "Unknown" contains the same file as "Pride and Prejudice" by "Austen Jane", "Jane Austin" and "J. Austen". They all contain the same identical text (.txt) file. While "Pride and Prejudice" by "Jane Austen" contains a different epub file.



Hope you found this post useful. See you in about a week with details on some more plugins.

24 comments:

  1. Is there a method to display books that display as single title, but has multiple formats?

    For example I have Through the Looking Glass - it has an epub format and a mobi format. I have other books that have just a mobi format.

    Is there a method to search for books that have two or more formats in them?

    I think I explained that clearly - but if not let me know.

    ReplyDelete
  2. Expand the Formats entry in the tag browser and select the formats say epub and mobi using ctrl. At this point both should have a green + sign to the left of them. Just above the cover browser you will see the typed text "formats:true or formats:"=EPUB" or formats:"=MOBI"". Change that to "formats:true or formats:"=EPUB" and formats:"=MOBI"", i. e. replace the "or" by an "and" and then press enter.
    that should do it.

    ReplyDelete
  3. Nice plugin. I have too many binary duplicates- Is there a way to auto delete all duplicates?

    ReplyDelete
  4. What if the find duplicates plug-in finds a few hundred duplicates. Is there a way to select and remove them all at the same time. I really don't want to go through and individually select and delete each one.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. If you do a search for duplicates, and have them come up one group at a time, then you can use ctrl-A to select all, unslect the ones you want to keep and then hit Del.

      Delete
    3. I found the easiest way to remove 100's of duplicates is to create a new library and then add the books from the library with the 100's of duplicates and when asked 'Add Duplicates' say No. You could check the number books in the original library minus the number of duplicates and that should be the number in the new library. I hope that makes sense.

      Delete
    4. I found the expanded 'Remove Books - Files of a specific format' to be the fastest. select ALL the duplicates found, and delete only the .mobi files or the .epub as you see fit. If you are then left with a few remnant books (Books having metadata and covers, but no associated ebook file) sort the same list by size, and erase the zero volume files.

      I hope this helps.
      /Jaime

      Delete
  5. I cannot find how to launch the 'Find Duplicates' plug after I install it. Restarted many times, but there is no apparent way in the application to launch it.

    ReplyDelete
    Replies
    1. Update: after reinstalling the plugin, it appears on the right side of the task bar for 5-10 secs after I launch Calibre, then vanishes and cannot be found until I restart Calibre, with the same results again.
      Running version 0.8.41 on Win7 x64

      Delete
    2. Mine is also not showing

      Delete
    3. I added it to mine but have too many icons on top. If you have a double arrow hit that. The icon your looking for should be there. Once it came up I was suprised that I had 5 dups I didn't know about.

      I hope that helps.

      I love this program.

      Delete
  6. Is it possible to use this plugin to weed out duplicate authors or variations on author name? My library has gotten rather large, and I'd like an efficient way to combine authors with variations in their names.

    An example:

    Patterson, James
    James Patterson
    Patterson and others, James
    etc.

    Thanks!

    ReplyDelete
  7. @Weekendmedic: yes it is. the blog post explains how to do that.

    ReplyDelete
  8. i am having a hard time installing the plugin. I am running version 0.8.8 but it keeps tell me i need at least 0.8.57. How do i fix this?

    ReplyDelete
    Replies
    1. Sorry but the dot is not a decimal dot. 57 > 8 so you need version 0.8.57 or greater
      Numbering goes as 0.8.8, 0.8.9, 0.8.10, 0.8.11 and so on.

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Sorry if this is obvious, but I cannot get the Find Duplicates plugin to highlight more than the first duplicate instance. Like T-Roy (January 31, 2012 4:11 AM), I would like to have the plugin highlight all duplicate instances so I can delete all duplicates at the same time, in one fell swoop, rather than repeating a delete on each duplicate. No matter what I do, the Find Duplicate plugin only highlights the first instance.

    I'll keep plugging away at it, but if you can shed some light on it I'd appreciate it.

    (I'm dealing with the case where I allowed duplicates to be included when I added some new books. Boy, what the the wrong thing to do).

    Thanks!!!

    ReplyDelete
  11. Inquiry:
    I have the same problem as those above. I would like to know; is it possible to uninstall and reinstall calibre so that it would reuse my library? Would that be a good way to delete all duplicates?

    ReplyDelete
    Replies
    1. uninstalling calibre wont remove duplicates. Is it that you cant figure out how to launch the find duplicates plugin?

      Delete
  12. re-installing calibre will not harm your library in anyway. You can continue to use the library after re-installation.

    ReplyDelete
  13. I'm using "Duplicate Files Deleter" for this kind of issue, guaranteed easy fix.

    ReplyDelete
  14. What is the process of looking for best essay writing service uk support at your site? I am having difficulties a lot with a 5000 terms longish essay I have to finish and publish by the end of the 7 days for one of the extra public perform programs I had taken up.

    ReplyDelete