Saturday, October 8, 2011

Custom news fetching

The "Fetch News" feature in calibre is very powerful and versatile. It has over a 1000 built in news recipes, spanning over 30 languages and 50 countries, for websites of newspapers, magazines and blogs. At the click of a few buttons you can set up regular downloads of the news source of your choice. Not only can you choose from this large variety of available recipes, but you can also, create a customized recipe of your own or tweak one of the available recipes to make it better suited to you. In the past creating pretty news recipes has been a job for those with some expertise in python. However, thanks to the "auto clean up" feature recently included in calibre, for many a website or combination of websites, anyone can follow a simple formula to write a recipe. This may not work for a small fraction of complicated websites but is well worth a try. The "Fetch News" icon in the top calibre toolbar is shown in the figure below.







Basic news fetching: To get started just click on the above icon in the top toolbar in the main calibre window. A new window, shown in Figure: 1 below, opens up. It lists all available news sources by language and country. For example "Arabic[2]" on the second line indicates there are 2 news souces available in the Arabic language. To see what these news sources are, click the little arrow on the left of "Arabic". You will notice in case of English there are a number of entries. This is because a number of different countries publish English newspapers, magazines and blogs. The list indicates there are 7 news sources from australia, 1 from Bulgaria, 23 from Canada and so on.




If you are looking for a particular news source, say the "Washington Post", then you can search for it as shown in Figure: 2 above. Type in all or some the letters in the search bar and press enter the list below will reduce to include only matches to your search, so as shown in Figure: 2 above only 3 news sources match up to the word "washington". Now if you select "The Washington Post", options appear on the right side that allow you to download the news immediately (bottom right, "Download Now" button), or to set up a schedule for automatic downloads. The automatic download will occur the first time calibre is run on your computer after the scheduled download time.

Adding RSS feeds to existing news recipes: There are some very simple things you can do to customize the news recipes to your taste.
News recipes contain different RSS feeds for different sections like Sports, Politics, Business etc. You may be interested in a feed that is not included like say Entertainment.
It is very simple to add this feed of interest to you.

First go to the news website of interest. Search the page for RSS (RSS feeds) indicated usually by a little rectangular orange icon. Now click on this link and it should take you to a page with RSS links to various sections of the news like Sports, Politics Business etc.

Startup calibre and click the little arrow on the fetch news and click on "Add a custom news source". A new window opens up and on the bottom left corner click on "Customize builtin recipe". Now a little window opens up where you can pick the recipe of the news scource you wish to customize.

Example : The Los Angeles Times

Now on the left column of the previous window The Los Angeles Times should show up. Select it and the recipe will show up on the right column.

If you look at the recipe you will find a block with RSS feeds that begins with

    "feeds = [ ...
             


Instead suppose it had looked like (with fewer feeds and not including the feed on the Sports which say you are intersted in)

    feeds = [
              (u'Top News'             , u'http://feeds.latimes.com/latimes/news'                           )
             ,(u'Local News'           , u'http://feeds.latimes.com/latimes/news/local'                     )
             ,(u'National'             , u'http://feeds.latimes.com/latimes/news/nationworld/nation'        )
             ,(u'National Politics'    , u'http://feeds.latimes.com/latimes/news/politics/'                 )
             ,(u'Business'             , u'http://feeds.latimes.com/latimes/business'                       )
             ,(u'Education'            , u'http://feeds.latimes.com/latimes/news/education'                 )
             ,(u'Environment'          , u'http://feeds.latimes.com/latimes/news/science/environment'       )
             ,(u'Religion'             , u'http://feeds.latimes.com/latimes/features/religion'              )
             ,(u'Science'              , u'http://feeds.latimes.com/latimes/news/science'                   )
             ,(u'Technology'           , u'http://feeds.latimes.com/latimes/technology'                     )
             ,(u'Africa'               , u'http://feeds.latimes.com/latimes/africa'                         )
            ]


Then all you have to do is follow the previous syntax and add in the name and get the link from the page of the website of The Los Angeles Times with the RSS feeds corresponding to Sports.

Make sure the link you get looks similar to the other links in the recipe. If not try to copy the link from the little square orange icon.

If there are RSS feeds corresponding to sections that do not interest you, you can delete the names and links corresponding to those sections. This will make the download process faster an remove clutter.

Then click the Add/Update recipe button at the bottom left corner. Now a new "replace recipe?" window opens up. Click replace recipe and you are done!

To access this recipe go to the main calibre window and click "Fetch News" and you get a list of news sources. The first entry is Custom. Click on it and it will expand to show the list of your customized news sources.

Auto clean up: This is a powerful feature that enables lay users to make custom recipes. You may be interested in making a single news recipe that has RSS feeds from different blogs and news sources that you visit. This can be done quite easily with "Auto clean up". The following recipe obtains the RSS feeds for the politics section of 3 different news sources, namely, "The Seattle Times", "The San Francisco Chronicle" and "The Los Angeles Times":

from calibre.web.feeds.news import BasicNewsRecipe


class Politics(BasicNewsRecipe):
    title          = u'Politics'
    language       = 'en'
    __author__     = 'Krittika Goyal'
    oldest_article = 3 #days
    max_articles_per_feed = 20
    use_embedded_content = False

    no_stylesheets = True
    auto_cleanup = True
    auto_cleanup_keep = '//div[@class="thumbnail"]'


    feeds          = [
('Seattle Times',
 'http://seattletimes.nwsource.com/rss/politics.xml'),
('San Francisco Chronicle',
 'http://feeds.sfgate.com/sfgate/rss/feeds/blogs/sfgate/nov05election/index_rss2'),
('LA tIMES',
 'http://feeds.latimes.com/latimes/news/politics/'),
]

The first line in red must be in every news recipe. The next block of code in grey is information like title author etc, which you should change to suit your recipe.  The next two lines in red are what is used to clean up the web page, remove advertisements and other unwanted material. The "auto_cleanup" uses statistical analysis to extract the useful content in the news website or blog. I will return to the blue line later. The next grey block of code includes the feeds of interest. The output you get (w/o the blue line) for one of the pages of "The San Francisco Chronicle" is shown in the figure below.



"auto_cleanup" is usually great at picking out the relevant content from a variety of websites, so you do not need to manually clean up each website. As a result you can use feeds from different websites even if the articles have very different structures. Sometimes however, "auto_cleanup" can be over zealous and remove content that is indeed relevant like a picture. To fix this you need to understand a little bit of HTML. You need to use "firebug" in firefox or a similar tool to find out the tags corresponding to the part that you would like to remain.

In the above example "auto_cleanup" was removing the picture at the beginning of the articles in "The Los Angeles Times".  To fix that we had to add the blue line of code. Now the picture is included (see figure below).



Without the blue line of code, you would still get the text of the article but the picture would be missing. The "auto_cleanup" feature is based on code from the ReadItLater open source project.


For the more advanced user: Finally for those of you who are adventurous or experienced in programming, the customizing news feature in calibre is very powerful and for tips on using it to its full potential visit the "tips for developing new recipes" section of the calibre user manual.

The great thing about calibre is that its features are accessible at many levels so both lay users as well as advanced tinkerers find it useful and enjoyable.

Finally sorry for this blog post being later than usual, but Kovid and I moved to India today and the moving procedure kept me very busy. We still need to settle in so the next few blog posts may be a little off schedule as well and i may e a little slow in responding to comments. I will do my best to be on time. As always see you in about a week and hope you found this post useful.

4 comments:

  1. Thank you for this marvellous piece of software !

    ReplyDelete
  2. Hi, Thank you for this excellent description.
    I used this description with my daily preferred news feeds and it works great.
    Only thing is that in the following feeds the articles get canceled by autoclean.
    I tried the thing with firebug as you described but was not able to find out the correct tag for keeping the articles.

    If you could help me with that I would highly appreciate that feed:

    http://www.all-in.de/nachrichten/allgaeu/polizeimeldungen/Polizeimeldungen-polizei-sachschaden-verletzung-Verkehrsunfall-mit-schwerverletztem-Fussgaenger-in-Sulzschneid;art2756,1144546?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+memmingen+%28all-in.de%2Frss+-+Memmingen+und+Unterallg%C3%A4u%29

    My current entry in Calibre is:

    from calibre.web.feeds.news import BasicNewsRecipe


    class Memminger_Zeitung(BasicNewsRecipe):
    title = u'Memminger Zeitung'
    language = 'de'
    __author__ = 'Memminger Zeitung GmbH'
    oldest_article = 1 #days
    max_articles_per_feed = 20
    use_embedded_content = False

    no_stylesheets = True
    auto_cleanup = True
    auto_cleanup_keep = '//div[@class="thumbnail"]'


    feeds = [
    (u'Politik und Gesellschaft', u'http://www.augsburger-allgemeine.de/politik/rss') ,
    (u'Wirtschaft', u'http://www.augsburger-allgemeine.de/wirtschaft/rss') ,
    (u'Aus aller Welt', u'http://www.augsburger-allgemeine.de/panorama/rss') ,
    (u'Sport', u'http://www.augsburger-allgemeine.de/sport/rss') ,
    (u'Bayern', u'http://www.augsburger-allgemeine.de/bayern/rss') ,
    (u'Allgäu', u'http://feeds.all-in.de/allgaeu') ,
    (u'Lokales Memmingen', u'http://feeds.all-in.de/memmingen'),
    ]

    ReplyDelete
  3. شكرا جزيلا لك Kovid Goyal :
    برنامج calibre من أفظل البرامج المتخصصة بالكتب الإلكترونية
    أتمنى لك النجاح و ننتظر منك الجديد .

    ReplyDelete