Blog

Making crawling and scraping websites slightly less painful with Anemone

By Glen Crawford, 23 Jan 2014

If you have ever had to scrape a website before to harvest data (and I sincerely hope that you haven't), then you will know and understand the pain of writing a script to automate the trawling of a site and parsing flaky and inconsistent HTML to get to the data that you need. It can be frustrating, and depressing, and painful, but luckily, there are tools to make it slightly less unbearable.

One of those tools is called Anemone. Anemone is a Ruby gem that enables you to write scripts to crawl one or more websites to parse their pages and harvest information. It takes care of all the basic things like making the requests, collecting the URLs on a page, following redirects, etc, and gives you hooks into parts of the process to allow you to specify which links to follow, which pages to parse, and so on.

There are three main areas of Anemone to explain: options, "verbs" and the page objects. I'll explain them as I work through an example. The following code is a quick experiment using Anemone to crawl the reInteractive blog and count how many posts each author has written (conveniently forgetting that we already have an Atom feed for that).

Configuring the crawling behaviour

Start by adding the anemone gem to your project (place it into your Gemfile, or install it yourself and require 'anemone').

Then invoke Anemone#crawl with the starting URL of your crawler. You can also pass in options to customize Anemone's behaviour. Anemone comes with some sensible defaults however, which you can see here.

The focus_crawl method allows you to pass in a block that selects which links on each page should be followed. page.links returns the href attribute's value of all <a> elements on the page. In the example below I'm testing each link against two different regular expressions, so that only URLs from blog navigation links (the "Newer" and "Older" links at the bottom of the pages) and links to blog posts are followed. Of course, many URLs will pop up on multiple pages, but don't worry, Anemone will only crawl each URL once, no matter how many times the URL is found.

The on_every_page method allows you to perform an action on each page that the crawler visits. The block you pass in has access to the page object (Anemone::Page) that represents the page. It gives you access to the page's body, the HTTP status code, referer, etc. In my example, if the page was returned with a 200 OK and is a blog post page, then I am passing the page into a method to process the blog post.

Anemone provides various methods to customise its crawling behaviour. As well as focus_crawl and on_every_page there is after_crawl, on_pages_like, and skip_links_like. But the former two are the ones that you will be using most often.

Anemone.crawl("http://www.reinteractive.net/blog", :verbose => false, :depth_limit => 5) do |anemone|
  anemone.focus_crawl do |page|
    page.links.select{|link| link.to_s.match(BLOG_NAVIGATION_URLS) || link.to_s.match(BLOG_POST_URLS)}
  end
  anemone.on_every_page do |page|
    process_blog_post(page) if page.code == 200 && page.url.to_s.match(BLOG_POST_URLS)
  end
end

Parsing the page

Now that you have the page and you know that you want to parse it, you can start picking the values that you want out of the body of the document. The page object has an attribute called doc, which returns a Nokogiri::HTML::Document representing the page's body. You can search through the document for the values that you want by using the css and xpath methods. The Nokogiri site itself has a great tutorial on searching through an HTML document using Nokogiri.

In this example, I'm using the method of using CSS selectors to locate the values that I want: the blog post's title and author. I'm then just building up a hash of these authors and posts, i.e., {"Glen Crawford" => ["Post #1", "Post #2"]}.

@authors_and_posts = {}

def process_blog_post(page)
  title = page.doc.css(".blog-header h1").text
  author = page.doc.css("meta[name='author']").attr("content").value

  @authors_and_posts[author] ||= []
  @authors_and_posts[author] << title unless @authors_and_posts[author].include?(title)
end

Conclusion

And that's all there really is to it. We configured Anemone, gave it a starting URL, told it how to decide which links to follow, what to do with each matching page that it found, and then simply parsed out the values that we needed. This makes it easy to see that ChloƩ and Mikel are well in front in terms of blog posts published, with 41 and 22 respectively, with Leonard catching up with 9 posts.

Obviously, scraping a website like this to harvest data isn't ideal. It takes time to run and generates a bunch of HTTP requests, it can be a pain to implement the regular expressions and XPaths, and most importantly, the website could be modified or rebuilt, changing the URLs and HTML structure of the pages. In the latter case, your crawler would likely break, and you would have to fix it or do it again from scratch.

In most cases, it is far better to pull the data from an API of some sort. But if the website that you want the data from doesn't have one, or won't implement one, or their API doesn't give you all the data that you need, then you might have to turn to crawling and scraping the site. And if you have to do that, then Anemone is a great tool for making that process a whole lot less painful than it can be.

blog comments powered by Disqus