Parse.ly Crawler

« JavaScript Tracker | Parse.ly Crawler | Parse.ly Dash »

Parse.ly Crawler

What does the Crawler do?

The Parse.ly Crawler is a backend system that makes HTTP requests to your web server in order to download content and metadata from your site pages. In its most typical use, news sites will have their article pages crawled by the crawler so that they may appear as “Posts” inside Parse.ly Dash, and so that other pieces of metadata such as author, section, publication date, image, and topics are automatically extracted.

Technical details about the Crawler

Parse.ly’s Crawler will send out crawl requests to your pages as the pageviews stream in from those URLs. It will crawl your content using the user-agent: Mozilla/5.0 (compatible; parse.ly scraper/0.14; +http://parsely.com). Note that the version field can change over time.

Here are the current set of IP addresses for our Crawler worker machines: 198.61.208.242, 198.61.209.12, 198.61.209.16.

You may be worried about additional load that this crawler might put on your server. This is generally not a concern. The Crawler takes the following steps to be polite to the servers it crawls:

  • It limits the number of concurrent requests it opens to your server to ensure it doesn’t affect your concurrency throughput.
  • It caches articles it has already seen.
  • It introduces a small delay between HTTP requests to ensure the load is spread out.
  • It does not pro-actively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.

In the first month of integration or so you will see more crawling activity than in future months, as Parse.ly will be loading in older articles that receive occasional visits. This will wane over time.

Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is done entirely asynchronously by Parse.ly’s servers “after the fact”.

Crawling through pay walls

Some Parse.ly customers do not have all of their content accessible to the public due to a “pay wall”. In these cases, you must coordinate with our support team to arrange for a special login account, only accessible to our Crawler, which can gain access to your pay wall content. The credentials for this account will only be used by the Crawler and will not be shared with anyone else.

Providing Page Attributes Manually

Our typical process is to write a custom crawler for each of our customers. However, you may also provide page attributes manually if they are available in your CMS, which may result in more reliable output from our crawler.

We can do this via the parsely-page META tag on your article pages, which can add arbitrary metadata to existing posts.

Here is the integration approach for parsely-page:

  1. Add a META tag to your page with the name “parsely-page” and a content field which is a serialized (and, as needed, quote-escaped) JSON string. For example:
<meta name='parsely-page'
      content='{"title": "Obama gives speech on Iraq",
                "link": "http://nytimes.com/2152/obama-iraq",
                "image_url": "http://nytimes.com/img/2152.jpg",
                "type": "post",
                "post_id": "2152",
                "pub_date": "2011-05-25T13:00:00Z",
                "section": "Politics",
                "author": "Josh Jones",
                "tags": ["election 2012","editorials","obama barack","romney mitt"]
               }'>

Note

Please do not copy-paste the above example directly into your page templates. You must review the Technical Caveats below to ensure that line breaks, string escaping, and literal values are chosen correctly.

  1. Notify the Parse.ly team that the META tag has been integrated. We can confirm that the parsely-page attribute is accessible on all article pages and use this to power our metadata extraction.

Here is a quick guide to the fields above:

Field Description
title Post or page title (article headline)
link Canonical URL for post/page
image_url URL for image associated with post/page
type One of “post”, “frontpage”, “sectionpage”
post_id String that uniquely identifies this post
pub_date Publication date, as ISO 8601 UTC timezone string
section Section of the site (e.g. A+E, Politics)
author Author who wrote the post
tags An array of Tags associated with this post

Technical Caveats

Remove all line breaks. In this example, the line breaks are included only for display purposes. META tags typically should not contain line breaks in the content attribute to ensure the highest level of browser compatibility. Typically this can be achieved by replacing all instances of the newline character "\n" with the empty string "", e.g. value.replace("\n", "").

Escape all single and double quotes in JSON item values. All single quotes should be replaced with the JSON unicode equivalent \u0027 and double quotes should be escaped with \".

Values in parsely-page will appear literally inside Parse.ly Dash. String values supplied here, specifically title, author, and section, will display in Dash exactly as they appear in the tag (after HTML decoding as described above). As a result, make sure to use proper capitalization and specify section names as you expect them to appear inside Dash.

parsely-page Examples in Different Languages and Content Management Systems

Implementing <meta name="parsely-page" ... /> greatly helps with the accuracy of data you see in Dash, but we want to make it as easy as possible for you to implement this meta tag no matter what content management system (CMS) or programming language you’re using.

If you’d like to see your CMS supported then drop us a line and let us know.

Let’s take an example, here’s a theoretical article published on the Awesome-Publisher.com site.

Variable Value
Title Man known to lie in road is run over and killed \n wife is “Stunned”
Link http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is-run-over-and-killed
Image URL http://images.awesome-publisher.com/123456123456.png
Type post
Post ID 123456
Pub Date 2012-01-01 at 11:34:02AM EST
Section News
Author John Doe
Tags news, traffic, local

Notice that the title has a few characters that could be troublesome for JSON documents (namely a newline \n as well as double quotes "). In addition, the date the article was published is given in eastern standard time (EST) where as parsely-page requires UTC time.

How would we output the parsely-page meta tag given the example above? Let’s take a look at a few examples in both popular CMSs as well as programming languages.

WordPress

Getting Parse.ly Dash working on your WordPress site/blog is easy with the wp-parsely WordPress plugin. Just follow the installation instructions and you’ll be ready to go in no time.

We’re always updating this plugin so for current users it’s a good idea to keep an eye out for updates within your WordPress settings screen.

PHP

<?php
$title      = "Man known to lie in road is run over and killed\nwife is \"Stunned\"";
$link       = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is...";
$imageURL   = "http://images.awesome-publisher.com/123456123456.png";
$section    = "News";
$author     = "John Doe";
$tags       = array("news", "traffic", "local");
$pubDate    = DateTime::createFromFormat("Y-m-d H:i:s P", "2012-01-01 11:34:02 -05:00");

function getCleanParselyPageValue($val) {
    $val = str_replace("\n", "", $val);
    $val = str_replace("\r", "", $val);
    return $val;
}

$parselyPage = array();
$parselyPage["title"]       = getCleanParselyPageValue($title);
$parselyPage["link"]        = $link;
$parselyPage["image_url"]   = $imageURL;
$parselyPage["type"]        = "post";
$parselyPage["post_id"]     = "123456";
$parselyPage["pub_date"]    = gmdate("Y-m-d\TH:i:s\Z", $pubDate->getTimestamp());
$parselyPage["section"]     = getCleanParselyPageValue($section);
$parselyPage["author"]      = getCleanParselyPageValue($author);
$parselyPage["tags"]        = $tags;

$output = "<meta name='parsely-page' content='" . json_encode($parselyPage, JSON_HEX_APOS | JSON_HEX_QUOT) . "' />";
?>

Python

import datetime
import json
import pytz

def get_clean_parsely_page_value(string):
    return string.replace('\n', ' ').replace("'", "\u0027")


title       = "Man known to lie in road is run over and killed\nwife is \"Stunned\""
link        = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is..."
image_url   = "http://images.awesome-publisher.com/123456123456.png"
section     = "News"
author      = "John Doe"
pub_date    = datetime.datetime(2012, 1, 1, 11, 34, 2, tzinfo=pytz.timezone("US/Eastern"))
tags        = ["news", "traffic", "local"]

parsely_page = {}
parsely_page["title"]     = get_clean_parsely_page_value(title)
parsely_page["link"]      = link
parsely_page["image_url"] = image_url
parsely_page["type"]      = "post"
parsely_page["post_id"]   = "123456"
parsely_page["pub_date"]  = pub_date.astimezone(pytz.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
parsely_page["section"]   = get_clean_parsely_page_value(section)
parsely_page["author"]    = get_clean_parsely_page_value(author)
parsely_page["tags"]      = tags

output = "<meta name='parsely-page' content='%s' />" % json.dumps(parsely_page)

Ruby

require 'date'

title       = "Man known to lie in road is run over and killed\nwife is \"Stunned\""
link        = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is..."
image_url   = "http://images.awesome-publisher.com/123456123456.png"
section     = "News"
author      = "John Doe"
pub_date    = Time.new(2012,1,1,11,34,2, '-05:00')
tags        = ["news", "traffic", "local"]

def get_clean_parsely_page_value(val)
    return val.gsub(/\n/, "").gsub(/'/, "\u0027")

parsely_page = {}
parsely_page[:title]      = get_clean_parsely_page_value(title)
parsely_page[:link]       = link
parsely_page[:image_url]  = image_url
parsely_page[:type]       = "post"
parsely_page[:post_id]    = "123456"
parsely_page[:pub_date]   = pub_date.utc.strftime("%FT%TZ")
parsely_page[:section]    = get_clean_parsely_page_value(section)
parsely_page[:author]     = get_clean_parsely_page_value(author)
parsely_page[:tags]       = tags

output = "<meta name='parsely-page' content='#{JSON.generate(parsely_page)}' />"