contents of this page
Parse.ly Crawler¶
What does the Crawler do?¶
The Parse.ly Crawler is a backend system that makes HTTP requests to your web server in order to download content and metadata from your site pages. In its most typical use, news sites will have their article pages crawled by the crawler so that they may appear as “Posts” inside Parse.ly Dash, and so that other pieces of metadata such as author, section, publication date, image, and topics are automatically extracted.
Technical details about the Crawler¶
Parse.ly’s Crawler will send out crawl requests to your pages as the pageviews stream in from those URLs. It will crawl your content using the user-agent: Mozilla/5.0 (compatible; parse.ly scraper/0.14; +http://parsely.com). Note that the version field can change over time.
Here are the current set of IP addresses for our Crawler worker machines: 198.61.208.242, 198.61.209.12, 198.61.209.16.
You may be worried about additional load that this crawler might put on your server. This is generally not a concern. The Crawler takes the following steps to be polite to the servers it crawls:
- It limits the number of concurrent requests it opens to your server to ensure it doesn’t affect your concurrency throughput.
- It caches articles it has already seen.
- It introduces a small delay between HTTP requests to ensure the load is spread out.
- It does not pro-actively spider your site; instead, pages are crawled only as they are visited by users. This way, archived articles that are not visited are not needlessly crawled.
In the first month of integration or so you will see more crawling activity than in future months, as Parse.ly will be loading in older articles that receive occasional visits. This will wane over time.
Finally, we must emphasize that crawling is an entirely back-end operation. That is, crawling in no way affects the pageload performance of your visitors coming to your site. It is done entirely asynchronously by Parse.ly’s servers “after the fact”.
Crawling through pay walls¶
Some Parse.ly customers do not have all of their content accessible to the public due to a “pay wall”. In these cases, you must coordinate with our support team to arrange for a special login account, only accessible to our Crawler, which can gain access to your pay wall content. The credentials for this account will only be used by the Crawler and will not be shared with anyone else.
Providing Page Attributes Manually¶
Our typical process is to write a custom crawler for each of our customers. However, you may also provide page attributes manually if they are available in your CMS, which may result in more reliable output from our crawler.
We can do this via the parsely-page META tag on your article pages, which can add arbitrary metadata to existing posts.
Here is the integration approach for parsely-page:
- Add a META tag to your page with the name “parsely-page” and a content field which is a serialized (and, as needed, quote-escaped) JSON string. For example:
<meta name='parsely-page'
content='{"title": "Obama gives speech on Iraq",
"link": "http://nytimes.com/2152/obama-iraq",
"image_url": "http://nytimes.com/img/2152.jpg",
"type": "post",
"post_id": "2152",
"pub_date": "2011-05-25T13:00:00Z",
"section": "Politics",
"author": "Josh Jones",
"tags": ["election 2012","editorials","obama barack","romney mitt"]
}'>
Note
Please do not copy-paste the above example directly into your page templates. You must review the Technical Caveats below to ensure that line breaks, string escaping, and literal values are chosen correctly.
- Notify the Parse.ly team that the META tag has been integrated. We can confirm that the parsely-page attribute is accessible on all article pages and use this to power our metadata extraction.
Here is a quick guide to the fields above:
Field Description title Post or page title (article headline) link Canonical URL for post/page image_url URL for image associated with post/page type One of “post”, “frontpage”, “sectionpage” post_id String that uniquely identifies this post pub_date Publication date, as ISO 8601 UTC timezone string section Section of the site (e.g. A+E, Politics) author Author who wrote the post tags An array of Tags associated with this post
Technical Caveats¶
Remove all line breaks. In this example, the line breaks are included only for display purposes. META tags typically should not contain line breaks in the content attribute to ensure the highest level of browser compatibility. Typically this can be achieved by replacing all instances of the newline character "\n" with the empty string "", e.g. value.replace("\n", "").
Escape all single and double quotes in JSON item values. All single quotes should be replaced with the JSON unicode equivalent \u0027 and double quotes should be escaped with \".
Values in parsely-page will appear literally inside Parse.ly Dash. String values supplied here, specifically title, author, and section, will display in Dash exactly as they appear in the tag (after HTML decoding as described above). As a result, make sure to use proper capitalization and specify section names as you expect them to appear inside Dash.
parsely-page Examples in Different Languages and Content Management Systems¶
Implementing <meta name="parsely-page" ... /> greatly helps with the accuracy of data you see in Dash, but we want to make it as easy as possible for you to implement this meta tag no matter what content management system (CMS) or programming language you’re using.
If you’d like to see your CMS supported then drop us a line and let us know.
Let’s take an example, here’s a theoretical article published on the Awesome-Publisher.com site.
Variable Value Title Man known to lie in road is run over and killed \n wife is “Stunned” Link http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is-run-over-and-killed Image URL http://images.awesome-publisher.com/123456123456.png Type post Post ID 123456 Pub Date 2012-01-01 at 11:34:02AM EST Section News Author John Doe Tags news, traffic, local
Notice that the title has a few characters that could be troublesome for JSON documents (namely a newline \n as well as double quotes "). In addition, the date the article was published is given in eastern standard time (EST) where as parsely-page requires UTC time.
How would we output the parsely-page meta tag given the example above? Let’s take a look at a few examples in both popular CMSs as well as programming languages.
WordPress¶
Getting Parse.ly Dash working on your WordPress site/blog is easy with the wp-parsely WordPress plugin. Just follow the installation instructions and you’ll be ready to go in no time.
We’re always updating this plugin so for current users it’s a good idea to keep an eye out for updates within your WordPress settings screen.
PHP¶
<?php
$title = "Man known to lie in road is run over and killed\nwife is \"Stunned\"";
$link = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is...";
$imageURL = "http://images.awesome-publisher.com/123456123456.png";
$section = "News";
$author = "John Doe";
$tags = array("news", "traffic", "local");
$pubDate = DateTime::createFromFormat("Y-m-d H:i:s P", "2012-01-01 11:34:02 -05:00");
function getCleanParselyPageValue($val) {
$val = str_replace("\n", "", $val);
$val = str_replace("\r", "", $val);
return $val;
}
$parselyPage = array();
$parselyPage["title"] = getCleanParselyPageValue($title);
$parselyPage["link"] = $link;
$parselyPage["image_url"] = $imageURL;
$parselyPage["type"] = "post";
$parselyPage["post_id"] = "123456";
$parselyPage["pub_date"] = gmdate("Y-m-d\TH:i:s\Z", $pubDate->getTimestamp());
$parselyPage["section"] = getCleanParselyPageValue($section);
$parselyPage["author"] = getCleanParselyPageValue($author);
$parselyPage["tags"] = $tags;
$output = "<meta name='parsely-page' content='" . json_encode($parselyPage, JSON_HEX_APOS | JSON_HEX_QUOT) . "' />";
?>
Python¶
import datetime
import json
import pytz
def get_clean_parsely_page_value(string):
return string.replace('\n', ' ').replace("'", "\u0027")
title = "Man known to lie in road is run over and killed\nwife is \"Stunned\""
link = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is..."
image_url = "http://images.awesome-publisher.com/123456123456.png"
section = "News"
author = "John Doe"
pub_date = datetime.datetime(2012, 1, 1, 11, 34, 2, tzinfo=pytz.timezone("US/Eastern"))
tags = ["news", "traffic", "local"]
parsely_page = {}
parsely_page["title"] = get_clean_parsely_page_value(title)
parsely_page["link"] = link
parsely_page["image_url"] = image_url
parsely_page["type"] = "post"
parsely_page["post_id"] = "123456"
parsely_page["pub_date"] = pub_date.astimezone(pytz.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
parsely_page["section"] = get_clean_parsely_page_value(section)
parsely_page["author"] = get_clean_parsely_page_value(author)
parsely_page["tags"] = tags
output = "<meta name='parsely-page' content='%s' />" % json.dumps(parsely_page)
Ruby¶
require 'date'
title = "Man known to lie in road is run over and killed\nwife is \"Stunned\""
link = "http://www.awesome-publisher.com/123456/man-known-to-lie-in-road-is..."
image_url = "http://images.awesome-publisher.com/123456123456.png"
section = "News"
author = "John Doe"
pub_date = Time.new(2012,1,1,11,34,2, '-05:00')
tags = ["news", "traffic", "local"]
def get_clean_parsely_page_value(val)
return val.gsub(/\n/, "").gsub(/'/, "\u0027")
parsely_page = {}
parsely_page[:title] = get_clean_parsely_page_value(title)
parsely_page[:link] = link
parsely_page[:image_url] = image_url
parsely_page[:type] = "post"
parsely_page[:post_id] = "123456"
parsely_page[:pub_date] = pub_date.utc.strftime("%FT%TZ")
parsely_page[:section] = get_clean_parsely_page_value(section)
parsely_page[:author] = get_clean_parsely_page_value(author)
parsely_page[:tags] = tags
output = "<meta name='parsely-page' content='#{JSON.generate(parsely_page)}' />"
Link Aliases¶
Parse.ly Dash doesn’t just track individual URLs, but actually groups links that refer to the same Post as one. This allows easier, simpler, and more accurate tracking of your content.
Let’s consider a wildly popular article by one of our flagship customers, The Atlantic. In November 2011, they published a cover story about the changing demographics of society, specifically with regard to single women. However, this article didn’t appear as one URL – it appeared at each of the following URLs on their site:
- Main Aritcle URL: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/
- Page 2: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/2/
- Page 3: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/3/
- Page 4: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/4/
- Page 5: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/5/
- Printable: http://www.theatlantic.com/magazine/print/2011/11/all-the-single-ladies/8654/
- Single Page: http://www.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/?single_page=true
- Mobile Page: http://m.theatlantic.com/magazine/archive/2011/11/all-the-single-ladies/8654/
That’s 8 unique URLs, but all representing one “Post”, which is this specific article, “All The Single Ladies”, written by Kate Bolick and published on November 2, 2011. With Dash, these unique URLs are grouped together into a single “Post” which represents all traffic – direct, social, search, etc. – for this logical article.
If you are a customer using a custom crawler, our team handles this for your site automatically. If you are providing page attributes manually, you achieve this by tagging each of the above pages with the same parsely-page “link” field. You’ll be able to break out the specific URLs that led to the logical Post within Dash.
It’s true that most traffic for this article typically ends up at the “Main Article URL”, but tracking the post across all its incarnations is important, especially when you consider social media and search channels. For example, a search engine might get a hit on page 3 of the article for certain keywords. Your Twitter audience might choose to tweet the single-page version of the article, or the mobile version, rather than the “main” article URL. In Dash, all of these URLs are consolidated into tracking under the single Post.
Internally, we refer to this feature as “Link Aliases”. Every Post has a “Canonical Link” (the Main Article URL above), but also has any number of link aliases, which Dash considers logically equivalent to the canonical link.