Building an AMP Proxy for Blogger
We had a customer with a unique problem: they are using Blogger and needed to implement an AMP version of their website to keep competitive.
I was recently fortunate to dive deep into this technical project that was unique in a few ways. It would be delivering web pages in AMP format: a recent specification for fast delivery of web content to mobile browsers. This is relatively new and there are not many implementations yet in the wild. The source for the data would be of the equally rare sort: Google's Blogger platform.
The customer for whom we were building this project is a niche focused news website that has been on Blogger for a decade. They are easily one of the top traffic sites on the service with 50,000+ posts getting between 5-6 million page views a month. While, technically, their content consists of Blogger "posts" functionally they are much more like a standard news article: think specialty journalism. As most of their competition is using AMP to deliver mobile results, they too needed to implement AMP to keep relevant in the increasingly growing mobile space.
This project has novel endpoints: a new mobile development target environment, AMP and Blogger as a data source.
As our end technologies were Blogger and AMP, these were our immovable objects. Our job was the means for bring them together. It was going to have to be some new component in the middle, which gave us some flexibility to decide the how and what.
Blogger is one of the original blog hosting services available to the public. It was built in 1999 and bought into Google in 2003.
It has a rudimentary templating system available for editing in a HTML textfield(!) (no FTP, SFTP or pull methods for uploading new templates). Even modifying templates requires a lot of copy and paste if you want to use a proper text editor. If you mis-copy or somehow delete a character in the textfield...it is a path to endless frustration. Source control using a copy/paste input area is likewise a horrorshow.
I may sound like I am down on Blogger, not true. For a SaaS blogging system that is somewhat templatable, is easy to use for neophytes, and also Google integrated it is a wonderful product. You cannot beat the price for that functionality: free. It only gets sticky when you start looking for ways to extend the product. At that point it becomes a more frustrating experience. However, for almost all of its user's needs the built-in capabilities are sufficient.
Without provisions for any proper extensibility any idea of adding AMP support directly on the Blogger site was out immediately. Our solution for this would have to occur outside of the Blogger platform. Fortuitously there is a REST API available with has enough access to our data to get us going.
Accelerated Mobile Pages (AMP) is the latest in a 20 year history of using the web to deliver content to mobile devices. I have been developing in the mobile space since 1999, when my employer at the time was pushing early and often into mobile solutions. This project is not my first at building a web proxy for a mobile three letter acronym, just my first this decade!
Mobile development for web pages is about getting content delivered into specially constrained environments: content consumable by a browser on a mobile device. (Nothing in this post addresses actual applications written for iOS or Android, that is its own area of concern). As time went on the browsing environments became more aligned to standard web browsers. Once media queries became available the technique of responsive design became the current choice for mobile sites. Same site, different look.
AMP was created as part of the Google obsession with the three second rule: mobile users abandon sites that take longer than three seconds to load. The current technique for mobile sites, responsive design, has encouraged web sites to use a single site for both desktop and mobile. Desktop assets unutilized by mobile devices are causing slow down in loading times for the mobile experience. AMP is Google's attempt to re-focus mobile development on performance, specifically to create the best (as in fastest) mobile user experience. The project is significant because it is backed by Google and its use will have impact on how your site is presented in search results. That alone makes AMP worth pursuing.
AMP is a new step in mobile
How AMP works
Essentially this requires that you be able to inform the AMP runtime via the markup components the sizing parameters of any of the components on your AMP page. These tags use parameters like height, width and layout, typically unnecessary in a browser to be mandatory at the tag level. The AMP environment loaded into the browser will render the page, using this sizing information and layout guidance to preserve space for all the deferred and slower loading content. Allowing the text components of the page to be viewed while images and ads are still loading.
There are lots of resources available to look deeper into how AMP works.
- Efficiently handle numerous requests that could kick off multiple calls to fetch external resources before resolving
- Working with REST JSON APIs
- Availability of libraries to handle things like route-based pattern matching and DOM manipulation
- The ability to asynchronously write files
Other support technologies used in this project included: Handlebars, Moment, SuperAgent and Sass. Node excels at having a wide range of libraries available for free use. So much of the Node developer base is web-focused that you will find a tremendous amount of support libraries for almost any web related task (frustratingly excepting Blogger).
Especially helpful was Nock, a mocking library I was able to use to mockup all the Blogger API calls. This made testing quick and therefore an integral part of the development process. Per usual, it was some labor to get testing running well, but it paid back again and again as the project developed. The Blogger nock mocks are interesting enough to release on their own but there is no discernible demand for generic Bloggger API mocking. Like none. I don't see developers using Blogger much for anything, so like this testing code there is very little reusable Blogger development tools, or even useful discussion for that matter. We were on our own for most of the Blogger aspect of this project. If you are developing anything serious for Blogger be prepared to do a lot of independent work.
The proxy server
As Blogger is a closed system, anything we did would have to live outside of the Blogger hosting environment. The approach we chose was to build a distinct proxy server for the Blogger site that would mirror the existing structure of the site URLs. Any URL on the blogger site where the protocol was replaced with https and "amp" instead of "www" would deliver the AMP version of whatever content was at that location. This approach is not uncommon, it is how Reddit does it for example.
At minimum we were going to have to figure out how to translate all the possible URLs into Blogger API calls. After which, how to transform the API results into AMP. Then how to best deliver them to the browser. First, what are the URLs for which we should expect to provide?
Blogger only has 4 paths we need to worry about for this application:
|/p/[slug].html||A web page|
|/[year]/[month]/[slug].html||A blog post|
|/search/label/[label]||label based listing|
If we can handle these paths that will be just about everything our AMP user will need to see. The homepage will be a custom page focusing on latest and trending posts. The other pages will generally reflect the content on their web pages, specially formatted to AMP of course.
Static site generator!
Looking at Blogger URL structures, they all look like standard file locations. This was my first clue that we could make this setup extremely efficient. We could replicate the exact Blogger URL structure with our static AMP files served from a standard web server. This is the concept that informs all the architecture. We would then be serving much of our content as a static website: with all the inherent benefits of a static site.
Where http://www.site.com/2017/01/some-post-title.html may not be an actual path on Blogger (closed system, so who knows), on our AMP proxy it can be a real file, an AMP file.
https://amp.site.com/2017/01/some-post-title.html could be a static file, located at [webroot]/2017/01/some-post-title.html our AMP proxy would be a dynamically built static collection of AMP files representing the website.
A key to the whole setup is running a static-optimized web server in front of the application, in this case NGINX. When you consider the content, almost pages need to be dynamically rendered into AMP only one time: after which it can be served statically from the forward facing webserver forever. There are a few exceptions: any page listing content should be updated often as the site is constantly churning, the homepage most of all. However, for the 50,000+ standard posts, except for the most recent, they will likely never be updated again. Meaning we do the heavy lifting of AMP translation once and never need to worry about that page again for the lifetime of the server.
Using a typical try_files / proxy_pass NGINX setup, the webserver can be generally ignorant of the application's work, knowing only to forward requests to the application if it cannot find the AMP file in its webroot.
A neat trick for mapping static site URL structures is using an index file for directories. Sometimes the dynamically parsed url does not include a HTML filename. You often see this in use for a homepage, the path "/" will result in the "index.html" file located at the webroot being sent. Any url without a filename can be created statically as a directory, so https://amp.site.com/foo/bar can be delivered through some smart try_files NGINX config and an index.html file at [webroot]/foo/bar/index.html.
Managing static files
The Node application's job will be to populate the static locations of the files as they are requested. It will know the file corresponding to the path request is not present because if it were the request would have been handled directly by the webserver. After generating the AMP page it will write the result to the proper location in the webroot.
The application will treat these generated resources as would a cache, responsible for their lifecycle, invalidating pages after a duration. The nature of the generated page will dictate how long it lives. Invalidation is as simple as removing the file from the webserver's webroot directory. Then, should that URL be requested again it will be re-generated and re-cached.
Generally the homepage will be cached for a short time, a few minutes, while listings, posts and pages all last a longer, respectively.
Now that we know what URLs will be AMPed and how they will be delevered/stored the only thing left is to AMP them.
For templating I eventually settled on Handlebars.js. My initial gut instinct was to use Mustache.js largely because AMP has a mustache component and should we ever use it having the same templating language through the platform would be a nice thing. Typically I tend towards Pug all things being equal. However, after a little mustaching it became clear that having some logic and helpers in the templates was going to ease development quite a bit. So a little more enhanced, generally Mustache-ish templating language it was.
Actually AMPing the content was much much trickier. Having to modify legacy content into the AMP spec is one of the pain points introduced by the move to a more restrictive mobile environment.
IF you are doing web work long enough you encounter many systems migrations. The movement of a site's dynamically served content from one system to a new database/content management framework/whatever. You get used to having to massage content a little along the way to fit into pre-conceived notions of what should or should not be allowed in the content at its most basic level. This content conversion ends up being a lot of that.
HTML is one of those things that sneaks into content often. It is just a lot easier to build a web CMS that allows HTML in the content rather than try to come up with some intermediate format that keep the text "pure" but retains the intentions present in HTML. You just end up eventually allowing HTML. Normally this is not a huge issue, except for AMP.
AMP allows a subset of HTML and only that subset, any other HTML elements or unspecified markup will cause a validation error. Your AMP page will not officially be an AMP page: no Google for you! In this instance a decade of copy/paste and the looser requirements of HTML in practice allowed for a lot of extra markup to sneak into the site content.
Turns out, when you mix in a decade, multiple editors, copy/paste and loose input validation you get a lot of weird markup mixed into your content. Easily the most common and frustration of these are embedded Microsoft Word markup. I've been dealing with this specific problem for years and never seem to be able to completely scrub it all reliably. Even worse, during the execution of this project I ran into some weird embedded XML early attempts at a schema.org type markup oddly here and there. Needless to say there had to be a lot of content munging.
The text munging toolkit
I used two indispensable elements in HTML transformation to tame the site content. A DOM parsing and manipulation library (cheerio.js in this case) and the ever-amazing power of Regular Expressions. We load up the source content via cheerio as a DOM, extract offending elements of the DOM, restore the HTML and use regex to handle any of the less clean manipulations. Even with these tools in play the process is not pretty, but it does get done. This was a most detailed and time-consuming part of the process.
Images represented a special difficulty both for how Blogger handles images and how AMP expects them. For reasons previously outlined the <img> tag is not allowed. This site in particular is very image orientated, and the images are embedded as part of the HTML. Its not too difficult using our existing munging methods to extract all the image tags from the content and replace them with <amp-img> tags. This is where things started to get novel and really tricky.
<amp-img> requires the explicit height and width of the image to be attributes on the tag "which is used by the AMP runtime to determine the aspect ratio without fetching the image." This is a mostly fair approach: they are your images, you should know their dimensions, so spare the runtime the effort of loading and re-rendering if you already know the image sizes. This shifts the burden on to the site itself to know the image sizes, but you have the image files somewhere nearby right? Just read them and get the sizing. Easy! Enter Blogger...
In Blogger the content's embedded image tags are like most image tags embedded on websites, no guarantee of height and width information, mostly just the images source URL. Unlike a normal site we do not have ready access to our images, they are not stored anywhere we can access them directly. Blogger stores its images in the Googleplex Cloud Whateverspace, usually referenced via some https://[number].[two letters].blogspot.com domain like so:
or via the googleusercontent.com domain:
Now I was starting to worry, was I going to have to download every image (at least partially) to get its raw dimensions? Some of these images are huge, this could be a real performance killer. Its not like we are reading a few files on disk, we would have to download megs of images to our server just to create a few K text file! Some of these articles had a dozen or more images embedded in them! Downloading the images would have represented a disproportionate burden on our infrastructure. Even worse, it would have made the initial creation and serving of the AMP page slow, and slow is worst thing you can be in AMP.
Blogger's weird image URLs
I started looking for ways around this problem by seeing if there were some way to exploit an interesting property of the Google image host.
A Google hosted image has a novel URL with a special trick, you can alter its sizing and properties by modifying an aspect of the URL.
That s320 up there in the url is shorthand telling the server to give me a image of with 320px dimensions. You can also just asked for a cropped square image by using a -c like s320-c would be a 320px square. Its not just for cropping and re-sizing, want to download the image? use -d. You can combine these as well, it is worth looking into if you are serious about how you want to show your Google Clouded images.
This is handy, and will be handy for serving images but it still does not get me the height and width of the original image.
I started digging more into the undocumented world of Blogger and found another control term for the URL: -g. This was an odd one. Unlike the other control terms -g does not give you an image file, instead it gives you a XML file, real actual gloriously parsable text! Blogger, man, sometimes you are weird but great!
<TileInfo tile_width="512" tile_height="512" full_pyramid_depth="2" origin="TOP_LEFT" base_url="http://lh3.googleusercontent.com/-jrlpdLzy4_0/ThzZ62NS6rI/AAAAAAAACgQ/qS3Z-oe7Ho8/%255BUNSET%255D" tiler_version_number="2" image_width="604" image_height="453"> <pyramid_level num_tiles_x="1" num_tiles_y="1" inverse_scale="2" empty_pels_x="210" empty_pels_y="286"/> <pyramid_level num_tiles_x="2" num_tiles_y="1" inverse_scale="1" empty_pels_x="420" empty_pels_y="59"/> </TileInfo>
TileInfo? This is not a tile image, looks like we may have stumbled upon some Google Maps requirement that snuck into the Google Cloud Whatever image serving platform. Either way this solves our problem in a nice way, rather than download the image we can get this XML file with the image dimensions and parse them out. I'll take a half-K of text file over a MB+ of image data any day.
This still requires a separate request for each image we are going to use in our AMP files. While a considerable burden, not as much as downloading the image files. We still need to obtain all the sizing for all the images used in our post before we can render the AMP page for delivery. That will be a lot of requesting, waiting, other requesting, waiting then rendering. Fortuitously we chose a development environment which lends itself very well to handing the resolution of multiple asynchronous requests simultaneously. Hooray Node!
Encircle the user?
Do we keep the user linked to our AMP site or link them out to the www. site? When an AMP user clicks on a URL on an AMP page should the link go to the full www. url, or should it keep the user using AMP by linking to the AMP version of that content? There is surprisingly little debate about this available and I could see it going either way. Using AMP for the initial page via a search engine but then moving the user over to your regular responsive site for subsequent links.
We chose to keep the user on AMP, rewriting all the embedded content URLs to reference the AMP site. If they navigated to AMP they were already in an environment where AMP is supported and works well. It may be jarring to go from AMP to a responsive mobile site. That being said, every page features multiple links to take the users to the regular responsive website. While AMP becomes the default, any users who wishes may obviously opt out at any time after their first AMP page.
I don't think the debate on this topic is over.
You can access the Blogger API via the Google API Developers Console, by default you get 10,000 requests a day to the API. This is probably enough for most Blogger sites, but not the top tier. Not that AMP services have high utilization at this time, and the architecture keeps the requests to a minimum, but with five times the default API allotment in posts alone our request limit can be reached quickly.
Who is by far the largest consumer of AMP pages? The Google crawler, of course. I had thought asking for an API bump would be an automatic process. There is a form right there in the API console. Turns out for Blogger it is not. While there is the standard form to fill out, submitting the form resulted in a message along the lines of "This means of request does not work with Blogger."
Google to the left of us, Google to the right of us! One causing our trouble, one preventing us from alleviating it. Frustrating! Blogger never seems to be on the same tier as the rest of the Google services. Eventually we were able to use a personal contact at Blogger to get the API bump, but it took some time. You literally need to contact a single person at Blogger to get this performed. This reiterates my point: Blogger is a great platform for some things, but it never has been a developer's platform. Things that are normally very simple in other environments are like skating uphill.
The AMP specification
You can't view the AMP specification as a finished product. It is rolling along, already issuing deprecation errors for some currently valid AMP constructs. Normally this is not much of a thing, as browsers have historically been forgiving with what they receive. Not so with AMP, since pages must validate to be useful in Google's AMP Ecosystem. I would expect in the early live of this standard to be constantly having to tweak the output of the AMP pages for awhile.
By the huge for the large
AMP is going to be tough on the smaller content providers. They will essentially have to create a separate website for mobile. It is complex enough that you will need to know what you are doing as a developer to create it as well. If you are using a ubiquitous CMS AMP may come easy, if you are not...
AMP was designed in a partnership with large publishers who were more likely to use it. Big content publishers have something in common: in-house web tech expertise. Implementing AMP on unique systems may be tough, but well within their capabilities. If the small publisher is on a unique or non-extensible platform they are going to have to do some real web development or miss out.
AMP sites may "officially" not get bumped in rankings (yet) but their placement in an AMP carousel is a huge bonus. AMP also causes faster loading times and other user-friendliness signals we know do impact mobile search results. AMP will turn into a keep up withe the tech to keep up in the results type of situation which will be a lot harder for some publishers. If it becomes a long term adopted environment it could also end up being another nail in the coffin for smaller publishers.
How did it go?
The end result has been good thus far, scaling has been a non-issue. Load is only a factor in the initial start of a server while the static cache is getting filled. Once filled we are able to run the service in a very light setup with almost no server impact at all. Because of all the caching already performed by Google (and us) I don't forsee any real load problems in the future should AMP really take off.
The only thing I would like to see more of is transparency from Google on how AMP pages are considered for placement on search. I know this is whistling in the wind, asking Google to be transparent on any aspect of search results. At the same time, they have been pushing AMP and it would be very helpful to know why some AMP articles show up carousel or listing, and why some have images and some do not. Simple things that go a little past the validating, schema, and error tools.
Very happy we went with Node early on, every tricky spot was handled well by the very nature of Node. Although I have been moving over to Go for personal projects, Node is still king for a lot of web serving tasks.