On RSS Scalability
September 13th, 2004 • General
Tod Friendly on Kuro5hin says:
RSS is not a good mechanism for getting new content to a large number of people; it’s just too hit and miss.
He says that RSS is doomed to fail because it’s a pull rather than push aggregation. Personally, I think the main reason for RSS success is that it’s based on something we already have, namely HTTP, which is (mostly) a pull protocol. Anyone can come up with a structured news format, a competing DTD, or even start a whole new trend of XML feed formats, it’s the community who decides to either adopt a format, or just throw it away; now that’s hit and miss. But when it comes to finding a real solution , well, it’s just too damn difficult! Sure, HTTP headers can help,
<ttl> might help too, but that only happens in a perfect world, where feed readers obey the standards, where developers aren’t too lazy to implement these features, where we don’t have to know 9 "compatible" RSS formats just to figure out how to parse 2 or 3 news items.
Robert Scoble assumes it’s the hits that matter (if his math is correct), that they’re the cause of RSS being out of control, but the real problem with RSS is that eats bandwidth not that it generates a lot of traffic. Many busy sites survived huge loads, unexpected spikes, even Slashdot, but that came with a price, the cost of bandwidth. It doesn’t matter if I get 1,000,000 hits per months if 60% of them gets a 0.5KB response, I can afford that, but when all of them get a full 25KB response, now that’s going to hurt.
RSS is essentially screen scraping in a fancy dress except that it has the tools for it, it’s not a bunch of Perl scripts anymore, but the idea is still the same; staying up to date. So we shouldn’t blame RSS for it’s misbehavior, after all, it’s not a transport protocol, it doesn’t know anything about caching, load-balancing, methods of distribution, it can be used anywhere. If those who wrote the standards decided that RSS will be polled using POP3 (as crazy as it might seem), RSS wouldn’t mind, however it would still consume the same bandwidth. RSS worked because I, the user, get to choose when I want to update, how I want to update, and you can’t send me v1@gra unless I really want to, and even then, I know exactly where it came from and how to get rid of it. My aggregator defaults to 5 minutes, it’s hard-coded and the developers say it’s too technically difficult to change it, but I like my aggregator very much. So what are you, the site owner, going to do? Deprive me my right from reading your content?! What if I’m pretty interested and my company is going to get you the best business deal ever? Would you still want to ban me?!
Solving this problem should not, in any way, be on the user side, I don’t mean the client side, but the user side. Solving it on the client side to me means that we, as developers, have to agree on following the standards, they’re only there for our own good, send
If-Modified-Since headers, ask the server if there were any updates, and if there were downloading them gzipp’ed, following
301 Moved Permanently and altering the feeds’ locations, storing and understanding
<ttl>, and generally doing a lot of boring work. The user doesn’t have to know the netiquette, it’s our jobs to make the user happy and not ruin everyone’s day.
Most of the suggested solutions, like Nick Bradbury’s, Simon Willison’s and Charles Miller’s, require smart feed aggregators, and since most of the available ones aren’t that smart, these solutions would probably fail. Other solutions suggest that we should only provide summaries in RSS and full content on the site, MSDN blogs did that, but guess what happened to them? They’ve lost readership. I’m going to go with a bold assumption here, I think providing summaries costs much more bandwidth than providing full feeds! For instance, Joel’s posts are a must read to me, so if he provides summaries I’m not even going to bother reading them, I’ll be downloading his feed anyway just to click on the titles and open a browser, this way I’d be consuming bandwidth for reading each post individually rather than just downloading the feed once and reading it in my aggregator, since RSS is usually less than 10% markup vs. 20-30% markup in HTML files (and that’ on well-designed sites). So think twice before you decide to provide summaries and only summaries.
Here’s an idea: Simon was looking for a way to track RSS feeds, he suggested issuing an HTTP redirect to a URL with a certain unique ID, but this requires that the aggregator stores the new location and continues updating from it. Well, this wouldn’t work in most current aggregators, so here’s a suggestion, why not issue a unique URL every time the page is requested? Something like http://mysite.com/rss/d41d8cd98f00b204e9800998ecf8427e. This ensures that each and every user gets a new ID every time the page is requested, and whenever the URL is added into the aggregator it doesn’t change. No need for HTTP work-arounds and redirects, this way, you can track each and every feed, and even better, rather than always giving a certain feed the latest 15 news items, why not give the n latest? Not everyone is outdated, some people might only need 2 or 3 items since their last update, others might need the latest 30, so why not update them on exactly what they missed? If there are no updates, just send a blank RSS file, one with no items, it’s perfectly valid, and it’s hardly 0.5KB in most cases.
I’m aware of this method’s limitations, the processing overhead it causes, the database load, and the fact the there is no single location for my RSS, but that fades in comparison to the bandwidth costs on busy sites. I coded a little demonstration for this, you can try it here, or download the source. When the script is first requested it gives you an page with a URL to add to your aggregator. The first time the script is requested it returns RSS with no items, second and third time it returns two posts, third and fourth return blank RSS again, the fifth time you get a single post, and all subsequent requests blank RSS again. Combined with some caching mechanism, this method should work in most cases. It’s just a quick hack, I didn’t put much time into it, but you get the idea. Hopefully someone will find a better solution for this RSS mess.
A theory on why RSS traffic is growing out of control
RSS is sticky traffic
Thoughts on RSS and bandwidth
RSS Bandwidth Usage
Syndication Bandwitdh Consumption
Dynamic RSS Feeds and Bandwidth Consumption
RSS Consumption or Production
Making RSS Scale
When RSS Traffic Looks Like a DDoS
RSS growing pains
HTTP Conditional Get for RSS Hackers
Joel’s RSS problem
Lowering bandwidth usage with TTL in RSS
The myth of RSS compatibility
How to track an RSS feed
HTTP Conditional Get for RSS Hackers