Snippets
Chamberlain: A User-Serving Model for Identity Management - 11-05-2009
Dr. Ernie writes about user-centric identity management in the wake ...
Best Warning Message Ever - 09-30-2009
From my diagnostic warning messages today: Instantiating NSNavExpansionButtonCell (superclass of ...
Toy Chest: Online Tools for Non-Programmers - 08-12-2009
UC Santa Barbara maintains The Toy Chest: a great list ...
Topia TermExtract - 08-12-2009
This little library looks fun - Topia TermExtract applies a ...
Terminology Watch: Log or Sign In? - 07-31-2009
Tanin Ehrami took the time to collate what terms are ...
Making the Web Smarter - 07-31-2009
Fred Wilson writes about the new Common Tag consortium. The ...
Curating the Real-Time Web - 07-31-2009
A catchy tag phrase from publish2. It's another way to ...
Shiny boxes for bits - 06-12-2009
From Core77, a design review of recent "digital talisman" products. ...
Shadow Physics Game - 06-12-2009
Intruiging demo video of a platformer where you control the ...
Hive: Petabyte-scale data warehousing on Hadoop - 06-12-2009
An engineering note from Facebook about Hive, their Hadoop-based data ...

Paying for nytimes.com

01-22-2010

A day after the Chapter 11 filing of my local newspaper, Nicholas Carr analyzes the New York Times' plans to install a "metered access" payment system for their online content. I think his analysis is dead on.

The most interesting difference between this plan, and the Times' earlier attempt with TimesSelect, is the understanding that the nytimes.com site has to be open to incoming links. By pulling users in from search and social media sites, and saving the conversion attempt for the next click, the Times has a chance to generate revenue from a much larger community of users.

On Namespaces in JSON

01-21-2010

David Baron wrote a thought-provoking short post on Distributed Extensibility two months ago, which I had been meaning to comment on. I'm particularly interested in how to handle distributed extensibility of "RPC-like" network messages, which these days usually means talking about REST and JSON. A couple years ago, it meant talking about XML Namespaces, which is a topic that often causes cries of dismay among developers, so the design pressure to do something different (and hopefully better) is strong.

A tiny introduction to XML Namespaces (or, XMLNS) and their discontents, for the uninitiated

To create XML documents which contain elements defined by more than one authority, the XMLNS spec allows authors to embed prefix declarations into an XML document. These prefixes can be used to "namespace-qualify" any other element, which indicates that they belong to that other namespace.

In theory, this means that you can save lots of room by only declaring an external namespace once (which is good, since namespace identifiers tend to be big URLs). In practice, it means that the tag name of an element in XML document cannot be reasoned about with constructing a data structure encoding all the parent elements of the element and dealing with a variety of tricky corner cases. This means that simple lexical scanners (i.e. regular expressions) cannot (100%) correctly process a document that contains namespace prefixes.

A number of important XML specifications, most notably XSL, have an uncomfortable relationship with prefixed elements. A number of other important specifications, such as XML Exclusive Canonicalization (which is a critical piece of XML Signature), have to jump through a number of hoops to interact nicely with XMLNS.

Then, a couple days ago, no less an authority than Tim Bray weighed in with some recommendations on how to think about JSON extensibility. The short version is this: when extending JSON, use globally-unique names, which encode the design authority of the extender, and use the unique names everywhere. This implies that receivers should follow a MustIgnore policy for messages they receive, since any message could contain extensions a receiver doesn't understand. (It is significant that the SOAP specification considering this an important enough feature to encode "MustUnderstand" as a required attribute of the root-level extensibility definition: Bray's proposal has no such mechanism. It could easily be handled at a higher level, as part of the envelope-level wrapping of a JSON message, though.)

I spent a long time dealing with corner cases of XML Namespaces, especially when I was implementing XML Security, so I have a lot of personal scar tissue around what that specification does well, and does not so well. For exploratory purposes, I thought I'd make a quick table of how these approaches stack up.

GoalXML NamespacesGlobal Names in JSON
Globally Unique Up to each domain host, which is managable Same. But terser.
Efficient for Transport Yes: a normalized XML document declares each namespace only once Not particularly. The namespace identifier is effectively redeclared for each attribute.
Self-documenting Good: The URL can be loaded in a web browser Not great: There's a hint to the controlling authority and that's it. Is that a problem? You can problem just pop it into a search engine and you're done.
Document fragments are legal Definitely not: Removing a subelement from an XML document with namespaces requires complicated DOM-level manipulation of the document tree. Yes, trivially
Handles versioning reasonably Not particularly: In most cases, bumping an XML Namespace version means dropping in an entirely new set of element handlers Maybe: Bray's scheme could handle refinement at the level of a single attribute.

On balance, I think Bray's proposal comes out ahead. It's verbose, but that's what gzip is for. The ability to process JSON fragments with more-or-less context-free lexical scanners is a big win. And I think we've learned that having namespace URIs that resolve to documents wasn't all that important.

Identity in the Browser, Nov. 2009

12-01-2009

Our work at Mozilla on Identity in the Browser is picking up. Here's a quick link dump of recent work:

  • Labs UX lead Aza Raskin writes about Identity in the Browser, with our most recent mockups and interface thoughts.
  • I have started a formal project to define and construct an Account Manager for Firefox. Our goal is to provide browser-level support for end-user identity management, which covers the current tasks of login, logout, signup, information change, password change, and information revocation. I'll write more about it soon.
  • I've pursuing another project that is currently called Contact Pool; this project is about providing a rich in-browser library for making applications that act on social entities (mostly people!).

An Informatics Model for News?

11-19-2009

Dan Conover writes a consistently good blog about the future of journalism over at Xark!. This week, he wrote movingly about the future of journalism as an information product producer..

The future value of journalism -- what I contend will be the next successful evolutionary step in media development -- will be in creating information products based on thoughtful structures. That doesn't mean the end of narrative, or the end of the live report from the field, but it does mean that journalists will have to learn to view "their story" as a subset of a larger file that stores information in ways that machines can search for interesting patterns.

I call this The Informatics Model, and I think it sounds a lot more complex than it really is. But once we've established it, everyone will come to understand that the asset that journalism creates and owns is the structure in which it assembles and stores freely available (but expensive to gather) information. No individual fact has an appreciable value. The structure in which each resides, complete with metadata that tells us its "aboutness," will be the resource that we sell not only to news consumers, but to researchers, businesses, networks and specialized clients.

Give away the stories. Sell the structures.

Follow the link on The Informatics Model for a much more detailed exploration of what the idea means.

It's a provocative idea, and a somewhat troubling one. Among other things, it suggests that we should be training librarians and engineers to be journalists, instead of starting with writers. It also suggests that journalists are competing head-on with Google, in an effort to "organize the world's information". The image that leaps to mind there, unfortunately, is of John Henry with his hammer.

SPDY

11-13-2009

The Chromium team at Google has announced SPDY, an experimental new application-level message framing protocol for the web. SPDY is designed to speed up web browsing by reducing latency and minimizing the effect of lossy and slow networks. I worked for a time on the WAAS WAN Optimization product at Cisco, and was particularly interested in optimizations to HTTP, so I'm glad to see Google making an effort to work on this problem.

The SPDY whitepaper has a good overview of the current understanding of the performance issues in HTTP. I'll try to illustrate those problems, and explain how SPDY proposes to address them.

In theory, a web browser simply opens a connection to a server, requests a page, and receives the page's data. In practice, of course, most web pages are made of many objects, most of which need to be retrieved for the page to be rendered correctly. All modern browsers open multiple TCP connections to retrieve these objects, and interleave the requests for the objects across all the TCP streams, like this:

(In these diagrams, each grey box indicates a single data stream; the blue portion is when we are sending or receiving data. The length of each blue portion is a function of the client send time, the time for the packet to reach the server, the time for the server to process the request, and the time for all of the packets of the response to reach the client. The exact length of the response time is therefore a function of the network latency and the TCP window size, which is more detail than I'm going to go into now!)

I'm illustrating a couple points with this diagram:

  • The client can't start requesting objects until it finishes receiving the first page. That's because the web browser needs to parse the HTML to discover the addresses of all the embedded objects. (Some browsers now start requesting embedded objects before they finish parsing the page, which means that the very first blue bar doesn't need to be done before the new black bars start).
  • Some requests are handled very quickly, while others take more time -- the fastest hits are due to a cache hit, when the server simply returns an Unmodified header. The client has no way of knowing which request is going to be a fast one when it makes it though (though it could try to guess). That's too bad, because the client can actually make a pretty good guess about which requests are the important ones (prioritizing, say, the big image right in front of the user, instead of the image in the footer that is scrolled off the screen right now). In practice, HTTP has the head of line blocking problem.
  • The transmission time includes the time to send both the message headers and the message body. For small messages, the message headers are much larger than the body -- in a cache hit case, the message is 100% headers.
  • More TCP streams leads directly to lower latency. Most modern web browsers open six connections to a server, which reduces perceived latency significantly. But those streams are not free: they consume flow records in a load balancer, a file descriptor in the web server's operating system, and (in most server implementations) a thread or process in the web server.

Now, in a real-world network, the story gets a bit uglier. Here's a diagram that shows what happens in a high-packet-loss network:


In these diagrams, the red region shows where a packet was lost. Under TCP retransmission rules, the entire stream is stalled while the sending party waits for the ACK, decides that packet loss has occured, and retransmits. In practice, this can take a second or more.

The picture gets even uglier if the packet loss occurs early in the flow:


In this picture, a packet was lost during the initial page request (or even, worst case, during the initial SYN of the TCP handshake). In that case, the entire page load is stalled waiting for that one stream to retransmit.

How SPDY Changes The Picture

SPDY proposes to keep the TCP substrate of HTTP, and to preserve the request/response message exchange format, but to replace the stream-oriented protocol with a more sophisticated multiplexed message framing protocol. It looks like this:


  • The client opens a single TCP connection to the server, and sends HTTP requests down it.
  • Requests and responses can be multiplexed on this single connection.
  • Headers are compressed (notice that the cached message pairs are much shorter).
  • The server has the option of initiating "server push", by delivering some responses that the client did not ask for, because it knows they will be needed.
  • Individual bars take a bit longer, because the available bandwidth is shared.

In the best case, SPDY can be much faster than HTTP-over-multiple-TCP.

The SPDY designers mandated that it must run over SSL. While they claim that this is for the security benefit, I think it far more likely that it is because it allowed them to tunnel through application-aware networking infrastructure.

Alternative Approaches

The Stream Control Transmission Protocol attacks the problem at the transport layer. SCTP proposes to replace TCP, a single-connection, stream-oriented protocol, with an association-based, multi-stream protocol, still running over IP.

There is an Internet Society paper, "Why is SCTP needed given TCP and UDP are widely available?", which does a good job explaining the advantages of SCTP. Some experimental work on supporting HTTP over SCTP has been done, and a prototype of HTTP-over-SCTP in Firefox has been demonstrated.

SCTP has been around for almost ten years, and hasn't really seen much uptake, despite having many attractive characteristics. It would require updates to many pieces of the computing infrastructure, both in application-aware networking gear (load balancers, firewalls, etc.) and in client and server operating systems and applications.

So, what next?

It looks like the Google team has done some great work, and has tried hard to strike a balance between progress and compatibility. Like all attempts to improve the infrastructure of a system that is under heavy use, there are a lot of hard questions to ask about it.

  • The SPDY research (and other work done by other teams) has shown that HTTP header compression has major benefits, especially on low-bandwidth uplinks. I'd be interested in analyzing the speedup of just adding header compression to HTTP/1.1.
  • Tunneling over SSL allows SPDY to hide from large chunks of the application-aware network infrastructure, but there is still a huge deployed base of SSL-terminating load balancers and reverse proxies. Upgrading every data center in the world to work with SPDY would be a huge task -- I'd like to see more thinking about how we could gracefully upgrade the world.
  • The server push model proposed by SPDY raises some interesting possibilities for server-side optimization of perceived client latency. I can imagine a "site compiler" that builds a resource manifest for each page and prepositions the content -- the problem, of course, is that a cache hit beats a server push every time. Perhaps the server should push an object manifest to the client, which would allow the client to make a single request for all the objects that it doesn't have cached yet.
All posts »