Thoughts on Do-Not-Track

01-23-2011

By Michael Hanson, Mozilla Labs

There has been a lot of activity in the last month or two on web user tracking and the creation of various Do-Not-Track mechanisms. I wanted to take the time to write down my current thinking on the topic.

Let me begin by defining a couple ways that users relate to the websites they visit.

A first-party relationship is established between a web user and a website with the user's awareness, and presumably, consent. In the most obvious cases, this relationship involves creating an account and logging in to a site. In some cases, it may just represent a visit to the site.

A third-party relationship is one that exists between a web user and some website because of actions taken by another website. When a user visits a news site, which directs the user's browser to open an image from an ad network's site, the relationship is third-party. In many cases, the user does not know that the relationship exists.

So, then we can define different kinds of web tracking.

First-party tracking is intended to preserve a first-party relationship. The trivial example is session tracking within a single site - displaying your account information in the header of each page, for example. More complex is tracking of a user across multiple web servers that work together to provide a web experience - the way sports.yahoo.com and news.yahoo.com work together, for example. More complex yet is tracking of a user across multiple properties of a single company - think of sharing a login cookie between flickr.com and yahoo.com. And most complex of all is tracking a user through a federated login system, such as Facebook Connect, which conveys some user tracking data from Facebook to any website on the Internet.

Third-party tracking is intended to provide some persistence to a third-party relationship. The use that most people think of is behavioral advertising, in which a user's search keywords are identified and communicated to an ad server, where they are used to select a display ad. Many other uses exist for this technology, though. Behavioral metrics, such as those provided by Google Analytics or Quantcast, are one example.

Some kinds of tracking are subtle to peg. For example, when Google Analytics tracks a user for metrics analysis, the data it collects is aggregated only on a per-site basis, and only made available to the site owner — which is, legally and conceptually, a first-party relationship, despite the fact the "google.com" cookie was third party generated by code from a third party (see edit note below). If the site opts in to analytics data sharing, user data is tracked and aggregated more broadly, which makes the tracking relationship more third-party. As far as I can tell, there is no way to know whether a given site has opted to share their data with Google's broader tracking program. (Which is not to call out Google's tracking practices — there are many metrics tracking systems with similar properties).

(Edit 1/25/2011: I incorrectly use Google Analytics as an example for a third-party cookie here; Analytics uses a first party cookie which is issued by a third-party.)

So, having defined my terms, how do the various technical approaches to Do-Not-Track stack up?

1. Per-Company Opt-Out Cookies and an Opt-Out Registry

The NAI, an advertising industry consortium, has created a Behavioral Advertising Opt-Out registry. At the registry site users can click a checkbox to indicate to a site that they do not want to receive behaviorally-targeted ads. In practice, this means that each site places a new cookie on the user's browser that encodes the user's "do not track me" intent.

This approach has some serious flaws. The most critical of them is that the user's Do-Not-Track intent is stored in a cookie. If the user follows the recommended practice of clearing their cookies to protect their privacy, the advertiser will go ahead and start tracking them again as soon as they return to the web! This is a fundamental flaw with storing preferences about cookie usage in the cookie system.

Pros:

  • The mechanism is simple, and addresses behavioral advertising, a major source of end-user concern

Cons:

  • The lack of tracking is essentially on the honor system. Advertisers still receive the information they would need to track the user; they just agree not to. (If you look at what happens to your browser when you click the "Opt-Out" button at the NAI website, you'll notice that many advertisers set a new "optout" cookie on you, but don't delete the old ones. Just in case you change your mind!)
  • Clearing cookies returns the user to a "track me" status, which is the opposite of their intent.
  • Opt-out is required on a per-advertiser basis; if a new advertiser comes online the user has to revisit the registry and opt out again.
  • There is no mechanism here to prevent tracking for non-advertising applications

2. Browser-Based Request, or Header, Blocking

A different approach is to modify the web browser to prevent communication with the tracking site. If no data is exchanged, obviously no tracking can occur. This approach is sometimes called "load blocking", as it blocks the initial loading of data from the tracking party.

The strongest version of this is to block all web requests to a domain, or a subdomain. The AdBlock addon for Firefox, and the proposed Tracking Protection feature in IE 9, take this approach. Such a system requires a blacklist of sites which should not be communicated with, which means that there is an ongoing maintenance burden for any such system. Websites can circumvent the system by changing their URLs, or by changing their domain names, but this is an administrative hassle. The browser can tell the difference between a first-party request (such as when the user visits the target site) and a third-party request (such as from an embedded element) and block only the third-party requests.

A downside to this approach is that it is "all or nothing" — there is no way to express the willingness to contact a site, but the unwillingness to be tracked. There is no way for a user to say, for example to an advertising company, "You may show me advertisements, but I do not want you to use behavioral tracking to target advertisements to me." This makes it hard for good actors to understand and respect user intent while fulfilling their business goals.

Pros:

  • For sites on the list, no tracking is possible

Cons:

  • The list of sites needs to be maintained, and could be subject to abuse, gaming, or staleness
  • Denies good actors a way to communicate with the user
  • Blocking all requests to a site can break non-tracking functionality

2a. Browser-Based Cookie Blocking

A weaker version of this is to block the transmission or handling of certain headers, especially the Cookie and Set-Cookie headers, to certain sites or under certain conditions. This has been attempted by many browser makers in various ways at various times; here is a quick summary of the state of the art:

  • Internet Explorer, while running in the "Medium" security mode, requires that a site provide a P3P header before a third-party Set-Cookie request is honored. As far as the community of internet use is concerned, this is just a magical invocation that turns on cookies, rather than any kind of principled declaration of a privacy policy. That's a pretty sad failure of a system that was intended to do much more; the P3P system was proposed with the best of intentions but has not been adopted in a meaningful way.
  • Safari blocks third-party cookie setting by default, only accepting cookies from the top-level domain of a page. Some sites work around this by redirecting the user to the third party, setting a cookie, and then redirecting back to the embedding page. Safari allows third-party cookie reading, however.
  • Firefox has a preference to disable third-party cookies, which is False by default. This works well in practice but few users turn the preference on.

Pros:

  • Targets the most common tracking mechanism directly

Cons:

  • Browser makers can't agree on how this should work, and mechanisms are trivialized by workarounds

3. A Do-Not-Track Header

Okay, so that brings us to the newest proposal. What if we conclude that it is browser's job to let the web know whether a user wants to consent to tracking?

The obvious way to implement this is with a new HTTP request header. Let's not get too focused on what the exact name of the header will be - but let's call it "Do-Not-Track", and give it a value of 0 or 1. If the browser sends "Do-Not-Track: 1" with a request, it means that the user doesn't want to be tracked. What that means, exactly, is the website's problem. (The actual header proposal, as it turns out, is "X-Tracking-Choice: do-not-track", but the concepts are identical)

The header clearly doesn't prevent all possibly privacy harms, since the browser is still potentially sending all the information that would be required to track the user. What a Do-Not-Track header would do, however, is create a clear statement of user intent -- or, in more traditional words, a paper trail.

In actual practice, a Do-Not-Track header would be a piece of a consumer protection scheme. By creating a paper trail of user intent, it could allow a regulatory body to investigate claims of improper data usage. If a firm was found to track users in spite of the presence of affirmative Do-Not-Track headers, and after a reasonable length of time for implementation had elapsed, a stronger case could be made that they were infringing their user's privacy. This obviously does not work for sites that are willing to ignore user intent or break laws — stronger technical countermeasures will be necessary in those cases. So let us restrict this part of the discussion to sites that want to respect their users and operate within a framework of consent.

One can imagine this feature allowing websites to be upfront with users about their tracking needs: "To use this site, you have to enable tracking" — but with their site? with their advertising networks? In a degenerate case, users could be required to turn off Do-Not-Track for the entire web, just to get into a site. (Which is what happens with Firefox's third-party cookie setting today) Our design should be aware of these cases and avoid those failure modes.

But what would this header be asking the site to do, exactly? Is the user asking the website to stop third-party tracking only, or first-party tracking too? What about cross-site tracking within a content network? Tracking for metrics? Personalization of web widgets, like the Facebook "X of your friends liked this" frame?

I propose that the user's intent can be captured in a simple rule: If the Do-Not-Track header is present, and the site has a "tracking opt-out" mechanism, the mechanism should be activated. If the site does not have an explicit opt-out mechanism, the user should experience only content from their first-party relationship with the page being viewed.

For behavioral advertising servers, the intent of a Do-Not-Track header is quite clear: it should be interpreted as though the user visited the Opt-Out Registry and clicked the checkbox. This is a clear win over the piecemeal approach we have today.

For web metrics servers, like Google Analytics, the meaning of the header is a bit more complicated. Presumably it indicates that the user is okay with being counted in aggregate statistics, but does not want their unique identity used for metrical analysis. Do we need to make a list of potential uses of user data and provide a list of which data should not be tracked in the header (yes to metrics, no to ads)? That is technically possible, but adds complexity.

For an embedded widget, the correct behavior is probably to render whatever content would be given to an anonymous user. To say it in technical terms: When web content is rendered in a third-party context, and the Do-Not-Track header is present, the web server should ignore any user ID or session tracking cookies. This may be disconcerting to some users -- they may want cross-site personalization of their web experience. Do browsers need to implement per-site whitelisting of the Do-Not-Track header?

For a federated login provider, the behavior is even more complicated. The OpenID protocol supports an "immediate" mode, in which the user's identity is relayed to the relying site immediately, with no user check required (though in practice many OpenID servers will get the user's consent once per relying site). Should the presence of a Do-Not-Track header stop this behavior? Obviously, in the non-immediate mode, the user is being asked to consent to a form of tracking, and the header is not very meaningful.

By moving the discussion from a technical domain to a policy domain, the header could change the debate. Obviously gathering feedback from the technical community is necessary, but another important next step would would be to validate the merits of the scheme with legal and regulatory experts.

Pros:

  • Clear signal of user intent which persists in the browser
  • Doesn't break existing web functionality
  • Works for all domains, including advertising, personalization, and metrics

Cons:

  • Has no effect until sites are incented to adopt it
  • Does not prevent malicious or covert tracking
  • May require additional browser-level intelligence with per-domain, or per-type-of-use, logic

Some additional reading on the topic:
---

Comparing these three approaches, the Do-Not-Track header has clear advantages. It is a clear statement of user intent, which persists across cookie deletion, which does not require a central registry or blacklist, and which gives good actors the information they need to treat users with respect. It will require some new work by web servers to implement, but the work is straightforward and uncomplicated. It puts control in the hands of the users, and creates transparency for users, web service providers, and advertisers. A complete solution for users will require additional, more technical, countermeasures, but the header-based approach will provide a powerful new tool.

Michael Hanson is a principal engineer of Mozilla Labs, an arm of the Mozilla Foundation, which is a non-profit dedicated to preserving freedom, openness, and participation on the Web. This blog represents his personal opinions, not those of his employer.

You can also follow him on Twitter.

blog comments powered by Disqus