De-Anonymizing Web Communities with Gravatar

Introduction

Gravatar is a web service for hosting avatars, thumbnail-sized images which represent users on online communities such as Stack Overflow and WordPress.com. Gravatar allows a user to associate an avatar with their e-mail address; websites that integrate with Gravatar will display this avatar alongside posts provided that the user uses the same e-mail address on both sites.

http://www.gravatar.com/avatar/30b3db431ea2a3dbed966d71c98d205c?s=48

To the left is my avatar. Note that the URL for this image contains a 32-digit hexadecimal string, 30b3db431ea2a3dbed966d71c98d205c, which Gravatar explains is the MD5 hash of my e-mail address after some trivial normalization.

Gravatar's e-mail address hashing is ostensibly used to prevent spammers from harvesting e-mail addresses. However, while it does obfuscate the user's address, this hashing technique is insufficient to protect the anonymity of the user:

  • A user's e-mail address can be recovered through a preimage attack, and
  • comments made by a single user posting with multiple identities can be matched.
http://www.gravatar.com/avatar/b43000bc3d5bf291287e8f90213ed339?s=48

Obviously, the latter is not much of an issue for users who willingly associate an avatar with their address. However, it is important to note that Gravatar generates a URL even for those who do not opt-in to the service. If no avatar is found, a generic image is returned (right).

Case Study: RPInsider

To demonstrate this technique, I harvested user comments from RPInsider, a blog aimed towards students at Rensselaer Polytechnic Institute. Comments on the blog tend to be anonymous, but RPInsider requires an e-mail address to publish a comment; Gravatar is used to display avatars alongside the comments.

Here is a sample of anonymous comment authors found on RPInsider:

Author Gravatar Hash
Lindy Hop 2d0a2fd8af2df830d5ffe17904f376f2
HousingWoes 43db340a19ee7a43d881d5cfc9ac1bf3
Shirley Jackson b6986f5e176b44e0d11c54882d445dba
Anonymous f83c0460432e5c60e3fa1a32ead7eb09

All members of the RPI community are given an @rpi.edu e-mail address derived from the owner's name; it is likely that many RPInsider comment authors will use this address. Since the address space of RPI e-mail addresses is relatively small, given a hash it is trivial to perform a brute force search to recover the plaintext.

In all, 673 comments were harvested representing 290 unique hash-username pairs. 98 of these pairs (34%) were successfully associated with an RPI e-mail address. De-anonymization took 7 hours 37 minutes on a 2.83 GHz Intel Xeon CPU.

Author Gravatar Hash E-mail Address
Lindy Hop 2d0a2fd8af2df830d5ffe17904f376f2 d--m@rpi.edu
HousingWoes 43db340a19ee7a43d881d5cfc9ac1bf3 s-----2@rpi.edu
Shirley Jackson b6986f5e176b44e0d11c54882d445dba l-----6@rpi.edu
Anonymous f83c0460432e5c60e3fa1a32ead7eb09 v---r@rpi.edu

Further analysis of the results revealed several users who posted with multiple usernames but used the same e-mail address. In more than one case a user posted with his or her real identity as well as with a pseudonym.

In addition, correlating the results with the Rensselaer staff directory identified two staff members who posted anonymously.

It should be noted that this preimage attack was made feasible by the fact that RPI e-mail addresses follow a predictable format; the search space is too vast for a generalized attack. That said, e-mail addresses harvested through other means (e.g., web crawling), or an attack targeted at a specific user, could be used instead of brute force.