Introduction
Gravatar is a web service for hosting avatars, thumbnail-sized images which represent users on online communities such as Stack Overflow and WordPress.com. Gravatar allows a user to associate an avatar with their e-mail address; websites that integrate with Gravatar will display this avatar alongside posts provided that the user uses the same e-mail address on both sites.
To the left is my avatar. Note that the URL for this image contains a 32-digit hexadecimal string, 30b3db431ea2a3dbed966d71c98d205c, which Gravatar explains is the MD5 hash of my e-mail address after some trivial normalization.
Gravatar's e-mail address hashing is ostensibly used to prevent spammers from harvesting e-mail addresses. However, while it does obfuscate the user's address, this hashing technique is insufficient to protect the anonymity of the user:
- A user's e-mail address can be recovered through a preimage attack, and
- comments made by a single user posting with multiple identities can be matched.
Obviously, the latter is not much of an issue for users who willingly associate an avatar with their address. However, it is important to note that Gravatar generates a URL even for those who do not opt-in to the service. If no avatar is found, a generic image is returned (right).
Case Study: RPInsider
To demonstrate this technique, I harvested user comments from RPInsider, a blog aimed towards students at Rensselaer Polytechnic Institute. Comments on the blog tend to be anonymous, but RPInsider requires an e-mail address to publish a comment; Gravatar is used to display avatars alongside the comments.
Here is a sample of anonymous comment authors found on RPInsider:
| Author | Gravatar Hash |
|---|---|
| Lindy Hop | 2d0a2fd8af2df830d5ffe17904f376f2 |
| HousingWoes | 43db340a19ee7a43d881d5cfc9ac1bf3 |
| Shirley Jackson | b6986f5e176b44e0d11c54882d445dba |
| Anonymous | f83c0460432e5c60e3fa1a32ead7eb09 |
All members of the RPI community are given an @rpi.edu e-mail address derived from the owner's name; it is likely that many RPInsider comment authors will use this address. Since the address space of RPI e-mail addresses is relatively small, given a hash it is trivial to perform a brute force search to recover the plaintext.
In all, 673 comments were harvested representing 290 unique hash-username pairs. 98 of these pairs (34%) were successfully associated with an RPI e-mail address. De-anonymization took 7 hours 37 minutes on a 2.83 GHz Intel Xeon CPU.
| Author | Gravatar Hash | E-mail Address |
|---|---|---|
| Lindy Hop | 2d0a2fd8af2df830d5ffe17904f376f2 | d--m@rpi.edu |
| HousingWoes | 43db340a19ee7a43d881d5cfc9ac1bf3 | s-----2@rpi.edu |
| Shirley Jackson | b6986f5e176b44e0d11c54882d445dba | l-----6@rpi.edu |
| Anonymous | f83c0460432e5c60e3fa1a32ead7eb09 | v---r@rpi.edu |
Further analysis of the results revealed several users who posted with multiple usernames but used the same e-mail address. In more than one case a user posted with his or her real identity as well as with a pseudonym.
In addition, correlating the results with the Rensselaer staff directory identified two staff members who posted anonymously.
It should be noted that this preimage attack was made feasible by the fact that RPI e-mail addresses follow a predictable format; the search space is too vast for a generalized attack. That said, e-mail addresses harvested through other means (e.g., web crawling), or an attack targeted at a specific user, could be used instead of brute force.