Be careful with your random tokens

Posted

We use Resque extensively in Meldium. Most of what our product does involves interacting with third-party web apps, and in order to keep the site snappy (and avoid long-running Heroku requests), we have a very common pattern:

  1. Kick off a job (i.e. create user accounts in Github and Trello).
  2. Return a token to the client.
  3. Have the client poll on the token until the job has finished.

During a recent code review, we found a problem with this pattern that has security implications: the tokens we returned in step 2 were not sufficiently random, which opened up a potential vulnerability in our app. Specifically, the job_id returned by resque-status 0.3.3 and earlier is not a sufficiently random token. We've fixed the issue, but I wanted to share what we found to help you avoid a similar mistake in your app.

Full disclosure: the production version of Meldium was vulnerable to the attack I describe in this post. We've audited the related code and determined that due to other checks we had in place, an attacker could not have used this exploit to gain any sensitive data. The worst case would be that an attacker could have learned the status of another user's jobs (queued, working, failed, etc). But depending on how your app is implemented, versions of this attack could be used to gain access to much more sensitive data.

Polling and Resque::Status

The vulnerable code in our app used version 0.3.3 of the Resque::Status plugin in addition to Resque, and exposed the job_ids generated by Resque::Status back to the end user. Here's roughly how those jobs worked:

This code seems okay - we're passing a random-looking job_id back to the user to poll on. Since Meldium uses SSL everywhere, that token can't be sniffed by another user. However, the problem is that the job_ids are not random - they just look that way. An attacker who can generate one job can easily guess the IDs of other jobs.

UUIDs have low entropy

To see where this breaks down, let's take a look at the relevant code in resque-status 0.3.3:

UUID.generate(:compact) gives you back a nice, random-looking string - 32 hex characters at 4 bits each gives us a 128 bit space:

But these UUIDs look a lot less random if you generate ten of them in a row:

The UUID gem docs tell us exactly what's going on here:

A UUID is 128 bit long, and consists of a 60-bit time value, a 16-bit sequence number and a 48-bit node identifier.

The time value is taken from the system clock, and is monotonically incrementing. However, since it is possible to set the system clock backward, a sequence number is added. The sequence number is incremented each time the UUID generator is started. The combination guarantees that identifiers created on the same machine are unique with a high degree of probability.

And RFC 4122, which defines the UUID standard, goes one step further. (Update: thanks to Douglas Murth for pointing out in the comments that this is a "version 1" UUID. A "version 4" UUID uses 126 psuedorandom bits).

Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example. A predictable random number source will exacerbate the situation.

It's important to note that this problem is not a bug or vulnerability in resque-status - the library makes no claim to return random job IDs. The bug here is in the application misusing the IDs by treating them as authorization tokens.

Fixing the glitch

So how did we fix the bug? One option would be to upgrade to a newer version (0.4.0 or later) of resque-status, which now uses SecureRandom.hex to generate job_ids. This isn't really a reliable fix: resque-status makes no promises about the IDs it returns, and the implementation may change again in the future.

Because we're paranoid about these things, we're generating our own random tokens (using SecureRandom.hex) and passing them into each job. That way we know where the tokens are coming from, independent of changes in our library code, and we don't need an extra authentication check on job ownership. Here's a snippet of our base Job class that shows how we return results from a Resque job:

And remember, this is just one instance of a class of vulnerabilities. Every time your app gives out a handle or token that authorizes the user to do something or read some sensitive data, think about where that token comes from and whether it can be guessed.