[java] How good is Java's UUID.randomUUID?

I know that randomized UUIDs have a very, very, very low probability for collision in theory, but I am wondering, in practice, how good Java's randomUUID() is in terms of not having collision? Does anybody have any experience to share?

This question is related to java uuid

The answer is


Many of the answers discuss how many UUIDs would have to be generated to reach a 50% chance of a collision. But a 50%, 25%, or even 1% chance of collision is worthless for an application where collision must be (virtually) impossible.

Do programmers routinely dismiss as "impossible" other events that can and do occur?

When we write data to a disk or memory and read it back again, we take for granted that the data are correct. We rely on the device's error correction to detect any corruption. But the chance of undetected errors is actually around 2-50.

Wouldn't it make sense to apply a similar standard to random UUIDs? If you do, you will find that an "impossible" collision is possible in a collection of around 100 billion random UUIDs (236.5).

This is an astronomical number, but applications like itemized billing in a national healthcare system, or logging high frequency sensor data on a large array of devices could definitely bump into these limits. If you are writing the next Hitchhiker's Guide to the Galaxy, don't try to assign UUIDs to each article!


Since most answers focused on the theory I think I can add something to the discussion by giving a practical test I did. In my database I have around 4.5 million UUIDs generated using Java 8 UUID.randomUUID(). The following ones are just some I found out:

c0f55f62-b990-47bc-8caa-f42313669948

c0f55f62-e81e-4253-8299-00b4322829d5

c0f55f62-4979-4e87-8cd9-1c556894e2bb


b9ea2498-fb32-40ef-91ef-0ba00060fe64

be87a209-2114-45b3-9d5a-86d00060fe64


4a8a74a6-e972-4069-b480-bdea1177b21f

12fb4958-bee2-4c89-8cf8-edea1177b21f

If it was truly random, the probability of having these kind of similar UUIDs would be considerably low (see edit), since we're considering only 4.5 million entries. So, although this function is good, in terms of not having collisions, for me it doesn't seem that good as it would be in theory.

Edit:

A lot of people seem to not understand this answer so I'll clarify my point: I know that the similarities are "small" and far from a full collision. However, I just wanted to compare the Java's UUID.randomUUID() with a true random number generator, which is the actual question.

In a true random number generator, the probability of the last case happening would be around = 0.007%. Therefore, I think my conclusion stands.

Formula is explained in this wiki article en.wikipedia.org/wiki/Birthday_problem


The original generation scheme for UUIDs was to concatenate the UUID version with the MAC address of the computer that is generating the UUID, and with the number of 100-nanosecond intervals since the adoption of the Gregorian calendar in the West. By representing a single point in space (the computer) and time (the number of intervals), the chance of a collision in values is effectively nil.


Wikipedia has a very good answer http://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

the number of random version 4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion, computed as follows:

...

This number is equivalent to generating 1 billion UUIDs per second for about 85 years, and a file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes, many times larger than the largest databases currently in existence, which are on the order of hundreds of petabytes.

...

Thus, for there to be a one in a billion chance of duplication, 103 trillion version 4 UUIDs must be generated.


At a former employer we had a unique column that contained a random uuid. We got a collision the first week after it was deployed. Sure, the odds are low but they aren't zero. That is why Log4j 2 contains UuidUtil.getTimeBasedUuid. It will generate a UUID that is unique for 8,925 years so long as you don't generate more than 10,000 UUIDs/millisecond on a single server.


I'm not an expert, but I'd assume that enough smart people looked at Java's random number generator over the years. Hence, I'd also assume that random UUIDs are good. So you should really have the theoretical collision probability (which is about 1 : 3 × 10^38 for all possible UUIDs. Does anybody know how this changes for random UUIDs only? Is it 1/(16*4) of the above?)

From my practical experience, I've never seen any collisions so far. I'll probably have grown an astonishingly long beard the day I get my first one ;)


We have been using the Java's random UUID in our application for more than one year and that to very extensively. But we never come across of having collision.


Does anybody have any experience to share?

There are 2^122 possible values for a type-4 UUID. (The spec says that you lose 2 bits for the type, and a further 4 bits for a version number.)

Assuming that you were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in your lifetime would be vanishingly small. And to detect the duplicate, you'd have to solve the problem of comparing 1 million new UUIDs per second against all of the UUIDs you have previously generated1!

The chances that anyone has experienced (i.e. actually noticed) a duplicate in real life are even smaller than vanishingly small ... because of the practical difficulty of looking for collisions.

Now of course, you will typically be using a pseudo-random number generator, not a source of truly random numbers. But I think we can be confident that if you are using a creditable provider for your cryptographic strength random numbers, then it will be cryptographic strength, and the probability of repeats will be the same as for an ideal (non-biased) random number generator.

However, if you were to use a JVM with a "broken" crypto- random number generator, all bets are off. (And that might include some of the workarounds for "shortage of entropy" problems on some systems. Or the possibility that someone has tinkered with your JRE, either on your system or upstream.)


1 - Assuming that you used "some kind of binary btree" as proposed by an anonymous commenter, each UUID is going to need O(NlogN) bits of RAM memory to represent N distinct UUIDs assuming low density and random distribution of the bits. Now multiply that by 1,000,000 and the number of seconds that you are going to run the experiment for. I don't think that is practical for the length of time needed to test for collisions of a high quality RNG. Not even with (hypothetical) clever representations.


UUID uses java.security.SecureRandom, which is supposed to be "cryptographically strong". While the actual implementation is not specified and can vary between JVMs (meaning that any concrete statements made are valid only for one specific JVM), it does mandate that the output must pass a statistical random number generator test.

It's always possible for an implementation to contain subtle bugs that ruin all this (see OpenSSH key generation bug) but I don't think there's any concrete reason to worry about Java UUIDs's randomness.


I play at lottery last year, and I've never won .... but it seems that there lottery has winners ...

doc : http://tools.ietf.org/html/rfc4122

Type 1 : not implemented. collision are possible if the uuid is generated at the same moment. impl can be artificially a-synchronize in order to bypass this problem.

Type 2 : never see a implementation.

Type 3 : md5 hash : collision possible (128 bits-2 technical bytes)

Type 4 : random : collision possible (as lottery). note that the jdk6 impl dont use a "true" secure random because the PRNG algorithm is not choose by developer and you can force system to use a "poor" PRNG algo. So your UUID is predictable.

Type 5 : sha1 hash : not implemented : collision possible (160 bit-2 technical bytes)