Hash function that produces short hashes

Question

Is there a way of encryption that can take a string of any length and produce a sub-10-character hash? I want to produce reasonably unique ID's but based on message contents, rather than randomly.

I can live with constraining the messages to integer values, though, if arbitrary-length strings are impossible. However, the hash must not be similar for two consecutive integers, in that case.

User · Answer

If you don t need an algorithm that s strong against intentional modification  I ve found an algorithm called adler32 that produces pretty short   8 character  results  Choose it from the dropdown here to try it out   http   www sha1-online com

User · Answer

Just summarizing an answer that was helpful to me (noting @erasmospunk's comment about using base-64 encoding). My goal was to have a short string that was mostly unique...

I'm no expert, so please correct this if it has any glaring errors (in Python again like the accepted answer):

import base64
import hashlib
import uuid

unique_id = uuid.uuid4()
# unique_id = UUID('8da617a7-0bd6-4cce-ae49-5d31f2a5a35f')

hash = hashlib.sha1(str(unique_id).encode("UTF-8"))
# hash.hexdigest() = '882efb0f24a03938e5898aa6b69df2038a2c3f0e'

result = base64.b64encode(hash.digest())
# result = b'iC77DySgOTjliYqmtp3yA4osPw4='

The result here is using more than just hex characters (what you'd get if you used hash.hexdigest()) so it's less likely to have a collision (that is, should be safer to truncate than a hex digest).

Note: Using UUID4 (random). See http://en.wikipedia.org/wiki/Universally_unique_identifier for the other types.

User · Answer

Simply run this in a terminal  on MacOS or Linux    crc32  lt  echo  some string     8 characters long

User · Answer

If you need  sub-10-character hash  you could use Fletcher-32 algorithm which produces 8 character hash  32 bits    CRC-32 or Adler-32   CRC-32 is slower than Adler32 by a factor of 20  - 100    Fletcher-32 is slightly more reliable than Adler-32  It has a lower computational cost than the Adler checksum  Fletcher vs Adler comparison    A sample program with a few Fletcher implementations is given below        include  lt stdio h gt       include  lt string h gt       include  lt stdint h gt     for uint32 t      uint32 t fletcher32 1 const uint16 t  data  size t len                    uint32 t c0  c1              unsigned int i               for  c0   c1   0  len  gt   360  len -  360                        for  i   0  i  lt  360    i                                c0   c0    data                                c1   c1   c0                                            c0   c0   65535                      c1   c1   65535                            for  i   0  i  lt  len    i                        c0   c0    data                        c1   c1   c0                            c0   c0   65535              c1   c1   65535              return  c1  lt  lt  16   c0              uint32 t fletcher32 2 const uint16 t  data  size t l                uint32 t sum1   0xffff  sum2   0xffff           while  l                unsigned tlen   l  gt  359   359   l              l -  tlen              do                   sum2    sum1     data                  while  --tlen               sum1    sum1  amp  0xffff     sum1  gt  gt  16               sum2    sum2  amp  0xffff     sum2  gt  gt  16                        Second reduction step to reduce sums to 16 bits            sum1    sum1  amp  0xffff     sum1  gt  gt  16           sum2    sum2  amp  0xffff     sum2  gt  gt  16           return  sum2  lt  lt  16    sum1             int main                 char  str1    abcde             char  str2    abcdef            size t len1    strlen str1  1    2        0  will be used for padding          size t len2    strlen str2  1    2               uint32 t f1   fletcher32 1 str1   len1           uint32 t f2   fletcher32 2 str1   len1            printf   u  X  n      f1 f1           printf   u  X  n n    f2 f2            f1   fletcher32 1 str2   len2           f2   fletcher32 2 str2   len2            printf   u  X  n  f1 f1           printf   u  X  n  f2 f2            return 0          Output   4031760169 F04FC729                                                                                                                                                                                                                               4031760169 F04FC729                                                                                                                                                                                                                                1448095018 56502D2A                                                                                                                                                                                                                               1448095018 56502D2A                                                                                                                                                                                                                                 Agrees with Test vectors     abcde   - gt  4031760169  0xF04FC729   abcdef  - gt  1448095018  0x56502D2A    Adler-32 has a weakness for short messages with few hundred bytes  because the checksums for these messages have a poor coverage of the 32 available bits  Check this   The Adler32 algorithm is not complex enough to compete with comparable checksums

User · Answer

You can use the hashlib library for Python. The shake_128 and shake_256 algorithms provide variable length hashes. Here's some working code (Python3):

import hashlib
>>> my_string = 'hello shake'
>>> hashlib.shake_256(my_string.encode()).hexdigest(5)
'34177f6a0a'

Notice that with a length parameter x (5 in example) the function returns a hash value of length 2x.

User · Answer

I needed something along the lines of a simple string reduction function recently. Basically, the code looked something like this (C/C++ code ahead):

size_t ReduceString(char *Dest, size_t DestSize, const char *Src, size_t SrcSize, bool Normalize)
{
    size_t x, x2 = 0, z = 0;

    memset(Dest, 0, DestSize);

    for (x = 0; x < SrcSize; x++)
    {
        Dest[x2] = (char)(((unsigned int)(unsigned char)Dest[x2]) * 37 + ((unsigned int)(unsigned char)Src[x]));
        x2++;

        if (x2 == DestSize - 1)
        {
            x2 = 0;
            z++;
        }
    }

    // Normalize the alphabet if it looped.
    if (z && Normalize)
    {
        unsigned char TempChr;
        y = (z > 1 ? DestSize - 1 : x2);
        for (x = 1; x < y; x++)
        {
            TempChr = ((unsigned char)Dest[x]) & 0x3F;

            if (TempChr < 10)  TempChr += '0';
            else if (TempChr < 36)  TempChr = TempChr - 10 + 'A';
            else if (TempChr < 62)  TempChr = TempChr - 36 + 'a';
            else if (TempChr == 62)  TempChr = '_';
            else  TempChr = '-';

            Dest[x] = (char)TempChr;
        }
    }

    return (SrcSize < DestSize ? SrcSize : DestSize);
}

It probably has more collisions than might be desired but it isn't intended for use as a cryptographic hash function. You might try various multipliers (i.e. change the 37 to another prime number) if you get too many collisions. One of the interesting features of this snippet is that when Src is shorter than Dest, Dest ends up with the input string as-is (0 * 37 + value = value). If you want something "readable" at the end of the process, Normalize will adjust the transformed bytes at the cost of increasing collisions.

Source:

https://github.com/cubiclesoft/cross-platform-cpp/blob/master/sync/sync_util.cpp

User · Answer

You need to hash the contents to come up with a digest. There are many hashes available but 10-characters is pretty small for the result set. Way back, people used CRC-32, which produces a 33-bit hash (basically 4 characters plus one bit). There is also CRC-64 which produces a 65-bit hash. MD5, which produces a 128-bit hash (16 bytes/characters) is considered broken for cryptographic purposes because two messages can be found which have the same hash. It should go without saying that any time you create a 16-byte digest out of an arbitrary length message you're going to end up with duplicates. The shorter the digest, the greater the risk of collisions.

However, your concern that the hash not be similar for two consecutive messages (whether integers or not) should be true with all hashes. Even a single bit change in the original message should produce a vastly different resulting digest.

So, using something like CRC-64 (and base-64'ing the result) should get you in the neighborhood you're looking for.

User · Answer

You can use any commonly available hash algorithm  eg  SHA-1   which will give you a slightly longer result than what you need  Simply truncate the result to the desired length  which may be good enough   For example  in Python    gt  gt  gt  import hashlib  gt  gt  gt  hash   hashlib sha1  my message  encode  UTF-8    hexdigest    gt  gt  gt  hash  104ab42f1193c336aa2cf08a2c946d5c6fd0fcdb   gt  gt  gt  hash  10   104ab42f11

User · Answer

It is now 2019 and there are better options  Namely  xxhash     echo test   xxhsum                                                            2d7f1808da1fa63c  stdin

User · Answer

You could use an existing hash algorithm that produces something short, like MD5 (128 bits) or SHA1 (160). Then you can shorten that further by XORing sections of the digest with other sections. This will increase the chance of collisions, but not as bad as simply truncating the digest.

Also, you could include the length of the original data as part of the result to make it more unique. For example, XORing the first half of an MD5 digest with the second half would result in 64 bits. Add 32 bits for the length of the data (or lower if you know that length will always fit into fewer bits). That would result in a 96-bit (12-byte) result that you could then turn into a 24-character hex string. Alternately, you could use base 64 encoding to make it even shorter.

[encryption] Hash function that produces short hashes?

The answer is

Examples related to encryption

Examples related to uniqueidentifier

Tags