[regex] RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.

I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.

So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.

My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?


UPDATE:

Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.

[^-A-Za-z0-9+/=]|=[^=]|={3,}$

And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.

This question is related to regex base64 standards-compliance rfc

The answer is


The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://tools.ietf.org/html/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.

Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

The cited RFC considers the empty string as valid (see https://tools.ietf.org/html/rfc4648#section-10) therefore the above regex also does.

The equivalent regular expression for base64url (again, refer to the above RFC) is:

^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.


The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://tools.ietf.org/html/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.

Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

The cited RFC considers the empty string as valid (see https://tools.ietf.org/html/rfc4648#section-10) therefore the above regex also does.

The equivalent regular expression for base64url (again, refer to the above RFC) is:

^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.


^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

This one is good, but will match an empty String

This one does not match empty string :

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.


Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.


To validate base64 image we can use this regex

/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }

Here's an alternative regular expression:

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

It satisfies the following conditions:

  • The string length must be a multiple of four - (?=^(.{4})*$)
  • The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
  • It can have up to two padding (=) characters on the end - ={0,2}
  • It accepts empty strings

The best regexp which I could find up till now is in here https://www.npmjs.com/package/base64-regex

which is in the current version looks like:

module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

This one is good, but will match an empty String

This one does not match empty string :

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

To validate base64 image we can use this regex

/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }

Here's an alternative regular expression:

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

It satisfies the following conditions:

  • The string length must be a multiple of four - (?=^(.{4})*$)
  • The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
  • It can have up to two padding (=) characters on the end - ={0,2}
  • It accepts empty strings

The best regexp which I could find up till now is in here https://www.npmjs.com/package/base64-regex

which is in the current version looks like:

module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to base64

How to convert an Image to base64 string in java? How to convert file to base64 in JavaScript? How to convert Base64 String to javascript file object like as from file input form? How can I encode a string to Base64 in Swift? ReadFile in Base64 Nodejs Base64: java.lang.IllegalArgumentException: Illegal character Converting file into Base64String and back again Convert base64 string to image How to encode text to base64 in python Convert base64 string to ArrayBuffer

Examples related to standards-compliance

Declaration of Methods should be Compatible with Parent Methods in PHP What is the "-->" operator in C/C++? Is it valid to have a html form inside another html form? RegEx to parse or validate Base64 data Can an html element have multiple ids?

Examples related to rfc

Are email addresses case sensitive? What is the behavior difference between return-path, reply-to and from? How to overcome root domain CNAME restrictions? RegEx to parse or validate Base64 data