[php] urlencode vs rawurlencode?

Proof is in the source code of PHP.

I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.

Download the source (or use http://lxr.php.net/ to browse it online), grep all the files for the function name, you'll find something such as this:

PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.

RawUrlEncode()

PHP_FUNCTION(rawurlencode)
{
    char *in_str, *out_str;
    int in_str_len, out_str_len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
                              &in_str_len) == FAILURE) {
        return;
    }

    out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len);
    RETURN_STRINGL(out_str, out_str_len, 0);
}

UrlEncode()

PHP_FUNCTION(urlencode)
{
    char *in_str, *out_str;
    int in_str_len, out_str_len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
                              &in_str_len) == FAILURE) {
        return;
    }

    out_str = php_url_encode(in_str, in_str_len, &out_str_len);
    RETURN_STRINGL(out_str, out_str_len, 0);
}

Okay, so what's different here?

They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode

So go look for those functions!

Lets look at php_raw_url_encode

PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length)
{
    register int x, y;
    unsigned char *str;

    str = (unsigned char *) safe_emalloc(3, len, 1);
    for (x = 0, y = 0; len--; x++, y++) {
        str[y] = (unsigned char) s[x];
#ifndef CHARSET_EBCDIC
        if ((str[y] < '0' && str[y] != '-' && str[y] != '.') ||
            (str[y] < 'A' && str[y] > '9') ||
            (str[y] > 'Z' && str[y] < 'a' && str[y] != '_') ||
            (str[y] > 'z' && str[y] != '~')) {
            str[y++] = '%';
            str[y++] = hexchars[(unsigned char) s[x] >> 4];
            str[y] = hexchars[(unsigned char) s[x] & 15];
#else /*CHARSET_EBCDIC*/
        if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) {
            str[y++] = '%';
            str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4];
            str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15];
#endif /*CHARSET_EBCDIC*/
        }
    }
    str[y] = '\0';
    if (new_length) {
        *new_length = y;
    }
    return ((char *) str);
}

And of course, php_url_encode:

PHPAPI char *php_url_encode(char const *s, int len, int *new_length)
{
    register unsigned char c;
    unsigned char *to, *start;
    unsigned char const *from, *end;

    from = (unsigned char *)s;
    end = (unsigned char *)s + len;
    start = to = (unsigned char *) safe_emalloc(3, len, 1);

    while (from < end) {
        c = *from++;

        if (c == ' ') {
            *to++ = '+';
#ifndef CHARSET_EBCDIC
        } else if ((c < '0' && c != '-' && c != '.') ||
                   (c < 'A' && c > '9') ||
                   (c > 'Z' && c < 'a' && c != '_') ||
                   (c > 'z')) {
            to[0] = '%';
            to[1] = hexchars[c >> 4];
            to[2] = hexchars[c & 15];
            to += 3;
#else /*CHARSET_EBCDIC*/
        } else if (!isalnum(c) && strchr("_-.", c) == NULL) {
            /* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */
            to[0] = '%';
            to[1] = hexchars[os_toascii[c] >> 4];
            to[2] = hexchars[os_toascii[c] & 15];
            to += 3;
#endif /*CHARSET_EBCDIC*/
        } else {
            *to++ = c;
        }
    }
    *to = 0;
    if (new_length) {
        *new_length = to - start;
    }
    return (char *) start;
}

One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.

Both of these functions manage EBCDIC if the web server has defined it.

Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:

/* rfc1738:

   ...The characters ";",
   "/", "?", ":", "@", "=" and "&" are the characters which may be
   reserved for special meaning within a scheme...

   ...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
   reserved characters used for their reserved purposes may be used
   unencoded within a URL...

   For added safety, we only leave -_. unencoded.
 */

static unsigned char hexchars[] = "0123456789ABCDEF";

Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.

Differences in ASCII:

URLENCODE:

  • Calculates a start/end length of the input string, allocates memory
  • Walks through a while-loop, increments until we reach the end of the string
  • Grabs the present character
  • If the character is equal to ASCII Char 0x20 (ie, a "space"), add a + sign to the output string.
  • If it's not a space, and it's also not alphanumeric (isalnum(c)), and also isn't and _, -, or . character, then we , output a % sign to array position 0, do an array look up to the hexchars array for a lookup for os_toascii array (an array from Apache that translates char to hex code) for the key of c (the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded.
  • If it ends up it's not a space, it's alphanumeric or one of the _-. chars, it outputs exactly what it is.

RAWURLENCODE:

  • Allocates memory for the string
  • Iterates over it based on length provided in function call (not calculated in function as with URLENCODE).

Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.

  • Assigns the present character to a matching character position in str.
  • It checks if the present character is alphanumeric, or one of the _-. chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++ rather than to[1], this is because the strings are being built in different ways, but reach the same goal at the end anyway.
  • When the loop's done and the length's gone, It actually terminates the string, assigning the \0 byte.
  • It returns the encoded string.

Differences:

  • UrlEncode checks for space, assigns a + sign, RawURLEncode does not.
  • UrlEncode does not assign a \0 byte to the string, RawUrlEncode does (this may be a moot point)
  • They iterate differntly, one may be prone to overflow with malformed strings, I'm merely suggesting this and I haven't actually investigated.

They basically iterate differently, one assigns a + sign in the event of ASCII 20.

Differences in EBCDIC:

URLENCODE:

  • Same iteration setup as with ASCII
  • Still translating the "space" character to a + sign. Note-- I think this needs to be compiled in EBCDIC or you'll end up with a bug? Can someone edit and confirm this?
  • It checks if the present char is a char before 0, with the exception of being a . or -, OR less than A but greater than char 9, OR greater than Z and less than a but not a _. OR greater than z (yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).

RAWURLENCODE:

  • Same iteration setup as with ASCII
  • Same check as described in the EBCDIC version of URL Encode, with the exception that if it's greater than z, it excludes ~ from the URL encode.
  • Same assignment as the ASCII RawUrlEncode
  • Still appending the \0 byte to the string before return.

Grand Summary

  • Both use the same hexchars lookup table
  • URIEncode doesn't terminate a string with \0, raw does.
  • If you're working in EBCDIC I'd suggest using RawUrlEncode, as it manages the ~ that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.
  • They iterate differently, one may be faster, one may be prone to memory or string based exploits.
  • URIEncode makes a space into +, RawUrlEncode makes a space into %20 via array lookups.

Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.

Suggested implementations

Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."

If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.

If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.

Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).

Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to urlencode

How to URL encode in Python 3? How to encode a URL in Swift Swift - encode URL Java URL encoding of query string parameters Objective-C and Swift URL encoding How to URL encode a string in Ruby Should I URL-encode POST data? How to set the "Content-Type ... charset" in the request header using a HTML link Parsing GET request parameters in a URL that contains another URL How to urlencode a querystring in Python?

Examples related to url-encoding

A html space is showing as %2520 instead of %20 Sharing a URL with a query string on Twitter What is %2C in a URL? How to do URL decoding in Java? How to encode URL to avoid special characters in Java? urlencoded Forward slash is breaking URL Test if string is URL encoded in PHP URL encoding the space character: + or %20? In a URL, should spaces be encoded using %20 or +? urlencode vs rawurlencode?