Sanitizing strings to make them URL and filename safe

Question

I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL  like a post slug  and also safe to use as file names  For example  when someone uploads a file I want to make sure that I remove all dangerous characters from the name   So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also          Convert a string to the file URL safe  slug  form        param string  string the string to clean     param bool  is filename TRUE will allow additional filename characters     return string     function sanitize  string        is filename   FALSE        Replace all weird characters with dashes   string   preg replace      w -     is filename                     u    -    string        Only allow one dash separator at a time  and make string lowercase   return mb strtolower preg replace   --  u    -    string    UTF-8        Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names    is-filename allows some additional characters like temp vim files  update  removed the star character since I could not think of a valid use

User · Answer

This should make your filenames safe      string   preg replace array    s                      w    -      array                 string     and a deeper solution to this is      Remove special accented characters - ie  s     clean name   strtr  string  array        gt   S         gt   Z         gt   s         gt   z         gt   Y         gt   A         gt   A         gt   A         gt   A         gt   A         gt   A         gt   C         gt   E         gt   E         gt   E         gt   E         gt   I         gt   I         gt   I         gt   I         gt   N         gt   O         gt   O         gt   O         gt   O         gt   O         gt   O         gt   U         gt   U         gt   U         gt   U         gt   Y         gt   a         gt   a         gt   a         gt   a         gt   a         gt   a         gt   c         gt   e         gt   e         gt   e         gt   e         gt   i         gt   i         gt   i         gt   i         gt   n         gt   o         gt   o         gt   o         gt   o         gt   o         gt   o         gt   u         gt   u         gt   u         gt   u         gt   y         gt   y      clean name   strtr  clean name  array        gt   TH          gt   th          gt   DH          gt   dh          gt   ss          gt   OE          gt   oe          gt   AE          gt   ae          gt   u       clean name   preg replace array    s                      w    -      array                 clean name     This assumes that you want a dot in the filename  if you want it transferred to lowercase  just use   clean name   strtolower  clean name     for the last line

User · Answer

Depending on how you will use it  you might want to add a length limit to protect against buffer overflows

User · Answer

CLEAN ILLEGAL CHARACTERS function clean filename  source file         search               search       amp         search               search               search               search               search               search               search               search               search               search               search               search                replace               replace      and        replace      S        replace               replace              replace              replace              replace              replace              replace              replace              replace              replace              replace              return str replace  search  replace  source file

User · Answer

Try this   function normal chars  string         string   htmlentities  string  ENT QUOTES   UTF-8         string   preg replace    amp   a-z  1 2   acute cedil circ grave lig orn ring slash th tilde uml   i     1    string        string   html entity decode  string  ENT QUOTES   UTF-8         string   preg replace array     0-9a-z  i       -             string        return trim  string    -       Examples   echo normal chars    lix----   xel           Alix Axel echo normal chars                             aeiouAEIOU echo normal chars                             uyAEIOUYaA   Based on the selected answer in this thread  URL Friendly Username in PHP

User · Answer

There are already several solutions provided for this question but I have read and tested most of the code here and I ended up with this solution which is a mix of what I learned here   The function  The function is bundled here in a Symfony2 bundle but it can be extracted to be used as plain PHP  it only has a dependency with the iconv function that must be enabled   Filesystem php    lt  php  namespace COil Bundle COilCoreBundle Component HttpKernel Util   use Symfony Component HttpKernel Util Filesystem as BaseFilesystem          Extends the Symfony filesystem object      class Filesystem extends BaseFilesystem                  Make a filename safe to use in any function   Accents  spaces  special chars            The iconv function must be activated                 param string   fileName       The filename to sanitize  with or without extension          param string   defaultIfEmpty The default string returned for a non valid filename  only special chars or separators          param string   separator      The default separator         param boolean  lowerCase      Tells if the string must converted to lower case                author COil  lt https   github com COil gt          see    http   stackoverflow com questions 2668854 sanitizing-strings-to-make-them-url-and-filename-safe                return string             public function sanitizeFilename  fileName   defaultIfEmpty    default    separator         lowerCase   true               Gather file informations and store its extension      fileInfos   pathinfo  fileName        fileExt     array key exists  extension    fileInfos         strtolower  fileInfos  extension                  Removes accents      fileName    iconv  UTF-8    us-ascii  TRANSLIT    fileInfos  filename             Removes all characters that are not separators  letters  numbers  dots or whitespaces      fileName   preg replace      a-zA-Z   preg quote  separator     d   s          lowerCase   strtolower  fileName     fileName           Replaces all successive separators into a single one      fileName   preg replace       preg quote  separator    s   u    separator   fileName           Trim beginning and ending seperators      fileName   trim  fileName   separator           If empty use the default string     if  empty  fileName              fileName    defaultIfEmpty             return  fileName   fileExt            The unit tests  What is interesting is that I have created PHPUnit tests  first to test edge cases and so you can check if it fits your needs   If you find a bug  feel free to add a test case   FilesystemTest php    lt  php  namespace COil Bundle COilCoreBundle Tests Unit Helper   use COil Bundle COilCoreBundle Component HttpKernel Util Filesystem          Test the Filesystem custom class      class FilesystemTest extends  PHPUnit Framework TestCase                  test sanitizeFilename               public function testFilesystem              fs   new Filesystem          this- gt assertEquals  logo orange gif    fs- gt sanitizeFilename  --log                  ora    -- g  -- gif       sanitizeFilename   handles complex filename with specials chars         this- gt assertEquals  coilstack    fs- gt sanitizeFilename  cOiLsTaCk       sanitizeFilename   converts all characters to lower case         this- gt assertEquals  cOiLsTaCk    fs- gt sanitizeFilename  cOiLsTaCk    default        false      sanitizeFilename   lower case can be desactivated  passing false as the 4th argument         this- gt assertEquals  coil stack    fs- gt sanitizeFilename  coil stack       sanitizeFilename   convert a white space to a separator         this- gt assertEquals  coil-stack    fs- gt sanitizeFilename  coil stack    default    -       sanitizeFilename   can use a different separator as the 3rd argument         this- gt assertEquals  coil stack    fs- gt sanitizeFilename  coil          stack       sanitizeFilename   removes successive white spaces to a single separator         this- gt assertEquals  coil stack    fs- gt sanitizeFilename         coil stack       sanitizeFilename   removes spaces at the beginning of the string         this- gt assertEquals  coil stack    fs- gt sanitizeFilename  coil   stack                sanitizeFilename   removes spaces at the end of the string         this- gt assertEquals  coilstack    fs- gt sanitizeFilename  coil      stack       sanitizeFilename   removes non-ASCII characters         this- gt assertEquals  coil stack    fs- gt sanitizeFilename  coil stack         sanitizeFilename   keeps separators         this- gt assertEquals  coil stack    fs- gt sanitizeFilename   coil        stack       sanitizeFilename   converts successive separators into a single one         this- gt assertEquals  coil stack gif    fs- gt sanitizeFilename  cOil Stack GiF       sanitizeFilename   lower case filename and extension         this- gt assertEquals  copy of coil stack exe    fs- gt sanitizeFilename  Copy of coil stack exe       sanitizeFilename   keeps dots before the extension         this- gt assertEquals  default doc    fs- gt sanitizeFilename               doc       sanitizeFilename   returns a default file name if filename only contains special chars         this- gt assertEquals  default docx    fs- gt sanitizeFilename           -  --                                  docx       sanitizeFilename   returns a default file name if filename only contains special chars         this- gt assertEquals  logo edition 1314352521 jpg    fs- gt sanitizeFilename  logo edition 1314352521 jpg       sanitizeFilename   returns the filename untouched if it does not need to be modified         userId   rand 1  10        this- gt assertEquals  user doc     userId    doc    fs- gt sanitizeFilename        doc    user doc     userId      sanitizeFilename   returns the default string  the 2nd argument  if it can  t be sanitized              The test results   checked on Ubuntu with PHP 5 3 2 and MacOsX with PHP 5 3 17   All tests pass   phpunit -c app  src COil Bundle COilCoreBundle Tests Unit Helper FilesystemTest php PHPUnit 3 6 10 by Sebastian Bergmann   Configuration read from  var www strangebuzz com app phpunit xml dist     Time  0 seconds  Memory  5 75Mb  OK  1 test  17 assertions

User · Answer

I don t think having a list of chars to remove is safe  I would rather use the following   For filenames  Use an internal ID or a hash of the filecontent  Save the document name in a database  This way you can keep the original filename and still find the file   For url parameters  Use urlencode   to encode any special characters

User · Answer

and this is Joomla 3 3 2 version from JFile  makeSafe  file   public static function makeSafe  file           Remove any trailing dots  as those aren t ever valid file names       file   rtrim  file              regex   array        2          A-Za-z0-9     -                     return trim preg replace  regex       file

User · Answer

This is a good function   public function getFriendlyURL  string        setlocale LC CTYPE   en US UTF8         string   iconv  UTF-8    ASCII  TRANSLIT  IGNORE    string        string   preg replace      - pL pN s   u    -    string        string   str replace       -    string        string   trim  string   -         string   strtolower  string       return  string

User · Answer

I found this larger function in the Chyrp code          Function  sanitize    Returns a sanitized string  typically for URLs        Parameters          string - The string to sanitize          force lowercase - Force the string to lowercase          anal - If set to  true   will remove all non-alphanumeric characters      function sanitize  string   force lowercase   true   anal   false         strip   array                                           amp                                                                                                           amp  8216      amp  8217      amp  8220      amp  8221      amp  8211      amp  8212                                                      lt           gt                    clean   trim str replace  strip      strip tags  string          clean   preg replace    s      -    clean        clean     anal    preg replace     a-zA-Z0-9          clean     clean       return   force lowercase             function exists  mb strtolower                  mb strtolower  clean   UTF-8                 strtolower  clean             clean      and this one in the wordpress code         Sanitizes a filename replacing whitespace with dashes       Removes special characters that are illegal in filenames on certain    operating systems and special characters requiring special escaping    to manipulate at the command line  Replaces spaces and consecutive    dashes with a single dash  Trim period  dash and underscore from beginning    and end of filename         since 2 1 0        param string  filename The filename to be sanitized     return string The sanitized filename     function sanitize file name   filename          filename raw    filename       special chars   array                                  lt      gt                                amp                                                                 special chars   apply filters  sanitize file name chars    special chars   filename raw        filename   str replace  special chars       filename        filename   preg replace     s-       -    filename        filename   trim  filename    -         return apply filters  sanitize file name    filename   filename raw       Update Sept 2012  Alix Axel has done some incredible work in this area  His phunction framework includes several great text filters and transformations    Unaccent Slug Filter

User · Answer

Some observations on your solution     u  at the end of your pattern means that the pattern  and not the text it s matching will be interpreted as UTF-8  I presume you assumed the latter      w matches the underscore character  You specifically include it for files which leads to the assumption that you don t want them in URLs  but in the code you have URLs will be permitted to include an underscore  The inclusion of  foreign UTF-8  seems to be locale-dependent  It s not clear whether this is the locale of the server or client  From the PHP docs            A  word  character is any letter or digit or the underscore character  that is  any character which can be part of a Perl  word   The definition of letters and digits is controlled by PCRE s character tables  and may vary if locale-specific matching is taking place  For example  in the  fr   French  locale  some character codes greater than 128 are used for accented letters  and these are matched by  w       Creating the slug  You probably shouldn t include accented etc  characters in your post slug since  technically  they should be percent encoded  per URL encoding rules  so you ll have ugly looking URLs   So  if I were you  after lowercasing  I d convert any  special  characters to their equivalent   e g     -  e  and replace non  a-z  characters with  -   limiting to runs of a single  -  as you ve done  There s an implementation of converting special characters here  https   web archive org web 20130208144021 http   neo22s com slug  Sanitization in general  OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application    The Encoder interface provides   canonicalize  string  input   bool  strict   true   decodeFromBase64  string  input  decodeFromURL  string  input  encodeForBase64  string  input   bool  wrap   false   encodeForCSS  string  input  encodeForHTML  string  input  encodeForHTMLAttribute  string  input  encodeForJavaScript  string  input  encodeForOS  Codec  codec  string  input  encodeForSQL  Codec  codec  string  input  encodeForURL  string  input  encodeForVBScript  string  input  encodeForXML  string  input  encodeForXMLAttribute  string  input  encodeForXPath  string  input    https   github com OWASP PHP-ESAPI https   www owasp org index php Category OWASP Enterprise Security API

User · Answer

This isn t exactly an answer as it doesn t provide any solutions  yet    but it s too big to fit on a comment       I did some testing  regarding file names  on Windows 7 and Ubuntu 12 04 and what I found out was that   1  PHP Can t Handle non-ASCII Filenames  Although both Windows and Ubuntu can handle Unicode filenames  even RTL ones as it seems  PHP 5 3 requires hacks to deal even with the plain old ISO-8859-1  so it s better to keep it ASCII only for safety   2  The Lenght of the Filename Matters  Specially on Windows   On Ubuntu  the maximum length a filename can have  incluinding extension  is 255  excluding path     var www uploads 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345    However  on Windows 7  NTFS  the maximum lenght a filename can have depends on it s absolute path    0   0   244   11 chars  C  1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234 1234567 txt  0   3   240   11 chars  C  123 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 1234567 txt  3   3   236   11 chars  C  123 456 12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 1234567 txt   Wikipedia says that      NTFS allows each path component  directory or filename  to be 255   characters long    To the best of my knowledge  and testing   this is wrong   In total  counting slashes  all these examples have 259 chars  if you strip the C   that gives 256 characters  not 255     The directories where created using the Explorer and you ll notice that it restrains itself from using all the available space for the directory name  The reason for this is to allow the creation of files using the 8 3 file naming convention  The same thing happens for other partitions   Files don t need to reserve the 8 3 lenght requirements of course    255 chars  E  12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901 txt   You can t create any more sub-directories if the absolute path of the parent directory has more than 242 characters  because 256   242   1       8       3  Using Windows Explorer  you can t create another directory if the parent directory has more than 233 characters  depending on the system locale   because 256   233   10       8       3  the 10 here is the length of the string New folder   Windows file system poses a nasty problem if you want to assure inter-operability between file systems   3  Beware of Reserved Characters and Keywords  Aside from removing non-ASCII  non-printable and control characters  you also need to re place move         lt  gt       Just removing these characters might not be the best idea because the filename might lose some of it s meaning  I think that  at the very least  multiple occurences of these characters should be replaced by a single underscore      or perhaps something more representative  this is just an idea         -        -  -   -     -     lt  -     gt  -      There are also special keywords that should be avoided  like NUL   although I m not sure how to overcome that  Perhaps a black list with a random name fallback would be a good approach to solve it   4  Case Sensitiveness  This should go without saying  but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case  that way my file txt and My File txt on Linux won t both become the same my file txt file on Windows   5  Make Sure It s Unique  If the file name already exists  a unique identifier should be appended to it s base file name   Common unique identifiers include the UNIX timestamp  a digest of the file contents or a random string   6  Hidden Files  Just because it can be named doesn t mean it should     Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot   7  Other Considerations  If you have to strip some chars of the file name  the extension is usually more important than the base name of the file  Allowing a considerable maximum number of characters for the file extension  8-16  one should strip the characters from the base name  It s also important to note that in the unlikely event of having a more than one long extension - such as   graphmlz tag gz -   graphmlz tag only   should be considered as the file base name in this case   8  Resources  Calibre handles file name mangling pretty decently     src calibre utils filenames py  src calibre library save to disk py   Wikipedia page on file name mangling and linked chapter from Using Samba     If for instance  you try to create a file that violates any of the rules 1 2 3  you ll get a very useful error   Warning  touch    Unable to create file     because No error in     on line

User · Answer

why not simply use php s urlencode  it replaces  dangerous  characters with their hex representation for urls  i e   20 for a space

User · Answer

I ve always thought Kohana did a pretty good job of it   public static function title  title   separator    -    ascii only   FALSE    if   ascii only     TRUE       Transliterate non-ASCII characters  title   UTF8  transliterate to ascii  title       Remove all characters that are not the separator  a-z  0-9  or whitespace  title   preg replace       preg quote  separator   a-z0-9 s          strtolower  title      else      Remove all characters that are not the separator  letters  numbers  or whitespace  title   preg replace       preg quote  separator    pL pN s   u       UTF8  strtolower  title          Replace all separator characters and whitespace by a single separator  title   preg replace      preg quote  separator    s   u    separator   title       Trim separators from the beginning and end return trim  title   separator       The handy UTF8  transliterate to ascii   will turn stuff like       n   Of course  you could replace the other UTF8    stuff with mb   functions

User · Answer

This is the code used by Prestashop to sanitize urls    replaceAccentedChars   is used by   str2url   to remove diacritics  function replaceAccentedChars  str         patterns   array             Lowercase                x 0105  x 00E0  x 00E1  x 00E2  x 00E3  x 00E4  x 00E5   u               x 00E7  x 010D  x 0107   u               x 010F   u               x 00E8  x 00E9  x 00EA  x 00EB  x 011B  x 0119   u               x 00EC  x 00ED  x 00EE  x 00EF   u               x 0142  x 013E  x 013A   u               x 00F1  x 0148   u               x 00F2  x 00F3  x 00F4  x 00F5  x 00F6  x 00F8   u               x 0159  x 0155   u               x 015B  x 0161   u               x 00DF   u               x 0165   u               x 00F9  x 00FA  x 00FB  x 00FC  x 016F   u               x 00FD  x 00FF   u               x 017C  x 017A  x 017E   u               x 00E6   u               x 0153   u               Uppercase                x 0104  x 00C0  x 00C1  x 00C2  x 00C3  x 00C4  x 00C5   u               x 00C7  x 010C  x 0106   u               x 010E   u               x 00C8  x 00C9  x 00CA  x 00CB  x 011A  x 0118   u               x 0141  x 013D  x 0139   u               x 00D1  x 0147   u               x 00D3   u               x 0158  x 0154   u               x 015A  x 0160   u               x 0164   u               x 00D9  x 00DA  x 00DB  x 00DC  x 016E   u               x 017B  x 0179  x 017D   u               x 00C6   u               x 0152   u          replacements   array               a    c    d    e    i    l    n    o    r    s    ss    t    u    y    z    ae    oe                A    C    D    E    L    N    O    R    S    T    U    Z    AE    OE                  return preg replace  patterns   replacements   str      function str2url  str        if  function exists  mb strtolower             str   mb strtolower  str   utf-8          str   trim  str       if   function exists  mb strtolower             str   replaceAccentedChars  str           Remove all non-whitelist chars       str   preg replace     a-zA-Z0-9 s          - pL  u        str        str   preg replace     s          -            str        str   str replace array             -    str           If it was not possible to lowercase the string with mb strtolower  we do it after the transformations         This way we lose fewer special chars      if   function exists  mb strtolower             str   strtolower  str        return  str

User · Answer

This is a nice way to secure an upload filename    file name   trim basename stripslashes  name       x00   x20

User · Answer

I recommend  URLify for PHP  480  stars on Github  -  the PHP port of URLify js from the Django project  Transliterates non-ascii characters for use in URLs    Basic usage   To generate slugs for URLs    lt  php  echo URLify  filter    J    tudie le fran  ais         jetudie-le-francais   echo URLify  filter   Lo siento  no hablo espa  ol         lo-siento-no-hablo-espanol     gt    To generate slugs for file names    lt  php  echo URLify  filter        jpg   60      true       foto jpg     gt     None of the other suggestions matched my criteria    Should be installable via composer Should not depend on iconv since it behaves differently on different systems Should be extendable to allow overrides and custom character replacements Popular  for instance many stars on Github  Has tests   As a bonus  URLify also removes certain words and strips away all characters not transliterated    Here is a test case with tons of foreign characters being transliterated properly using URLify  https   gist github com motin a65e6c1cc303e46900d10894bf2da87f

User · Answer

Solution  1  You have ability to install PHP extensions on server  hosting   For transliteration of  almost every single language on the planet Earth  to ASCII characters    Install PHP Intl extension first  This is command for Debian  Ubuntu   sudo aptitude install php5-intl This is my fileName function  create test php and paste there following code      x000D   x000D   lt  doctype html gt  x000D   lt html lang  en  gt  x000D   lt head gt  x000D   lt meta charset  utf-8  gt  x000D   lt title gt Test lt  title gt  x000D   lt  head gt  x000D   lt body gt  x000D   lt  php x000D   x000D  function pr  string    x000D    print   lt hr gt    x000D    print       fileName  string         x000D    print   lt br gt    x000D    print        string        x000D    x000D   x000D  function fileName  string    x000D       remove html tags x000D     clean   strip tags  string   x000D       transliterate x000D     clean   transliterator transliterate  Any-Latin Latin-ASCII     clean   x000D       remove non-number and non-letter characters x000D     clean   str replace  --    -   preg replace     a-z0-9-    i       preg replace array  x000D         s     x000D           w-   -    x000D       array  x000D            x000D         x000D        clean     x000D       replace  -  for     x000D     clean   strtr  clean  array  x000D       -    gt      x000D        x000D       remove double      x000D     positionInString   stripos  clean         x000D    while   positionInString     false    x000D       clean   str replace             clean   x000D       positionInString   stripos  clean         x000D      x000D       remove     from the end and beginning of the string x000D     clean   rtrim ltrim  clean              x000D       lowercase the string x000D    return strtolower  clean   x000D    x000D  pr   replace     amp   a-z  1 2   ac134 56f4315981743 8765475  lt7nl2  5  n  138y  73t  7  lute     x000D  pr htmlspecialchars   lt script gt alert   hacked    lt  script gt      x000D  pr    lix----   xel        x000D  pr                          x000D  pr                           x000D  pr  nie4c a a    n      a      x000D  pr                x000D  pr         x000D  pr                  x000D  pr                   x000D  pr         -        x000D  pr                x000D  pr  Mao Tr ch     ng    x000D  pr         x000D  pr                 x000D    gt  x000D   lt  body gt  x000D   lt  html gt  x000D   x000D   x000D    This line is core        transliterate    clean   transliterator transliterate  Any-Latin Latin-ASCII     clean     Answer based on this post   Solution  2  You don t have ability to install PHP extensions on server  hosting     Pretty good job is done in transliteration module for CMS Drupal  It supports almost every single language on the planet Earth  I suggest to check plugin repository if you want to have really complete solution sanitizing strings

User · Answer

Here s CodeIgniter s implementation          Sanitize Filename        param   string   str        Input file name     param   bool     relative path  Whether to preserve paths     return  string     public function sanitize filename  str   relative path   FALSE         bad   array                   lt  --    -- gt      lt      gt                        amp                                                                   20     22             3c            lt            253c          lt            3e            gt            0e            gt            28                       29                       2528                     26            amp            24                       3f                       3b                       3d                         if      relative path                 bad                    bad                      str   remove invisible characters  str  FALSE       return stripslashes str replace  bad       str        And the remove invisible characters dependency   function remove invisible characters  str   url encoded   TRUE         non displayables   array            every control character except newline  dec 10          carriage return  dec 13  and horizontal tab  dec 09      if   url encoded                 non displayables        0 0-8bcef         url encoded 00-08  11  12  14  15          non displayables        1 0-9a-f          url encoded 16-31             non displayables         x00- x08 x0B x0C x0E- x1F x7F   S        00-08  11  12  14-31  127      do                str   preg replace  non displayables       str  -1   count             while   count        return  str

User · Answer

I have adapted from another source and added a couple extra  maybe a little overkill         Convert a string into a url safe address         param string  unformatted     return string     public function formatURL  unformatted          url   strtolower trim  unformatted           replace accent characters  forien languages      search   array                                                                                                                                                                                                                                                                                                                                                                    A    a    A    a    A    a    C    c    C    c    C    c    C    c    D    d          d    E    e    E    e    E    e    E    e    E    e    G    g    G    g    G    g    G    g    H    h    H    h    I    i    I    i    I    i    I    i    I    i              J    j    K    k    L    l    L    l    L    l              L    l    N    n    N    n    N    n         O    o    O    o    O    o                R    r    R    r    R    r    S    s    S    s    S    s                T    t    T    t    T    t    U    u    U    u    U    u    U    u    U    u    U    u    W    w    Y    y          Z    z    Z    z                           O    o    U    u    A    a    I    i    O    o    U    u    U    u    U    u    U    u    U    u                                        replace   array  A    A    A    A    A    A    AE    C    E    E    E    E    I    I    I    I    D    N    O    O    O    O    O    O    U    U    U    U    Y    s    a    a    a    a    a    a    ae    c    e    e    e    e    i    i    i    i    n    o    o    o    o    o    o    u    u    u    u    y    y    A    a    A    a    A    a    C    c    C    c    C    c    C    c    D    d    D    d    E    e    E    e    E    e    E    e    E    e    G    g    G    g    G    g    G    g    H    h    H    h    I    i    I    i    I    i    I    i    I    i    IJ    ij    J    j    K    k    L    l    L    l    L    l    L    l    l    l    N    n    N    n    N    n    n    O    o    O    o    O    o    OE    oe    R    r    R    r    R    r    S    s    S    s    S    s    S    s    T    t    T    t    T    t    U    u    U    u    U    u    U    u    U    u    U    u    W    w    Y    y    Y    Z    z    Z    z    Z    z    s    f    O    o    U    u    A    a    I    i    O    o    U    u    U    u    U    u    U    u    U    u    A    a    AE    ae    O    o          url   str replace  search   replace   url          replace common characters      search   array   amp                      replace   array  and    pounds    dollars          url  str replace  search   replace   url           remove - for spaces and union characters      find   array        amp      r n     n                         url   str replace  find   -    url          delete and replace rest of special chars      find   array     a-z0-9 - lt  gt          -         lt    gt    gt           replace   array      -             uri   preg replace  find   replace   url        return  uri

User · Answer

In terms of file uploads  you would be safest to prevent the user from controlling the file name   As has already been hinted at  store the canonicalised filename in a database along with a randomly chosen and unique name which you ll use as the actual filename   Using OWASP ESAPI  these names could be generated thus    userFilename     ESAPI  getEncoder  - gt canonicalize  input string    safeFilename     ESAPI  getRandomizer  - gt getRandomFilename      You could append a timestamp to the  safeFilename to help ensure that the randomly generated filename is unique without even checking for an existing file   In terms of encoding for URL  and again using ESAPI    safeForURL       ESAPI  getEncoder  - gt encodeForURL  input string     This method performs canonicalisation before encoding the string and will handle all character encodings

User · Answer

I have entry titles with all kinds of weird latin characters as well as some HTML tags that I needed to translate into a useful dash-delimited filename format  I combined  SoLoGHoST s answer with a couple of items from  Xeoncross s answer and customized a bit       function sanitize  string  force lowercase true          Clean up titles for filenames      clean   strip tags  string        clean   strtr  clean  array        gt   S         gt   Z         gt   s         gt   z         gt   Y         gt   A         gt   A         gt   A         gt   A         gt   A         gt   A         gt   C         gt   E         gt   E         gt   E         gt   E         gt   I         gt   I         gt   I         gt   I         gt   N         gt   O         gt   O         gt   O         gt   O         gt   O         gt   O         gt   U         gt   U         gt   U         gt   U         gt   Y         gt   a         gt   a         gt   a         gt   a         gt   a         gt   a         gt   c         gt   e         gt   e         gt   e         gt   e         gt   i         gt   i         gt   i         gt   i         gt   n         gt   o         gt   o         gt   o         gt   o         gt   o         gt   o         gt   u         gt   u         gt   u         gt   u         gt   y         gt   y          clean   strtr  clean  array        gt   TH          gt   th          gt   DH          gt   dh          gt   ss          gt   OE          gt   oe          gt   AE          gt   ae          gt   u          gt   -          clean   str replace  --    -   preg replace     a-z0-9-  i       preg replace array    s         w-   -      array  -         clean          return   force lowercase             function exists  mb strtolower                  mb strtolower  clean   UTF-8                 strtolower  clean             clean      I needed to manually add the em dash character       to the translation array  There may be others but so far my file names are looking good   So   Part 1  My dad   s      urburts       they   re  not  the best   becomes   part-1-my-dads-zurburts-theyre-not-the-best  I just add   html  to the returned string

User · Answer

There is 2 good answers to slugfy your data  use it https   stackoverflow com a 3987966 971619 or it https   stackoverflow com a 7610586 971619

User · Answer

This post seems to work the best among all that I have tied  http   gsynuh com php-string-filename-url-safe 205

[php] Sanitizing strings to make them URL and filename safe?

Examples related to php

Examples related to url

Examples related to filenames

Examples related to sanitization