Turn a string into a valid filename

Question

I have a string that I want to use as a filename  so I want to remove all characters that wouldn t be allowed in filenames  using Python   I d rather be strict than otherwise  so let s say I want to retain only letters  digits  and a small set of other characters like   -       What s the most elegant solution   The filename needs to be valid on multiple operating systems  Windows  Linux and Mac OS  - it s an MP3 file in my library with the song title as the filename  and is shared and backed up between 3 machines

User · Answer

gt  gt  gt  import string  gt  gt  gt  safechars   bytearray    -       string digits   string ascii letters  encode     gt  gt  gt  allchars   bytearray range 0x100    gt  gt  gt  deletechars   bytearray set allchars  - set safechars    gt  gt  gt  filename   u  ab xa0c    txt   gt  gt  gt  safe filename   filename encode  ascii    ignore   translate None  deletechars  decode    gt  gt  gt  safe filename  abc  txt    It doesn t handle empty strings  special filenames   nul    con   etc

User · Answer

You can look at the Django framework for how they create a  quot slug quot  from arbitrary text   A slug is URL- and filename- friendly  The Django text utils define a function  slugify    that s probably the gold standard for this kind of thing  Essentially  their code is the following  import unicodedata import re  def slugify value  allow unicode False        quot  quot  quot      Taken from https   github com django django blob master django utils text py     Convert to ASCII if  allow unicode  is False  Convert spaces or repeated     dashes to single dashes  Remove characters that aren t alphanumerics      underscores  or hyphens  Convert to lowercase  Also strip leading and     trailing whitespace  dashes  and underscores       quot  quot  quot      value   str value      if allow unicode          value   unicodedata normalize  NFKC   value      else          value   unicodedata normalize  NFKD   value  encode  ascii    ignore   decode  ascii       value   re sub r    w s-        value lower        return re sub r  - s      -   value  strip  -     And the older version  def slugify value        quot  quot  quot      Normalizes string  converts to lowercase  removes non-alpha characters      and converts spaces to hyphens       quot  quot  quot      import unicodedata     value   unicodedata normalize  NFKD   value  encode  ascii    ignore       value   unicode re sub     w s-        value  strip   lower        value   unicode re sub   - s      -   value                 return value  There s more  but I left it out  since it doesn t address slugification  but escaping

User · Answer

What is the reason to use the strings as file names  If human readability is not a factor I would go with base64 module which can produce file system safe strings  It won t be readable but you won t have to deal with collisions and it is reversible   import base64 file name string   base64 urlsafe b64encode your string    Update  Changed based on Matthew comment

User · Answer

Another issue that the other comments haven t addressed yet is the empty string  which is obviously not a valid filename  You can also end up with an empty string from stripping too many characters   What with the Windows reserved filenames and issues with dots  the safest answer to the question    how do I normalise a valid filename from arbitrary user input     is    don t even bother try     if you can find any other way to avoid it  eg  using integer primary keys from a database as filenames   do that   If you must  and you really need to allow spaces and         for file extensions as part of the name  try something like   import re badchars  re compile r   A-Za-z0-9                         badnames  re compile r  aux com 1-9  con lpt 1-9  prn           def makeName s       name  badchars sub      s      if badnames match name           name      name     return name   Even this can t be guaranteed right especially on unexpected OSs     for example RISC OS hates spaces and uses         as a directory separator

User · Answer

Most of these solutions don t work     hello world  -   helloworld     helloworld   -   helloworld   This isn t what you want generally  say you are saving the html for each link  you re going to overwrite the html for a different webpage   I pickle a dict such as     helloworld                 hello world    helloworld     helloworld     helloworld1        2          2 represents the number that should be appended to the next filename   I look up the filename each time from the dict  If it s not there  I create a new one  appending the max number if needed

User · Answer

This whitelist approach  ie  allowing only the chars present in valid chars  will work if there aren t limits on the formatting of the files or combination of valid chars that are illegal  like        for example  what you say would allow a filename named     txt  which I think is not valid on Windows  As this is the most simple approach I d try to remove whitespace from the valid chars and prepend a known valid string in case of error  any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex     gt  gt  gt  import string  gt  gt  gt  valid chars    -      s s     string ascii letters  string digits   gt  gt  gt  valid chars  -     abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789   gt  gt  gt  filename    This Is a  valid  - filename   amp    txt   gt  gt  gt     join c for c in filename if c in valid chars   This Is a  valid  - filename  txt

User · Answer

Just like S Lott answered  you can look at the Django Framework for how they convert a string to a valid filename     The most recent and updated version is found in utils text py  and defines  get valid filename   which is as follows   def get valid filename s       s   str s  strip   replace               return re sub r   u   - w         s      See https   github com django django blob master django utils text py

User · Answer

Just to further complicate things  you are not guaranteed to get a valid filename just by removing invalid characters   Since allowed characters differ on different filenames  a conservative approach could end up turning a valid name into an invalid one   You may want to add special handling for the cases where    The string is all invalid characters  leaving you with an empty string  You end up with a string with a special meaning  eg     or      On windows  certain device names are reserved   For instance  you can t create a file named  nul    nul txt   or nul anything in fact   The reserved names are   CON  PRN  AUX  NUL  COM1  COM2  COM3  COM4  COM5  COM6  COM7  COM8  COM9  LPT1  LPT2   LPT3  LPT4  LPT5  LPT6  LPT7  LPT8  and LPT9   You can probably work around these issues by prepending some string to the filenames that can never result in one of these cases  and stripping invalid characters

User · Answer

UPDATE  All links broken beyond repair in this 6 year old answer   Also  I also wouldn t do it this way anymore  just base64 encode or drop unsafe chars  Python 3 example   import re t   re compile   a-zA-Z0-9   -    unsafe    abc                   v            safe    ch for ch in unsafe if t match ch       gt   abc    With base64 you can encode and decode  so you can retrieve the original filename again   But depending on the use case you might be better off generating a random filename and storing the metadata in separate file or DB   from random import choice from string import ascii lowercase  ascii uppercase  digits allowed chr   ascii lowercase   ascii uppercase   digits  safe      join  choice allowed chr  for   in range 16        gt   CYQ4JDKE9JfcRzAZ    ORIGINAL LINKROTTEN ANSWER   The bobcat project contains a python module that does just this   It s not completely robust  see this post and this reply   So  as noted  base64 encoding is probably a better idea if readability doesn t matter    Docs https   svn origo ethz ch bobcat src-doc safefilename-module html Source https   svn origo ethz ch bobcat trunk src bobcatlib safefilename py

User · Answer

You can use list comprehension together with the string methods    gt  gt  gt  s  foo-bar baz qux 127   9    gt  gt  gt     join x for x in s if x isalnum     foobarbazqux1279

User · Answer

You can use list comprehension together with the string methods    gt  gt  gt  s  foo-bar baz qux 127   9    gt  gt  gt     join x for x in s if x isalnum     foobarbazqux1279

User · Answer

Why not just wrap the  osopen  with a try except and let the underlying OS sort out whether the file is valid   This seems like much less work and is valid no matter which OS you use

User · Answer

This is the solution I ultimately used   import unicodedata  validFilenameChars    -      s s     string ascii letters  string digits   def removeDisallowedFilenameChars filename       cleanedFilename   unicodedata normalize  NFKD   filename  encode  ASCII    ignore       return    join c for c in cleanedFilename if c in validFilenameChars    The unicodedata normalize call replaces accented characters with the unaccented equivalent  which is better than simply stripping them out  After that all disallowed characters are removed   My solution doesn t prepend a known string to avoid possible disallowed filenames  because I know they can t occur given my particular filename format  A more general solution would need to do so

User · Answer

You can use list comprehension together with the string methods    gt  gt  gt  s  foo-bar baz qux 127   9    gt  gt  gt     join x for x in s if x isalnum     foobarbazqux1279

User · Answer

I liked the python-slugify approach here but it was stripping dots also away which was not desired  So I optimized it for uploading a clean filename to s3 this way   pip install python-slugify   Example code   s    Very   Unsafe   file nname h  h    n r  txt  clean basename   slugify os path splitext s  0   clean extension   slugify os path splitext s  1  1    if clean extension      clean filename           format clean basename  clean extension  elif clean basename      clean filename   clean basename else      clean filename    none    only unclean characters   Output    gt  gt  gt  clean filename  very-unsafe-file-name-haha txt    This is so failsafe  it works with filenames without extension and it even works for only unsafe characters file names  result is none here

User · Answer

If you don t mind installing a package  this should be useful  https   pypi org project pathvalidate   From https   pypi org project pathvalidate  sanitize-a-filename      from pathvalidate import sanitize filename  fname    fi l e p  a t gt h  t lt xt  print f  fname  - gt   sanitize filename fname   n   fname     0 a b c lt d gt e f  g h i 0 txt  print f  fname  - gt   sanitize filename fname   n         Output      fi l e p a t gt h  t lt xt - gt  filepath txt  a b c lt d gt e f  g h i 0 txt - gt   abcde f g h i 0 txt

User · Answer

gt  gt  gt  import string  gt  gt  gt  safechars   bytearray    -       string digits   string ascii letters  encode     gt  gt  gt  allchars   bytearray range 0x100    gt  gt  gt  deletechars   bytearray set allchars  - set safechars    gt  gt  gt  filename   u  ab xa0c    txt   gt  gt  gt  safe filename   filename encode  ascii    ignore   translate None  deletechars  decode    gt  gt  gt  safe filename  abc  txt    It doesn t handle empty strings  special filenames   nul    con   etc

User · Answer

I realise there are many answers but they mostly rely on regular expressions or external modules  so I d like to throw in my own answer  A pure python function  no external module needed  no regular expression used  My approach is not to clean invalid chars  but to only allow valid ones   def normalizefilename fn       validchars    -           out          for c in fn        if str isalpha c  or str isdigit c  or  c in validchars           out    c       else          out            return out       if you like  you can add your own valid chars to the validchars variable at the beginning  such as your national letters that don t exist in English alphabet  This is something you may or may not want  some file systems that don t run on UTF-8 might still have problems with non-ASCII chars   This function is to test for a single file name validity  so it will replace path separators with   considering them invalid chars  If you want to add that  it is trivial to modify the if to include os path separator

User · Answer

You could use the re sub   method to replace anything not  filelike   But in effect  every character could be valid  so there are no prebuilt functions  I believe   to get it done   import re  str    File name  txt  f   open os path join   tmp   re sub    -a-zA-Z0-9              str     Would result in a filehandle to  tmp filename txt

User · Answer

gt  gt  gt  import string  gt  gt  gt  safechars   bytearray    -       string digits   string ascii letters  encode     gt  gt  gt  allchars   bytearray range 0x100    gt  gt  gt  deletechars   bytearray set allchars  - set safechars    gt  gt  gt  filename   u  ab xa0c    txt   gt  gt  gt  safe filename   filename encode  ascii    ignore   translate None  deletechars  decode    gt  gt  gt  safe filename  abc  txt    It doesn t handle empty strings  special filenames   nul    con   etc

User · Answer

gt  gt  gt  import string  gt  gt  gt  safechars   bytearray    -       string digits   string ascii letters  encode     gt  gt  gt  allchars   bytearray range 0x100    gt  gt  gt  deletechars   bytearray set allchars  - set safechars    gt  gt  gt  filename   u  ab xa0c    txt   gt  gt  gt  safe filename   filename encode  ascii    ignore   translate None  deletechars  decode    gt  gt  gt  safe filename  abc  txt    It doesn t handle empty strings  special filenames   nul    con   etc

User · Answer

Just to further complicate things  you are not guaranteed to get a valid filename just by removing invalid characters   Since allowed characters differ on different filenames  a conservative approach could end up turning a valid name into an invalid one   You may want to add special handling for the cases where    The string is all invalid characters  leaving you with an empty string  You end up with a string with a special meaning  eg     or      On windows  certain device names are reserved   For instance  you can t create a file named  nul    nul txt   or nul anything in fact   The reserved names are   CON  PRN  AUX  NUL  COM1  COM2  COM3  COM4  COM5  COM6  COM7  COM8  COM9  LPT1  LPT2   LPT3  LPT4  LPT5  LPT6  LPT7  LPT8  and LPT9   You can probably work around these issues by prepending some string to the filenames that can never result in one of these cases  and stripping invalid characters

User · Answer

You can look at the Django framework for how they create a  quot slug quot  from arbitrary text   A slug is URL- and filename- friendly  The Django text utils define a function  slugify    that s probably the gold standard for this kind of thing  Essentially  their code is the following  import unicodedata import re  def slugify value  allow unicode False        quot  quot  quot      Taken from https   github com django django blob master django utils text py     Convert to ASCII if  allow unicode  is False  Convert spaces or repeated     dashes to single dashes  Remove characters that aren t alphanumerics      underscores  or hyphens  Convert to lowercase  Also strip leading and     trailing whitespace  dashes  and underscores       quot  quot  quot      value   str value      if allow unicode          value   unicodedata normalize  NFKC   value      else          value   unicodedata normalize  NFKD   value  encode  ascii    ignore   decode  ascii       value   re sub r    w s-        value lower        return re sub r  - s      -   value  strip  -     And the older version  def slugify value        quot  quot  quot      Normalizes string  converts to lowercase  removes non-alpha characters      and converts spaces to hyphens       quot  quot  quot      import unicodedata     value   unicodedata normalize  NFKD   value  encode  ascii    ignore       value   unicode re sub     w s-        value  strip   lower        value   unicode re sub   - s      -   value                 return value  There s more  but I left it out  since it doesn t address slugification  but escaping

User · Answer

Though you have to be careful  It is not clearly said in your intro  if you are looking only at latine language  Some words can become meaningless or another meaning if you sanitize them with ascii characters only   imagine you have  for  t po  sie   forest poetry   your sanitization might give  fort-posie   strong   something meaningless   Worse if  you have to deal with chinese characters         your system might end up doing  ---  which is doomed to fail after a while and not very helpful  So if you deal with only files I would encourage to either call them a generic chain that you control or to keep the characters as it is  For URIs  about the same

User · Answer

Here  this should cover all the bases  It handles all types of issues for you  including  but not limited too  character substitution  Works in Windows   nix  and almost every other file system  Allows printable characters only  def txt2filename txt  chr set  normal         quot  quot  quot Converts txt to a valid Windows  nix filename with printable characters only       args          txt  The str to convert          chr set   normal    universal   or  inclusive                universal        - 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz               normal         Every printable character exept those disallowed on Windows  nix               extended       All  normal  characters plus the extended character ASCII codes 128-255      quot  quot  quot       FILLER    -         Step 1  Remove excluded characters      if chr set     universal             Lookups in a set are O n  vs O n   x  for a str          printables   set   - 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz       else          if chr set     normal               max chr   127         elif chr set     extended               max chr   256         else              raise ValueError f The chr set argument may be normal  extended or universal  not  chr set             EXCLUDED CHRS   set r  lt  gt   quot                         Illegal characters in Windows filenames          EXCLUDED CHRS update chr 127                      DEL  non-printable           printables   set chr x                           for x in range 32  max chr                           if chr x  not in EXCLUDED CHRS      result      join x if x in printables else FILLER     Allow printable characters only                       for x in txt         Step 2  Device names       and      are invalid filenames in Windows      DEVICE NAMES    CON PRN AUX NUL COM1 COM2 COM3 COM4                         COM5 COM6 COM7 COM8 COM9 LPT1 LPT2                         LPT3 LPT4 LPT5 LPT6 LPT7 LPT8 LPT9                         CONIN  CONOUT        split            This list is an O n  operation      if result in DEVICE NAMES          result   f - result -         Step 3  Maximum length of filename is 255 bytes in Windows and Linux  other  nix flavors may allow longer names       result   result  255         Step 4  Windows does not allow filenames to end with     or     or begin with          result   re sub r         FILLER  result      result   re sub r      FILLER  result       return result  This solution needs no external libraries  It substitutes non-printable filenames too because they are not always simple to deal with

User · Answer

I m sure this isn t a great answer  since it modifies the string it s looping over  but it seems to work alright   import string for chr in your string   if chr            your string   your string replace            elif chr not in string ascii letters or chr not in string digits      your string   your string replace chr

User · Answer

UPDATE  All links broken beyond repair in this 6 year old answer   Also  I also wouldn t do it this way anymore  just base64 encode or drop unsafe chars  Python 3 example   import re t   re compile   a-zA-Z0-9   -    unsafe    abc                   v            safe    ch for ch in unsafe if t match ch       gt   abc    With base64 you can encode and decode  so you can retrieve the original filename again   But depending on the use case you might be better off generating a random filename and storing the metadata in separate file or DB   from random import choice from string import ascii lowercase  ascii uppercase  digits allowed chr   ascii lowercase   ascii uppercase   digits  safe      join  choice allowed chr  for   in range 16        gt   CYQ4JDKE9JfcRzAZ    ORIGINAL LINKROTTEN ANSWER   The bobcat project contains a python module that does just this   It s not completely robust  see this post and this reply   So  as noted  base64 encoding is probably a better idea if readability doesn t matter    Docs https   svn origo ethz ch bobcat src-doc safefilename-module html Source https   svn origo ethz ch bobcat trunk src bobcatlib safefilename py

User · Answer

Keep in mind  there are actually no restrictions on filenames on Unix systems other than    It may not contain  0  It may not contain      Everything else is fair game       touch     even multiline   haha      31m red    0m   evil    ls -la  -rw-r--r--       0 Nov 17 23 39  even multiline haha   31m red   0m evil   ls -lab -rw-r--r--       0 Nov 17 23 39  neven  multiline nhaha n 033 31m  red   033 0m nevil   perl -e  for my  i   glob q    even       print  i         even multiline haha  red  evil   Yes  i just stored ANSI Colour Codes in a file name and had them take effect    For entertainment  put a BEL character in a directory name and watch the fun that ensues when you CD into it

User · Answer

You could use the re sub   method to replace anything not  filelike   But in effect  every character could be valid  so there are no prebuilt functions  I believe   to get it done   import re  str    File name  txt  f   open os path join   tmp   re sub    -a-zA-Z0-9              str     Would result in a filehandle to  tmp filename txt

User · Answer

Just to further complicate things  you are not guaranteed to get a valid filename just by removing invalid characters   Since allowed characters differ on different filenames  a conservative approach could end up turning a valid name into an invalid one   You may want to add special handling for the cases where    The string is all invalid characters  leaving you with an empty string  You end up with a string with a special meaning  eg     or      On windows  certain device names are reserved   For instance  you can t create a file named  nul    nul txt   or nul anything in fact   The reserved names are   CON  PRN  AUX  NUL  COM1  COM2  COM3  COM4  COM5  COM6  COM7  COM8  COM9  LPT1  LPT2   LPT3  LPT4  LPT5  LPT6  LPT7  LPT8  and LPT9   You can probably work around these issues by prepending some string to the filenames that can never result in one of these cases  and stripping invalid characters

User · Answer

This whitelist approach  ie  allowing only the chars present in valid chars  will work if there aren t limits on the formatting of the files or combination of valid chars that are illegal  like        for example  what you say would allow a filename named     txt  which I think is not valid on Windows  As this is the most simple approach I d try to remove whitespace from the valid chars and prepend a known valid string in case of error  any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex     gt  gt  gt  import string  gt  gt  gt  valid chars    -      s s     string ascii letters  string digits   gt  gt  gt  valid chars  -     abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789   gt  gt  gt  filename    This Is a  valid  - filename   amp    txt   gt  gt  gt     join c for c in filename if c in valid chars   This Is a  valid  - filename  txt

User · Answer

You can look at the Django framework for how they create a  quot slug quot  from arbitrary text   A slug is URL- and filename- friendly  The Django text utils define a function  slugify    that s probably the gold standard for this kind of thing  Essentially  their code is the following  import unicodedata import re  def slugify value  allow unicode False        quot  quot  quot      Taken from https   github com django django blob master django utils text py     Convert to ASCII if  allow unicode  is False  Convert spaces or repeated     dashes to single dashes  Remove characters that aren t alphanumerics      underscores  or hyphens  Convert to lowercase  Also strip leading and     trailing whitespace  dashes  and underscores       quot  quot  quot      value   str value      if allow unicode          value   unicodedata normalize  NFKC   value      else          value   unicodedata normalize  NFKD   value  encode  ascii    ignore   decode  ascii       value   re sub r    w s-        value lower        return re sub r  - s      -   value  strip  -     And the older version  def slugify value        quot  quot  quot      Normalizes string  converts to lowercase  removes non-alpha characters      and converts spaces to hyphens       quot  quot  quot      import unicodedata     value   unicodedata normalize  NFKD   value  encode  ascii    ignore       value   unicode re sub     w s-        value  strip   lower        value   unicode re sub   - s      -   value                 return value  There s more  but I left it out  since it doesn t address slugification  but escaping

User · Answer

I m sure this isn t a great answer  since it modifies the string it s looping over  but it seems to work alright   import string for chr in your string   if chr            your string   your string replace            elif chr not in string ascii letters or chr not in string digits      your string   your string replace chr

User · Answer

You could use the re sub   method to replace anything not  filelike   But in effect  every character could be valid  so there are no prebuilt functions  I believe   to get it done   import re  str    File name  txt  f   open os path join   tmp   re sub    -a-zA-Z0-9              str     Would result in a filehandle to  tmp filename txt

User · Answer

This is the solution I ultimately used   import unicodedata  validFilenameChars    -      s s     string ascii letters  string digits   def removeDisallowedFilenameChars filename       cleanedFilename   unicodedata normalize  NFKD   filename  encode  ASCII    ignore       return    join c for c in cleanedFilename if c in validFilenameChars    The unicodedata normalize call replaces accented characters with the unaccented equivalent  which is better than simply stripping them out  After that all disallowed characters are removed   My solution doesn t prepend a known string to avoid possible disallowed filenames  because I know they can t occur given my particular filename format  A more general solution would need to do so

User · Answer

Why not just wrap the  osopen  with a try except and let the underlying OS sort out whether the file is valid   This seems like much less work and is valid no matter which OS you use

User · Answer

There is a nice project on Github called python-slugify     Install   pip install python-slugify   Then use    gt  gt  gt  from slugify import slugify  gt  gt  gt  txt    This  is  a    test ---   gt  gt  gt  slugify txt   this-is-a-test

User · Answer

Keep in mind  there are actually no restrictions on filenames on Unix systems other than    It may not contain  0  It may not contain      Everything else is fair game       touch     even multiline   haha      31m red    0m   evil    ls -la  -rw-r--r--       0 Nov 17 23 39  even multiline haha   31m red   0m evil   ls -lab -rw-r--r--       0 Nov 17 23 39  neven  multiline nhaha n 033 31m  red   033 0m nevil   perl -e  for my  i   glob q    even       print  i         even multiline haha  red  evil   Yes  i just stored ANSI Colour Codes in a file name and had them take effect    For entertainment  put a BEL character in a directory name and watch the fun that ensues when you CD into it

User · Answer

You can look at the Django framework for how they create a  quot slug quot  from arbitrary text   A slug is URL- and filename- friendly  The Django text utils define a function  slugify    that s probably the gold standard for this kind of thing  Essentially  their code is the following  import unicodedata import re  def slugify value  allow unicode False        quot  quot  quot      Taken from https   github com django django blob master django utils text py     Convert to ASCII if  allow unicode  is False  Convert spaces or repeated     dashes to single dashes  Remove characters that aren t alphanumerics      underscores  or hyphens  Convert to lowercase  Also strip leading and     trailing whitespace  dashes  and underscores       quot  quot  quot      value   str value      if allow unicode          value   unicodedata normalize  NFKC   value      else          value   unicodedata normalize  NFKD   value  encode  ascii    ignore   decode  ascii       value   re sub r    w s-        value lower        return re sub r  - s      -   value  strip  -     And the older version  def slugify value        quot  quot  quot      Normalizes string  converts to lowercase  removes non-alpha characters      and converts spaces to hyphens       quot  quot  quot      import unicodedata     value   unicodedata normalize  NFKD   value  encode  ascii    ignore       value   unicode re sub     w s-        value  strip   lower        value   unicode re sub   - s      -   value                 return value  There s more  but I left it out  since it doesn t address slugification  but escaping

User · Answer

What is the reason to use the strings as file names  If human readability is not a factor I would go with base64 module which can produce file system safe strings  It won t be readable but you won t have to deal with collisions and it is reversible   import base64 file name string   base64 urlsafe b64encode your string    Update  Changed based on Matthew comment

User · Answer

Most of these solutions don t work     hello world  -   helloworld     helloworld   -   helloworld   This isn t what you want generally  say you are saving the html for each link  you re going to overwrite the html for a different webpage   I pickle a dict such as     helloworld                 hello world    helloworld     helloworld     helloworld1        2          2 represents the number that should be appended to the next filename   I look up the filename each time from the dict  If it s not there  I create a new one  appending the max number if needed

User · Answer

Another issue that the other comments haven t addressed yet is the empty string  which is obviously not a valid filename  You can also end up with an empty string from stripping too many characters   What with the Windows reserved filenames and issues with dots  the safest answer to the question    how do I normalise a valid filename from arbitrary user input     is    don t even bother try     if you can find any other way to avoid it  eg  using integer primary keys from a database as filenames   do that   If you must  and you really need to allow spaces and         for file extensions as part of the name  try something like   import re badchars  re compile r   A-Za-z0-9                         badnames  re compile r  aux com 1-9  con lpt 1-9  prn           def makeName s       name  badchars sub      s      if badnames match name           name      name     return name   Even this can t be guaranteed right especially on unexpected OSs     for example RISC OS hates spaces and uses         as a directory separator

User · Answer

Another issue that the other comments haven t addressed yet is the empty string  which is obviously not a valid filename  You can also end up with an empty string from stripping too many characters   What with the Windows reserved filenames and issues with dots  the safest answer to the question    how do I normalise a valid filename from arbitrary user input     is    don t even bother try     if you can find any other way to avoid it  eg  using integer primary keys from a database as filenames   do that   If you must  and you really need to allow spaces and         for file extensions as part of the name  try something like   import re badchars  re compile r   A-Za-z0-9                         badnames  re compile r  aux com 1-9  con lpt 1-9  prn           def makeName s       name  badchars sub      s      if badnames match name           name      name     return name   Even this can t be guaranteed right especially on unexpected OSs     for example RISC OS hates spaces and uses         as a directory separator

User · Answer

This whitelist approach  ie  allowing only the chars present in valid chars  will work if there aren t limits on the formatting of the files or combination of valid chars that are illegal  like        for example  what you say would allow a filename named     txt  which I think is not valid on Windows  As this is the most simple approach I d try to remove whitespace from the valid chars and prepend a known valid string in case of error  any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex     gt  gt  gt  import string  gt  gt  gt  valid chars    -      s s     string ascii letters  string digits   gt  gt  gt  valid chars  -     abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789   gt  gt  gt  filename    This Is a  valid  - filename   amp    txt   gt  gt  gt     join c for c in filename if c in valid chars   This Is a  valid  - filename  txt

User · Answer

Why not just wrap the  osopen  with a try except and let the underlying OS sort out whether the file is valid   This seems like much less work and is valid no matter which OS you use

User · Answer

You could use the re sub   method to replace anything not  filelike   But in effect  every character could be valid  so there are no prebuilt functions  I believe   to get it done   import re  str    File name  txt  f   open os path join   tmp   re sub    -a-zA-Z0-9              str     Would result in a filehandle to  tmp filename txt

User · Answer

Answer modified for python 3 6  import string import unicodedata  validFilenameChars    -      s s     string ascii letters  string digits  def removeDisallowedFilenameChars filename       cleanedFilename   unicodedata normalize  NFKD   filename  encode  ASCII    ignore       return    join chr c  for c in cleanedFilename if chr c  in validFilenameChars

User · Answer

Keep in mind  there are actually no restrictions on filenames on Unix systems other than    It may not contain  0  It may not contain      Everything else is fair game       touch     even multiline   haha      31m red    0m   evil    ls -la  -rw-r--r--       0 Nov 17 23 39  even multiline haha   31m red   0m evil   ls -lab -rw-r--r--       0 Nov 17 23 39  neven  multiline nhaha n 033 31m  red   033 0m nevil   perl -e  for my  i   glob q    even       print  i         even multiline haha  red  evil   Yes  i just stored ANSI Colour Codes in a file name and had them take effect    For entertainment  put a BEL character in a directory name and watch the fun that ensues when you CD into it

User · Answer

There is a nice project on Github called python-slugify     Install   pip install python-slugify   Then use    gt  gt  gt  from slugify import slugify  gt  gt  gt  txt    This  is  a    test ---   gt  gt  gt  slugify txt   this-is-a-test

User · Answer

Just like S Lott answered  you can look at the Django Framework for how they convert a string to a valid filename     The most recent and updated version is found in utils text py  and defines  get valid filename   which is as follows   def get valid filename s       s   str s  strip   replace               return re sub r   u   - w         s      See https   github com django django blob master django utils text py

User · Answer

Though you have to be careful  It is not clearly said in your intro  if you are looking only at latine language  Some words can become meaningless or another meaning if you sanitize them with ascii characters only   imagine you have  for  t po  sie   forest poetry   your sanitization might give  fort-posie   strong   something meaningless   Worse if  you have to deal with chinese characters         your system might end up doing  ---  which is doomed to fail after a while and not very helpful  So if you deal with only files I would encourage to either call them a generic chain that you control or to keep the characters as it is  For URIs  about the same

User · Answer

Answer modified for python 3 6  import string import unicodedata  validFilenameChars    -      s s     string ascii letters  string digits  def removeDisallowedFilenameChars filename       cleanedFilename   unicodedata normalize  NFKD   filename  encode  ASCII    ignore       return    join chr c  for c in cleanedFilename if chr c  in validFilenameChars

User · Answer

This whitelist approach  ie  allowing only the chars present in valid chars  will work if there aren t limits on the formatting of the files or combination of valid chars that are illegal  like        for example  what you say would allow a filename named     txt  which I think is not valid on Windows  As this is the most simple approach I d try to remove whitespace from the valid chars and prepend a known valid string in case of error  any other approach will have to know about what is allowed where to cope with Windows file naming limitations and thus be a lot more complex     gt  gt  gt  import string  gt  gt  gt  valid chars    -      s s     string ascii letters  string digits   gt  gt  gt  valid chars  -     abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789   gt  gt  gt  filename    This Is a  valid  - filename   amp    txt   gt  gt  gt     join c for c in filename if c in valid chars   This Is a  valid  - filename  txt

User · Answer

Another issue that the other comments haven t addressed yet is the empty string  which is obviously not a valid filename  You can also end up with an empty string from stripping too many characters   What with the Windows reserved filenames and issues with dots  the safest answer to the question    how do I normalise a valid filename from arbitrary user input     is    don t even bother try     if you can find any other way to avoid it  eg  using integer primary keys from a database as filenames   do that   If you must  and you really need to allow spaces and         for file extensions as part of the name  try something like   import re badchars  re compile r   A-Za-z0-9                         badnames  re compile r  aux com 1-9  con lpt 1-9  prn           def makeName s       name  badchars sub      s      if badnames match name           name      name     return name   Even this can t be guaranteed right especially on unexpected OSs     for example RISC OS hates spaces and uses         as a directory separator

User · Answer

What is the reason to use the strings as file names  If human readability is not a factor I would go with base64 module which can produce file system safe strings  It won t be readable but you won t have to deal with collisions and it is reversible   import base64 file name string   base64 urlsafe b64encode your string    Update  Changed based on Matthew comment

User · Answer

Keep in mind  there are actually no restrictions on filenames on Unix systems other than    It may not contain  0  It may not contain      Everything else is fair game       touch     even multiline   haha      31m red    0m   evil    ls -la  -rw-r--r--       0 Nov 17 23 39  even multiline haha   31m red   0m evil   ls -lab -rw-r--r--       0 Nov 17 23 39  neven  multiline nhaha n 033 31m  red   033 0m nevil   perl -e  for my  i   glob q    even       print  i         even multiline haha  red  evil   Yes  i just stored ANSI Colour Codes in a file name and had them take effect    For entertainment  put a BEL character in a directory name and watch the fun that ensues when you CD into it

User · Answer

Just to further complicate things  you are not guaranteed to get a valid filename just by removing invalid characters   Since allowed characters differ on different filenames  a conservative approach could end up turning a valid name into an invalid one   You may want to add special handling for the cases where    The string is all invalid characters  leaving you with an empty string  You end up with a string with a special meaning  eg     or      On windows  certain device names are reserved   For instance  you can t create a file named  nul    nul txt   or nul anything in fact   The reserved names are   CON  PRN  AUX  NUL  COM1  COM2  COM3  COM4  COM5  COM6  COM7  COM8  COM9  LPT1  LPT2   LPT3  LPT4  LPT5  LPT6  LPT7  LPT8  and LPT9   You can probably work around these issues by prepending some string to the filenames that can never result in one of these cases  and stripping invalid characters

User · Answer

If you don t mind installing a package  this should be useful  https   pypi org project pathvalidate   From https   pypi org project pathvalidate  sanitize-a-filename      from pathvalidate import sanitize filename  fname    fi l e p  a t gt h  t lt xt  print f  fname  - gt   sanitize filename fname   n   fname     0 a b c lt d gt e f  g h i 0 txt  print f  fname  - gt   sanitize filename fname   n         Output      fi l e p a t gt h  t lt xt - gt  filepath txt  a b c lt d gt e f  g h i 0 txt - gt   abcde f g h i 0 txt

User · Answer

I liked the python-slugify approach here but it was stripping dots also away which was not desired  So I optimized it for uploading a clean filename to s3 this way   pip install python-slugify   Example code   s    Very   Unsafe   file nname h  h    n r  txt  clean basename   slugify os path splitext s  0   clean extension   slugify os path splitext s  1  1    if clean extension      clean filename           format clean basename  clean extension  elif clean basename      clean filename   clean basename else      clean filename    none    only unclean characters   Output    gt  gt  gt  clean filename  very-unsafe-file-name-haha txt    This is so failsafe  it works with filenames without extension and it even works for only unsafe characters file names  result is none here

User · Answer

Not exactly what OP was asking for but this is what I use because I need unique and reversible conversions     p3 code def safePath  url       return    join map lambda ch  chr ch  if ch in safePath chars else     02x    ch  url encode  utf-8     safePath chars   set map lambda x  ord x    0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz -         Result is  somewhat  readable  at least from a sysadmin point of view

User · Answer

I realise there are many answers but they mostly rely on regular expressions or external modules  so I d like to throw in my own answer  A pure python function  no external module needed  no regular expression used  My approach is not to clean invalid chars  but to only allow valid ones   def normalizefilename fn       validchars    -           out          for c in fn        if str isalpha c  or str isdigit c  or  c in validchars           out    c       else          out            return out       if you like  you can add your own valid chars to the validchars variable at the beginning  such as your national letters that don t exist in English alphabet  This is something you may or may not want  some file systems that don t run on UTF-8 might still have problems with non-ASCII chars   This function is to test for a single file name validity  so it will replace path separators with   considering them invalid chars  If you want to add that  it is trivial to modify the if to include os path separator

User · Answer

Here  this should cover all the bases  It handles all types of issues for you  including  but not limited too  character substitution  Works in Windows   nix  and almost every other file system  Allows printable characters only  def txt2filename txt  chr set  normal         quot  quot  quot Converts txt to a valid Windows  nix filename with printable characters only       args          txt  The str to convert          chr set   normal    universal   or  inclusive                universal        - 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz               normal         Every printable character exept those disallowed on Windows  nix               extended       All  normal  characters plus the extended character ASCII codes 128-255      quot  quot  quot       FILLER    -         Step 1  Remove excluded characters      if chr set     universal             Lookups in a set are O n  vs O n   x  for a str          printables   set   - 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz       else          if chr set     normal               max chr   127         elif chr set     extended               max chr   256         else              raise ValueError f The chr set argument may be normal  extended or universal  not  chr set             EXCLUDED CHRS   set r  lt  gt   quot                         Illegal characters in Windows filenames          EXCLUDED CHRS update chr 127                      DEL  non-printable           printables   set chr x                           for x in range 32  max chr                           if chr x  not in EXCLUDED CHRS      result      join x if x in printables else FILLER     Allow printable characters only                       for x in txt         Step 2  Device names       and      are invalid filenames in Windows      DEVICE NAMES    CON PRN AUX NUL COM1 COM2 COM3 COM4                         COM5 COM6 COM7 COM8 COM9 LPT1 LPT2                         LPT3 LPT4 LPT5 LPT6 LPT7 LPT8 LPT9                         CONIN  CONOUT        split            This list is an O n  operation      if result in DEVICE NAMES          result   f - result -         Step 3  Maximum length of filename is 255 bytes in Windows and Linux  other  nix flavors may allow longer names       result   result  255         Step 4  Windows does not allow filenames to end with     or     or begin with          result   re sub r         FILLER  result      result   re sub r      FILLER  result       return result  This solution needs no external libraries  It substitutes non-printable filenames too because they are not always simple to deal with

User · Answer

In one line   valid file name   re sub     w     -        any string    you can also put     character to make it more readable  in case of replacing slashs  for example

User · Answer

Not exactly what OP was asking for but this is what I use because I need unique and reversible conversions     p3 code def safePath  url       return    join map lambda ch  chr ch  if ch in safePath chars else     02x    ch  url encode  utf-8     safePath chars   set map lambda x  ord x    0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz -         Result is  somewhat  readable  at least from a sysadmin point of view

User · Answer

You can use list comprehension together with the string methods    gt  gt  gt  s  foo-bar baz qux 127   9    gt  gt  gt     join x for x in s if x isalnum     foobarbazqux1279

User · Answer

What is the reason to use the strings as file names  If human readability is not a factor I would go with base64 module which can produce file system safe strings  It won t be readable but you won t have to deal with collisions and it is reversible   import base64 file name string   base64 urlsafe b64encode your string    Update  Changed based on Matthew comment

User · Answer

In one line   valid file name   re sub     w     -        any string    you can also put     character to make it more readable  in case of replacing slashs  for example

User · Answer

Why not just wrap the  osopen  with a try except and let the underlying OS sort out whether the file is valid   This seems like much less work and is valid no matter which OS you use

[python] Turn a string into a valid filename?

Examples related to python

Examples related to filenames

Examples related to slug

Examples related to sanitize