What is the best way to remove accents normalize in a Python unicode string

Question

I have a Unicode string in Python  and I would like to remove all the accents  diacritics   I found on the web an elegant way to do this  in Java    convert the Unicode string to its long normalized form  with a separate character for letters and diacritics  remove all the characters whose Unicode type is  quot diacritic quot    Do I need to install a library such as pyICU or is this possible with just the Python standard library   And what about python 3  Important note  I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart

User · Accepted Answer

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

User · Answer

gensim utils deaccent text  from Gensim - topic modelling for humans    Sef chomutovskych komunistu dostal postou bily prasek    Another solution is unidecode   Note that the suggested solution with unicodedata typically removes accents only in some character  e g  it turns  l  into     rather than into  l

User · Answer

In response to  MiniQuark s answer   I was trying to read in a csv file that was half-French  containing accents  and also some strings which would eventually become integers and floats  As a test  I created a test txt file that looked like this      Montr  al    ber  12 89  M  re  Fran  oise  no  l  889   I had to include lines 2 and 3 to get it to work  which I found in a python ticket   as well as incorporate  Jabba s comment   import sys  reload sys   sys setdefaultencoding  utf-8   import csv import unicodedata  def remove accents input str       nkfd form   unicodedata normalize  NFKD   unicode input str       return u   join  c for c in nkfd form if not unicodedata combining c     with open  test txt   as f      read   csv reader f      for row in read          for element in row              print remove accents element    The result   Montreal uber 12 89 Mere Francoise noel 889    Note  I am on Mac OS X 10 8 4 and using Python 2 7 3

User · Answer

Some languages have combining diacritics as language letters and accent diacritics to specify accent   I think it is more safe to specify explicitly what diactrics you want to strip   def strip accents string  accents   COMBINING ACUTE ACCENT    COMBINING GRAVE ACCENT    COMBINING TILDE         accents   set map unicodedata lookup  accents       chars    c for c in unicodedata normalize  NFD   string  if c not in accents      return unicodedata normalize  NFC      join chars

User · Answer

Actually I work on project compatible python 2 6  2 7 and 3 4 and I have to create IDs from free user entries    Thanks to you  I have created this function that works wonders   import re import unicodedata  def strip accents text               Strip accents from input String        param text  The input string       type text  String        returns  The processed String       rtype  String              try          text   unicode text   utf-8       except  TypeError  NameError     unicode is a default on python 3          pass     text   unicodedata normalize  NFD   text      text   text encode  ascii    ignore       text   text decode  utf-8       return str text   def text to id text               Convert input text to id        param text  The input string       type text  String        returns  The processed String       rtype  String              text   strip accents text lower        text   re sub              text      text   re sub    0-9a-zA-Z -        text      return text   result   text to id  Montr  al    ber  12 89  M  re  Fran  oise  no  l  889    gt  gt  gt   montreal uber 1289 mere francoise noel 889

User · Answer

How about this   import unicodedata def strip accents s      return    join c for c in unicodedata normalize  NFD   s                    if unicodedata category c      Mn     This works on greek letters  too    gt  gt  gt  strip accents u A  u00c0  u0394  u038E   u A A  u0394  u03a5   gt  gt  gt     The character category  Mn  stands for Nonspacing Mark  which is similar to unicodedata combining in MiniQuark s answer  I didn t think of unicodedata combining  but it is probably the better solution  because it s more explicit    And keep in mind  these manipulations may significantly alter the meaning of the text  Accents  Umlauts etc  are not  decoration

User · Answer

This handles not only accents  but also  strokes   as in    etc     import unicodedata as ud  def rmdiacritics char               Return the base character of char  by  removing  any     diacritics like accents or curls and strokes and the like              desc   ud name char      cutoff   desc find   WITH        if cutoff    -1          desc   desc  cutoff          try              char   ud lookup desc          except KeyError              pass    removing  WITH      produced an invalid name     return char   This is the most elegant way I can think of  and it has been mentioned by alexis in a comment on this page   although I don t think it is very elegant indeed  In fact  it s more of a hack  as pointed out in comments  since Unicode names are     really just names  they give no guarantee to be consistent or anything   There are still special letters that are not handled by this  such as turned and inverted letters  since their unicode name does not contain  WITH   It depends on what you want to do anyway  I sometimes needed accent stripping for achieving dictionary sort order   EDIT NOTE   Incorporated suggestions from the comments  handling lookup errors  Python-3 code

User · Answer

I just found this answer on the Web   import unicodedata  def remove accents input str       nfkd form   unicodedata normalize  NFKD   input str      only ascii   nfkd form encode  ASCII    ignore       return only ascii   It works fine  for French  for example   but I think the second step  removing the accents  could be handled better than dropping the non-ASCII characters  because this will fail for some languages  Greek  for example    The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics   Edit  this does the trick   import unicodedata  def remove accents input str       nfkd form   unicodedata normalize  NFKD   input str      return u   join  c for c in nfkd form if not unicodedata combining c      unicodedata combining c  will return true if the character c can be combined with the preceding character  that is mainly if it s a diacritic   Edit 2  remove accents expects a unicode string  not a byte string   If you have a byte string  then you must decode it into a unicode string like this   encoding    utf-8    or iso-8859-15  or cp1252  or whatever encoding you use byte string   b caf       or simply  caf    before python 3  unicode string   byte string decode encoding

User · Answer

import unicodedata from random import choice  import perfplot import regex import text unidecode   def remove accent chars regex x  str       return regex sub r  p Mn        unicodedata normalize  NFKD   x     def remove accent chars join x  str         answer by MiniQuark       https   stackoverflow com a 517974 7966259     return u quot  quot  join  c for c in unicodedata normalize  NFKD   x  if not unicodedata combining c      perfplot show      setup lambda n     join  choice  M  laga Fran  ois Ph  t Hon      for i in range n         kernels           remove accent chars regex          remove accent chars join          text unidecode unidecode             labels   regex    join    unidecode        n range  2    k for k in range 22        equality check None  relative to 0  xlabel  str len

[python] What is the best way to remove accents (normalize) in a Python unicode string?

Examples related to python

Examples related to python-3.x

Examples related to unicode

Examples related to python-2.x

Examples related to diacritics