[python] python encoding utf-8

You don't need to encode data that is already encoded. When you try to do that, Python will first try to decode it to unicode before it can encode it back to UTF-8. That is what is failing here:

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Just write your data directly to the file, there is no need to encode already-encoded data.

If you instead build up unicode values instead, you would indeed have to encode those to be writable to a file. You'd want to use codecs.open() instead, which returns a file object that will encode unicode values to UTF-8 for you.

You also really don't want to write out the UTF-8 BOM, unless you have to support Microsoft tools that cannot read UTF-8 otherwise (such as MS Notepad).

For your MySQL insert problem, you need to do two things:

  • Add charset='utf8' to your MySQLdb.connect() call.

  • Use unicode objects, not str objects when querying or inserting, but use sql parameters so the MySQL connector can do the right thing for you:

    artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode
    
    c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    
    # ...
    
    c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
    

It may actually work better if you used codecs.open() to decode the contents automatically instead:

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

You may want to brush up on Unicode and UTF-8 and encodings. I can recommend the following articles:

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8