A bunch of the tweets I am importing are having this issue where they read
b'I posted a new photo to Facebook'
I gather the b
indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the b
doesn't go away and is interferring in future code.
Is there a simple way to remove this b
prefix from my lines of text?
Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web.
Here's the link content I'm analyzing:
https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0
new_tweets = 'content in the link'
outtweets = [[tweet.text.encode("utf-8").decode("utf-8")] for tweet in new_tweets]
print(outtweets)
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-21-6019064596bf> in <module>()
1 for screen_name in user_list:
----> 2 get_all_tweets(screen_name,"instance file")
<ipython-input-19-e473b4771186> in get_all_tweets(screen_name, mode)
99 with open(os.path.join(save_location,'%s.instance' % screen_name), 'w') as f:
100 writer = csv.writer(f)
--> 101 writer.writerows(outtweets)
102 else:
103 with open(os.path.join(save_location,'%s.csv' % screen_name), 'w') as f:
C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 64-65: character maps to <undefined>
This question is related to
python
Assuming you don't want to immediately decode it again like others are suggesting here, you can parse it to a string and then just strip the leading 'b
and trailing '
.
>>> x = "Hi there "
>>> x = "Hi there ".encode("utf-8")
>>> x
b"Hi there \xef\xbf\xbd"
>>> str(x)[2:-1]
"Hi there \\xef\\xbf\\xbd"
You need to decode it to convert it to a string. Check the answer here about bytes literal in python3.
In [1]: b'I posted a new photo to Facebook'.decode('utf-8')
Out[1]: 'I posted a new photo to Facebook'
****How to remove b' ' chars which is decoded string in python ****
import base64
a='cm9vdA=='
b=base64.b64decode(a).decode('utf-8')
print(b)
It is just letting you know that the object you are printing is not a string, rather a byte object as a byte literal. People explain this in incomplete ways, so here is my take.
Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string object encoded in utf-8. (Note that converting here means decoding)
byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations
You see that we simply apply the .decode(utf8)
function.
https://docs.python.org/3.3/library/stdtypes.html#bytes
https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals
stringliteral ::= [stringprefix](shortstring | longstring)
stringprefix ::= "r" | "u" | "R" | "U"
shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::= shortstringchar | stringescapeseq
longstringitem ::= longstringchar | stringescapeseq
shortstringchar ::= <any source character except "\" or newline or the quote>
longstringchar ::= <any source character except "\">
stringescapeseq ::= "\" <any source character>
bytesliteral ::= bytesprefix(shortbytes | longbytes)
bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::= shortbyteschar | bytesescapeseq
longbytesitem ::= longbyteschar | bytesescapeseq
shortbyteschar ::= <any ASCII character except "\" or newline or the quote>
longbyteschar ::= <any ASCII character except "\">
bytesescapeseq ::= "\" <any ASCII character>
Although the question is very old, I think it may be helpful to who is facing the same problem. Here the texts is a string like below:
text= "b'I posted a new photo to Facebook'"
Thus you can not remove b by encoding it because it's not a byte. I did the following to remove it.
cleaned_text = text.split("b'")[1]
which will give "I posted a new photo to Facebook"
I got it done by only encoding the output using utf-8. Here is the code example
new_tweets = api.GetUserTimeline(screen_name = user,count=200)
result = new_tweets[0]
try: text = result.text
except: text = ''
with open(file_name, 'a', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(text)
i.e: do not encode when collecting data from api, encode the output (print or write) only.
On python 3.6 with django 2.0, decode on a byte literal does not works as expected. Yeah i get the right result when i print it, but the b'value' is still there even if you print it right.
This is what im encoding
uid': urlsafe_base64_encode(force_bytes(user.pk)),
This is what im decoding:
uid = force_text(urlsafe_base64_decode(uidb64))
This is what django 2.0 says :
urlsafe_base64_encode(s)[source]
Encodes a bytestring in base64 for use in URLs, stripping any trailing equal signs.
urlsafe_base64_decode(s)[source]
Decodes a base64 encoded string, adding back any trailing equal signs that might have been stripped.
This is my account_activation_email_test.html file
{% autoescape off %}
Hi {{ user.username }},
Please click on the link below to confirm your registration:
http://{{ domain }}{% url 'accounts:activate' uidb64=uid token=token %}
{% endautoescape %}
This is my console response:
Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Activate Your MySite Account From: webmaster@localhost To: [email protected] Date: Fri, 20 Apr 2018 06:26:46 -0000 Message-ID: <152420560682.16725.4597194169307598579@Dash-U>
Hi testuser,
Please click on the link below to confirm your registration:
http://127.0.0.1:8000/activate/b'MjU'/4vi-fasdtRf2db2989413ba/
as you can see uid = b'MjU'
expected uid = MjU
test in console:
$ python
Python 3.6.4 (default, Apr 7 2018, 00:45:33)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.utils.http import urlsafe_base64_encode, urlsafe_base64_decode
>>> from django.utils.encoding import force_bytes, force_text
>>> var1=urlsafe_base64_encode(force_bytes(3))
>>> print(var1)
b'Mw'
>>> print(var1.decode())
Mw
>>>
After investigating it seems like its related to python 3. My workaround was quite simple:
'uid': user.pk,
i receive it as uidb64 on my activate function:
user = User.objects.get(pk=uidb64)
and voila:
Content-Transfer-Encoding: 7bit
Subject: Activate Your MySite Account
From: webmaster@localhost
To: [email protected]
Date: Fri, 20 Apr 2018 20:44:46 -0000
Message-ID: <152425708646.11228.13738465662759110946@Dash-U>
Hi testuser,
Please click on the link below to confirm your registration:
http://127.0.0.1:8000/activate/45/4vi-3895fbb6b74016ad1882/
now it works fine. :)
Source: Stackoverflow.com