This will surely be an easy one but it is really bugging me.
I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.
All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.
Every time I go to print out a variable that holds 'String' I get [u'String']
printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?
You probably have a list containing one unicode string. The repr
of this is [u'String']
.
You can convert this to a list of byte strings using any variation of the following:
# Functional style.
print map(lambda x: x.encode('ascii'), my_list)
# List comprehension.
print [x.encode('ascii') for x in my_list]
# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)
# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)
# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)
[u'String']
is a text representation of a list that contains a Unicode string on Python 2.
If you run print(some_list)
then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list))
i.e., to create a text representation of a Python object with the type list
, repr()
function is called for each item.
Don't confuse a Python object and its text representation—repr('a') != 'a'
and even the text representation of the text representation differs: repr(repr('a')) != repr('a')
.
repr(obj)
returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj
.
To avoid calling repr()
, you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)
—it prints a comma separated list of the strings: String
Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can't represent all the characters e.g., if you try to use 'ascii'
encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.
Maybe i dont understand , why cant you just get the element.text and then convert it before using it ? for instance (dont know why you would do this but...) find all label elements of the web page and iterate between them until you find one called MyText
avail = []
avail = driver.find_elements_by_class_name("label");
for i in avail:
if i.text == "MyText":
Convert the string from i and do whatever you wanted to do ... maybe im missing something in the original message ? or was this what you were looking for ?
If accessing/printing single element lists (e.g., sequentially or filtered):
my_list = [u'String'] # sample element
my_list = [str(my_list[0])]
encode("latin-1")
helped me in my case:
facultyname[0].encode("latin-1")
Do you really mean u'String'
?
In any event, can't you just do str(string)
to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)
pass the output to str() function and it will remove the convert the unicode output. also by printing the output it will remove the u'' tags from it.
import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r))
will print
{'name': 'A', 'primary_key': 1}
Use dir
or type
on the 'string' to find out what it is. I suspect that it's one of BeautifulSoup's tag objects, that prints like a string, but really isn't one. Otherwise, its inside a list and you need to convert each string separately.
In any case, why are you objecting to using Unicode? Any specific reason?
Source: Stackoverflow.com