[python] How do I do a case-insensitive string comparison?

How can I do case insensitive string comparison in Python?

I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.

This question is related to python comparison case-insensitive

The answer is


This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling! make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!

You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored. Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"

This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever

if anyone has any questions get me on this..

    import re as yes

    def bar_or_spam():

        message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")

I saw this solution here using regex.

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

It works well with accents

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:

Using Python 2, calling .lower() on each string or Unicode object...

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.

Assume we have a file called unicode.txt containing the two strings S?s?f?? and S?S?F?S. With Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
S?s?f??
S?S?F?S

>>> first, second = u.splitlines()
>>> print first.lower()
s?s?f??
>>> print second.lower()
s?s?f?s
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

The S character has two lowercase forms, ? and s, and .lower() won't help compare them case-insensitively.

However, as of Python 3, all three forms will resolve to ?, and calling lower() on both strings will work correctly:

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
S?s?f??
S?S?F?S

>>> first, second = s.splitlines()
>>> print(first.lower())
s?s?f??
>>> print(second.lower())
s?s?f??
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)


Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

"ß".lower()
#>>> 'ß'

"ß".upper().lower()
#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BU?E" equal - that's the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. [...]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think "ê" == "e^" - but it doesn't:

"ê" == "e^"
#>>> False

This is because the accent on the latter is a combining character.

import unicodedata

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX']

[unicodedata.name(char) for char in "e^"]
#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "e^")
#>>> True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

How about converting to lowercase first? you can use string.lower().


def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

Using Python 2, calling .lower() on each string or Unicode object...

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.

Assume we have a file called unicode.txt containing the two strings S?s?f?? and S?S?F?S. With Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
S?s?f??
S?S?F?S

>>> first, second = u.splitlines()
>>> print first.lower()
s?s?f??
>>> print second.lower()
s?s?f?s
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True

The S character has two lowercase forms, ? and s, and .lower() won't help compare them case-insensitively.

However, as of Python 3, all three forms will resolve to ?, and calling lower() on both strings will work correctly:

>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
S?s?f??
S?S?F?S

>>> first, second = s.splitlines()
>>> print(first.lower())
s?s?f??
>>> print(second.lower())
s?s?f??
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)


How about converting to lowercase first? you can use string.lower().


Section 3.13 of the Unicode standard defines algorithms for caseless matching.

X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'a°'). D145 introduces "canonical caseless matching":

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

Example:

>>> 'å'.casefold() == 'a°'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('a°')
True

There are also compatibility caseless matching (D146) for cases such as '?' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.


The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

>>> "hello".upper() == "HELLO".upper()
True
>>> 

This is another regex which I have learned to love/hate over the last week so usually import as (in this case yes) something that reflects how im feeling! make a normal function.... ask for input, then use ....something = re.compile(r'foo*|spam*', yes.I)...... re.I (yes.I below) is the same as IGNORECASE but you cant make as many mistakes writing it!

You then search your message using regex's but honestly that should be a few pages in its own , but the point is that foo or spam are piped together and case is ignored. Then if either are found then lost_n_found would display one of them. if neither then lost_n_found is equal to None. If its not equal to none return the user_input in lower case using "return lost_n_found.lower()"

This allows you to much more easily match up anything thats going to be case sensitive. Lastly (NCS) stands for "no one cares seriously...!" or not case sensitive....whichever

if anyone has any questions get me on this..

    import re as yes

    def bar_or_spam():

        message = raw_input("\nEnter FoO for BaR or SpaM for EgGs (NCS): ") 

        message_in_coconut = yes.compile(r'foo*|spam*',  yes.I)

        lost_n_found = message_in_coconut.search(message).group()

        if lost_n_found != None:
            return lost_n_found.lower()
        else:
            print ("Make tea not love")
            return

    whatz_for_breakfast = bar_or_spam()

    if whatz_for_breakfast == foo:
        print ("BaR")

    elif whatz_for_breakfast == spam:
        print ("EgGs")

Section 3.13 of the Unicode standard defines algorithms for caseless matching.

X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'a°'). D145 introduces "canonical caseless matching":

import unicodedata

def NFD(text):
    return unicodedata.normalize('NFD', text)

def canonical_caseless(text):
    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

Example:

>>> 'å'.casefold() == 'a°'.casefold()
False
>>> canonical_caseless('å') == canonical_caseless('a°')
True

There are also compatibility caseless matching (D146) for cases such as '?' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.


How about converting to lowercase first? you can use string.lower().


I saw this solution here using regex.

import re
if re.search('mandy', 'Mandy Pande', re.IGNORECASE):
# is True

It works well with accents

In [42]: if re.search("ê","ê", re.IGNORECASE):
....:        print(1)
....:
1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

In [36]: "ß".lower()
Out[36]: 'ß'
In [37]: "ß".upper()
Out[37]: 'SS'
In [38]: "ß".upper().lower()
Out[38]: 'ss'
In [39]: if re.search("ß","ßß", re.IGNORECASE):
....:        print(1)
....:
1
In [40]: if re.search("SS","ßß", re.IGNORECASE):
....:        print(1)
....:
In [41]: if re.search("ß","SS", re.IGNORECASE):
....:        print(1)
....:

Comparing strings in a case insensitive way seems trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

The first thing to note is that case-removing conversions in Unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

"ß".lower()
#>>> 'ß'

"ß".upper().lower()
#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BU?E" equal - that's the newer capital form. The recommended way is to use casefold:

str.casefold()

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. [...]

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

Then you should consider accents. If your font renderer is good, you probably think "ê" == "e^" - but it doesn't:

"ê" == "e^"
#>>> False

This is because the accent on the latter is a combining character.

import unicodedata

[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX']

[unicodedata.name(char) for char in "e^"]
#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "e^")
#>>> True

To finish up, here this is expressed in functions:

import unicodedata

def normalize_caseless(text):
    return unicodedata.normalize("NFKD", text.casefold())

def caseless_equal(left, right):
    return normalize_caseless(left) == normalize_caseless(right)

def insenStringCompare(s1, s2):
    """ Method that takes two strings and returns True or False, based
        on if they are equal, regardless of case."""
    try:
        return s1.lower() == s2.lower()
    except AttributeError:
        print "Please only pass strings into this method."
        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to comparison

Wildcard string comparison in Javascript How to compare two JSON objects with the same elements in a different order equal? Comparing strings, c++ Char Comparison in C bash string compare to multiple correct values Comparing two hashmaps for equal values and same key sets? Comparing boxed Long values 127 and 128 Compare two files report difference in python How do I fix this "TypeError: 'str' object is not callable" error? Compare cell contents against string in Excel

Examples related to case-insensitive

Are PostgreSQL column names case-sensitive? How to make String.Contains case insensitive? SQL- Ignore case while searching for a string String contains - ignore case How to compare character ignoring case in primitive types How to perform case-insensitive sorting in JavaScript? Contains case insensitive Case insensitive string as HashMap key Case insensitive searching in Oracle How to replace case-insensitive literal substrings in Java