I would like to lay claim to the ONE and only solution posted so far that actually works. :-)
Three classes of problems that have to be dealt with.
Non-transitive matching rules for lower and uppercase. The Turkish I problem has been mentioned frequently in other replies. According to comments in Android source for String.regionMatches, the Georgian comparison rules requires additional conversion to lower-case when comparing for case-insensitive equality.
Cases where upper- and lower-case forms have a different number of letters. Pretty much all of the solutions posted so far fail, in these cases. Example: German STRASSE vs. Straße have case-insensitive equality, but have different lengths.
Binding strengths of accented characters. Locale AND context effect whether accents match or not. In French, the uppercase form of 'é' is 'E', although there is a movement toward using uppercase accents . In Canadian French, the upper-case form of 'é' is 'É', without exception. Users in both countries would expect "e" to match "é" when searching. Whether accented and unaccented characters match is locale-specific. Now consider: does "E" equal "É"? Yes. It does. In French locales, anyway.
I am currently using android.icu.text.StringSearch
to correctly implement previous implementations of case-insensitive indexOf operations.
Non-Android users can access the same functionality through the ICU4J package, using the com.ibm.icu.text.StringSearch
class.
Be careful to reference classes in the correct icu package (android.icu.text
or com.ibm.icu.text
) as Android and the JRE both have classes with the same name in other namespaces (e.g. Collator).
this.collator = (RuleBasedCollator)Collator.getInstance(locale);
this.collator.setStrength(Collator.PRIMARY);
....
StringSearch search = new StringSearch(
pattern,
new StringCharacterIterator(targetText),
collator);
int index = search.first();
if (index != SearchString.DONE)
{
// remember that the match length may NOT equal the pattern length.
length = search.getMatchLength();
....
}
Test Cases (Locale, pattern, target text, expectedResult):
testMatch(Locale.US,"AbCde","aBcDe",true);
testMatch(Locale.US,"éèê","EEE",true);
testMatch(Locale.GERMAN,"STRASSE","Straße",true);
testMatch(Locale.FRENCH,"éèê","EEE",true);
testMatch(Locale.FRENCH,"EEE","éèê",true);
testMatch(Locale.FRENCH,"éèê","ÉÈÊ",true);
testMatch(new Locale("tr-TR"),"TITLE","title",true); // Turkish dotless I/i
testMatch(new Locale("tr-TR"),"TITLE","title",true); // Turkish dotted I/i
testMatch(new Locale("tr-TR"),"TITLE","title",false); // Dotless-I != dotted i.
PS: As best as I can determine, the PRIMARY binding strength should do the right thing when locale-specific rules differentiate between accented and non-accented characters according to dictionary rules; but I don't which locale to use to test this premise. Donated test cases would be gratefully appreciated.
--
Copyright notice: because StackOverflow's CC-BY_SA copyrights as applied to code-fragments are unworkable for professional developers, these fragments are dual licensed under more appropriate licenses here: https://pastebin.com/1YhFWmnU