Comparing as lower or as upper case? (common enough issue)
Both below will return 0 with strcicmpL("A", "a")
and strcicmpU("A", "a")
.
Yet strcicmpL("A", "_")
and strcicmpU("A", "_")
can return different signed results as '_'
is often between the upper and lower case letters.
This affects the sort order when used with qsort(..., ..., ..., strcicmp)
. Non-standard library C functions like the commonly available stricmp()
or strcasecmp()
tend to be well defined and favor comparing via lowercase. Yet variations exist.
int strcicmpL(char const *a, char const *b) {
while (*b) {
int d = tolower(*a) - tolower(*b);
if (d) {
return d;
}
a++;
b++;
}
return tolower(*a);
}
int strcicmpU(char const *a, char const *b) {
while (*b) {
int d = toupper(*a) - toupper(*b);
if (d) {
return d;
}
a++;
b++;
}
return toupper(*a);
}
char
can have a negative value. (not rare)
touppper(int)
and tolower(int)
are specified for unsigned char
values and the negative EOF
. Further, strcmp()
returns results as if each char
was converted to unsigned char
, regardless if char
is signed or unsigned.
tolower(*a); // Potential UB
tolower((unsigned char) *a); // Correct (Almost - see following)
char
can have a negative value and not 2's complement. (rare)
The above does not handle -0
nor other negative values properly as the bit pattern should be interpreted as unsigned char
. To properly handle all integer encodings, change the pointer type first.
// tolower((unsigned char) *a);
tolower(*(const unsigned char *)a); // Correct
Locale (less common)
Although character sets using ASCII code (0-127) are ubiquitous, the remainder codes tend to have locale specific issues. So strcasecmp("\xE4", "a")
might return a 0 on one system and non-zero on another.
Unicode (the way of the future)
If a solution needs to handle more than ASCII consider a unicode_strcicmp()
. As C lib does not provide such a function, a pre-coded function from some alternate library is recommended. Writing your own unicode_strcicmp()
is a daunting task.
Do all letters map one lower to one upper? (pedantic)
[A-Z] maps one-to-one with [a-z], yet various locales map various lower case chracters to one upper and visa-versa. Further, some uppercase characters may lack a lower case equivalent and again, visa-versa.
This obliges code to covert through both tolower()
and tolower()
.
int d = tolower(toupper(*a)) - tolower(toupper(*b));
Again, potential different results for sorting if code did tolower(toupper(*a))
vs. toupper(tolower(*a))
.
Portability
@B. Nadolson recommends to avoid rolling your own strcicmp()
and this is reasonable, except when code needs high equivalent portable functionality.
Below is an approach that even performed faster than some system provided functions. It does a single compare per loop rather than two by using 2 different tables that differ with '\0'
. Your results may vary.
static unsigned char low1[UCHAR_MAX + 1] = {
0, 1, 2, 3, ...
'@', 'a', 'b', 'c', ... 'z', `[`, ... // @ABC... Z[...
'`', 'a', 'b', 'c', ... 'z', `{`, ... // `abc... z{...
}
static unsigned char low2[UCHAR_MAX + 1] = {
// v--- Not zero, but A which matches none in `low1[]`
'A', 1, 2, 3, ...
'@', 'a', 'b', 'c', ... 'z', `[`, ...
'`', 'a', 'b', 'c', ... 'z', `{`, ...
}
int strcicmp_ch(char const *a, char const *b) {
// compare using tables that differ slightly.
while (low1[*(const unsigned char *)a] == low2[*(const unsigned char *)b]) {
a++;
b++;
}
// Either strings differ or null character detected.
// Perform subtraction using same table.
return (low1[*(const unsigned char *)a] - low1[*(const unsigned char *)b]);
}