I have faced the same problem with a character that I never managed to match with a where query - CHARINDEX, LIKE, REPLACE
, etc. did not work. Then I have used a brute force solution which is awful, heavy but works:
Step 1: make a copy of the complete data set - keep track of the original names with an source_id referencing the pk of the source table (and keep this source id in all the subsequent tables).
Step 2: LTRIM RTRIM
the data, and replace all double spaces, tab, etc (basically all the CHAR(1) to CHAR(32) by one space. Lowercase the whole set as well.
Step 3: replace all the special characters that you know (get the list of all the quotes, double quotes, etc.) by something from a-z (I suggest z). Basically replace everything that is not standard English characters by a z (using nested REPLACE of REPLACE in a loop).
Step 4: split by word into a second copy, where each word is in a separate row - the split is a SUBSTRING
based on the position of the space characters - at this point, we should miss the ones where there's a hidden space that we did not catche earlier.
Step 5: split each word into a third copy, where each letter is in a separate row (I know it makes a very large table) - keep track of the charindex of each letter in a separate column.
Step 6: Select everything in the above table which is not LIKE [a-z]. This is the list of the unidentified characters we want to exclude.
From the output of step 6 we have enough data to make a series of substring of the source to select everything but the unknown character we want to exclude.
Note 1: there are smart ways to optimize this, depending on the size of the original expression (steps 4, 5 and 6 can be made in one go).
Note 2: this is not very fast, but the fastest way to get this done for a large data set, because the split of lines into words and words into letters is made by substring, which slices all the table into one character slices. However, this is quite heavy to build. With a smaller set, it may be enough to parse each record one by one and search for character which is not in a list of all English characters plus all special characters.