Find non-ASCII characters in varchar columns using SQL Server

Question

How can rows with non-ASCII characters be returned using SQL Server  If you can show how to do it for one column would be great     I am doing something like this now  but it is not working  select   from Staging APARMRE1 as ar where ar Line like      -        For extra credit  if it can span all varchar columns in a table  that would be outstanding  In this solution  it would be nice to return three columns    The identity field for that record   This will allow the whole record to be reviewed with another query   The column name The text with the invalid character    Id   FieldName   InvalidText         ---- ----------- -------------------   25   LastName    Sol  s                56   FirstName   Fran  ois            100   Address1    123   mlaut street     Invalid characters would be any outside the range of SPACE  3210  through    12710

User · Answer

I ve been running this bit of code with success  declare  UnicodeData table        data nvarchar 500    insert into       UnicodeData values       N Horse          N Dog         N Cat    select     data from      UnicodeData  where     data collate LATIN1 GENERAL BIN    cast data as varchar max     Which works well for known columns   For extra credit  I wrote this quick script to search all nvarchar columns in a given table for Unicode characters   declare       sql    varchar max                table sysname            mytable  -- enter your table here   with ColumnData as       select         RowId                 row number   over  order by c COLUMN NAME           c COLUMN NAME          ColumnName                 c COLUMN NAME                TableName                  c TABLE SCHEMA           c TABLE NAME            from         INFORMATION SCHEMA COLUMNS c     where         c DATA TYPE            nvarchar          and c TABLE NAME       table   select      sql    sql    select FieldName         c ColumnName                InvalidCharacter        c COLUMN NAME       from     c TableName     where     c ColumnName     collate LATIN1 GENERAL BIN    cast     c ColumnName     as varchar max         case when c RowId  lt  gt   select max RowId  from ColumnData  then    union all   else    end   char 13  from     ColumnData c  -- check -- print  sql exec   sql    I m not a fan of dynamic SQL but it does have its uses for exploratory queries like this

User · Answer

There is a user defined function available on the web  Parse Alphanumeric    Google UDF parse alphanumeric and you should find the code for it   This user defined function removes all characters that doesn t fit between 0-9  a-z  and A-Z   Select   from Staging APARMRE1 ar where udf parsealpha ar last name   lt  gt  ar last name   That should bring back any records that have a last name with invalid chars for you   though your bonus points question is a bit more of a challenge  but I think a case statement could handle it   This is a bit psuedo code  I m not entirely sure if it d work   Select id  case when udf parsealpha ar last name   lt  gt  ar last name then  last name  when udf parsealpha ar first name   lt  gt  ar first name then  first name  when udf parsealpha ar Address1   lt  gt  ar last name then  Address1  end   case when udf parsealpha ar last name   lt  gt  ar last name then ar last name when udf parsealpha ar first name   lt  gt  ar first name then ar first name when udf parsealpha ar Address1   lt  gt  ar last name then ar Address1 end from Staging APARMRE1 ar where udf parsealpha ar last name   lt  gt  ar last name or udf parsealpha ar first name   lt  gt  ar first name or udf parsealpha ar Address1   lt  gt  ar last name    I wrote this in the forum post box   so I m not quite sure if that ll function as is  but it should be close   I m not quite sure how it will behave if a single record has two fields with invalid chars either   As an alternative  you should be able to change the from clause away from a single table and into a subquery that looks something like   select id fieldname value from   Select id  last name  as  fieldname   last name as  value  from Staging APARMRE1 ar Union Select id  first name  as  fieldname   first name as  value  from Staging APARMRE1 ar --- and repeat unions for each field    where udf parsealpha value   lt  gt  value   Benefit here is for every column you ll only need to extend the union statement here  while you need to put that comparisson three times for every column in the case statement version of this script

User · Answer

Here is a UDF I built to detectc columns with extended ascii charaters  It is quick and you can extended the character set you want to check  The second parameter allows you to switch between checking anything outside the standard character set or allowing an extended set   create function  dbo   udf ContainsNonASCIIChars     string nvarchar 4000    checkExtendedCharset bit   returns bit as begin      declare  pos int   0      declare  char varchar 1       declare  return bit   0       while  pos  lt  len  string      begin         select  char   substring  string   pos  1          if ascii  char   lt  32 or ascii  char   gt  126              begin                 if  checkExtendedCharset   1                     begin                         if ascii  char  not in  9 124 130 138 142 146 150 154 158 160 170 176 180 181 183 184 185 186 192 193 194 195 196 197 199 200 201 202 203 204 205 206 207 209 210 211 212 213 214 216 217 218 219 220 221 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 248 249 250 251 252 253 254 255                              begin                                 select  return   1                                  select  pos    len  string    1                              end                         else                             begin                                 select  pos    pos   1                             end                     end                 else                     begin                         select  return   1                          select  pos    len  string    1                          end             end         else             begin                 select  pos    pos   1             end     end      return  return   end   USAGE   select Address1  from PropertyFile English where udf ContainsNonASCIIChars Address1  1    1

User · Answer

running the various solutions on some real world data - 12M rows varchar length  30  around 9k dodgy rows  no full text index in play  the patIndex solution is the fastest  and it also selects the most rows     pre-ran km  to set the cache to a known state  ran the 3 processes  and finally ran km again - the last 2 runs of km gave times within 2 seconds   patindex solution by Gerhard Weiss -- Runtime 0 38  returns 9144 rows  select dodgyColumn from myTable fcc WHERE  patindex       -     COLLATE Latin1 General BIN dodgyColumn    gt 0   the substring-numbers solution by MT  -- Runtime 1 16  returned 8996 rows  select dodgyColumn from myTable fcc INNER JOIN dbo Numbers32k dn ON dn number lt  len fcc dodgyColumn    WHERE ASCII SUBSTRING fcc dodgyColumn   dn Number  1   lt 32      OR ASCII SUBSTRING fcc dodgyColumn   dn Number  1   gt 127   udf solution by Deon Robertson -- Runtime 3 47  returns 7316 rows  select dodgyColumn  from myTable  where dbo udf test ContainsNonASCIIChars dodgyColumn   1    1

User · Answer

This script searches for non-ascii characters in one column   It generates a string of all valid characters  here code point 32 to 127   Then it searches for rows that don t match the list   declare  str varchar 128  declare  i int set  str      set  i   32 while  i  lt   127     begin     set  str    str         char  i      set  i    i   1     end  select  col1 from    YourTable where   col1 like          str        escape

User · Answer

To find which field has invalid characters   SELECT   FROM Staging APARMRE1 FOR XML AUTO  TYPE   You can test it with this query   SELECT top 1  char 31    char 31     hex 0x1F   field from sysobjects FOR XML AUTO  TYPE   The result will be      Msg 6841  Level 16  State 1  Line 3 FOR XML could not serialize the   data for node  field  because it contains a character  0x001F  which   is not allowed in XML  To retrieve this data using FOR XML  convert it   to binary  varbinary or image data type and use the BINARY BASE64   directive    It is very useful when you write xml files and get error of invalid characters when validate it

User · Answer

Here is a solution for the single column search using PATINDEX  It also displays the StartPosition  InvalidCharacter and ASCII code   select line    patindex       -     COLLATE Latin1 General BIN Line  as  Position     substring line patindex       -     COLLATE Latin1 General BIN Line  1  as  InvalidCharacter     ascii substring line patindex       -     COLLATE Latin1 General BIN Line  1   as  ASCIICode  from  staging APARMRE1 where patindex       -     COLLATE Latin1 General BIN Line   gt 0

User · Answer

try something like this   DECLARE  YourTable table  PK int  col1 varchar 20   col2 varchar 20   col3 varchar 20   INSERT  YourTable VALUES  1   ok   ok   ok   INSERT  YourTable VALUES  2   BA  char 182   D   ok   ok   INSERT  YourTable VALUES  3   ok  char 182   BAD   ok   INSERT  YourTable VALUES  4   ok   ok   B  char 182   AD   INSERT  YourTable VALUES  5  char 182   BAD   ok  char 182   BAD   INSERT  YourTable VALUES  6   BAD  char 182   B  char 182   AD   BAD  char 182  char 182  char 182    --if you have a Numbers table use that  other wise make one using a CTE  WITH AllNumbers AS     SELECT 1 AS Number     UNION ALL     SELECT Number 1         FROM AllNumbers         WHERE Number lt 1000   SELECT      pk   Col1  BadValueColumn  CONVERT varchar 20  col1  AS BadValue --make the XYZ in convert varchar XYZ        the largest value of col1  col2  col3     FROM  YourTable           y         INNER JOIN AllNumbers n ON n Number  lt   LEN y col1      WHERE ASCII SUBSTRING y col1  n Number  1   lt 32 OR ASCII SUBSTRING y col1  n Number  1   gt 127 UNION SELECT      pk   Col2  BadValueColumn  CONVERT varchar 20  col2  AS BadValue --make the XYZ in convert varchar XYZ        the largest value of col1  col2  col3     FROM  YourTable           y         INNER JOIN AllNumbers n ON n Number  lt   LEN y col2      WHERE ASCII SUBSTRING y col2  n Number  1   lt 32 OR ASCII SUBSTRING y col2  n Number  1   gt 127 UNION SELECT      pk   Col3  BadValueColumn  CONVERT varchar 20  col3  AS BadValue --make the XYZ in convert varchar XYZ        the largest value of col1  col2  col3     FROM  YourTable           y         INNER JOIN AllNumbers n ON n Number  lt   LEN y col3      WHERE ASCII SUBSTRING y col3  n Number  1   lt 32 OR ASCII SUBSTRING y col3  n Number  1   gt 127 order by 1 OPTION  MAXRECURSION 1000    OUTPUT   pk          BadValueColumn BadValue ----------- -------------- -------------------- 2           Col1           BA  D 3           Col2             BAD 4           Col3           B  AD 5           Col1             BAD 5           Col3             BAD 6           Col1           BAD   6           Col2           B  AD 6           Col3           BAD         8 row s  affected

[sql-server] Find non-ASCII characters in varchar columns using SQL Server

Examples related to sql-server

Examples related to tsql

Examples related to sql-server-2005

Examples related to non-ascii-characters