Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

Question

I m writing an import utility that is using phone numbers as a unique key within the import   I need to check that the phone number does not already exist in my DB  The problem is that   phone numbers in the DB could have things like dashes and parenthesis and possibly other things  I wrote a function to remove these things  the problem is that it is slow and with thousands of records in my DB and thousands of records to import at once  this process can be unacceptably slow  I ve already made the phone number column an index   I tried using the script from this post  T-SQL trim  amp nbsp  and other non-alphanumeric characters   But that didn t speed it up any   Is there a faster way to remove non-numeric characters  Something that can perform well when 10 000 to 100 000 records have to be compared   Whatever is done needs to perform fast   Update Given what people responded with  I think I m going to have to clean the fields before I run the import utility    To answer the question of what I m writing the import utility in  it is a C  app  I m comparing BIGINT to BIGINT now  with no need to alter DB data and I m still taking a performance hit with a very small set of data  about 2000 records     Could comparing BIGINT to BIGINT be slowing things down   I ve optimized the code side of my app as much as I can  removed regexes  removed unneccessary DB calls   Although I can t isolate SQL as the source of the problem anymore  I still feel like it is

User · Answer

replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace replace string  a       b       c       d       e       f       g       h       i       j       k       l       m       n       o       p       q       r       s       t       u       v       w       x       y       z       A       B       C       D       E       F       G       H       I       J       K       L       M       N       O       P       Q       R       S       T       U       V       W       X       Y       Z      1 AS string

User · Answer

can you remove them in a nightly process  storing them in a separate field  then do an update on changed records right before you run the process   Or on the insert update  store the  numeric  format  to reference later   A trigger would be an easy way to do it

User · Answer

In case you didn t want to create a function  or you needed just a single inline call in T-SQL  you could try   set  Phone   REPLACE REPLACE REPLACE REPLACE  Phone                  -                Of course this is specific to removing phone number formatting  not a generic remove all special characters from string function

User · Answer

I saw this solution with T-SQL code and PATINDEX   I like it  -   CREATE Function  fnRemoveNonNumericCharacters   strText VARCHAR 1000   RETURNS VARCHAR 1000  AS BEGIN     WHILE PATINDEX     0-9      strText   gt  0     BEGIN         SET  strText   STUFF  strText  PATINDEX     0-9      strText   1          END     RETURN  strText END

User · Answer

Thousands of records against thousands of records is not normally a problem   I ve used SSIS to import millions of records with de-duping like this   I would clean up the database to remove the non-numeric characters in the first place and keep them out

User · Answer

Working with varchars is fundamentally slow and inefficient compared to working with numerics  for obvious reasons   The functions you link to in the original post will indeed be quite slow  as they loop through each character in the string to determine whether or not it s a number   Do that for thousands of records and the process is bound to be slow   This is the perfect job for Regular Expressions  but they re not natively supported in SQL Server   You can add support using a CLR function  but it s hard to say how slow this will be without trying it  I would definitely expect it to be significantly faster than looping through each character of each phone number  however   Once you get the phone numbers formatted in your database so that they re only numbers  you could switch to a numeric type in SQL which would yield lightning-fast comparisons against other numeric types   You might find that  depending on how fast your new data is coming in  doing the trimming and conversion to numeric on the database side is plenty fast enough once what you re comparing to is properly formatted  but if possible  you would be better off writing an import utility in a  NET language that would take care of these formatting issues before hitting the database   Either way though  you re going to have a big problem regarding optional formatting   Even if your numbers are guaranteed to be only North American in origin  some people will put the 1 in front of a fully area-code qualified phone number and others will not  which will cause the potential for multiple entries of the same phone number   Furthermore  depending on what your data represents  some people will be using their home phone number which might have several people living there  so a unique constraint on it would only allow one database member per household   Some would use their work number and have the same problem  and some would or wouldn t include the extension which would cause artificial uniqueness potential again   All of that may or may not impact you  depending on your particular data and usages  but it s important to keep in mind

User · Answer

I would try Scott s CLR function first but add a WHERE clause to reduce the number of records updated   UPDATE table SET phoneNumber   dbo StripNonNumeric phoneNumber   WHERE phonenumber like     0-9      If you know that the great majority of your records have non-numeric characters it might not help though

User · Answer

I may misunderstand  but you ve got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import   For updating the existing records  I would just use SQL  that only has to happen once   However  SQL isn t optimized for this sort of operation  since you said you are writing an import utility  I would do those updates in the context of the import utility itself  not in SQL   This would be much better performance wise   What are you writing the utility in   Also  I may be completely misunderstanding the process  so I apologize if off-base   Edit   For the initial update  if you are using SQL Server 2005  you could try a CLR function   Here s a quick one using regex   Not sure how the performance would compare  I ve never used this myself except for a quick test right now   using System    using System Data    using System Text RegularExpressions    using System Data SqlClient    using System Data SqlTypes    using Microsoft SqlServer Server     public partial class UserDefinedFunctions            Microsoft SqlServer Server SqlFunction        public static SqlString StripNonNumeric SqlString input                    Regex regEx   new Regex    D              return regEx Replace input Value                        After this is deployed  to update you could just use   UPDATE table SET phoneNumber   dbo StripNonNumeric phoneNumber

User · Answer

I know it is late to the game  but here is a function that I created for T-SQL that quickly removes non-numeric characters  Of note  I have a schema  String  that I put utility functions for strings into     CREATE FUNCTION String ComparablePhone   string nvarchar 32    RETURNS bigint AS BEGIN     DECLARE  out bigint   -- 1  table of unique characters to be kept     DECLARE  keepers table   chr nchar 1  not null primary key        INSERT INTO  keepers   chr   VALUES  N 0    N 1    N 2    N 3    N 4    N 5    N 6    N 7    N 8    N 9     -- 2  Identify the characters in the string to remove     WITH found   id  position   AS               SELECT              ROW NUMBER   OVER  ORDER BY  n1 n10  DESC   -- since we are using stuff  for the position to continue to be accurate  start from the greatest position and work towards the smallest              n1 n10          FROM               SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9  AS d1               SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30  AS d10         WHERE              n1 n10  BETWEEN 1 AND len  string              AND substring  string   n1 n10   1  NOT IN  SELECT chr FROM  keepers        -- 3  Use stuff to snuff out the identified characters     SELECT           string   stuff   string  position  1           FROM          found     ORDER BY         id ASC  -- important to process the removals in order  see ROW NUMBER   above  -- 4  Try and convert the results to a bigint        IF len  string    0         RETURN NULL  -- an empty string converts to 0      RETURN convert bigint  string    END   Then to use it to compare for inserting  something like this   INSERT INTO Contacts   phone  first name  last name   SELECT i phone  i first name  i last name FROM Imported AS i LEFT JOIN Contacts AS c ON String ComparablePhone c phone    String ComparablePhone i phone  WHERE c phone IS NULL -- Exclude those that already exist

User · Answer

create function dbo RemoveNonNumericChar  str varchar 500     returns varchar 500    begin   declare  startingIndex int   set  startingIndex 0   while 1 1   begin       set  startingIndex  patindex     0-9     str        if  startingIndex  lt  gt  0       begin           set  str   replace  str substring  str  startingIndex 1            end       else    break     end   return  str   end  go    select dbo RemoveNonNumericChar  aisdfhoiqwei352345234       345345

User · Answer

Simple function   CREATE FUNCTION  dbo   RemoveAlphaCharacters   InputString VARCHAR 1000   RETURNS VARCHAR 1000  AS BEGIN   WHILE PATINDEX     0-9     InputString  gt 0         SET  InputString   STUFF  InputString PATINDEX     0-9     InputString  1            RETURN  InputString END  GO

User · Answer

I d use an Inline Function from performance perspective  see below  Note that symbols like      -  etc will not be removed  CREATE FUNCTION  dbo   UDF RemoveNumericStringsFromString       str varchar 100      RETURNS TABLE AS RETURN  WITH Tally  n  as        -- 100 rows    SELECT TOP  Len  Str   ROW NUMBER   OVER  ORDER BY  SELECT NULL      FROM  VALUES  0   0   0   0   0   0   0   0   0   0   a n     CROSS JOIN  VALUES 0   0   0   0   0   0   0   0   0   0   b n         SELECT OutStr    STUFF          SELECT SUBSTRING  Str  n 1  st         FROM Tally         WHERE ISNUMERIC SUBSTRING  Str  n 1     1         FOR XML PATH     type  value       varchar 100    1 0       GO      Use it     SELECT OutStr   FROM dbo UDF RemoveNumericStringsFromString  fjkfhk759734977fwe9794t23       Result set    759734977979423      You can define it with more than 100 characters

User · Answer

Looking for a super simple solution   SUBSTRING  Phone   CHARINDEX       Phone   1  1  3           SUBSTRING  Phone   CHARINDEX       Phone   1  1  3           SUBSTRING  Phone   CHARINDEX  -    Phone   1  1  4  AS Phone

User · Answer

I would recommend enforcing a strict format for phone numbers in the database  I use the following format   Assuming US phone numbers   Database  5555555555x555  Display   555  555-5555 ext 555  Input  10 digits or more digits embedded in any string   Regex replacing removes all non-numeric characters

User · Answer

Although I can t isolate SQL as the source of the problem anymore  I still feel like it is    Fire up SQL Profiler and take a look   Take the resulting queries and check their execution plans to make sure that index is being used

[sql] Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

Examples related to sql

Examples related to sql-server

Examples related to performance

Examples related to optimization