[sql] Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

I'm writing an import utility that is using phone numbers as a unique key within the import.

I need to check that the phone number does not already exist in my DB. The problem is that phone numbers in the DB could have things like dashes and parenthesis and possibly other things. I wrote a function to remove these things, the problem is that it is slow and with thousands of records in my DB and thousands of records to import at once, this process can be unacceptably slow. I've already made the phone number column an index.

I tried using the script from this post:
T-SQL trim &nbsp (and other non-alphanumeric characters)

But that didn't speed it up any.

Is there a faster way to remove non-numeric characters? Something that can perform well when 10,000 to 100,000 records have to be compared.

Whatever is done needs to perform fast.

Update
Given what people responded with, I think I'm going to have to clean the fields before I run the import utility.

To answer the question of what I'm writing the import utility in, it is a C# app. I'm comparing BIGINT to BIGINT now, with no need to alter DB data and I'm still taking a performance hit with a very small set of data (about 2000 records).

Could comparing BIGINT to BIGINT be slowing things down?

I've optimized the code side of my app as much as I can (removed regexes, removed unneccessary DB calls). Although I can't isolate SQL as the source of the problem anymore, I still feel like it is.

This question is related to sql sql-server performance optimization

The answer is


I may misunderstand, but you've got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import.

For updating the existing records, I would just use SQL, that only has to happen once.

However, SQL isn't optimized for this sort of operation, since you said you are writing an import utility, I would do those updates in the context of the import utility itself, not in SQL. This would be much better performance wise. What are you writing the utility in?

Also, I may be completely misunderstanding the process, so I apologize if off-base.

Edit:
For the initial update, if you are using SQL Server 2005, you could try a CLR function. Here's a quick one using regex. Not sure how the performance would compare, I've never used this myself except for a quick test right now.

using System;  
using System.Data;  
using System.Text.RegularExpressions;  
using System.Data.SqlClient;  
using System.Data.SqlTypes;  
using Microsoft.SqlServer.Server;  

public partial class UserDefinedFunctions  
{  
    [Microsoft.SqlServer.Server.SqlFunction]  
    public static SqlString StripNonNumeric(SqlString input)  
    {  
        Regex regEx = new Regex(@"\D");  
        return regEx.Replace(input.Value, "");  
    }  
};  

After this is deployed, to update you could just use:

UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)

I know it is late to the game, but here is a function that I created for T-SQL that quickly removes non-numeric characters. Of note, I have a schema "String" that I put utility functions for strings into...

CREATE FUNCTION String.ComparablePhone( @string nvarchar(32) ) RETURNS bigint AS
BEGIN
    DECLARE @out bigint;

-- 1. table of unique characters to be kept
    DECLARE @keepers table ( chr nchar(1) not null primary key );
    INSERT INTO @keepers ( chr ) VALUES (N'0'),(N'1'),(N'2'),(N'3'),(N'4'),(N'5'),(N'6'),(N'7'),(N'8'),(N'9');

-- 2. Identify the characters in the string to remove
    WITH found ( id, position ) AS
    (
        SELECT 
            ROW_NUMBER() OVER (ORDER BY (n1+n10) DESC), -- since we are using stuff, for the position to continue to be accurate, start from the greatest position and work towards the smallest
            (n1+n10)
        FROM 
            (SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) AS d1,
            (SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30) AS d10
        WHERE
            (n1+n10) BETWEEN 1 AND len(@string)
            AND substring(@string, (n1+n10), 1) NOT IN (SELECT chr FROM @keepers)
    )
-- 3. Use stuff to snuff out the identified characters
    SELECT 
        @string = stuff( @string, position, 1, '' )
    FROM 
        found
    ORDER BY
        id ASC; -- important to process the removals in order, see ROW_NUMBER() above

-- 4. Try and convert the results to a bigint   
    IF len(@string) = 0
        RETURN NULL; -- an empty string converts to 0

    RETURN convert(bigint,@string); 
END

Then to use it to compare for inserting, something like this;

INSERT INTO Contacts ( phone, first_name, last_name )
SELECT i.phone, i.first_name, i.last_name
FROM Imported AS i
LEFT JOIN Contacts AS c ON String.ComparablePhone(c.phone) = String.ComparablePhone(i.phone)
WHERE c.phone IS NULL -- Exclude those that already exist

can you remove them in a nightly process, storing them in a separate field, then do an update on changed records right before you run the process?

Or on the insert/update, store the "numeric" format, to reference later. A trigger would be an easy way to do it.


create function dbo.RemoveNonNumericChar(@str varchar(500))  
returns varchar(500)  
begin  
declare @startingIndex int  
set @startingIndex=0  
while 1=1  
begin  
    set @startingIndex= patindex('%[^0-9]%',@str)  
    if @startingIndex <> 0  
    begin  
        set @str = replace(@str,substring(@str,@startingIndex,1),'')  
    end  
    else    break;   
end  
return @str  
end

go  

select dbo.RemoveNonNumericChar('aisdfhoiqwei352345234@#$%^$@345345%^@#$^')  

I would try Scott's CLR function first but add a WHERE clause to reduce the number of records updated.

UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber) 
WHERE phonenumber like '%[^0-9]%'

If you know that the great majority of your records have non-numeric characters it might not help though.


Simple function:

CREATE FUNCTION [dbo].[RemoveAlphaCharacters](@InputString VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
  WHILE PATINDEX('%[^0-9]%',@InputString)>0
        SET @InputString = STUFF(@InputString,PATINDEX('%[^0-9]%',@InputString),1,'')     
  RETURN @InputString
END

GO

Thousands of records against thousands of records is not normally a problem. I've used SSIS to import millions of records with de-duping like this.

I would clean up the database to remove the non-numeric characters in the first place and keep them out.


replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string,

:)


I'd use an Inline Function from performance perspective, see below: Note that symbols like '+','-' etc will not be removed

CREATE FUNCTION [dbo].[UDF_RemoveNumericStringsFromString]
 (
 @str varchar(100)
 )
 RETURNS TABLE AS RETURN
 WITH Tally (n) as 
  (
  -- 100 rows
   SELECT TOP (Len(@Str)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
   FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
   CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
  )

  SELECT OutStr =  STUFF(
       (SELECT SUBSTRING(@Str, n,1) st
        FROM Tally
        WHERE ISNUMERIC(SUBSTRING(@Str, n,1)) = 1
        FOR XML PATH(''),type).value('.', 'varchar(100)'),1,0,'')
  GO

  /*Use it*/
  SELECT OutStr
  FROM dbo.UDF_RemoveNumericStringsFromString('fjkfhk759734977fwe9794t23')
  /*Result set
   759734977979423 */

You can define it with more than 100 characters...


I would recommend enforcing a strict format for phone numbers in the database. I use the following format. (Assuming US phone numbers)

Database: 5555555555x555

Display: (555) 555-5555 ext 555

Input: 10 digits or more digits embedded in any string. (Regex replacing removes all non-numeric characters)


Working with varchars is fundamentally slow and inefficient compared to working with numerics, for obvious reasons. The functions you link to in the original post will indeed be quite slow, as they loop through each character in the string to determine whether or not it's a number. Do that for thousands of records and the process is bound to be slow. This is the perfect job for Regular Expressions, but they're not natively supported in SQL Server. You can add support using a CLR function, but it's hard to say how slow this will be without trying it I would definitely expect it to be significantly faster than looping through each character of each phone number, however!

Once you get the phone numbers formatted in your database so that they're only numbers, you could switch to a numeric type in SQL which would yield lightning-fast comparisons against other numeric types. You might find that, depending on how fast your new data is coming in, doing the trimming and conversion to numeric on the database side is plenty fast enough once what you're comparing to is properly formatted, but if possible, you would be better off writing an import utility in a .NET language that would take care of these formatting issues before hitting the database.

Either way though, you're going to have a big problem regarding optional formatting. Even if your numbers are guaranteed to be only North American in origin, some people will put the 1 in front of a fully area-code qualified phone number and others will not, which will cause the potential for multiple entries of the same phone number. Furthermore, depending on what your data represents, some people will be using their home phone number which might have several people living there, so a unique constraint on it would only allow one database member per household. Some would use their work number and have the same problem, and some would or wouldn't include the extension which would cause artificial uniqueness potential again.

All of that may or may not impact you, depending on your particular data and usages, but it's important to keep in mind!


"Although I can't isolate SQL as the source of the problem anymore, I still feel like it is."

Fire up SQL Profiler and take a look. Take the resulting queries and check their execution plans to make sure that index is being used.


In case you didn't want to create a function, or you needed just a single inline call in T-SQL, you could try:

set @Phone = REPLACE(REPLACE(REPLACE(REPLACE(@Phone,'(',''),' ',''),'-',''),')','')

Of course this is specific to removing phone number formatting, not a generic remove all special characters from string function.


Looking for a super simple solution:

SUBSTRING([Phone], CHARINDEX('(', [Phone], 1)+1, 3)
       + SUBSTRING([Phone], CHARINDEX(')', [Phone], 1)+1, 3)
       + SUBSTRING([Phone], CHARINDEX('-', [Phone], 1)+1, 4) AS Phone

I saw this solution with T-SQL code and PATINDEX. I like it :-)

CREATE Function [fnRemoveNonNumericCharacters](@strText VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
    WHILE PATINDEX('%[^0-9]%', @strText) > 0
    BEGIN
        SET @strText = STUFF(@strText, PATINDEX('%[^0-9]%', @strText), 1, '')
    END
    RETURN @strText
END

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to optimization

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Measuring execution time of a function in C++ GROUP BY having MAX date How to efficiently remove duplicates from an array without using Set Storing JSON in database vs. having a new column for each key Read file As String How to write a large buffer into a binary file in C++, fast? Is optimisation level -O3 dangerous in g++? Why is processing a sorted array faster than processing an unsorted array? MySQL my.cnf performance tuning recommendations