[mysql] How to remove all non-alpha numeric characters from a string in MySQL?

I'm working on a routine that compares strings, but for better efficiency I need to remove all characters that are not letters or numbers.

I'm using multiple REPLACE functions now, but maybe there is a faster and nicer solution ?

This question is related to mysql regex string alphanumeric

The answer is


So far, the only alternative approach less complicated than the other answers here is to determine the full set of special characters of the column, i.e. all the special characters that are in use in that column at the moment, and then do a sequential replace of all those characters, e.g.

update pages set slug = lower(replace(replace(replace(replace(name, ' ', ''), '-', ''), '.', ''), '&', '')); # replacing just space, -, ., & only

.

This is only advisable on a known set of data, otherwise it's trivial for some special characters to slip past with a blacklist approach instead of a whitelist approach.

Obviously, the simplest way is to pre-validate the data outside of sql due to the lack of robust built-in whitelisting (e.g. via a regex replace).


Probably a silly suggestion compared to others:

if(!preg_match("/^[a-zA-Z0-9]$/",$string)){
    $sortedString=preg_replace("/^[a-zA-Z0-9]+$/","",$string);
}

Based on the answer by Ryan Shillington, modified to work with strings longer than 255 characters and preserving spaces from the original string.

FYI there is lower(str) in the end.

I used this to compare strings:

DROP FUNCTION IF EXISTS spacealphanum;
DELIMITER $$
CREATE FUNCTION `spacealphanum`( str TEXT ) RETURNS TEXT CHARSET utf8
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret TEXT DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN 
        SET ret=CONCAT(ret,c); 
      ELSEIF  c = ' ' THEN
          SET ret=CONCAT(ret," ");
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  SET ret = lower(ret);
  RETURN ret; 
  END $$
  DELIMITER ;

I needed to get only alphabetic characters of a string in a procedure, and did:

SET @source = "whatever you want";
SET @target = '';
SET @i = 1;
SET @len = LENGTH(@source);
WHILE @i <= @len DO
    SET @char = SUBSTRING(@source, @i, 1);
    IF ((ORD(@char) >= 65 && ORD(@char) <= 90) || (ORD(@char) >= 97 && ORD(@char) <= 122)) THEN
        SET @target = CONCAT(@target, @char);
    END IF;
    SET @i = @i + 1;
END WHILE;

Needed to replace non-alphanumeric characters rather than remove non-alphanumeric characters so I have created this based on Ryan Shillington's alphanum. Works for strings up to 255 characters in length

DROP FUNCTION IF EXISTS alphanumreplace; 
DELIMITER | 
CREATE FUNCTION alphanumreplace( str CHAR(255), d CHAR(32) ) RETURNS CHAR(255) 
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret CHAR(32) DEFAULT ''; 
  DECLARE c CHAR(1); 
  SET len = CHAR_LENGTH( str ); 
  REPEAT 
    BEGIN 
      SET c = MID( str, i, 1 ); 
      IF c REGEXP '[[:alnum:]]' THEN SET ret=CONCAT(ret,c); 
      ELSE SET ret=CONCAT(ret,d);
      END IF; 
      SET i = i + 1; 
    END; 
  UNTIL i > len END REPEAT; 
  RETURN ret; 
END | 
DELIMITER ; 

Example:

select 'hello world!',alphanum('hello world!'),alphanumreplace('hello world!','-');
+--------------+--------------------------+-------------------------------------+
| hello world! | alphanum('hello world!') | alphanumreplace('hello world!','-') |
+--------------+--------------------------+-------------------------------------+
| hello world! | helloworld               | hello-world-                        |
+--------------+--------------------------+-------------------------------------+

You'll need to add the alphanum function seperately if you want that, I just have it here for the example.


The fastest way I was able to find (and using ) is with convert().

from Doc. CONVERT() with USING is used to convert data between different character sets.

Example:

convert(string USING ascii)

In your case the right character set will be self defined

NOTE from Doc. The USING form of CONVERT() is available as of 4.1.0.


This can be done with a regular expression replacer function I posted in another answer and have blogged about here. It may not be the most efficient solution possible and might look overkill for the job in hand - but like a Swiss army knife, it may come in useful for other reasons.

It can be seen in action removing all non-alphanumeric characters in this Rextester online demo.

SQL (excluding the function code for brevity):

SELECT txt,
       reg_replace(txt,
                   '[^a-zA-Z0-9]+',
                   '',
                   TRUE,
                   0,
                   0
                   ) AS `reg_replaced`
FROM test;

I had a similar problem with trying to match last names in our database that were slightly different. For example, sometimes people entered the same person's name as "McDonald" and also as "Mc Donald", or "St John" and "St. John".

Instead of trying to convert the Mysql data, I solved the problem by creating a function (in PHP) that would take a string and create an alpha-only regular expression:

function alpha_only_regex($str) {
    $alpha_only = str_split(preg_replace('/[^A-Z]/i', '', $str));
    return '^[^a-zA-Z]*'.implode('[^a-zA-Z]*', $alpha_only).'[^a-zA-Z]*$';
}

Now I can search the database with a query like this:

$lastname_regex = alpha_only_regex($lastname);
$query = "SELECT * FROM my_table WHERE lastname REGEXP '$lastname_regex';

Since MySQL 8.0 you can use regular expression to remove non alphanumeric characters from a string. There is method REGEXP_REPLACE

Here is the code to remove non-alphanumeric characters:

UPDATE {table} SET {column} = REGEXP_REPLACE({column}, '[^0-9a-zA-Z ]', '')

I tried a few solutions but at the end used replace. My data set is part numbers and I fairly know what to expect. But just for sanity, I used PHP to build the long query:

$dirty = array(' ', '-', '.', ',', ':', '?', '/', '!', '&', '@');
$query = 'part_no';
foreach ($dirty as $dirt) {
    $query = "replace($query,'$dirt','')";
}
echo $query;

This outputs something I used to get a headache from:

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(part_no,' ',''),'-',''),'.',''),',',''),':',''),'?',''),'/',''),'!',''),'&',''),'@','')

I have written this UDF. However, it only trims special characters at the beginning of the string. It also converts the string to lower case. You can update this function if desired.

DELIMITER //

DROP FUNCTION IF EXISTS DELETE_DOUBLE_SPACES//

CREATE FUNCTION DELETE_DOUBLE_SPACES ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    DECLARE result VARCHAR(250);
    SET result = REPLACE( title, '  ', ' ' );
    WHILE (result <> title) DO 
        SET title = result;
        SET result = REPLACE( title, '  ', ' ' );
    END WHILE;
    RETURN result;
END//

DROP FUNCTION IF EXISTS LFILTER//

CREATE FUNCTION LFILTER ( title VARCHAR(250) )
RETURNS VARCHAR(250) DETERMINISTIC
BEGIN
    WHILE (1=1) DO
        IF(  ASCII(title) BETWEEN ASCII('a') AND ASCII('z')
            OR ASCII(title) BETWEEN ASCII('A') AND ASCII('Z')
            OR ASCII(title) BETWEEN ASCII('0') AND ASCII('9')
        ) THEN
            SET title = LOWER( title );
            SET title = REPLACE(
                REPLACE(
                    REPLACE(
                        title,
                        CHAR(10), ' '
                    ),
                    CHAR(13), ' '
                ) ,
                CHAR(9), ' '
            );
            SET title = DELETE_DOUBLE_SPACES( title );
            RETURN title;
        ELSE
            SET title = SUBSTRING( title, 2 );          
        END IF;
    END WHILE;
END//
DELIMITER ;

SELECT LFILTER(' !@#$%^&*()_+1a    b');

Also, you could use regular expressions but this requires installing a MySql extension.


Straight and battletested solution for latin and cyrillic characters:

DELIMITER //

CREATE FUNCTION `remove_non_numeric_and_letters`(input TEXT)
  RETURNS TEXT
  BEGIN
    DECLARE output TEXT DEFAULT '';
    DECLARE iterator INT DEFAULT 1;
    WHILE iterator < (LENGTH(input) + 1) DO
      IF SUBSTRING(input, iterator, 1) IN
         ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?', '?')
      THEN
        SET output = CONCAT(output, SUBSTRING(input, iterator, 1));
      END IF;
      SET iterator = iterator + 1;
    END WHILE;
    RETURN output;
  END //

DELIMITER ;

Usage:

-- outputs "hello12356"
SELECT remove_non_numeric_and_letters('hello - 12356-?????? ""]')

the alphanum function (self answered) have a bug, but I don't know why. For text "cas synt ls 75W140 1L" return "cassyntls75W1401", "L" from the end is missing some how.

Now I use

delimiter //
DROP FUNCTION IF EXISTS alphanum //
CREATE FUNCTION alphanum(prm_strInput varchar(255))
RETURNS VARCHAR(255)
DETERMINISTIC
BEGIN
  DECLARE i INT DEFAULT 1;
  DECLARE v_char VARCHAR(1);
  DECLARE v_parseStr VARCHAR(255) DEFAULT ' ';
WHILE (i <= LENGTH(prm_strInput) )  DO
  SET v_char = SUBSTR(prm_strInput,i,1);
  IF v_char REGEXP  '^[A-Za-z0-9]+$' THEN 
        SET v_parseStr = CONCAT(v_parseStr,v_char);  
  END IF;
  SET i = i + 1;
END WHILE;
RETURN trim(v_parseStr);
END
//

(found on google)


SELECT teststring REGEXP '[[:alnum:]]+';

SELECT * FROM testtable WHERE test REGEXP '[[:alnum:]]+'; 

See: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Scroll down to the section that says: [:character_class:]

If you want to manipulate strings the fastest way will be to use a str_udf, see:
https://github.com/hholzgra/mysql-udf-regexp


None of these answers worked for me. I had to create my own function called alphanum which stripped the chars for me:

DROP FUNCTION IF EXISTS alphanum; 
DELIMITER | 
CREATE FUNCTION alphanum( str CHAR(255) ) RETURNS CHAR(255) DETERMINISTIC
BEGIN 
  DECLARE i, len SMALLINT DEFAULT 1; 
  DECLARE ret CHAR(255) DEFAULT ''; 
  DECLARE c CHAR(1);
  IF str IS NOT NULL THEN 
    SET len = CHAR_LENGTH( str ); 
    REPEAT 
      BEGIN 
        SET c = MID( str, i, 1 ); 
        IF c REGEXP '[[:alnum:]]' THEN 
          SET ret=CONCAT(ret,c); 
        END IF; 
        SET i = i + 1; 
      END; 
    UNTIL i > len END REPEAT; 
  ELSE
    SET ret='';
  END IF;
  RETURN ret; 
END | 
DELIMITER ; 

Now I can do:

select 'This works finally!', alphanum('This works finally!');

and I get:

+---------------------+---------------------------------+
| This works finally! | alphanum('This works finally!') |
+---------------------+---------------------------------+
| This works finally! | Thisworksfinally                |
+---------------------+---------------------------------+
1 row in set (0.00 sec)

Hurray!


if you are using php then....

try{
$con = new PDO ("mysql:host=localhost;dbname=dbasename","root","");
}
catch(PDOException $e){
echo "error".$e-getMessage();   
}

$select = $con->prepare("SELECT * FROM table");
$select->setFetchMode(PDO::FETCH_ASSOC);
$select->execute();

while($data=$select->fetch()){ 

$id = $data['id'];
$column = $data['column'];
$column = preg_replace("/[^a-zA-Z0-9]+/", " ", $column); //remove all special characters

$update = $con->prepare("UPDATE table SET column=:column WHERE id='$id'");
$update->bindParam(':column', $column );
$update->execute();

// echo $column."<br>";
} 

Be careful, characters like ’ or » are considered as alpha by MySQL. It better to use something like :

IF c BETWEEN 'a' AND 'z' OR c BETWEEN 'A' AND 'Z' OR c BETWEEN '0' AND '9' OR c = '-' THEN


From a performance point of view, (and on the assumption that you read more than you write)

I think the best way would be to pre calculate and store a stripped version of the column, This way you do the transform less.

You can then put an index on the new column and get the database to do the work for you.


Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to alphanumeric

Replace all non Alpha Numeric characters, New Lines, and multiple White Space with one Space HTML5 form validation pattern alphanumeric with spaces? Regular expression to allow spaces between words Regex for checking if a string is strictly alphanumeric Determine if char is a num or letter How to determine if a String has non-alphanumeric characters? How to remove all non-alpha numeric characters from a string in MySQL? How do I sort strings alphabetically while accounting for value when a string is numeric? How to remove leading zeros from alphanumeric text? How to strip all non-alphabetic characters from string in SQL Server?