[php] Replacing accented characters php

I am trying to replace accented characters with the normal replacements. Below is what I am currently doing.

    $string = "Éric Cantona";
    $strict = strtolower($string);

    echo "After Lower: ".$strict;

    $patterns[0] = '/[á|â|à|å|ä]/';
    $patterns[1] = '/[ð|é|ê|è|ë]/';
    $patterns[2] = '/[í|î|ì|ï]/';
    $patterns[3] = '/[ó|ô|ò|ø|õ|ö]/';
    $patterns[4] = '/[ú|û|ù|ü]/';
    $patterns[5] = '/æ/';
    $patterns[6] = '/ç/';
    $patterns[7] = '/ß/';
    $replacements[0] = 'a';
    $replacements[1] = 'e';
    $replacements[2] = 'i';
    $replacements[3] = 'o';
    $replacements[4] = 'u';
    $replacements[5] = 'ae';
    $replacements[6] = 'c';
    $replacements[7] = 'ss';

    $strict = preg_replace($patterns, $replacements, $strict);
    echo "Final: ".$strict;

This gives me:

    After Lower: éric cantona
    Final: ric cantona

The above gives me ric cantona I want the output to be eric cantona.

can anyone help me with where I am going wrong?

This question is related to php string preg-replace non-ascii-characters

The answer is


You can try this one

class Diacritic
{
    public function replaceDiacritic($input)
    {
        $input = iconv('UTF-8','ASCII//TRANSLIT',$input);
        $input = preg_replace("/['|^|`|~|]/","",$input);
        $input = preg_replace('/["]/','',$input);
        return preg_replace('/[" "]/','_',$input);
    }
}

Vietnamese characters for those who need them

'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );

protected $_convertTable = array(
    '&' => 'and',   '@' => 'at',    '©' => 'c', '®' => 'r', 'À' => 'a',
    'Á' => 'a', 'Â' => 'a', 'Ä' => 'a', 'Å' => 'a', 'Æ' => 'ae','Ç' => 'c',
    'È' => 'e', 'É' => 'e', 'Ë' => 'e', 'Ì' => 'i', 'Í' => 'i', 'Î' => 'i',
    'Ï' => 'i', 'Ò' => 'o', 'Ó' => 'o', 'Ô' => 'o', 'Õ' => 'o', 'Ö' => 'o',
    'Ø' => 'o', 'Ù' => 'u', 'Ú' => 'u', 'Û' => 'u', 'Ü' => 'u', 'Ý' => 'y',
    'ß' => 'ss','à' => 'a', 'á' => 'a', 'â' => 'a', 'ä' => 'a', 'å' => 'a',
    'æ' => 'ae','ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
    'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ò' => 'o', 'ó' => 'o',
    'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ø' => 'o', 'ù' => 'u', 'ú' => 'u',
    'û' => 'u', 'ü' => 'u', 'ý' => 'y', 'þ' => 'p', 'ÿ' => 'y', 'A' => 'a',
    'a' => 'a', 'A' => 'a', 'a' => 'a', 'A' => 'a', 'a' => 'a', 'C' => 'c',
    'c' => 'c', 'C' => 'c', 'c' => 'c', 'C' => 'c', 'c' => 'c', 'C' => 'c',
    'c' => 'c', 'D' => 'd', 'd' => 'd', 'Ð' => 'd', 'd' => 'd', 'E' => 'e',
    'e' => 'e', 'E' => 'e', 'e' => 'e', 'E' => 'e', 'e' => 'e', 'E' => 'e',
    'e' => 'e', 'E' => 'e', 'e' => 'e', 'G' => 'g', 'g' => 'g', 'G' => 'g',
    'g' => 'g', 'G' => 'g', 'g' => 'g', 'G' => 'g', 'g' => 'g', 'H' => 'h',
    'h' => 'h', 'H' => 'h', 'h' => 'h', 'I' => 'i', 'i' => 'i', 'I' => 'i',
    'i' => 'i', 'I' => 'i', 'i' => 'i', 'I' => 'i', 'i' => 'i', 'I' => 'i',
    'i' => 'i', '?' => 'ij','?' => 'ij','J' => 'j', 'j' => 'j', 'K' => 'k',
    'k' => 'k', '?' => 'k', 'L' => 'l', 'l' => 'l', 'L' => 'l', 'l' => 'l',
    'L' => 'l', 'l' => 'l', '?' => 'l', '?' => 'l', 'L' => 'l', 'l' => 'l',
    'N' => 'n', 'n' => 'n', 'N' => 'n', 'n' => 'n', 'N' => 'n', 'n' => 'n',
    '?' => 'n', '?' => 'n', '?' => 'n', 'O' => 'o', 'o' => 'o', 'O' => 'o',
    'o' => 'o', 'O' => 'o', 'o' => 'o', 'Œ' => 'oe','œ' => 'oe','R' => 'r',
    'r' => 'r', 'R' => 'r', 'r' => 'r', 'R' => 'r', 'r' => 'r', 'S' => 's',
    's' => 's', 'S' => 's', 's' => 's', 'S' => 's', 's' => 's', 'Š' => 's',
    'š' => 's', 'T' => 't', 't' => 't', 'T' => 't', 't' => 't', 'T' => 't',
    't' => 't', 'U' => 'u', 'u' => 'u', 'U' => 'u', 'u' => 'u', 'U' => 'u',
    'u' => 'u', 'U' => 'u', 'u' => 'u', 'U' => 'u', 'u' => 'u', 'U' => 'u',
    'u' => 'u', 'W' => 'w', 'w' => 'w', 'Y' => 'y', 'y' => 'y', 'Ÿ' => 'y',
    'Z' => 'z', 'z' => 'z', 'Z' => 'z', 'z' => 'z', 'Ž' => 'z', 'ž' => 'z',
    '?' => 'z', '?' => 'e', 'ƒ' => 'f', 'O' => 'o', 'o' => 'o', 'U' => 'u',
    'u' => 'u', 'A' => 'a', 'a' => 'a', 'I' => 'i', 'i' => 'i', 'O' => 'o',
    'o' => 'o', 'U' => 'u', 'u' => 'u', 'U' => 'u', 'u' => 'u', 'U' => 'u',
    'u' => 'u', 'U' => 'u', 'u' => 'u', 'U' => 'u', 'u' => 'u', '?' => 'a',
    '?' => 'a', '?' => 'ae','?' => 'ae','?' => 'o', '?' => 'o', '?' => 'e',
    '?' => 'jo','?' => 'e', '?' => 'i', '?' => 'i', '?' => 'a', '?' => 'b',
    '?' => 'v', '?' => 'g', '?' => 'd', '?' => 'e', '?' => 'zh','?' => 'z',
    '?' => 'i', '?' => 'j', '?' => 'k', '?' => 'l', '?' => 'm', '?' => 'n',
    '?' => 'o', '?' => 'p', '?' => 'r', '?' => 's', '?' => 't', '?' => 'u',
    '?' => 'f', '?' => 'h', '?' => 'c', '?' => 'ch','?' => 'sh','?' => 'sch',
    '?' => '-', '?' => 'y', '?' => '-', '?' => 'je','?' => 'ju','?' => 'ja',
    '?' => 'a', '?' => 'b', '?' => 'v', '?' => 'g', '?' => 'd', '?' => 'e',
    '?' => 'zh','?' => 'z', '?' => 'i', '?' => 'j', '?' => 'k', '?' => 'l',
    '?' => 'm', '?' => 'n', '?' => 'o', '?' => 'p', '?' => 'r', '?' => 's',
    '?' => 't', '?' => 'u', '?' => 'f', '?' => 'h', '?' => 'c', '?' => 'ch',
    '?' => 'sh','?' => 'sch','?' => '-','?' => 'y', '?' => '-', '?' => 'je',
    '?' => 'ju','?' => 'ja','?' => 'jo','?' => 'e', '?' => 'i', '?' => 'i',
    '?' => 'g', '?' => 'g', '?' => 'a', '?' => 'b', '?' => 'g', '?' => 'd',
    '?' => 'h', '?' => 'v', '?' => 'z', '?' => 'h', '?' => 't', '?' => 'i',
    '?' => 'k', '?' => 'k', '?' => 'l', '?' => 'm', '?' => 'm', '?' => 'n',
    '?' => 'n', '?' => 's', '?' => 'e', '?' => 'p', '?' => 'p', '?' => 'C',
    '?' => 'c', '?' => 'q', '?' => 'r', '?' => 'w', '?' => 't', '™' => 'tm',
);

From magento, im using it for basically everything


You can use PHP strtr() function to get rid of accented characters :

$string = "Éric Cantona";
$accented_array = array('Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E','Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U','Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c','è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o','ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );

$required_str = strtr( $string, $accented_array );

So I found this on php.net page for preg_replace function

// replace accented chars

$string = "Zacarías Ferreíra"; // my definition for string variable
$accents = '/&([A-Za-z]{1,2})(grave|acute|circ|cedil|uml|lig);/';

$string_encoded = htmlentities($string,ENT_NOQUOTES,'UTF-8');

$string = preg_replace($accents,'$1',$string_encoded);

If you have encoding issues you may get someting like this "Zacarías Ferreíra", just decode the string and use said code above

$string = utf8_decode("Zacarías Ferreíra");

Disclaimer: I'm not supporting this answer anymore (I was blind at that time). But thanks for the up-votes =P

You can take this as basis. From WordPress, used to generate pretty urls (the entry point is the slugify() function):

/**
 * Converts all accent characters to ASCII characters.
 *
 * If there are no accent characters, then the string given is just returned.
 *
 * @param string $string Text that might have accent characters
 * @return string Filtered string with replaced "nice" characters.
 */

function remove_accents($string) {
 if (!preg_match('/[\x80-\xff]/', $string))
  return $string;
 if (seems_utf8($string)) {
  $chars = array(
  // Decompositions for Latin-1 Supplement
  chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
  chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
  chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
  chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
  chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
  chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
  chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
  chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
  chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
  chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
  chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
  chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
  chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
  chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
  chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
  chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
  chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
  chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
  chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
  chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
  chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
  chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
  chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
  chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
  chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
  chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
  chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
  chr(195).chr(191) => 'y',
  // Decompositions for Latin Extended-A
  chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
  chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
  chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
  chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
  chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
  chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
  chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
  chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
  chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
  chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
  chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
  chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
  chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
  chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
  chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
  chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
  chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
  chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
  chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
  chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
  chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
  chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
  chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
  chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
  chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
  chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
  chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
  chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
  chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
  chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
  chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
  chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
  chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
  chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
  chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
  chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
  chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
  chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
  chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
  chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
  chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
  chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
  chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
  chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
  chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
  chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
  chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
  chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
  chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
  chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
  chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
  chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
  chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
  chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
  chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
  chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
  chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
  chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
  chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
  chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
  chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
  chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
  chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
  chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
  // Euro Sign
  chr(226).chr(130).chr(172) => 'E',
  // GBP (Pound) Sign
  chr(194).chr(163) => '');
  $string = strtr($string, $chars);
 } else {
  // Assume ISO-8859-1 if not UTF-8
  $chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
   .chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
   .chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
   .chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
   .chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
   .chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
   .chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
   .chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
   .chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
   .chr(252).chr(253).chr(255);
  $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";
  $string = strtr($string, $chars['in'], $chars['out']);
  $double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
  $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
  $string = str_replace($double_chars['in'], $double_chars['out'], $string);
 }
 return $string;
}

/**
 * Checks to see if a string is utf8 encoded.
 *
 * @author bmorel at ssi dot fr
 *
 * @param string $Str The string to be checked
 * @return bool True if $Str fits a UTF-8 model, false otherwise.
 */
function seems_utf8($Str) { # by bmorel at ssi dot fr
 $length = strlen($Str);
 for ($i = 0; $i < $length; $i++) {
  if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
  elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n = 1; # 110bbbbb
  elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n = 2; # 1110bbbb
  elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n = 3; # 11110bbb
  elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n = 4; # 111110bb
  elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n = 5; # 1111110b
  else return false; # Does not match any model
  for ($j = 0; $j < $n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return false;
  }
 }
 return true;
}

function utf8_uri_encode($utf8_string, $length = 0) {
 $unicode = '';
 $values = array();
 $num_octets = 1;
 $unicode_length = 0;
 $string_length = strlen($utf8_string);
 for ($i = 0; $i < $string_length; $i++) {
  $value = ord($utf8_string[$i]);
  if ($value < 128) {
   if ($length && ($unicode_length >= $length))
    break;
   $unicode .= chr($value);
   $unicode_length++;
  } else {
   if (count($values) == 0) $num_octets = ($value < 224) ? 2 : 3;
   $values[] = $value;
   if ($length && ($unicode_length + ($num_octets * 3)) > $length)
    break;
   if (count( $values ) == $num_octets) {
    if ($num_octets == 3) {
     $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
     $unicode_length += 9;
    } else {
     $unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
     $unicode_length += 6;
    }
    $values = array();
    $num_octets = 1;
   }
  }
 }
 return $unicode;
}

/**
 * Sanitizes title, replacing whitespace with dashes.
 *
 * Limits the output to alphanumeric characters, underscore (_) and dash (-).
 * Whitespace becomes a dash.
 *
 * @param string $title The title to be sanitized.
 * @return string The sanitized title.
 */
function slugify($title) {
 $title = strip_tags($title);
 // Preserve escaped octets.
 $title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
 // Remove percent signs that are not part of an octet.
 $title = str_replace('%', '', $title);
 // Restore octets.
 $title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
 $title = remove_accents($title);
 if (seems_utf8($title)) {
  if (function_exists('mb_strtolower')) {
   $title = mb_strtolower($title, 'UTF-8');
  }
  $title = utf8_uri_encode($title, 200);
 }
 $title = strtolower($title);
 $title = preg_replace('/&.+?;/', '', $title); // kill entities
 $title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
 $title = preg_replace('/\s+/', '-', $title);
 $title = preg_replace('|-+|', '-', $title);
 $title = trim($title, '-');
 return $title;
}

This worked for me:

<?php
setlocale(LC_ALL, "en_US.utf8"); 
$val = iconv('UTF-8','ASCII//TRANSLIT',$val);
?>

It's worked like magically, i have used only array, this pattern is worked for me. check this pattern


Adding a little bit to what Lizard said, it worked to display correctly on web page, but added some other codes to complete what I was looking for replacing my tags to search correctly into my database with special characters. Thanks in advance.

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 
                            '&#225;'=>'a', '&#233;'=>'e', '&#237;'=>'i', '&#243;'=>'o', '&#250;'=>'u',  
                            '&#193;'=>'A', '&#201;'=>'E', '&#205;'=>'I', '&#211;'=>'O', '&#218;'=>'U',
                            '&#209;'=>'N', '&#241;'=>'n' );
$newtag = strtr( $newtag, $unwanted_array );

if you have http://php.net/manual/en/book.intl.php available, this will solve your problem:

$string = "Éric Cantona";
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

EDIT

To install the php extension in ubuntu:

apt-get install php-intl

Don't forget, in composer, to require the extension ext-intl to ensure it properly fits into deployed systems.


strtolower only works on iso-8859-1 encoded strings. You could try with mb_strtolower.

Or, if you have to mangle with multibyte-extensions, you might as well use iconv's transliteration support:

iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text);

Edit:

It seems I was a bit fast. You appear to use iso-8859-1, so your current strategy will work. You just need to write the regexp's properly. Eg.:

'/(ð|é|ê|è|ë)/'

not:

'/[ð|é|ê|è|ë]/'

As an alternative (a bit more complex in nature through), have a look at how wordpress does accent removal. Made some changes below to make it run independently without referencing wordpress functions.

     function mbstring_binary_safe_encoding($reset = false)
{
    static $encodings  = array();
    static $overloaded = null;

    if (is_null($overloaded)) {
        $overloaded = function_exists('mb_internal_encoding') && (ini_get('mbstring.func_overload') & 2);
    }

    if (false === $overloaded) {
        return;
    }

    if (!$reset) {
        $encoding = mb_internal_encoding();
        array_push($encodings, $encoding);
        mb_internal_encoding('ISO-8859-1');
    }

    if ($reset && $encodings) {
        $encoding = array_pop($encodings);
        mb_internal_encoding($encoding);
    }
}

function seems_utf8($str)
{
    mbstring_binary_safe_encoding();
    $length = strlen($str);
    mbstring_binary_safe_encoding(true);
    for ($i = 0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) {
            $n = 0;
        }
        // 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) {
            $n = 1;
        }
        // 110bbbbb
        elseif (($c & 0xF0) == 0xE0) {
            $n = 2;
        }
        // 1110bbbb
        elseif (($c & 0xF8) == 0xF0) {
            $n = 3;
        }
        // 11110bbb
        elseif (($c & 0xFC) == 0xF8) {
            $n = 4;
        }
        // 111110bb
        elseif (($c & 0xFE) == 0xFC) {
            $n = 5;
        }
        // 1111110b
        else {
                return false;
            }
            // Does not match any model
            for ($j = 0; $j < $n; $j++) {
                // n bytes matching 10bbbbbb follow ?
                if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) {
                    return false;
                }

            }
        }
        return true;
    }

    function remove_accents($string)
{
        if (!preg_match('/[\x80-\xff]/', $string)) {
            return $string;
        }

        if (seems_utf8($string)) {
            $chars = array(
                // Decompositions for Latin-1 Supplement
                'ª' => 'a', 'º'  => 'o',
                'À' => 'A', 'Á'  => 'A',
                'Â' => 'A', 'Ã'  => 'A',
                'Ä' => 'A', 'Å'  => 'A',
                'Æ' => 'AE', 'Ç' => 'C',
                'È' => 'E', 'É'  => 'E',
                'Ê' => 'E', 'Ë'  => 'E',
                'Ì' => 'I', 'Í'  => 'I',
                'Î' => 'I', 'Ï'  => 'I',
                'Ð' => 'D', 'Ñ'  => 'N',
                'Ò' => 'O', 'Ó'  => 'O',
                'Ô' => 'O', 'Õ'  => 'O',
                'Ö' => 'O', 'Ù'  => 'U',
                'Ú' => 'U', 'Û'  => 'U',
                'Ü' => 'U', 'Ý'  => 'Y',
                'Þ' => 'TH', 'ß' => 's',
                'à' => 'a', 'á'  => 'a',
                'â' => 'a', 'ã'  => 'a',
                'ä' => 'a', 'å'  => 'a',
                'æ' => 'ae', 'ç' => 'c',
                'è' => 'e', 'é'  => 'e',
                'ê' => 'e', 'ë'  => 'e',
                'ì' => 'i', 'í'  => 'i',
                'î' => 'i', 'ï'  => 'i',
                'ð' => 'd', 'ñ'  => 'n',
                'ò' => 'o', 'ó'  => 'o',
                'ô' => 'o', 'õ'  => 'o',
                'ö' => 'o', 'ø'  => 'o',
                'ù' => 'u', 'ú'  => 'u',
                'û' => 'u', 'ü'  => 'u',
                'ý' => 'y', 'þ'  => 'th',
                'ÿ' => 'y', 'Ø'  => 'O',
                // Decompositions for Latin Extended-A
                'A' => 'A', 'a'  => 'a',
                'A' => 'A', 'a'  => 'a',
                'A' => 'A', 'a'  => 'a',
                'C' => 'C', 'c'  => 'c',
                'C' => 'C', 'c'  => 'c',
                'C' => 'C', 'c'  => 'c',
                'C' => 'C', 'c'  => 'c',
                'D' => 'D', 'd'  => 'd',
                'Ð' => 'D', 'd'  => 'd',
                'E' => 'E', 'e'  => 'e',
                'E' => 'E', 'e'  => 'e',
                'E' => 'E', 'e'  => 'e',
                'E' => 'E', 'e'  => 'e',
                'E' => 'E', 'e'  => 'e',
                'G' => 'G', 'g'  => 'g',
                'G' => 'G', 'g'  => 'g',
                'G' => 'G', 'g'  => 'g',
                'G' => 'G', 'g'  => 'g',
                'H' => 'H', 'h'  => 'h',
                'H' => 'H', 'h'  => 'h',
                'I' => 'I', 'i'  => 'i',
                'I' => 'I', 'i'  => 'i',
                'I' => 'I', 'i'  => 'i',
                'I' => 'I', 'i'  => 'i',
                'I' => 'I', 'i'  => 'i',
                '?' => 'IJ', '?' => 'ij',
                'J' => 'J', 'j'  => 'j',
                'K' => 'K', 'k'  => 'k',
                '?' => 'k', 'L'  => 'L',
                'l' => 'l', 'L'  => 'L',
                'l' => 'l', 'L'  => 'L',
                'l' => 'l', '?'  => 'L',
                '?' => 'l', 'L'  => 'L',
                'l' => 'l', 'N'  => 'N',
                'n' => 'n', 'N'  => 'N',
                'n' => 'n', 'N'  => 'N',
                'n' => 'n', '?'  => 'n',
                '?' => 'N', '?'  => 'n',
                'O' => 'O', 'o'  => 'o',
                'O' => 'O', 'o'  => 'o',
                'O' => 'O', 'o'  => 'o',
                'Œ' => 'OE', 'œ' => 'oe',
                'R' => 'R', 'r'  => 'r',
                'R' => 'R', 'r'  => 'r',
                'R' => 'R', 'r'  => 'r',
                'S' => 'S', 's'  => 's',
                'S' => 'S', 's'  => 's',
                'S' => 'S', 's'  => 's',
                'Š' => 'S', 'š'  => 's',
                'T' => 'T', 't'  => 't',
                'T' => 'T', 't'  => 't',
                'T' => 'T', 't'  => 't',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                'W' => 'W', 'w'  => 'w',
                'Y' => 'Y', 'y'  => 'y',
                'Ÿ' => 'Y', 'Z'  => 'Z',
                'z' => 'z', 'Z'  => 'Z',
                'z' => 'z', 'Ž'  => 'Z',
                'ž' => 'z', '?'  => 's',
                // Decompositions for Latin Extended-B
                '?' => 'S', '?'  => 's',
                '?' => 'T', '?'  => 't',
                // Euro Sign
                '€' => 'E',
                // GBP (Pound) Sign
                '£' => '',
                // Vowels with diacritic (Vietnamese)
                // unmarked
                'O' => 'O', 'o'  => 'o',
                'U' => 'U', 'u'  => 'u',
                // grave accent
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'E', '?'  => 'e',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'U', '?'  => 'u',
                '?' => 'Y', '?'  => 'y',
                // hook
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'E', '?'  => 'e',
                '?' => 'E', '?'  => 'e',
                '?' => 'I', '?'  => 'i',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'U', '?'  => 'u',
                '?' => 'U', '?'  => 'u',
                '?' => 'Y', '?'  => 'y',
                // tilde
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'E', '?'  => 'e',
                '?' => 'E', '?'  => 'e',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'U', '?'  => 'u',
                '?' => 'Y', '?'  => 'y',
                // acute accent
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'E', '?'  => 'e',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'U', '?'  => 'u',
                // dot below
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'A', '?'  => 'a',
                '?' => 'E', '?'  => 'e',
                '?' => 'E', '?'  => 'e',
                '?' => 'I', '?'  => 'i',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'O', '?'  => 'o',
                '?' => 'U', '?'  => 'u',
                '?' => 'U', '?'  => 'u',
                '?' => 'Y', '?'  => 'y',
                // Vowels with diacritic (Chinese, Hanyu Pinyin)
                '?' => 'a',
                // macron
                'U' => 'U', 'u'  => 'u',
                // acute accent
                'U' => 'U', 'u'  => 'u',
                // caron
                'A' => 'A', 'a'  => 'a',
                'I' => 'I', 'i'  => 'i',
                'O' => 'O', 'o'  => 'o',
                'U' => 'U', 'u'  => 'u',
                'U' => 'U', 'u'  => 'u',
                // grave accent
                'U' => 'U', 'u'  => 'u',
            );

            $string = strtr($string, $chars);
        } else {
            $chars = array();
            // Assume ISO-8859-1 if not UTF-8
            $chars['in'] = "\x80\x83\x8a\x8e\x9a\x9e"
                . "\x9f\xa2\xa5\xb5\xc0\xc1\xc2"
                . "\xc3\xc4\xc5\xc7\xc8\xc9\xca"
                . "\xcb\xcc\xcd\xce\xcf\xd1\xd2"
                . "\xd3\xd4\xd5\xd6\xd8\xd9\xda"
                . "\xdb\xdc\xdd\xe0\xe1\xe2\xe3"
                . "\xe4\xe5\xe7\xe8\xe9\xea\xeb"
                . "\xec\xed\xee\xef\xf1\xf2\xf3"
                . "\xf4\xf5\xf6\xf8\xf9\xfa\xfb"
                . "\xfc\xfd\xff";

            $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";

            $string              = strtr($string, $chars['in'], $chars['out']);
            $double_chars        = array();
            $double_chars['in']  = array("\x8c", "\x9c", "\xc6", "\xd0", "\xde", "\xdf", "\xe6", "\xf0", "\xfe");
            $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
            $string              = str_replace($double_chars['in'], $double_chars['out'], $string);
        }

        return $string;
    }

In PHP 5.4 the intl extension provides a new class named Transliterator.

I believe that's the best way to remove diacritics for two reasons:

  1. Transliterator is based on ICU, so you're using the tables of the ICU library. ICU is a great project, developed over the year to provide comprehensive tables and functionalities. Whatever table you want to write yourself, it will never be as complete as the one from ICU.

  2. In UTF-8, characters could be represented differently. For example, the character ñ could be saved as a single (multi-byte) character, or as the combination of characters ˜ (multibyte) and n. In addition to this, some characters in Unicode are homograph: they look the same while having different codepoints. For this reason it's also important to normalize the string.

Here's a sample code, taken from an old answer of mine:

<?php
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD);
$test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto'];
foreach($test as $e) {
    $normalized = $transliterator->transliterate($e);
    echo $e. ' --> '.$normalized."\n";
}
?>

Result:

abcd --> abcd
èe --> ee
€ --> €
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto

The first argument for the Transliterator class performs the removal of diacritics as well as the normalization of the string.


I just came accross the answer from Lizard which is extremely helpful - especially when you do some sorting. Isn't is beautiful how many chars we need to say mostly the same ;)

If anyone else if looking for a all-in solution (as far as the comments above tell), here's the copy&paste:

/**
 * Replace language-specific characters by ASCII-equivalents.
 * @param string $s
 * @return string
 */
public static function normalizeChars($s) {
    $replace = array(
        '?'=>'-', '?'=>'-', '?'=>'-', '?'=>'-',
        'A'=>'A', 'A'=>'A', 'À'=>'A', 'Ã'=>'A', 'Á'=>'A', 'Æ'=>'A', 'Â'=>'A', 'Å'=>'A', 'Ä'=>'Ae',
        'Þ'=>'B',
        'C'=>'C', '?'=>'C', 'Ç'=>'C',
        'È'=>'E', 'E'=>'E', 'É'=>'E', 'Ë'=>'E', 'Ê'=>'E',
        'G'=>'G',
        'I'=>'I', 'Ï'=>'I', 'Î'=>'I', 'Í'=>'I', 'Ì'=>'I',
        'L'=>'L',
        'Ñ'=>'N', 'N'=>'N',
        'Ø'=>'O', 'Ó'=>'O', 'Ò'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'Oe',
        'S'=>'S', 'S'=>'S', '?'=>'S', 'Š'=>'S',
        '?'=>'T',
        'Ù'=>'U', 'Û'=>'U', 'Ú'=>'U', 'Ü'=>'Ue',
        'Ý'=>'Y',
        'Z'=>'Z', 'Ž'=>'Z', 'Z'=>'Z',
        'â'=>'a', 'a'=>'a', 'a'=>'a', 'á'=>'a', 'a'=>'a', 'ã'=>'a', 'A'=>'a', '?'=>'a', '?'=>'a', 'å'=>'a', 'à'=>'a', '?'=>'a', '?'=>'a', 'A'=>'a', '?'=>'a', 'a'=>'a', 'ä'=>'ae', 'æ'=>'ae', '?'=>'ae', '?'=>'ae',
        '?'=>'b', '?'=>'b', '?'=>'b', 'þ'=>'b',
        'c'=>'c', 'C'=>'c', 'C'=>'c', 'c'=>'c', 'ç'=>'c', '?'=>'c', '?'=>'c', 'c'=>'c', '?'=>'c', 'C'=>'c', 'c'=>'c', '?'=>'ch', '?'=>'ch',
        '?'=>'d', 'd'=>'d', 'Ð'=>'d', 'D'=>'d', 'd'=>'d', '?'=>'d', '?'=>'D', 'ð'=>'d',
        '?'=>'e', '?'=>'e', '?'=>'e', '?'=>'e', '?'=>'e', 'e'=>'e', 'e'=>'e', 'e'=>'e', 'E'=>'e', 'E'=>'e', 'e'=>'e', 'e'=>'e', 'E'=>'e', '?'=>'e', 'E'=>'e', 'ê'=>'e', '?'=>'e', 'è'=>'e', 'ë'=>'e', 'é'=>'e',
        '?'=>'f', 'ƒ'=>'f', '?'=>'f',
        'g'=>'g', 'G'=>'g', 'G'=>'g', 'G'=>'g', '?'=>'g', '?'=>'g', 'g'=>'g', 'g'=>'g', '?'=>'g', '?'=>'g', '?'=>'g', 'g'=>'g',
        '?'=>'h', 'h'=>'h', '?'=>'h', 'H'=>'h', 'H'=>'h', 'h'=>'h', '?'=>'h', '?'=>'h',
        'î'=>'i', 'ï'=>'i', 'í'=>'i', 'ì'=>'i', 'i'=>'i', 'i'=>'i', 'i'=>'i', 'I'=>'i', '?'=>'i', 'i'=>'i', 'i'=>'i', 'I'=>'i', 'I'=>'i', '?'=>'i', 'I'=>'i', '?'=>'i', '?'=>'i', 'I'=>'i', '?'=>'i', '?'=>'i', '?'=>'i', 'i'=>'i', '?'=>'ij', '?'=>'ij',
        '?'=>'j', '?'=>'j', 'J'=>'j', 'j'=>'j', '?'=>'ja', '?'=>'ja', '?'=>'je', '?'=>'je', '?'=>'jo', '?'=>'jo', '?'=>'ju', '?'=>'ju',
        '?'=>'k', '?'=>'k', 'K'=>'k', '?'=>'k', '?'=>'k', 'k'=>'k', '?'=>'k',
        '?'=>'l', '?'=>'l', '?'=>'l', 'l'=>'l', 'l'=>'l', 'l'=>'l', 'L'=>'l', 'L'=>'l', '?'=>'l', 'L'=>'l', 'l'=>'l', '?'=>'l',
        '?'=>'m', '?'=>'m', '?'=>'m', '?'=>'m',
        'ñ'=>'n', '?'=>'n', 'N'=>'n', '?'=>'n', '?'=>'n', '?'=>'n', '?'=>'n', 'n'=>'n', '?'=>'n', 'n'=>'n', '?'=>'n', 'N'=>'n', 'n'=>'n',
        '?'=>'o', '?'=>'o', 'o'=>'o', 'õ'=>'o', 'ô'=>'o', 'O'=>'o', 'o'=>'o', 'O'=>'o', 'O'=>'o', 'o'=>'o', 'ø'=>'o', '?'=>'o', 'o'=>'o', 'ò'=>'o', '?'=>'o', 'O'=>'o', 'o'=>'o', 'ó'=>'o', 'O'=>'o', 'œ'=>'oe', 'Œ'=>'oe', 'ö'=>'oe',
        '?'=>'p', '?'=>'p', '?'=>'p', '?'=>'p',
        '?'=>'q',
        'r'=>'r', 'r'=>'r', 'R'=>'r', 'r'=>'r', 'R'=>'r', '?'=>'r', 'R'=>'r', '?'=>'r', '?'=>'r',
        '?'=>'s', '?'=>'s', 'S'=>'s', 'š'=>'s', 's'=>'s', '?'=>'s', 's'=>'s', '?'=>'s', 's'=>'s', '?'=>'sch', '?'=>'sch', '?'=>'sh', '?'=>'sh', 'ß'=>'ss',
        '?'=>'t', '?'=>'t', 't'=>'t', '?'=>'t', 't'=>'t', 't'=>'t', 'T'=>'t', '?'=>'t', '?'=>'t', 'T'=>'t', 'T'=>'t', '™'=>'tm',
        'u'=>'u', '?'=>'u', 'U'=>'u', 'u'=>'u', 'U'=>'u', 'u'=>'u', 'U'=>'u', 'U'=>'u', 'u'=>'u', 'U'=>'u', 'u'=>'u', 'U'=>'u', 'U'=>'u', 'u'=>'u', 'u'=>'u', 'U'=>'u', 'U'=>'u', 'u'=>'u', 'U'=>'u', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', '?'=>'u', 'u'=>'u', 'u'=>'u', 'U'=>'u', 'U'=>'u', 'u'=>'u', 'u'=>'u', 'ü'=>'ue',
        '?'=>'v', '?'=>'v', '?'=>'v',
        '?'=>'w', 'w'=>'w', 'W'=>'w',
        '?'=>'y', 'y'=>'y', 'ý'=>'y', 'ÿ'=>'y', 'Ÿ'=>'y', 'Y'=>'y',
        '?'=>'y', 'ž'=>'z', '?'=>'z', '?'=>'z', 'z'=>'z', '?'=>'z', 'z'=>'z', '?'=>'z', '?'=>'zh', '?'=>'zh'
    );
    return strtr($s, $replace);
}

Note some slight changes regarding the German umlauts (ä => ae)

Edit: Included more characters based on the posting from user3682119 (except for the copyright symbol) and the comment from daker.


To remove the diacritics, use iconv:

$val = iconv('ISO-8859-1','ASCII//TRANSLIT',$val);

or

$val = iconv('UTF-8','ASCII//TRANSLIT',$val);

note that php has some weird bug in that it (sometimes?) needs to have a locale set to make these conversions work, using setlocale().

edit tested, it gets all of your diacritics out of the box:

$val = "á|â|à|å|ä ð|é|ê|è|ë í|î|ì|ï ó|ô|ò|ø|õ|ö ú|û|ù|ü æ ç ß abc ABC 123";
echo iconv('UTF-8','ASCII//TRANSLIT',$val); 

output (updated 2019-12-30)

a|a|a|a|a d|e|e|e|e i|i|i|i o|o|o|o|o|o u|u|u|u ae c ss abc ABC 123

Note that ð is correctly transliterated to d instead of o, as in the accepted answer.


I found this way to be a good one, without having to worry too much about charsets and arrays, or iconv:

function replace_accents($str) {
   $str = htmlentities($str, ENT_COMPAT, "UTF-8");
   $str = preg_replace('/&([a-zA-Z])(uml|acute|grave|circ|tilde|ring);/','$1',$str);
   return html_entity_decode($str);
}

I've searched and your idea for accent striping is quite awesome and cost-effective but your regex is wrongly done and misses 2 extra params. Long story short the regex must be:

$patterns[0] = '/[áâàåä]/ui';
$patterns[1] = '/[ðéêèë]/ui';
$patterns[2] = '/[íîìï]/ui';
$patterns[3] = '/[óôòøõö]/ui';
$patterns[4] = '/[úûùü]/ui';
$patterns[5] = '/æ/ui';
$patterns[6] = '/ç/ui';
$patterns[7] = '/ß/ui';
$replacements[0] = 'a';
$replacements[1] = 'e';
$replacements[2] = 'i';
$replacements[3] = 'o';
$replacements[4] = 'u';
$replacements[5] = 'ae';
$replacements[6] = 'c';
$replacements[7] = 'ss';

As you can see is quite similar but the most important thing is the paramas after the second slash of the regular expression. When a regualr expression is like this /[someCoolRegex]/ui the u specifies that it must use unicode and the i specifies that is case insensitive, I've tested my own and with the ansewer in this forum I must say is more cost efective than using strtr.

Hope someone reads this answer.


An updated answer based on @BurninLeo's answer

function replace_spec_char($subject) {
    $char_map = array(
        "?" => "-", "?" => "-", "?" => "-", "?" => "-",
        "?" => "A", "A" => "A", "A" => "A", "A" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "?" => "A", "A" => "A", "?" => "A",
        "?" => "B", "?" => "B", "Þ" => "B",
        "C" => "C", "C" => "C", "Ç" => "C", "?" => "C", "?" => "C", "C" => "C", "C" => "C", "©" => "C", "?" => "C",
        "?" => "D", "D" => "D", "Ð" => "D", "?" => "D", "Ð" => "D",
        "È" => "E", "E" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "?" => "E", "E" => "E", "E" => "E", "E" => "E", "E" => "E", "?" => "E", "?" => "E", "?" => "E",
        "?" => "F", "ƒ" => "F",
        "G" => "G", "G" => "G", "G" => "G", "G" => "G", "?" => "G", "?" => "G", "?" => "G",
        "?" => "H", "H" => "H", "?" => "H", "H" => "H", "?" => "H",
        "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "I" => "I", "I" => "I", "I" => "I", "?" => "I", "I" => "I", "I" => "I", "?" => "I", "?" => "I", "I" => "I", "?" => "I",
        "?" => "J", "J" => "J",
        "?" => "K", "?" => "K", "K" => "K", "?" => "K", "?" => "K",
        "L" => "L", "?" => "L", "?" => "L", "L" => "L", "L" => "L", "L" => "L", "?" => "L",
        "?" => "M", "?" => "M", "?" => "M",
        "Ñ" => "N", "N" => "N", "?" => "N", "N" => "N", "?" => "N", "?" => "N", "?" => "N", "?" => "N", "N" => "N",
        "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "?" => "O", "O" => "O", "O" => "O", "O" => "O", "?" => "O", "O" => "O", "O" => "O",
        "?" => "P", "?" => "P", "?" => "P",
        "?" => "Q",
        "R" => "R", "R" => "R", "R" => "R", "?" => "R", "?" => "R", "®" => "R",
        "S" => "S", "S" => "S", "?" => "S", "Š" => "S", "?" => "S", "S" => "S", "?" => "S",
        "?" => "T", "?" => "T", "?" => "T", "T" => "T", "?" => "T", "T" => "T", "T" => "T",
        "Ù" => "U", "Û" => "U", "Ú" => "U", "U" => "U", "?" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U", "U" => "U",
        "?" => "V", "?" => "V",
        "Ý" => "Y", "?" => "Y", "Y" => "Y", "Ÿ" => "Y",
        "Z" => "Z", "Ž" => "Z", "Z" => "Z", "?" => "Z", "?" => "Z",
        "?" => "a", "a" => "a", "a" => "a", "a" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "?" => "a", "a" => "a", "?" => "a",
        "?" => "b", "?" => "b", "þ" => "b",
        "c" => "c", "c" => "c", "ç" => "c", "?" => "c", "?" => "c", "c" => "c", "c" => "c", "©" => "c", "?" => "c",
        "?" => "ch", "?" => "ch",
        "?" => "d", "d" => "d", "d" => "d", "?" => "d", "ð" => "d",
        "è" => "e", "e" => "e", "é" => "e", "ë" => "e", "ê" => "e", "?" => "e", "e" => "e", "e" => "e", "e" => "e", "e" => "e", "?" => "e", "?" => "e", "?" => "e",
        "?" => "f", "ƒ" => "f",
        "g" => "g", "g" => "g", "g" => "g", "g" => "g", "?" => "g", "?" => "g", "?" => "g",
        "?" => "h", "h" => "h", "?" => "h", "h" => "h", "?" => "h",
        "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "i" => "i", "i" => "i", "i" => "i", "?" => "i", "i" => "i", "i" => "i", "?" => "i", "?" => "i", "i" => "i", "?" => "i",
        "?" => "j", "?" => "j", "J" => "j", "j" => "j",
        "?" => "k", "?" => "k", "k" => "k", "?" => "k", "?" => "k",
        "l" => "l", "?" => "l", "?" => "l", "l" => "l", "l" => "l", "l" => "l", "?" => "l",
        "?" => "m", "?" => "m", "?" => "m",
        "ñ" => "n", "n" => "n", "?" => "n", "n" => "n", "?" => "n", "?" => "n", "?" => "n", "?" => "n", "n" => "n",
        "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "?" => "o", "o" => "o", "o" => "o", "o" => "o", "?" => "o", "o" => "o", "o" => "o",
        "?" => "p", "?" => "p", "?" => "p",
        "?" => "q",
        "r" => "r", "r" => "r", "r" => "r", "?" => "r", "?" => "r", "®" => "r",
        "s" => "s", "s" => "s", "?" => "s", "š" => "s", "?" => "s", "s" => "s", "?" => "s",
        "?" => "t", "?" => "t", "?" => "t", "t" => "t", "?" => "t", "t" => "t", "t" => "t",
        "ù" => "u", "û" => "u", "ú" => "u", "u" => "u", "?" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u", "u" => "u",
        "?" => "v", "?" => "v",
        "ý" => "y", "?" => "y", "y" => "y", "ÿ" => "y",
        "z" => "z", "ž" => "z", "z" => "z", "?" => "z", "?" => "z", "?" => "z",
        "™" => "tm",
        "@" => "at",
        "Ä" => "ae", "?" => "ae", "ä" => "ae", "æ" => "ae", "?" => "ae",
        "?" => "ij", "?" => "ij",
        "?" => "ja", "?" => "ja",
        "?" => "je", "?" => "je",
        "?" => "jo", "?" => "jo",
        "?" => "ju", "?" => "ju",
        "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
        "?" => "sch", "?" => "sch",
        "?" => "sh", "?" => "sh",
        "ß" => "ss",
        "Ü" => "ue",
        "?" => "zh", "?" => "zh",
    );
    return strtr($subject, $char_map);
}

$string = "Hí th?®ë, ?ßt å test!";
echo replace_spec_char($string);

Hí th?®ë, ?ßt å test! => Hi there, jusst a test!

This does not mix up upper and lower case chars except for longer chars (eg: ss,ch, sch) , added @ ® ©

Also if you want to build regex matching regardless to special chars :

rss => '[rrrRrR?R??](?:[s??Sšs?s?s][s??Sšs?s?s]|[ß])'

A vala implementation of this : https://code.launchpad.net/~jeremy-munsch/synapse-project/ascii-smart/+merge/277477

Here is the base list you could work with, with regex replacing (in sublime text) or small script you can build anything from this array to fill your needs.

"-" => "????",
"A" => "?AAAÀÃÁÆÂÅ?A?",
"B" => "??Þ",
"C" => "CCÇ??CC©?",
"D" => "?DÐ?Ð",
"E" => "ÈEÉËÊ?EEEE???",
"F" => "?ƒ",
"G" => "GGGG???",
"H" => "?H?H?",
"I" => "IÏÎÍÌIII?II??I?",
"J" => "?J",
"K" => "??K??",
"L" => "L??LLL?",
"M" => "???",
"N" => "ÑN?N????N",
"O" => "ØÓÒÔÕ?OOO?OO",
"P" => "???",
"Q" => "?",
"R" => "RRR??®",
"S" => "SS?Š?S?",
"T" => "???T?TT",
"U" => "ÙÛÚU?UUUUUUUUUUU",
"V" => "??",
"Y" => "Ý?YŸ",
"Z" => "ZŽZ??",
"a" => "?aaaàãáæâå?a?",
"b" => "??þ",
"c" => "ccç??cc©?",
"ch" => "?",
"d" => "?dd?ð",
"e" => "èeéëê?eeee???",
"f" => "?ƒ",
"g" => "gggg???",
"h" => "?h?h?",
"i" => "iïîíìiii?ii??i?",
"j" => "?j",
"k" => "??k??",
"l" => "l??lll?",
"m" => "???",
"n" => "ñn?n????n",
"o" => "øóòôõ?ooo?oo",
"p" => "???",
"q" => "?",
"r" => "rrr??®",
"s" => "ss?š?s?",
"t" => "???t?tt",
"u" => "ùûúu?uuuuuuuuuuu",
"v" => "??",
"y" => "ý?yÿ",
"z" => "zžz???",
"tm" => "™",
"at" => "@",
"ae" => "Ä?äæ?",
"ch" => "??",
"ij" => "??",
"j" => "??Jj",
"ja" => "??",
"je" => "??",
"jo" => "??",
"ju" => "??",
"oe" => "œŒöÖ",
"sch" => "??",
"sh" => "??",
"ss" => "ß",
"tm" => "™",
"ue" => "Ü",
"zh" => "??"

I know, that question has been asked a long long time ago...

I was looking for a short and elegant solution, but couldn't find satisfaction for two reasons:

First, most of the existing solutions replace a list of characters by a list of other characters. Unfortunately, it require to use a specific encoding for the php script file itself which might be unwanted.

Second, using iconv seems to be a good way, but it's not enough as the result of a converted character could be one or two characters, or a Fatal Exception.

So I wrote that small function which does the job :

function replaceAccent($string, $replacement = '_')
{
    $alnumPattern = '/^[a-zA-Z0-9 ]+$/';

    if (preg_match($alnumPattern, $string)) {
        return $string;
    }

    $ret = array_map(
        function ($chr) use ($alnumPattern, $replacement) {
            if (preg_match($alnumPattern, $chr)) {
                return $chr;
            } else {
                $chr = @iconv('ISO-8859-1', 'ASCII//TRANSLIT', $chr);
                if (strlen($chr) == 1) {
                    return $chr;
                } elseif (strlen($chr) > 1) {
                    $ret = '';
                    foreach (str_split($chr) as $char2) {
                        if (preg_match($alnumPattern, $char2)) {
                            $ret .= $char2;
                        }
                    }
                    return $ret;
                } else {
                    // replace whatever iconv fail to convert by something else
                    return $replacement;
                }
            }
        },
        str_split($string)
    );

    return implode($ret);
}

Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to preg-replace

Replace deprecated preg_replace /e with preg_replace_callback Replace preg_replace() e modifier with preg_replace_callback PHP replacing special characters like à->a, è->e PHP preg replace only allow numbers PHP remove special character from string Replacing accented characters php How to replace <span style="font-weight: bold;">foo</span> by <strong>foo</strong> using PHP and regex? Remove multiple whitespaces

Examples related to non-ascii-characters

How do I remove all non-ASCII characters with regex and Notepad++? Remove non-ascii character in string SyntaxError of Non-ASCII character Find non-ASCII characters in varchar columns using SQL Server Replacing accented characters php (grep) Regex to match non-ASCII characters?