I want to have a function to create slugs from Unicode strings, e.g. gen_slug('Andrés Cortez')
should return andres-cortez
. How should I do that?
This question is related to
php
internationalization
slug
Since gTLDs and IDNs are becoming more and more used I cannot see why URL shouldn't contain Andrés.
Just rawurlencode $URL you want instead. Most browsers show UTF-8 characters in URLs (not some ancient IE6 maybe) and bit.ly / goo.gl can be used to make it short in cases like Russian and Arabic if need may be for ad purposes or just write them in ads like user would write them on browser URL.
Only difference is spaces " " it might be good idea to replace them with "-" and "/" if you don't want to allow those.
<?php
function slugify($url)
{
$url = trim($url);
$url = str_replace(" ","-",$url);
$url = str_replace("/","-slash-",$url);
$url = rawurlencode($url);
}
?>
Url as encoded http://www.hurtta.com/RU/%D0%9F%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D1%8B/
Url as written http://www.hurtta.com/RU/????????/
If you have intl extension installed, you can use Transliterator::transliterate function to create a slug easily.
<?php
$string = 'Namnet på bildtävlingen';
$slug = \Transliterator::createFromRules(
':: Any-Latin;'
. ':: NFD;'
. ':: [:Nonspacing Mark:] Remove;'
. ':: NFC;'
. ':: [:Punctuation:] Remove;'
. ':: Lower();'
. '[:Separator:] > \'-\''
)
->transliterate( $string );
echo $slug; // namnet-pa-bildtavlingen
?>
There's a good solution here that deals with special characters as well.
Texto Fantástico => texto-fantastico
function slugify( $string, $separator = '-' ) {
$accents_regex = '~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i';
$special_cases = array( '&' => 'and', "'" => '');
$string = mb_strtolower( trim( $string ), 'UTF-8' );
$string = str_replace( array_keys($special_cases), array_values( $special_cases), $string );
$string = preg_replace( $accents_regex, '$1', htmlentities( $string, ENT_QUOTES, 'UTF-8' ) );
$string = preg_replace("/[^a-z0-9]/u", "$separator", $string);
$string = preg_replace("/[$separator]+/u", "$separator", $string);
return $string;
}
Author: Natxet
I didn't know which one to use so I made a quick bench on phptester.net
<?php
// First test
// https://stackoverflow.com/a/42740874/10232729
function slugify(STRING $string, STRING $separator = '-'){
$accents_regex = '~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i';
$special_cases = [ '&' => 'and', "'" => ''];
$string = mb_strtolower( trim( $string ), 'UTF-8' );
$string = str_replace( array_keys($special_cases), array_values( $special_cases), $string );
$string = preg_replace( $accents_regex, '$1', htmlentities( $string, ENT_QUOTES, 'UTF-8' ) );
$string = preg_replace('/[^a-z0-9]/u', $separator, $string);
return preg_replace('/['.$separator.']+/u', $separator, $string);
}
// Second test
// https://stackoverflow.com/a/13331948/10232729
function slug(STRING $string, STRING $separator = '-'){
$string = transliterator_transliterate('Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();', $string);
return str_replace(' ', $separator, $string);;
}
// Third test - My choice
// https://stackoverflow.com/a/38066136/10232729
function slugbis($text){
$replace = [
'<' => '', '>' => '', '-' => ' ', '&' => '',
'"' => '', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä'=> 'Ae',
'Ä' => 'A', 'Å' => 'A', 'A' => 'A', 'A' => 'A', 'A' => 'A', 'Æ' => 'Ae',
'Ç' => 'C', 'C' => 'C', 'C' => 'C', 'C' => 'C', 'C' => 'C', 'D' => 'D', 'Ð' => 'D',
'Ð' => 'D', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'E' => 'E',
'E' => 'E', 'E' => 'E', 'E' => 'E', 'E' => 'E', 'G' => 'G', 'G' => 'G',
'G' => 'G', 'G' => 'G', 'H' => 'H', 'H' => 'H', 'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I', 'I' => 'I', 'I' => 'I', 'I' => 'I', 'I' => 'I',
'I' => 'I', '?' => 'IJ', 'J' => 'J', 'K' => 'K', 'L' => 'K', 'L' => 'K',
'L' => 'K', 'L' => 'K', '?' => 'K', 'Ñ' => 'N', 'N' => 'N', 'N' => 'N',
'N' => 'N', '?' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'Oe', 'Ö' => 'Oe', 'Ø' => 'O', 'O' => 'O', 'O' => 'O', 'O' => 'O',
'Œ' => 'OE', 'R' => 'R', 'R' => 'R', 'R' => 'R', 'S' => 'S', 'Š' => 'S',
'S' => 'S', 'S' => 'S', '?' => 'S', 'T' => 'T', 'T' => 'T', 'T' => 'T',
'?' => 'T', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'Ue', 'U' => 'U',
'Ü' => 'Ue', 'U' => 'U', 'U' => 'U', 'U' => 'U', 'U' => 'U', 'U' => 'U',
'W' => 'W', 'Ý' => 'Y', 'Y' => 'Y', 'Ÿ' => 'Y', 'Z' => 'Z', 'Ž' => 'Z',
'Z' => 'Z', 'Þ' => 'T', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a',
'ä' => 'ae', 'ä' => 'ae', 'å' => 'a', 'a' => 'a', 'a' => 'a', 'a' => 'a',
'æ' => 'ae', 'ç' => 'c', 'c' => 'c', 'c' => 'c', 'c' => 'c', 'c' => 'c',
'd' => 'd', 'd' => 'd', 'ð' => 'd', 'è' => 'e', 'é' => 'e', 'ê' => 'e',
'ë' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e',
'ƒ' => 'f', 'g' => 'g', 'g' => 'g', 'g' => 'g', 'g' => 'g', 'h' => 'h',
'h' => 'h', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'i' => 'i',
'i' => 'i', 'i' => 'i', 'i' => 'i', 'i' => 'i', '?' => 'ij', 'j' => 'j',
'k' => 'k', '?' => 'k', 'l' => 'l', 'l' => 'l', 'l' => 'l', 'l' => 'l',
'?' => 'l', 'ñ' => 'n', 'n' => 'n', 'n' => 'n', 'n' => 'n', '?' => 'n',
'?' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'oe',
'ö' => 'oe', 'ø' => 'o', 'o' => 'o', 'o' => 'o', 'o' => 'o', 'œ' => 'oe',
'r' => 'r', 'r' => 'r', 'r' => 'r', 'š' => 's', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'ue', 'u' => 'u', 'ü' => 'ue', 'u' => 'u', 'u' => 'u',
'u' => 'u', 'u' => 'u', 'u' => 'u', 'w' => 'w', 'ý' => 'y', 'ÿ' => 'y',
'y' => 'y', 'ž' => 'z', 'z' => 'z', 'z' => 'z', 'þ' => 't', 'ß' => 'ss',
'?' => 'ss', '??' => 'iy', '?' => 'A', '?' => 'B', '?' => 'V', '?' => 'G',
'?' => 'D', '?' => 'E', '?' => 'YO', '?' => 'ZH', '?' => 'Z', '?' => 'I',
'?' => 'Y', '?' => 'K', '?' => 'L', '?' => 'M', '?' => 'N', '?' => 'O',
'?' => 'P', '?' => 'R', '?' => 'S', '?' => 'T', '?' => 'U', '?' => 'F',
'?' => 'H', '?' => 'C', '?' => 'CH', '?' => 'SH', '?' => 'SCH', '?' => '',
'?' => 'Y', '?' => '', '?' => 'E', '?' => 'YU', '?' => 'YA', '?' => 'a',
'?' => 'b', '?' => 'v', '?' => 'g', '?' => 'd', '?' => 'e', '?' => 'yo',
'?' => 'zh', '?' => 'z', '?' => 'i', '?' => 'y', '?' => 'k', '?' => 'l',
'?' => 'm', '?' => 'n', '?' => 'o', '?' => 'p', '?' => 'r', '?' => 's',
'?' => 't', '?' => 'u', '?' => 'f', '?' => 'h', '?' => 'c', '?' => 'ch',
'?' => 'sh', '?' => 'sch', '?' => '', '?' => 'y', '?' => '', '?' => 'e',
'?' => 'yu', '?' => 'ya'
];
// make a human readable string
$text = strtr($text, $replace);
// replace non letter or digits by -
$text = preg_replace('~[^\pL\d.]+~u', '-', $text);
// trim
$text = trim($text, '-');
// remove unwanted characters
$text = preg_replace('~[^-\w.]+~', '', $text);
return strtolower($text);
}
// Fourth test
// https://stackoverflow.com/a/2955521/10232729
function slugagain($string){
$table = [
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'd'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'R'=>'R', 'r'=>'r', ' '=>'-'
];
return strtr($string, $table);
}
// Fifth test
// https://stackoverflow.com/a/27396804/10232729
function slugifybis($url){
$url = trim($url);
$url = str_replace(' ', '-', $url);
$url = str_replace('/', '-slash-', $url);
return rawurlencode($url);
}
// Sixth and last test
// https://stackoverflow.com/a/39442034/10232729
setlocale( LC_ALL, "en_US.UTF8" );
function slugifyagain($string){
$string = iconv('utf-8', 'us-ascii//translit//ignore', $string); // transliterate
$string = str_replace("'", '', $string);
$string = preg_replace('~[^\pL\d]+~u', '-', $string); // replace non letter or non digits by "-"
$string = preg_replace('~[^-\w]+~', '', $string); // remove unwanted characters
$string = preg_replace('~-+~', '-', $string); // remove duplicate "-"
$string = trim($string, '-'); // trim "-"
$string = trim($string); // trim
$string = mb_strtolower($string, 'utf-8'); // lowercase
return urlencode($string); // safe;
};
$string = $newString = "¿ Àñdréß l'affreux garçon & nøël en forêt !";
$max = 10000;
echo '<pre>';
echo 'Beginning :';
echo '<br />';
echo '<br />';
echo '> Slugging '.$max.' iterations of following :';
echo '<br />';
echo '>> ' . $string;
echo '<br />';
echo '<br />';
echo 'Output results :';
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugify($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> First test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slug($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Second test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugbis($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Third test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugagain($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Fourth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugifybis($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Fifth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugifyagain($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Sixth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '</pre>';
Beginning :
Slugging 10000 iterations of following :
¿ Àñdréß l'affreux garçon & nøël en forêt !
Output results :
First test passed in 120.78ms
Result : -iquest-andresz-laffreux-arcon-and-noel-en-foret-
Second test passed in 3883.82ms
Result : -andreß-laffreux-garcon--nøel-en-foret-
Third test passed in 56.83ms
Result : andress-l-affreux-garcon-noel-en-foret
Fourth test passed in 18.93ms
Result : ¿-AndreSs-l'affreux-garcon-&-noel-en-foret-!
Fifth test passed in 6.45ms
Result : %C2%BF-%C3%80%C3%B1dr%C3%A9%C3%9F-l%27affreux-%C4%9Far%C3%A7on-%26-n%C3%B8%C3%ABl-en-for%C3%AAt-%21
Sixth test passed in 112.42ms
Result : andress-laffreux-garcon-n-el-en-foret
Further tests needed.
Edit : less iterations test
Beginning :
Slugging 100 iterations of following :
¿ Àñdréß l'affreux garçon & nøël en forêt !
Output results :
First test passed in 1.72ms
Result : -iquest-andresz-laffreux-arcon-and-noel-en-foret-
Second test passed in 48.59ms
Result : -andreß-laffreux-garcon--nøel-en-foret-
Third test passed in 0.91ms
Result : andress-l-affreux-garcon-noel-en-foret
Fourth test passed in 0.3ms
Result : ¿-AndreSs-l'affreux-garcon-&-noel-en-foret-!
Fifth test passed in 0.14ms
Result : %C2%BF-%C3%80%C3%B1dr%C3%A9%C3%9F-l%27affreux-%C4%9Far%C3%A7on-%26-n%C3%B8%C3%ABl-en-for%C3%AAt-%21
Sixth test passed in 1.4ms
Result : andress-laffreux-garcon-n-el-en-foret
This may be a way to do it too. Inspired from these links Experts-exchange and alinalexander
function slugifier($txt){
/* Get rid of accented characters */
$search = explode(",","ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u");
$replace = explode(",","c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u");
$txt = str_replace($search, $replace, $txt);
/* Lowercase all the characters */
$txt = strtolower($txt);
/* Avoid whitespace at the beginning and the ending */
$txt = trim($txt);
/* Replace all the characters that are not in a-z or 0-9 by a hyphen */
$txt = preg_replace("/[^a-z0-9]/", "-", $txt);
/* Remove hyphen anywhere it's more than one */
$txt = preg_replace("/[\-]+/", '-', $txt);
return $txt;
}
You could have a look at Normalizer::normalize()
, see here. It just needs to load the intl module for PHP
For me this variant is perfect, also it change &
to and
. Here is code:
function dSlug($string) {
return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1',htmlentities(preg_replace('/[&]/', ' and ', $title), ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8')), '-'));
}`
An updated version of @Imran Omar Bukhsh code (from the latest Wordpress (4.0) branch):
<?php
// Add methods to slugify taken from Wordpress:
// - https://github.com/WordPress/WordPress/blob/master/wp-includes/formatting.php
// - https://github.com/WordPress/WordPress/blob/master/wp-includes/functions.php
/**
* Set the mbstring internal encoding to a binary safe encoding when func_overload
* is enabled.
*
* When mbstring.func_overload is in use for multi-byte encodings, the results from
* strlen() and similar functions respect the utf8 characters, causing binary data
* to return incorrect lengths.
*
* This function overrides the mbstring encoding to a binary-safe encoding, and
* resets it to the users expected encoding afterwards through the
* `reset_mbstring_encoding` function.
*
* It is safe to recursively call this function, however each
* `mbstring_binary_safe_encoding()` call must be followed up with an equal number
* of `reset_mbstring_encoding()` calls.
*
* @since 3.7.0
*
* @see reset_mbstring_encoding()
*
* @param bool $reset Optional. Whether to reset the encoding back to a previously-set encoding.
* Default false.
*/
function mbstring_binary_safe_encoding( $reset = false ) {
static $encodings = array();
static $overloaded = null;
if ( is_null( $overloaded ) )
$overloaded = function_exists( 'mb_internal_encoding' ) && ( ini_get( 'mbstring.func_overload' ) & 2 );
if ( false === $overloaded )
return;
if ( ! $reset ) {
$encoding = mb_internal_encoding();
array_push( $encodings, $encoding );
mb_internal_encoding( 'ISO-8859-1' );
}
if ( $reset && $encodings ) {
$encoding = array_pop( $encodings );
mb_internal_encoding( $encoding );
}
}
/**
* Reset the mbstring internal encoding to a users previously set encoding.
*
* @see mbstring_binary_safe_encoding()
*
* @since 3.7.0
*/
function reset_mbstring_encoding() {
mbstring_binary_safe_encoding( true );
}
/**
* Checks to see if a string is utf8 encoded.
*
* NOTE: This function checks for 5-Byte sequences, UTF8
* has Bytes Sequences with a maximum length of 4.
*
* @author bmorel at ssi dot fr (modified)
* @since 1.2.1
*
* @param string $str The string to be checked
* @return bool True if $str fits a UTF-8 model, false otherwise.
*/
function seems_utf8($str) {
mbstring_binary_safe_encoding();
$length = strlen($str);
reset_mbstring_encoding();
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
/**
* Encode the Unicode values to be used in the URI.
*
* @since 1.5.0
*
* @param string $utf8_string
* @param int $length Max length of the string
* @return string String with Unicode encoded for URI.
*/
function utf8_uri_encode( $utf8_string, $length = 0 ) {
$unicode = '';
$values = array();
$num_octets = 1;
$unicode_length = 0;
mbstring_binary_safe_encoding();
$string_length = strlen( $utf8_string );
reset_mbstring_encoding();
for ($i = 0; $i < $string_length; $i++ ) {
$value = ord( $utf8_string[ $i ] );
if ( $value < 128 ) {
if ( $length && ( $unicode_length >= $length ) )
break;
$unicode .= chr($value);
$unicode_length++;
} else {
if ( count( $values ) == 0 ) $num_octets = ( $value < 224 ) ? 2 : 3;
$values[] = $value;
if ( $length && ( $unicode_length + ($num_octets * 3) ) > $length )
break;
if ( count( $values ) == $num_octets ) {
if ($num_octets == 3) {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
$unicode_length += 9;
} else {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
$unicode_length += 6;
}
$values = array();
$num_octets = 1;
}
}
}
return $unicode;
}
/**
* Sanitizes a title, replacing whitespace and a few other characters with dashes.
*
* Limits the output to alphanumeric characters, underscore (_) and dash (-).
* Whitespace becomes a dash.
*
* @since 1.2.0
*
* @param string $title The title to be sanitized.
* @param string $raw_title Optional. Not used.
* @param string $context Optional. The operation for which the string is sanitized.
* @return string The sanitized title.
*/
function sanitize_title_with_dashes( $title, $raw_title = '', $context = 'display' ) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = str_replace('.', '-', $title);
if ( 'save' == $context ) {
// Convert nbsp, ndash and mdash to hyphens
$title = str_replace( array( '%c2%a0', '%e2%80%93', '%e2%80%94' ), '-', $title );
// Strip these characters entirely
$title = str_replace( array(
// iexcl and iquest
'%c2%a1', '%c2%bf',
// angle quotes
'%c2%ab', '%c2%bb', '%e2%80%b9', '%e2%80%ba',
// curly quotes
'%e2%80%98', '%e2%80%99', '%e2%80%9c', '%e2%80%9d',
'%e2%80%9a', '%e2%80%9b', '%e2%80%9e', '%e2%80%9f',
// copy, reg, deg, hellip and trade
'%c2%a9', '%c2%ae', '%c2%b0', '%e2%80%a6', '%e2%84%a2',
// acute accents
'%c2%b4', '%cb%8a', '%cc%81', '%cd%81',
// grave accent, macron, caron
'%cc%80', '%cc%84', '%cc%8c',
), '', $title );
// Convert times to x
$title = str_replace( '%c3%97', 'x', $title );
}
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
$title = '#PFW Alexander McQueen Spring/Summer 2015';
echo "title -> slug: \n". $title ." -> ". sanitize_title_with_dashes($title);
echo "\n\n";
$title = '«GQ»: Elyas M\'Barek gehört zu Männern des Jahres';
echo "title -> slug: \n". $title ." -> ". sanitize_title_with_dashes($title);
View online example.
The most elegant way I think is using a Behat\Transliterator\Transliterator.
I need to extends this class by your class because it is an Abstract, some like this:
<?php
use Behat\Transliterator\Transliterator;
class Urlizer extends Transliterator
{
}
And then, just use it:
$text = "Master Ápiu";
$urlizer = new Urlizer();
$slug = $urlizer->transliterate($slug, "-");
echo $slug; // master-apiu
Of course you should put this things in your composer as well.
composer require behat/transliterator
More info here https://github.com/Behat/Transliterator
Here is an other one, for example " Title with strange characters ééé A X Z" becomes "title-with-strange-characters-eee-a-x-z".
/**
* Function used to create a slug associated to an "ugly" string.
*
* @param string $string the string to transform.
*
* @return string the resulting slug.
*/
public static function createSlug($string) {
$table = array(
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'd'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'R'=>'R', 'r'=>'r', '/' => '-', ' ' => '-'
);
// -- Remove duplicated spaces
$stripped = preg_replace(array('/\s{2,}/', '/[\t\n]/'), ' ', $string);
// -- Returns the slug
return strtolower(strtr($string, $table));
}
It is always a good idea to use existing solutions that are being supported by a lot of high-level developers. The most popular one is https://github.com/cocur/slugify. First of all, it supports more than one language, and it is being updated.
If you do not want to use the whole package, you can copy the part that you need.
if your slug contain only A-Za-z0-9- then it is ok for you
function sanitize_slug($text)
{
$text = preg_replace('/[^A-Za-z0-9-]+/', '-', $text);
$text = trim($text, '-');
$text = preg_replace('~-+~', '-', $text);
return $text;
}
Don't use preg_replace for this. There's a php function built just for the task: strtr() http://php.net/manual/en/function.strtr.php
Taken from the comments in the above link (and I tested it myself; it works:
function normalize ($string) {
$table = array(
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'd'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'R'=>'R', 'r'=>'r',
);
return strtr($string, $table);
}
public static function slugify ($text) {
$replace = [
'<' => '', '>' => '', ''' => '', '&' => '',
'"' => '', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä'=> 'Ae',
'Ä' => 'A', 'Å' => 'A', 'A' => 'A', 'A' => 'A', 'A' => 'A', 'Æ' => 'Ae',
'Ç' => 'C', 'C' => 'C', 'C' => 'C', 'C' => 'C', 'C' => 'C', 'D' => 'D', 'Ð' => 'D',
'Ð' => 'D', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'E' => 'E',
'E' => 'E', 'E' => 'E', 'E' => 'E', 'E' => 'E', 'G' => 'G', 'G' => 'G',
'G' => 'G', 'G' => 'G', 'H' => 'H', 'H' => 'H', 'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I', 'I' => 'I', 'I' => 'I', 'I' => 'I', 'I' => 'I',
'I' => 'I', '?' => 'IJ', 'J' => 'J', 'K' => 'K', 'L' => 'K', 'L' => 'K',
'L' => 'K', 'L' => 'K', '?' => 'K', 'Ñ' => 'N', 'N' => 'N', 'N' => 'N',
'N' => 'N', '?' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'Oe', 'Ö' => 'Oe', 'Ø' => 'O', 'O' => 'O', 'O' => 'O', 'O' => 'O',
'Œ' => 'OE', 'R' => 'R', 'R' => 'R', 'R' => 'R', 'S' => 'S', 'Š' => 'S',
'S' => 'S', 'S' => 'S', '?' => 'S', 'T' => 'T', 'T' => 'T', 'T' => 'T',
'?' => 'T', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'Ue', 'U' => 'U',
'Ü' => 'Ue', 'U' => 'U', 'U' => 'U', 'U' => 'U', 'U' => 'U', 'U' => 'U',
'W' => 'W', 'Ý' => 'Y', 'Y' => 'Y', 'Ÿ' => 'Y', 'Z' => 'Z', 'Ž' => 'Z',
'Z' => 'Z', 'Þ' => 'T', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a',
'ä' => 'ae', 'ä' => 'ae', 'å' => 'a', 'a' => 'a', 'a' => 'a', 'a' => 'a',
'æ' => 'ae', 'ç' => 'c', 'c' => 'c', 'c' => 'c', 'c' => 'c', 'c' => 'c',
'd' => 'd', 'd' => 'd', 'ð' => 'd', 'è' => 'e', 'é' => 'e', 'ê' => 'e',
'ë' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e', 'e' => 'e',
'ƒ' => 'f', 'g' => 'g', 'g' => 'g', 'g' => 'g', 'g' => 'g', 'h' => 'h',
'h' => 'h', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'i' => 'i',
'i' => 'i', 'i' => 'i', 'i' => 'i', 'i' => 'i', '?' => 'ij', 'j' => 'j',
'k' => 'k', '?' => 'k', 'l' => 'l', 'l' => 'l', 'l' => 'l', 'l' => 'l',
'?' => 'l', 'ñ' => 'n', 'n' => 'n', 'n' => 'n', 'n' => 'n', '?' => 'n',
'?' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'oe',
'ö' => 'oe', 'ø' => 'o', 'o' => 'o', 'o' => 'o', 'o' => 'o', 'œ' => 'oe',
'r' => 'r', 'r' => 'r', 'r' => 'r', 'š' => 's', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'ue', 'u' => 'u', 'ü' => 'ue', 'u' => 'u', 'u' => 'u',
'u' => 'u', 'u' => 'u', 'u' => 'u', 'w' => 'w', 'ý' => 'y', 'ÿ' => 'y',
'y' => 'y', 'ž' => 'z', 'z' => 'z', 'z' => 'z', 'þ' => 't', 'ß' => 'ss',
'?' => 'ss', '??' => 'iy', '?' => 'A', '?' => 'B', '?' => 'V', '?' => 'G',
'?' => 'D', '?' => 'E', '?' => 'YO', '?' => 'ZH', '?' => 'Z', '?' => 'I',
'?' => 'Y', '?' => 'K', '?' => 'L', '?' => 'M', '?' => 'N', '?' => 'O',
'?' => 'P', '?' => 'R', '?' => 'S', '?' => 'T', '?' => 'U', '?' => 'F',
'?' => 'H', '?' => 'C', '?' => 'CH', '?' => 'SH', '?' => 'SCH', '?' => '',
'?' => 'Y', '?' => '', '?' => 'E', '?' => 'YU', '?' => 'YA', '?' => 'a',
'?' => 'b', '?' => 'v', '?' => 'g', '?' => 'd', '?' => 'e', '?' => 'yo',
'?' => 'zh', '?' => 'z', '?' => 'i', '?' => 'y', '?' => 'k', '?' => 'l',
'?' => 'm', '?' => 'n', '?' => 'o', '?' => 'p', '?' => 'r', '?' => 's',
'?' => 't', '?' => 'u', '?' => 'f', '?' => 'h', '?' => 'c', '?' => 'ch',
'?' => 'sh', '?' => 'sch', '?' => '', '?' => 'y', '?' => '', '?' => 'e',
'?' => 'yu', '?' => 'ya'
];
// make a human readable string
$text = strtr($text, $replace);
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d.]+~u', '-', $text);
// trim
$text = trim($text, '-');
// remove unwanted characters
$text = preg_replace('~[^-\w.]+~', '', $text);
$text = strtolower($text);
return $text;
}
What about using something that is already implemented in Core?
//Clean non UTF-8 characters
Mage::getHelper('core/string')->cleanString($str)
Or one of the core url/ url rewrite methods..
Note: I have taken this from wordpress and it works!!
Use it like this:
echo sanitize('testing this link');
Code
//taken from wordpress
function utf8_uri_encode( $utf8_string, $length = 0 ) {
$unicode = '';
$values = array();
$num_octets = 1;
$unicode_length = 0;
$string_length = strlen( $utf8_string );
for ($i = 0; $i < $string_length; $i++ ) {
$value = ord( $utf8_string[ $i ] );
if ( $value < 128 ) {
if ( $length && ( $unicode_length >= $length ) )
break;
$unicode .= chr($value);
$unicode_length++;
} else {
if ( count( $values ) == 0 ) $num_octets = ( $value < 224 ) ? 2 : 3;
$values[] = $value;
if ( $length && ( $unicode_length + ($num_octets * 3) ) > $length )
break;
if ( count( $values ) == $num_octets ) {
if ($num_octets == 3) {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
$unicode_length += 9;
} else {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
$unicode_length += 6;
}
$values = array();
$num_octets = 1;
}
}
}
return $unicode;
}
//taken from wordpress
function seems_utf8($str) {
$length = strlen($str);
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
//function sanitize_title_with_dashes taken from wordpress
function sanitize($title) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = str_replace('.', '-', $title);
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
Since this answer is getting some attention, I'm adding some explanation.
The solution provided will essentially replace everything except A-Z, a-z, 0-9, & - (hyphen) with - (hyphen). So, it won't work properly with other unicode characters (which are valid characters for a URL slug/string). A common scenario is when the input string contains non-English characters.
Only use this solution if you're confident that the input string won't have unicode characters which you might want to be a part of output/slug.
Eg. "???? ?????" will become "----------" (all hyphens) instead of "????-?????" (valid URL slug).
How about...
$slug = strtolower(trim(preg_replace('/[^A-Za-z0-9-]+/', '-', $string)));
?
On my localhost everything was ok, but on server it helped me “set_locale” and “utf-8” at “mb_strtolower”.
<?
setlocale( LC_ALL, "en_US.UTF8" );
function slug( $str, $char = "-", $tf = "lowercase" )
{
$str = iconv( "utf-8", "us-ascii//translit//ignore", $str ); // transliterate
$str = str_replace( "'", "", $str ); // remove “'” generated by iconv
$str = preg_replace( "~[^a-z0-9]+~ui", $char, $str ); // replace unwanted by single “-”
$str = trim( $str, $char ); // trim “-”
if( $tf == "lowercase" ) $str = mb_strtolower( $str, "utf-8" ); // lowercase
elseif( $tf == "uppercase" ) $str = mb_strtoupper( $str, "utf-8" );
return $str;
}
?>
Test
$string = "--+ešcržýá091354--––—_-6íé???ß?ac … elnszzAC ELNS Z///+++||||..Z";
echo slug( $string );
echo slug( $string, "?", "uppercase" );
// ? escrzya091354-6iessac-elnszzac-elns-z-z
// ? ESCRZYA091354?6IESSAC?ELNSZZAC?ELNS?Z?Z
I am using:
function slugify($text)
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
return strtolower(preg_replace('/[^A-Za-z0-9-]+/', '-', $text));
}
Only fallback is that Cyrillic characters will not be converted, and I am searching now for solution that is not long str_replace for every single Cyrillic character.
Since I've Seen a lot of methods here but I've found a simplest method for myself.Maybe it will help someone.
$slug = strtolower(preg_replace('/[^a-zA-Z0-9\-]/', '',preg_replace('/\s+/', '-', $string) ));
I wrote this based on Maerlyn's response. This function will work regardless of the character encoding on the page. It also won't turn single quotes in to dashes :)
function slugify ($string) {
$string = utf8_encode($string);
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
$string = preg_replace('/[^a-z0-9- ]/i', '', $string);
$string = str_replace(' ', '-', $string);
$string = trim($string, '-');
$string = strtolower($string);
if (empty($string)) {
return 'n-a';
}
return $string;
}
Source: Stackoverflow.com