I had a similar problem. I wanted to have pretty URLs and reached the conclusion that I have to allow only letters, digits, - and _ in URLs.
That is fine, but then I wrote some nice regex and I realized that it recognizes all UTF-8 characters are not letters in .NET and was screwed. This appears to be a know problem for the .NET regex engine. So I got to this solution:
private static string GetTitleForUrlDisplay(string title)
{
if (!string.IsNullOrEmpty(title))
{
return Regex.Replace(Regex.Replace(title, @"[^A-Za-z0-9_-]", new MatchEvaluator(CharacterTester)).Replace(' ', '-').TrimStart('-').TrimEnd('-'), "[-]+", "-").ToLower();
}
return string.Empty;
}
/// <summary>
/// All characters that do not match the patter, will get to this method, i.e. useful for Unicode characters, because
/// .NET implementation of regex do not handle Unicode characters. So we use char.IsLetterOrDigit() which works nicely and we
/// return what we approve and return - for everything else.
/// </summary>
/// <param name="m"></param>
/// <returns></returns>
private static string CharacterTester(Match m)
{
string x = m.ToString();
if (x.Length > 0 && char.IsLetterOrDigit(x[0]))
{
return x.ToLower();
}
else
{
return "-";
}
}