[regex] Regular expression for validating names and surnames?

Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

  • other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
  • different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
  • no numbers or symbols or unnecessary punctuation or runes, etc..
  • titles, middle initials, suffixes are not part of this data
  • names are already separated by surnames.
  • we are prepared to force ultra rare names to be simplified (there's a person named '@' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
  • note that many countries have laws about names so there are standards to follow

Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?

I would be looking for something similar to the many "email address" regexes that you can find on google.

This question is related to regex c# globalization

The answer is


It's a very difficult problem to validate something like a name due to all the corner cases possible.

Corner Cases

Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.

If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.


Steps:

  1. first remove all accents
  2. apply the regular expression

To strip the accents:

private static string RemoveAccents(string s)
{
    s = s.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.Length; i++)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
    }
    return sb.ToString();
}

I would just allow everything (except an empty string) and assume the user knows what his name is.

There are 2 common cases:

  1. You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
  2. You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.

In case (1), you can allow all characters because you're checking against a paper document.

In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".


A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.

This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.

<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
                    ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
                    ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
                    ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">

BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?

As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.


This one worked perfectly for me in JavaScript: ^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$

Here is the method:

function isValidName(name) {
    var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
    return found > -1;
}

You could use the following regex code to validate 2 names separeted by a space with the following regex code:

^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$

or just use:

[[:lower:]] = [a-zà-ú]

[[:upper:]] =[A-ZÀ-Ú]

[[:alpha:]] = [A-Za-zÀ-ú]

[[:alnum:]] = [A-Za-zÀ-ú0-9]


I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.

Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.

Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.


This somewhat helps:

^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$


I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name

EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on it's own. However, this article has a section on filtering that is a starting point if you want to go that route.

http://tldp.org/HOWTO/Secure-Programs-HOWTO/cross-site-malicious-content.html

s/[\<\>\"\'\%\;\(\)\&\+]//g;

This one should work ^([A-Z]{1}+[a-z\-\.\']*+[\s]?)* Add some special characters if you need them.


I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.