[java] What does the term "canonical form" or "canonical representation" in Java mean?

I have often heard this term being used, but I have never really understood it.

What does it mean, and can anyone give some examples/point me to some links?

EDIT: Thanks to everyone for the replies. Can you also tell me how the canonical representation is useful in equals() performance, as stated in Effective Java?

This question is related to java

The answer is


Wikipedia points to the term Canonicalization.

A process for converting data that has more than one possible representation into a "standard" canonical representation. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

The Unicode example made the most sense to me:

Variable-length encodings in the Unicode standard, in particular UTF-8, have more than one possible encoding for most common characters. This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.

In summary, a standard form of representation for data. From this form you can then convert to any representation you may need.


Another good example might be: you have a class that supports the use of cartesian (x, y, z), spherical (r, theta, phi) and cylindrical coordinates (r, phi, z). For purposes of establishing equality (equals method), you would probably want to convert all representations to one "canonical" representation of your choosing, e.g. spherical coordinates. (Or maybe you would want to do this in general - i.e. use one internal representation.) I am not an expert, but this did occur to me as maybe a good concrete example.


The OP's questions about canonical form and how it can improve performance of the equals method can both be answered by extending the example provided in Effective Java.

Consider the following class:

public final class CaseInsensitiveString {

  private final String s;

  public CaseInsensitiveString(String s) {
    this.s = Objects.requireNonNull(s);
  }

  @Override 
  public boolean equals(Object o) {
    return o instanceof CaseInsensitiveString && ((CaseInsensitiveString) o).s.equalsIgnoreCase(s);
  }
}

The equals method in this example has added cost by using String's equalsIgnoreCase method. As mentioned in the text

you may want to store a canonical form of the field so the equals method can do a cheap exact comparison on canonical forms rather than a more costly nonstandard comparison.

What does Joshua Bloch mean when he says canonical form? Well, I think Dónal's concise answer is very appropriate. We can store the underlying String field in the CaseInsensitiveString example in a standard way, perhaps the uppercase form of the String. Now, you can reference this canonical form of the CaseInsensitiveString, its uppercase variant, and perform cheap evaluations in your equals and hashcode methods.


Another good example might be: you have a class that supports the use of cartesian (x, y, z), spherical (r, theta, phi) and cylindrical coordinates (r, phi, z). For purposes of establishing equality (equals method), you would probably want to convert all representations to one "canonical" representation of your choosing, e.g. spherical coordinates. (Or maybe you would want to do this in general - i.e. use one internal representation.) I am not an expert, but this did occur to me as maybe a good concrete example.


I believe there are two related uses of canonical: forms and instances.

A canonical form means that values of a particular type of resource can be described or represented in multiple ways, and one of those ways is chosen as the favored canonical form. (That form is canonized, like books that made it into the bible, and the other forms are not.) A classic example of a canonical form is paths in a hierarchical file system, where a single file can be referenced in a number of ways:

myFile.txt                                   # in current working dir
../conf/myFile.txt                           # relative to the CWD
/apps/tomcat/conf/myFile.txt                 # absolute path using symbolic links
/u1/local/apps/tomcat-5.5.1/conf/myFile.txt  # absolute path with no symlinks

The classic definition of the canonical representation of that file would be the last path. With local or relative paths you cannot globally identify the resource without contextual information. With absolute paths you can identify the resource, but cannot tell if two paths refer to the same entity. With two or more paths converted to their canonical forms, you can do all the above, plus determine if two resources are the same or not, if that is important to your application (solve the aliasing problem).

Note that the canonical form of a resource is not a quality of that particular form itself; there can be multiple possible canonical forms for a given type like file paths (say, lexicographically first of all possible absolute paths). One form is just selected as the canonical form for a particular application reason, or maybe arbitrarily so that everyone speaks the same language.

Forcing objects into their canonical instances is the same basic idea, but instead of determining one "best" representation of a resource, it arbitrarily chooses one instance of a class of instances with the same "content" as the canonical reference, then converts all references to equivalent objects to use the one canonical instance.

This can be used as a technique for optimizing both time and space. If there are multiple instances of equivalent objects in an application, then by forcing them all to be resolved as the single canonical instance of a particular value, you can eliminate all but one of each value, saving space and possibly time since you can now compare those values with reference identity (==) as opposed to object equivalence (equals() method).

A classic example of optimizing performance with canonical instances is collapsing strings with the same content. Calling String.intern() on two strings with the same character sequence is guaranteed to return the same canonical String object for that text. If you pass all your strings through that canonicalizer, you know equivalent strings are actually identical object references, i.e., aliases

The enum types in Java 5.0+ force all instances of a particular enum value to use the same canonical instance within a VM, even if the value is serialized and deserialized. That is why you can use if (day == Days.SUNDAY) with impunity in java if Days is an enum type. Doing this for your own classes is certainly possible, but takes care. Read Effective Java by Josh Bloch for details and advice.


The word "canonical" is just a synonym for "standard" or "usual". It doesn`t have any Java-specific meaning.


A canonical form means a naturally unique representation of the element


Canonical Data in RDBMS, Graph Data;
Think as "Normalization" or "Normal form" of a data in a RDBMS. Same data exists in different tables, represented with a unique identifier and mapped it in different tables.
or
Think a single form of a data in Graph Database that represented in many triples.

Major benefit of it is to make Dml (Data manipulation) more efficient since you can upsert (insert/update) only one value instead of many.


reduced to the simplest and most significant form without losing generality


The word "canonical" is just a synonym for "standard" or "usual". It doesn`t have any Java-specific meaning.


canonical representation means view the character in different style for example if I write a letter A means another person may write the letter A in different style:)

This is according to OPTICAL CHARACTER RECOGNITION FIELD


canonical representation means view the character in different style for example if I write a letter A means another person may write the letter A in different style:)

This is according to OPTICAL CHARACTER RECOGNITION FIELD


An easy way to remember it is the way "canonical" is used in theological circles, canonical truth is the real truth so if two people find it they have found the same truth. Same with canonical instance. If you think you have found two of them (i.e. a.equals(b)) you really only have one (i.e. a == b). So equality implies identity in the case of canonical object.

Now for the comparison. You now have the choice of using a==b or a.equals(b), since they will produce the same answer in the case of canonical instance but a==b is comparison of the reference (the JVM can compare two numbers extremely rapidly as they are just two 32 bit patterns compared to a.equals(b) which is a method call and involves more overhead.


A good example for understanding "canonical form/representation" is to look at the XML schema datatype definition of "boolean":

  • the "lexical representation" of boolean can be one of: {true, false, 1, 0} whereas
  • the "canonical representation" can only be one of {true, false}

This, in essence, means that

  • "true" and "1" get mapped to the canonical repr. "true" and
  • "false" and "0" get mapped to the canoncial repr. "false"

see the w3 XML schema datatype definition for boolean


I believe there are two related uses of canonical: forms and instances.

A canonical form means that values of a particular type of resource can be described or represented in multiple ways, and one of those ways is chosen as the favored canonical form. (That form is canonized, like books that made it into the bible, and the other forms are not.) A classic example of a canonical form is paths in a hierarchical file system, where a single file can be referenced in a number of ways:

myFile.txt                                   # in current working dir
../conf/myFile.txt                           # relative to the CWD
/apps/tomcat/conf/myFile.txt                 # absolute path using symbolic links
/u1/local/apps/tomcat-5.5.1/conf/myFile.txt  # absolute path with no symlinks

The classic definition of the canonical representation of that file would be the last path. With local or relative paths you cannot globally identify the resource without contextual information. With absolute paths you can identify the resource, but cannot tell if two paths refer to the same entity. With two or more paths converted to their canonical forms, you can do all the above, plus determine if two resources are the same or not, if that is important to your application (solve the aliasing problem).

Note that the canonical form of a resource is not a quality of that particular form itself; there can be multiple possible canonical forms for a given type like file paths (say, lexicographically first of all possible absolute paths). One form is just selected as the canonical form for a particular application reason, or maybe arbitrarily so that everyone speaks the same language.

Forcing objects into their canonical instances is the same basic idea, but instead of determining one "best" representation of a resource, it arbitrarily chooses one instance of a class of instances with the same "content" as the canonical reference, then converts all references to equivalent objects to use the one canonical instance.

This can be used as a technique for optimizing both time and space. If there are multiple instances of equivalent objects in an application, then by forcing them all to be resolved as the single canonical instance of a particular value, you can eliminate all but one of each value, saving space and possibly time since you can now compare those values with reference identity (==) as opposed to object equivalence (equals() method).

A classic example of optimizing performance with canonical instances is collapsing strings with the same content. Calling String.intern() on two strings with the same character sequence is guaranteed to return the same canonical String object for that text. If you pass all your strings through that canonicalizer, you know equivalent strings are actually identical object references, i.e., aliases

The enum types in Java 5.0+ force all instances of a particular enum value to use the same canonical instance within a VM, even if the value is serialized and deserialized. That is why you can use if (day == Days.SUNDAY) with impunity in java if Days is an enum type. Doing this for your own classes is certainly possible, but takes care. Read Effective Java by Josh Bloch for details and advice.


The word "canonical" is just a synonym for "standard" or "usual". It doesn`t have any Java-specific meaning.


The OP's questions about canonical form and how it can improve performance of the equals method can both be answered by extending the example provided in Effective Java.

Consider the following class:

public final class CaseInsensitiveString {

  private final String s;

  public CaseInsensitiveString(String s) {
    this.s = Objects.requireNonNull(s);
  }

  @Override 
  public boolean equals(Object o) {
    return o instanceof CaseInsensitiveString && ((CaseInsensitiveString) o).s.equalsIgnoreCase(s);
  }
}

The equals method in this example has added cost by using String's equalsIgnoreCase method. As mentioned in the text

you may want to store a canonical form of the field so the equals method can do a cheap exact comparison on canonical forms rather than a more costly nonstandard comparison.

What does Joshua Bloch mean when he says canonical form? Well, I think Dónal's concise answer is very appropriate. We can store the underlying String field in the CaseInsensitiveString example in a standard way, perhaps the uppercase form of the String. Now, you can reference this canonical form of the CaseInsensitiveString, its uppercase variant, and perform cheap evaluations in your equals and hashcode methods.


A canonical form means a naturally unique representation of the element


An easy way to remember it is the way "canonical" is used in theological circles, canonical truth is the real truth so if two people find it they have found the same truth. Same with canonical instance. If you think you have found two of them (i.e. a.equals(b)) you really only have one (i.e. a == b). So equality implies identity in the case of canonical object.

Now for the comparison. You now have the choice of using a==b or a.equals(b), since they will produce the same answer in the case of canonical instance but a==b is comparison of the reference (the JVM can compare two numbers extremely rapidly as they are just two 32 bit patterns compared to a.equals(b) which is a method call and involves more overhead.


reduced to the simplest and most significant form without losing generality


Canonical Data in RDBMS, Graph Data;
Think as "Normalization" or "Normal form" of a data in a RDBMS. Same data exists in different tables, represented with a unique identifier and mapped it in different tables.
or
Think a single form of a data in Graph Database that represented in many triples.

Major benefit of it is to make Dml (Data manipulation) more efficient since you can upsert (insert/update) only one value instead of many.


I believe there are two related uses of canonical: forms and instances.

A canonical form means that values of a particular type of resource can be described or represented in multiple ways, and one of those ways is chosen as the favored canonical form. (That form is canonized, like books that made it into the bible, and the other forms are not.) A classic example of a canonical form is paths in a hierarchical file system, where a single file can be referenced in a number of ways:

myFile.txt                                   # in current working dir
../conf/myFile.txt                           # relative to the CWD
/apps/tomcat/conf/myFile.txt                 # absolute path using symbolic links
/u1/local/apps/tomcat-5.5.1/conf/myFile.txt  # absolute path with no symlinks

The classic definition of the canonical representation of that file would be the last path. With local or relative paths you cannot globally identify the resource without contextual information. With absolute paths you can identify the resource, but cannot tell if two paths refer to the same entity. With two or more paths converted to their canonical forms, you can do all the above, plus determine if two resources are the same or not, if that is important to your application (solve the aliasing problem).

Note that the canonical form of a resource is not a quality of that particular form itself; there can be multiple possible canonical forms for a given type like file paths (say, lexicographically first of all possible absolute paths). One form is just selected as the canonical form for a particular application reason, or maybe arbitrarily so that everyone speaks the same language.

Forcing objects into their canonical instances is the same basic idea, but instead of determining one "best" representation of a resource, it arbitrarily chooses one instance of a class of instances with the same "content" as the canonical reference, then converts all references to equivalent objects to use the one canonical instance.

This can be used as a technique for optimizing both time and space. If there are multiple instances of equivalent objects in an application, then by forcing them all to be resolved as the single canonical instance of a particular value, you can eliminate all but one of each value, saving space and possibly time since you can now compare those values with reference identity (==) as opposed to object equivalence (equals() method).

A classic example of optimizing performance with canonical instances is collapsing strings with the same content. Calling String.intern() on two strings with the same character sequence is guaranteed to return the same canonical String object for that text. If you pass all your strings through that canonicalizer, you know equivalent strings are actually identical object references, i.e., aliases

The enum types in Java 5.0+ force all instances of a particular enum value to use the same canonical instance within a VM, even if the value is serialized and deserialized. That is why you can use if (day == Days.SUNDAY) with impunity in java if Days is an enum type. Doing this for your own classes is certainly possible, but takes care. Read Effective Java by Josh Bloch for details and advice.


The word "canonical" is just a synonym for "standard" or "usual". It doesn`t have any Java-specific meaning.


A good example for understanding "canonical form/representation" is to look at the XML schema datatype definition of "boolean":

  • the "lexical representation" of boolean can be one of: {true, false, 1, 0} whereas
  • the "canonical representation" can only be one of {true, false}

This, in essence, means that

  • "true" and "1" get mapped to the canonical repr. "true" and
  • "false" and "0" get mapped to the canoncial repr. "false"

see the w3 XML schema datatype definition for boolean