[machine-learning] Difference between classification and clustering in data mining?

Machine Learning or AI is largely perceived by the task it Performs/achieves.

In my opinion, by thinking about Clustering and Classification in notion of task they achieve can really help to understand the difference between the two.

Clustering is to Group things and Classification is to, kind of, label things.

Let's assume you are in a party hall where all men are in Suits and women are in Gowns.

Now, you ask your friend few questions:

Q1: Heyy, can you help me group people?

Possible answers that your friend can give are:

1: He can group people based on Gender, Male or Female

2: He can group people based on their clothes, 1 wearing suits other wearing gowns

3: He can group people based on color of their hairs

4: He can group people based on their age group, etc. etc. etc.

Their are numerous ways your friend can complete this task.

Of course, you can influence his decision making process by providing extra inputs like:

Can you help me group these people based on gender (or age group, or hair color or dress etc.)

Q2:

Before Q2, you need to do some pre-work.

You have to teach or inform your friend so that he can take informed decision. So, let's say you said to your friend that:

  • People with long hair are Women.

  • People with short hair are Men.

Q2. Now, you point out to a Person with long hair and ask your friend - Is it a Man or a Woman?

The only answer that you can expect is: Woman.

Of course, there can be men with long hairs and women with short hairs in the party. But, the answer is correct based on the learning you provided to your friend. You can further improve the process by teaching more to your friend on how to differentiate between the two.

In above example,

Q1 represents the task what Clustering achieves.

In Clustering you provide the data(people) to the algorithm(your friend) and ask it to group the data.

Now, it's up to algorithm to decide what's the best way to group is? (Gender, Color or age group).

Again,you can definitely influence the decision made by the algorithm by providing extra inputs.

Q2 represents the task Classification achieves.

There, you give your algorithm(your friend) some data(People), called as Training data, and made him learn which data corresponds to which label(Male or Female). Then you point your algorithm to certain data, called as Test data, and ask it to determine whether it is Male or Female. The better your teaching is, the better it's prediction.

And the Pre-work in Q2 or Classification is nothing but just training your model so that it can learn how to differentiate. In Clustering or Q1 this pre-work is the part of grouping.

Hope this helps someone.

Thanks

Examples related to machine-learning

Error in Python script "Expected 2D array, got 1D array instead:"? How to predict input image using trained model in Keras? What is the role of "Flatten" in Keras? How to concatenate two layers in keras? How to save final model using keras? scikit-learn random state in splitting dataset Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? What is the meaning of the word logits in TensorFlow? Can anyone explain me StandardScaler? Can Keras with Tensorflow backend be forced to use CPU or GPU at will?

Examples related to classification

FailedPreconditionError: Attempting to use uninitialized in Tensorflow Scikit-learn train_test_split with indices Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative What are advantages of Artificial Neural Networks over Support Vector Machines? Save classifier to disk in scikit-learn A simple explanation of Naive Bayes Classification Difference between classification and clustering in data mining?

Examples related to cluster-analysis

Cluster analysis in R: determine the optimal number of clusters Difference between classification and clustering in data mining? How do I determine k when using k-means clustering?

Examples related to data-mining

What is the difference between linear regression and logistic regression? Difference between classification and clustering in data mining? Calculate AUC in R? Can someone give an example of cosine similarity, in a very simple, graphical way?

Examples related to terminology

The differences between initialize, define, declare a variable What is the difference between a web API and a web service? What does "opt" mean (as in the "opt" directory)? Is it an abbreviation? What's the name for hyphen-separated case? What is Bit Masking? What is ADT? (Abstract Data Type) What exactly are iterator, iterable, and iteration? What is a web service endpoint? What is the difference between Cloud, Grid and Cluster? How to explain callbacks in plain english? How are they different from calling one function from another function?