[python] Convert array of indices to 1-hot encoded numpy array

Let's say I have a 1d numpy array

a = array([1,0,3])

I would like to encode this as a 2D one-hot array

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Is there a quick way to do this? Quicker than just looping over a to set elements of b, that is.

The answer is


I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:

all_good_list = [0,1,2,3,4]

go ahead, the posted solutions are already mentioned above. But what if considering this data:

problematic_list = [0,23,12,89,10]

If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:

import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)

I hope someone encountered same restrictions on above solutions and this might come in handy


In case you are using keras, there is a built in utility for that:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

And it does pretty much the same as @YXD's answer (see source-code).


Here's a dimensionality-independent standalone solution.

This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)

def expand_integer_grid(arr, n_classes):
    """

    :param arr: N dim array of size i_1, ..., i_N
    :param n_classes: C
    :returns: one-hot N+1 dim array of size i_1, ..., i_N, C
    :rtype: ndarray

    """
    one_hot = np.zeros(arr.shape + (n_classes,))
    axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
    flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
    one_hot[flat_grids + [arr.ravel()]] = 1
    assert((one_hot.sum(-1) == 1).all())
    assert(np.allclose(np.argmax(one_hot, -1), arr))
    return one_hot

Using a Neuraxle pipeline step:

  1. Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
  1. Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
  1. Assert it works
assert b_pred == b

Link to documentation: neuraxle.steps.numpy.OneHotEncoder


Here is what I find useful:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].

Exactly what you wanted to have I believe.

PS: the source is Sequence models - deeplearning.ai


I think the short answer is no. For a more generic case in n dimensions, I came up with this:

# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1

I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.


  • p will be a 2d ndarray.
  • We want to know which value is the highest in a row, to put there 1 and everywhere else 0.

clean and easy solution:

max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)

You can use the following code for converting into a one-hot vector:

let x is the normal class vector having a single column with classes 0 to some number:

import numpy as np
np.eye(x.max()+1)[x]

if 0 is not a class; then remove +1.


For 1-hot-encoding

   one_hot_encode=pandas.get_dummies(array)

For Example

ENJOY CODING


I am adding for completion a simple function, using only numpy operators:

   def probs_to_onehot(output_probabilities):
        argmax_indices_array = np.argmax(output_probabilities, axis=1)
        onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
        return onehot_output_array

It takes as input a probability matrix: e.g.:

[[0.03038822 0.65810204 0.16549407 0.3797123 ] ... [0.02771272 0.2760752 0.3280924 0.33458805]]

And it will return

[[0 1 0 0] ... [0 0 0 1]]


Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You can use sklearn.preprocessing.LabelBinarizer:

Example:

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

output:

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.


Such type of encoding are usually part of numpy array. If you are using a numpy array like this :

a = np.array([1,0,3])

then there is very simple way to convert that to 1-hot encoding

out = (np.arange(4) == a[:,None]).astype(np.float32)

That's it.


Use the following code. It works best.

def one_hot_encode(x):
"""
    argument
        - x: a list of labels
    return
        - one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))

for idx, val in enumerate(x):
    encoded[idx][val] = 1

return encoded

Found it here P.S You don't need to go into the link.


Here is a function that converts a 1-D vector to a 2-D one-hot array.

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

Below is some example usage:

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

Here is an example function that I wrote to do this based upon the answers above and my own use case:

def label_vector_to_one_hot_vector(vector, one_hot_size=10):
    """
    Use to convert a column vector to a 'one-hot' matrix

    Example:
        vector: [[2], [0], [1]]
        one_hot_size: 3
        returns:
            [[ 0.,  0.,  1.],
             [ 1.,  0.,  0.],
             [ 0.,  1.,  0.]]

    Parameters:
        vector (np.array): of size (n, 1) to be converted
        one_hot_size (int) optional: size of 'one-hot' row vector

    Returns:
        np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
    """
    squeezed_vector = np.squeeze(vector, axis=-1)

    one_hot = np.zeros((squeezed_vector.size, one_hot_size))

    one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1

    return one_hot

label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

If using tensorflow, there is one_hot():

import tensorflow as tf
import numpy as np

a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
#        [1., 0., 0.],
#        [0., 0., 0.]], dtype=float32)>

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

You can also use eye function of numpy:

numpy.eye(number of classes)[vector containing the labels]


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to numpy

Unable to allocate array with shape and data type How to fix 'Object arrays cannot be loaded when allow_pickle=False' for imdb.load_data() function? Numpy, multiply array with scalar TypeError: only integer scalar arrays can be converted to a scalar index with 1D numpy indices array Could not install packages due to a "Environment error :[error 13]: permission denied : 'usr/local/bin/f2py'" Pytorch tensor to numpy array Numpy Resize/Rescale Image what does numpy ndarray shape do? How to round a numpy array? numpy array TypeError: only integer scalar arrays can be converted to a scalar index

Examples related to machine-learning

Error in Python script "Expected 2D array, got 1D array instead:"? How to predict input image using trained model in Keras? What is the role of "Flatten" in Keras? How to concatenate two layers in keras? How to save final model using keras? scikit-learn random state in splitting dataset Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? What is the meaning of the word logits in TensorFlow? Can anyone explain me StandardScaler? Can Keras with Tensorflow backend be forced to use CPU or GPU at will?

Examples related to numpy-ndarray

what does numpy ndarray shape do? What is the purpose of meshgrid in Python / NumPy? Convert array of indices to 1-hot encoded numpy array How does numpy.newaxis work and when to use it? How to create a numpy array of all True or all False? What does -1 mean in numpy reshape? What is the difference between ndarray and array in numpy? Paritition array into N chunks with Numpy Concatenating two one-dimensional NumPy arrays How do I get indices of N maximum values in a NumPy array?

Examples related to one-hot-encoding

How can I one hot encode in Python? Convert array of indices to 1-hot encoded numpy array