# [python] How do I use itertools.groupby()?

I haven't been able to find an understandable explanation of how to actually use Python's `itertools.groupby()` function. What I'm trying to do is this:

• Take a list - in this case, the children of an objectified `lxml` element
• Divide it into groups based on some criteria
• Then later iterate over each of these groups separately.

I've reviewed the documentation, but I've had trouble trying to apply them beyond a simple list of numbers.

So, how do I use of `itertools.groupby()`? Is there another technique I should be using? Pointers to good "prerequisite" reading would also be appreciated.

This question is related to `python` `itertools`

## The answer is

The example on the Python docs is quite straightforward:

``````groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
groups.append(list(g))      # Store group iterator as a list
uniquekeys.append(k)
``````

So in your case, data is a list of nodes, `keyfunc` is where the logic of your criteria function goes and then `groupby()` groups the data.

You must be careful to sort the data by the criteria before you call `groupby` or it won't work. `groupby` method actually just iterates through a list and whenever the key changes it creates a new group.

@CaptSolo, I tried your example, but it didn't work.

``````from itertools import groupby
[(c,len(list(cs))) for c,cs in groupby('Pedro Manoel')]
``````

Output:

``````[('P', 1), ('e', 1), ('d', 1), ('r', 1), ('o', 1), (' ', 1), ('M', 1), ('a', 1), ('n', 1), ('o', 1), ('e', 1), ('l', 1)]
``````

As you can see, there are two o's and two e's, but they got into separate groups. That's when I realized you need to sort the list passed to the groupby function. So, the correct usage would be:

``````name = list('Pedro Manoel')
name.sort()
[(c,len(list(cs))) for c,cs in groupby(name)]
``````

Output:

``````[(' ', 1), ('M', 1), ('P', 1), ('a', 1), ('d', 1), ('e', 2), ('l', 1), ('n', 1), ('o', 2), ('r', 1)]
``````

Just remembering, if the list is not sorted, the groupby function will not work!

Another example:

``````for key, igroup in itertools.groupby(xrange(12), lambda x: x // 5):
print key, list(igroup)
``````

results in

``````0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]
``````

Note that igroup is an iterator (a sub-iterator as the documentation calls it).

This is useful for chunking a generator:

``````def chunker(items, chunk_size):
'''Group items in chunks of chunk_size'''
for _key, group in itertools.groupby(enumerate(items), lambda x: x[0] // chunk_size):
yield (g[1] for g in group)

with open('file.txt') as fobj:
for chunk in chunker(fobj):
process(chunk)
``````

Another example of groupby - when the keys are not sorted. In the following example, items in xx are grouped by values in yy. In this case, one set of zeros is output first, followed by a set of ones, followed again by a set of zeros.

``````xx = range(10)
yy = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
for group in itertools.groupby(iter(xx), lambda x: yy[x]):
print group[0], list(group[1])
``````

Produces:

``````0 [0, 1, 2]
1 [3, 4, 5]
0 [6, 7, 8, 9]
``````

One useful example that I came across may be helpful:

``````from itertools import groupby

#user input

myinput = input()

#creating empty list to store output

myoutput = []

for k,g in groupby(myinput):

myoutput.append((len(list(g)),int(k)))

print(*myoutput)
``````

Sample input: 14445221

Sample output: (1,1) (3,4) (1,5) (2,2) (1,1)

Sorting and groupby

``````from itertools import groupby

val = [{'name': 'satyajit', 'address': 'btm', 'pin': 560076},
{'name': 'Mukul', 'address': 'Silk board', 'pin': 560078},
{'name': 'Preetam', 'address': 'btm', 'pin': 560076}]

for pin, list_data in groupby(sorted(val, key=lambda k: k['pin']),lambda x: x['pin']):
...     print pin
...     for rec in list_data:
...             print rec
...
o/p:

560076
{'name': 'satyajit', 'pin': 560076, 'address': 'btm'}
{'name': 'Preetam', 'pin': 560076, 'address': 'btm'}
560078
{'name': 'Mukul', 'pin': 560078, 'address': 'Silk board'}
``````

You can write own groupby function:

``````           def groupby(data):
kv = {}
for k,v in data:
if k not in kv:
kv[k]=[v]
else:
kv[k].append(v)
return kv

Run on ipython:
In [10]: data = [('a', 1), ('b',2),('a',2)]

In [11]: groupby(data)
Out[11]: {'a': [1, 2], 'b': [2]}
``````

This basic implementation helped me understand this function. Hope it helps others as well:

``````arr = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (2, "E"), (3, "F")]

for k,g in groupby(arr, lambda x: x[0]):
print("--", k, "--")
for tup in g:
print(tup[1])  # tup[0] == k
``````
``````-- 1 --
A
B
C
-- 2 --
D
E
-- 3 --
F
``````

`itertools.groupby` is a tool for grouping items.

From the docs, we glean further what it might do:

`# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B`

`# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D`

`groupby` objects yield key-group pairs where the group is a generator.

Features

• A. Group consecutive items together
• B. Group all occurrences of an item, given a sorted iterable
• C. Specify how to group items with a key function *

Comparisons

``````# Define a printer for comparing outputs
>>> def print_groupby(iterable, keyfunc=None):
...    for k, g in it.groupby(iterable, keyfunc):
...        print("key: '{}'--> group: {}".format(k, list(g)))
``````
``````# Feature A: group consecutive occurrences
>>> print_groupby("BCAACACAADBBB")
key: 'B'--> group: ['B']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'D'--> group: ['D']
key: 'B'--> group: ['B', 'B', 'B']

# Feature B: group all occurrences
>>> print_groupby(sorted("BCAACACAADBBB"))
key: 'A'--> group: ['A', 'A', 'A', 'A', 'A']
key: 'B'--> group: ['B', 'B', 'B', 'B']
key: 'C'--> group: ['C', 'C', 'C']
key: 'D'--> group: ['D']

# Feature C: group by a key function
>>> # islower = lambda s: s.islower()                      # equivalent
>>> def islower(s):
...     """Return True if a string is lowercase, else False."""
...     return s.islower()
>>> print_groupby(sorted("bCAaCacAADBbB"), keyfunc=islower)
key: 'False'--> group: ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']
key: 'True'--> group: ['a', 'a', 'b', 'b', 'c']
``````

Uses

Note: Several of the latter examples derive from Víctor Terrón's PyCon (talk) (Spanish), "Kung Fu at Dawn with Itertools". See also the `groupby `source code written in C.

* A function where all items are passed through and compared, influencing the result. Other objects with key functions include `sorted()`, `max()` and `min()`.

Response

``````# OP: Yes, you can use `groupby`, e.g.
[do_something(list(g)) for _, g in groupby(lxml_elements, criteria_func)]
``````

A neato trick with groupby is to run length encoding in one line:

``````[(c,len(list(cgen))) for c,cgen in groupby(some_string)]
``````

will give you a list of 2-tuples where the first element is the char and the 2nd is the number of repetitions.

Edit: Note that this is what separates `itertools.groupby` from the SQL `GROUP BY` semantics: itertools doesn't (and in general can't) sort the iterator in advance, so groups with the same "key" aren't merged.

I would like to give another example where groupby without sort is not working. Adapted from example by James Sulak

``````from itertools import groupby

things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
for thing in group:
print "A %s is a %s." % (thing[1], key)
print " "
``````

output is

``````A bear is a vehicle.

A duck is a animal.
A cactus is a animal.

A speed boat is a vehicle.
A school bus is a vehicle.
``````

there are two groups with vehicule, whereas one could expect only one group

How do I use Python's itertools.groupby()?

You can use groupby to group things to iterate over. You give groupby an iterable, and a optional key function/callable by which to check the items as they come out of the iterable, and it returns an iterator that gives a two-tuple of the result of the key callable and the actual items in another iterable. From the help:

``````groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).
``````

Here's an example of groupby using a coroutine to group by a count, it uses a key callable (in this case, `coroutine.send`) to just spit out the count for however many iterations and a grouped sub-iterator of elements:

``````import itertools

def grouper(iterable, n):
def coroutine(n):
yield # queue up coroutine
for i in itertools.count():
for j in range(n):
yield i
groups = coroutine(n)
next(groups) # queue up coroutine

for c, objs in itertools.groupby(iterable, groups.send):
yield c, list(objs)
# or instead of materializing a list of objs, just:
# return itertools.groupby(iterable, groups.send)

list(grouper(range(10), 3))
``````

prints

``````[(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9])]
``````

WARNING:

The syntax list(groupby(...)) won't work the way that you intend. It seems to destroy the internal iterator objects, so using

``````for x in list(groupby(range(10))):
print(list(x[1]))
``````

will produce:

``````[]
[]
[]
[]
[]
[]
[]
[]
[]
[9]
``````

Instead, of list(groupby(...)), try [(k, list(g)) for k,g in groupby(...)], or if you use that syntax often,

``````def groupbylist(*args, **kwargs):
return [(k, list(g)) for k, g in groupby(*args, **kwargs)]
``````

and get access to the groupby functionality while avoiding those pesky (for small data) iterators all together.