I'm looking for an equivalent to sscanf()
in Python. I want to parse /proc/net/*
files, in C I could do something like this:
int matches = sscanf(
buffer,
"%*d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %*X %*X:%*X %*X:%*X %*X %*d %*d %ld %*512s\n",
local_addr, &local_port, rem_addr, &rem_port, &inode);
I thought at first to use str.split
, however it doesn't split on the given characters, but the sep
string as a whole:
>>> lines = open("/proc/net/dev").readlines()
>>> for l in lines[2:]:
>>> cols = l.split(string.whitespace + ":")
>>> print len(cols)
1
Which should be returning 17, as explained above.
Is there a Python equivalent to sscanf
(not RE), or a string splitting function in the standard library that splits on any of a range of characters that I'm not aware of?
You could install pandas and use pandas.read_fwf
for fixed width format files. Example using /proc/net/arp
:
In [230]: df = pandas.read_fwf("/proc/net/arp")
In [231]: print(df)
IP address HW type Flags HW address Mask Device
0 141.38.28.115 0x1 0x2 84:2b:2b:ad:e1:f4 * eth0
1 141.38.28.203 0x1 0x2 c4:34:6b:5b:e4:7d * eth0
2 141.38.28.140 0x1 0x2 00:19:99:ce:00:19 * eth0
3 141.38.28.202 0x1 0x2 90:1b:0e:14:a1:e3 * eth0
4 141.38.28.17 0x1 0x2 90:1b:0e:1a:4b:41 * eth0
5 141.38.28.60 0x1 0x2 00:19:99:cc:aa:58 * eth0
6 141.38.28.233 0x1 0x2 90:1b:0e:8d:7a:c9 * eth0
7 141.38.28.55 0x1 0x2 00:19:99:cc:ab:00 * eth0
8 141.38.28.224 0x1 0x2 90:1b:0e:8d:7a:e2 * eth0
9 141.38.28.148 0x1 0x0 4c:52:62:a8:08:2c * eth0
10 141.38.28.179 0x1 0x2 90:1b:0e:1a:4b:50 * eth0
In [232]: df["HW address"]
Out[232]:
0 84:2b:2b:ad:e1:f4
1 c4:34:6b:5b:e4:7d
2 00:19:99:ce:00:19
3 90:1b:0e:14:a1:e3
4 90:1b:0e:1a:4b:41
5 00:19:99:cc:aa:58
6 90:1b:0e:8d:7a:c9
7 00:19:99:cc:ab:00
8 90:1b:0e:8d:7a:e2
9 4c:52:62:a8:08:2c
10 90:1b:0e:1a:4b:50
In [233]: df["HW address"][5]
Out[233]: '00:19:99:cc:aa:58'
By default it tries to figure out the format automagically, but there are options you can give for more explicit instructions (see documentation). There are also other IO routines in pandas that are powerful for other file formats.
If the separators are ':', you can split on ':', and then use x.strip() on the strings to get rid of any leading or trailing whitespace. int() will ignore the spaces.
There is an ActiveState recipe which implements a basic scanf http://code.activestate.com/recipes/502213-simple-scanf-implementation/
Upvoted orip's answer. I think it is sound advice to use re module. The Kodos application is helpful when approaching a complex regexp task with Python.
There is a Python 2 implementation by odiak.
When I'm in a C mood, I usually use zip and list comprehensions for scanf-like behavior. Like this:
input = '1 3.0 false hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),input.split())]
print (a, b, c, d)
Note that for more complex format strings, you do need to use regular expressions:
import re
input = '1:3.0 false,hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),re.search('^(\d+):([\d.]+) (\w+),(\w+)$',input).groups())]
print (a, b, c, d)
Note also that you need conversion functions for all types you want to convert. For example, above I used something like:
strtobool = lambda s: {'true': True, 'false': False}[s]
you can turn the ":" to space, and do the split.eg
>>> f=open("/proc/net/dev")
>>> for line in f:
... line=line.replace(":"," ").split()
... print len(line)
no regex needed (for this case)
There is also the parse
module.
parse()
is designed to be the opposite of format()
(the newer string formatting function in Python 2.6 and higher).
>>> from parse import parse
>>> parse('{} fish', '1')
>>> parse('{} fish', '1 fish')
<Result ('1',) {}>
>>> parse('{} fish', '2 fish')
<Result ('2',) {}>
>>> parse('{} fish', 'red fish')
<Result ('red',) {}>
>>> parse('{} fish', 'blue fish')
<Result ('blue',) {}>
You can parse with module re
using named groups. It won't parse the substrings to their actual datatypes (e.g. int
) but it's very convenient when parsing strings.
Given this sample line from /proc/net/tcp
:
line=" 0: 00000000:0203 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 335 1 c1674320 300 0 0 0"
An example mimicking your sscanf example with the variable could be:
import re
hex_digit_pattern = r"[\dA-Fa-f]"
pat = r"\d+: " + \
r"(?P<local_addr>HEX+):(?P<local_port>HEX+) " + \
r"(?P<rem_addr>HEX+):(?P<rem_port>HEX+) " + \
r"HEX+ HEX+:HEX+ HEX+:HEX+ HEX+ +\d+ +\d+ " + \
r"(?P<inode>\d+)"
pat = pat.replace("HEX", hex_digit_pattern)
values = re.search(pat, line).groupdict()
import pprint; pprint values
# prints:
# {'inode': '335',
# 'local_addr': '00000000',
# 'local_port': '0203',
# 'rem_addr': '00000000',
# 'rem_port': '0000'}
Update: The Python documentation for its regex module, re
, includes a section on simulating scanf, which I found more useful than any of the answers above.
There is an example in the official python docs about how to use sscanf
from libc
:
# import libc
from ctypes import CDLL
if(os.name=="nt"):
libc = cdll.msvcrt
else:
# assuming Unix-like environment
libc = cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6") # alternative
# allocate vars
i = c_int()
f = c_float()
s = create_string_buffer(b'\000' * 32)
# parse with sscanf
libc.sscanf(b"1 3.14 Hello", "%d %f %s", byref(i), byref(f), s)
# read the parsed values
i.value # 1
f.value # 3.14
s.value # b'Hello'
You can split on a range of characters using the re
module.
>>> import re
>>> r = re.compile('[ \t\n\r:]+')
>>> r.split("abc:def ghi")
['abc', 'def', 'ghi']
Source: Stackoverflow.com