You can extract the different parts out quite neatly using a regex pattern:
In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]:
fips 1 state county state_code
0 00000 UNITED STATES UNITED STATES NaN NaN
1 01000 ALABAMA ALABAMA NaN NaN
2 01001 Autauga County, AL NaN Autauga County AL
3 01003 Baldwin County, AL NaN Baldwin County AL
4 01005 Barbour County, AL NaN Barbour County AL
[5 rows x 5 columns]
To explain the somewhat long regex:
(?P<fips>\d{5})
\d
) and names them "fips"
.The next part:
((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
Does either (|
) one of two things:
(?P<state>[A-Z ]*$)
*
) of capital letters or spaces ([A-Z ]
) and names this "state"
before the end of the string ($
),or
(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
.*
) thenstate_code
before the end of the string ($
).In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).