[c++] Using strtok with a std::string

I have a string that I would like to tokenize. But the C strtok() function requires my string to be a char*. How can I do this simply?

I tried:

token = strtok(str.c_str(), " "); 

which fails because it turns it into a const char*, not a char*

This question is related to c++ strtok

The answer is


Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.


It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile. I am giving you a complete but small program to tokenize the string using C strtok() function.

   #include <iostream>
   #include <string>
   #include <string.h> 
   using namespace std;
   int main() {
       string s="20#6 5, 3";
       // strtok requires volatile string as it modifies the supplied string in order to tokenize it 
       char *str=const_cast< char *>(s.c_str());    
       char *tok;
       tok=strtok(str, "#, " );     
       int arr[4], i=0;    
       while(tok!=NULL){
           arr[i++]=stoi(tok);
           tok=strtok(NULL, "#, " );
       }     
       for(int i=0; i<4; i++) cout<<arr[i]<<endl;


       return 0;
   }

NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.

How strtok works

Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.

#include <iostream>
#include <string>
#include <string.h> 
using namespace std;
int main() {
    string s="20#6 5, 3";
    char *str=const_cast< char *>(s.c_str());    
    char *tok;
    cout<<"string: "<<s<<endl;
    tok=strtok(str, "#, " );     
    cout<<"String: "<<s<<"\tToken: "<<tok<<endl;   
    while(tok!=NULL){
        tok=strtok(NULL, "#, " );
        cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
    }
    return 0;
}

Output:

string: 20#6 5, 3

String: 206 5, 3    Token: 20
String: 2065, 3     Token: 6
String: 2065 3      Token: 5
String: 2065 3      Token: 3
String: 2065 3      Token: 

strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.


First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.


  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    

First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");

#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
        std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.


If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.

std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
    subbuffer fld = flds.next();
    // do something with fld
}

// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');

  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    

#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
        std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.


If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.

std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
    subbuffer fld = flds.next();
    // do something with fld
}

// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);

Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.


There is a more elegant solution.

With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.

At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago

the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.

The other concern is does strtok() increases the size of the string. The MSDN documentation says:

Each call to strtok modifies strToken by inserting a null character after the token returned by that call.

But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:

one-two---three--four

we will end up with

one\0two\0--three\0-four

So my solution is very simple:


std::string str("some-text-to-split");
char seps[] = "-";
char *token;

token = strtok( &str[0], seps );
while( token != NULL )
{
   /* Do your thing */
   token = strtok( NULL, seps );
}

Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer


I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.


Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.


  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    

#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
        std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.


First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.


With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:

#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>

int main()
{
    ::std::string text{"pop dop rop"};
    char const * const psz_delimiter{" "};
    char * psz_token{::std::strtok(text.data(), psz_delimiter)};
    while(nullptr != psz_token)
    {
        ::std::cout << psz_token << ::std::endl;
        psz_token = std::strtok(nullptr, psz_delimiter);
    }
    return EXIT_SUCCESS;
}

output

pop
dop
rop


Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.


There is a more elegant solution.

With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.

At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago

the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.

The other concern is does strtok() increases the size of the string. The MSDN documentation says:

Each call to strtok modifies strToken by inserting a null character after the token returned by that call.

But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:

one-two---three--four

we will end up with

one\0two\0--three\0-four

So my solution is very simple:


std::string str("some-text-to-split");
char seps[] = "-";
char *token;

token = strtok( &str[0], seps );
while( token != NULL )
{
   /* Do your thing */
   token = strtok( NULL, seps );
}

Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer


EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.


Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.


First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");

With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:

#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>

int main()
{
    ::std::string text{"pop dop rop"};
    char const * const psz_delimiter{" "};
    char * psz_token{::std::strtok(text.data(), psz_delimiter)};
    while(nullptr != psz_token)
    {
        ::std::cout << psz_token << ::std::endl;
        psz_token = std::strtok(nullptr, psz_delimiter);
    }
    return EXIT_SUCCESS;
}

output

pop
dop
rop


Typecasting to (char*) got it working for me!

token = strtok((char *)str.c_str(), " "); 

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.


Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);

It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile. I am giving you a complete but small program to tokenize the string using C strtok() function.

   #include <iostream>
   #include <string>
   #include <string.h> 
   using namespace std;
   int main() {
       string s="20#6 5, 3";
       // strtok requires volatile string as it modifies the supplied string in order to tokenize it 
       char *str=const_cast< char *>(s.c_str());    
       char *tok;
       tok=strtok(str, "#, " );     
       int arr[4], i=0;    
       while(tok!=NULL){
           arr[i++]=stoi(tok);
           tok=strtok(NULL, "#, " );
       }     
       for(int i=0; i<4; i++) cout<<arr[i]<<endl;


       return 0;
   }

NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.

How strtok works

Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.

#include <iostream>
#include <string>
#include <string.h> 
using namespace std;
int main() {
    string s="20#6 5, 3";
    char *str=const_cast< char *>(s.c_str());    
    char *tok;
    cout<<"string: "<<s<<endl;
    tok=strtok(str, "#, " );     
    cout<<"String: "<<s<<"\tToken: "<<tok<<endl;   
    while(tok!=NULL){
        tok=strtok(NULL, "#, " );
        cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
    }
    return 0;
}

Output:

string: 20#6 5, 3

String: 206 5, 3    Token: 20
String: 2065, 3     Token: 6
String: 2065 3      Token: 5
String: 2065 3      Token: 3
String: 2065 3      Token: 

strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.


EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.