[c++] Parse (split) a string in C++ using string delimiter (standard C++)

I am parsing a string in C++ using the following:

using namespace std;

string parsed,input="text to be parsed";
stringstream input_stringstream(input);

if (getline(input_stringstream,parsed,' '))
{
     // do some processing.
}

Parsing with a single char delimiter is fine. But what if I want to use a string as delimiter.

Example: I want to split:

scott>=tiger

with >= as delimiter so that I can get scott and tiger.

This question is related to c++ parsing split token tokenize

The answer is


I would use boost::tokenizer. Here's documentation explaining how to make an appropriate tokenizer function: http://www.boost.org/doc/libs/1_52_0/libs/tokenizer/tokenizerfunction.htm

Here's one that works for your case.

struct my_tokenizer_func
{
    template<typename It>
    bool operator()(It& next, It end, std::string & tok)
    {
        if (next == end)
            return false;
        char const * del = ">=";
        auto pos = std::search(next, end, del, del + 2);
        tok.assign(next, pos);
        next = pos;
        if (next != end)
            std::advance(next, 2);
        return true;
    }

    void reset() {}
};

int main()
{
    std::string to_be_parsed = "1) one>=2) two>=3) three>=4) four";
    for (auto i : boost::tokenizer<my_tokenizer_func>(to_be_parsed))
        std::cout << i << '\n';
}

This should work perfectly for string (or single character) delimiters. Don't forget to include #include <sstream>.

std::string input = "Alfa=,+Bravo=,+Charlie=,+Delta";
std::string delimiter = "=,+"; 
std::istringstream ss(input);
std::string token;
std::string::iterator it;

while(std::getline(ss, token, *(it = delimiter.begin()))) {
    std::cout << token << " " << '\n'; // Token is extracted using '='
    while(*(++it)) ss.get();           // Skip the rest of delimiter if exists ",+"
}

The first while loop extracts a token using the first character of the string delimiter. The second while loop skips the rest of the delimiter and stops at the beginning of the next token.


If you do not want to modify the string (as in the answer by Vincenzo Pii) and want to output the last token as well, you may want to use this approach:

inline std::vector<std::string> splitString( const std::string &s, const std::string &delimiter ){
    std::vector<std::string> ret;
    size_t start = 0;
    size_t end = 0;
    size_t len = 0;
    std::string token;
    do{ end = s.find(delimiter,start); 
        len = end - start;
        token = s.substr(start, len);
        ret.emplace_back( token );
        start += len + delimiter.length();
        std::cout << token << std::endl;
    }while ( end != std::string::npos );
    return ret;
}

A very simple/naive approach:

vector<string> words_seperate(string s){
    vector<string> ans;
    string w="";
    for(auto i:s){
        if(i==' '){
           ans.push_back(w);
           w="";
        }
        else{
           w+=i;
        }
    }
    ans.push_back(w);
    return ans;
}

Or you can use boost library split function:

vector<string> result; 
boost::split(result, input, boost::is_any_of("\t"));

Or You can try TOKEN or strtok:

char str[] = "DELIMIT-ME-C++"; 
char *token = strtok(str, "-"); 
while (token) 
{ 
    cout<<token; 
    token = strtok(NULL, "-"); 
} 

Or You can do this:

char split_with=' ';
vector<string> words;
string token; 
stringstream ss(our_string);
while(getline(ss , token , split_with)) words.push_back(token);

Container splitR(const std::string& input, const std::string& delims) {
    Container out;
    size_t delims_len = delims.size();
    auto begIdx = 0;
    auto endIdx = input.find(delims, begIdx);
    if (endIdx == std::string::npos && input.size() != 0) {
        insert_in_container(out, input);
    }
    while (endIdx != std::string::npos) {
        insert_in_container(out, input.substr(begIdx, endIdx - begIdx));
        begIdx = endIdx + delims_len;
        endIdx = input.find(delims, begIdx);
        if (endIdx == std::string::npos) {
            insert_in_container(out, input.substr(begIdx, input.length() - begIdx));
        }
    }
    return out;
}

This method uses std::string::find without mutating the original string by remembering the beginning and end of the previous substring token.

#include <iostream>
#include <string>

int main()
{
    std::string s = "scott>=tiger";
    std::string delim = ">=";

    auto start = 0U;
    auto end = s.find(delim);
    while (end != std::string::npos)
    {
        std::cout << s.substr(start, end - start) << std::endl;
        start = end + delim.length();
        end = s.find(delim, start);
    }

    std::cout << s.substr(start, end);
}

std::vector<std::string> parse(std::string str,std::string delim){
    std::vector<std::string> tokens;
    char *str_c = strdup(str.c_str()); 
    char* token = NULL;

    token = strtok(str_c, delim.c_str()); 
    while (token != NULL) { 
        tokens.push_back(std::string(token));  
        token = strtok(NULL, delim.c_str()); 
    }

    delete[] str_c;

    return tokens;
}

#include<iostream>
#include<algorithm>
using namespace std;

int split_count(string str,char delimit){
return count(str.begin(),str.end(),delimit);
}

void split(string str,char delimit,string res[]){
int a=0,i=0;
while(a<str.size()){
res[i]=str.substr(a,str.find(delimit));
a+=res[i].size()+1;
i++;
}
}

int main(){

string a="abc.xyz.mno.def";
int x=split_count(a,'.')+1;
string res[x];
split(a,'.',res);

for(int i=0;i<x;i++)
cout<<res[i]<<endl;
  return 0;
}

P.S: Works only if the lengths of the strings after splitting are equal


Function:

std::vector<std::string> WSJCppCore::split(const std::string& sWhat, const std::string& sDelim) {
    std::vector<std::string> vRet;
    size_t nPos = 0;
    size_t nLen = sWhat.length();
    size_t nDelimLen = sDelim.length();
    while (nPos < nLen) {
        std::size_t nFoundPos = sWhat.find(sDelim, nPos);
        if (nFoundPos != std::string::npos) {
            std::string sToken = sWhat.substr(nPos, nFoundPos - nPos);
            vRet.push_back(sToken);
            nPos = nFoundPos + nDelimLen;
            if (nFoundPos + nDelimLen == nLen) { // last delimiter
                vRet.push_back("");
            }
        } else {
            std::string sToken = sWhat.substr(nPos, nLen - nPos);
            vRet.push_back(sToken);
            break;
        }
    }
    return vRet;
}

Unit-tests:

bool UnitTestSplit::run() {
bool bTestSuccess = true;

    struct LTest {
        LTest(
            const std::string &sStr,
            const std::string &sDelim,
            const std::vector<std::string> &vExpectedVector
        ) {
            this->sStr = sStr;
            this->sDelim = sDelim;
            this->vExpectedVector = vExpectedVector;
        };
        std::string sStr;
        std::string sDelim;
        std::vector<std::string> vExpectedVector;
    };
    std::vector<LTest> tests;
    tests.push_back(LTest("1 2 3 4 5", " ", {"1", "2", "3", "4", "5"}));
    tests.push_back(LTest("|1f|2?|3%^|44354|5kdasjfdre|2", "|", {"", "1f", "2?", "3%^", "44354", "5kdasjfdre", "2"}));
    tests.push_back(LTest("|1f|2?|3%^|44354|5kdasjfdre|", "|", {"", "1f", "2?", "3%^", "44354", "5kdasjfdre", ""}));
    tests.push_back(LTest("some1 => some2 => some3", "=>", {"some1 ", " some2 ", " some3"}));
    tests.push_back(LTest("some1 => some2 => some3 =>", "=>", {"some1 ", " some2 ", " some3 ", ""}));

    for (int i = 0; i < tests.size(); i++) {
        LTest test = tests[i];
        std::string sPrefix = "test" + std::to_string(i) + "(\"" + test.sStr + "\")";
        std::vector<std::string> vSplitted = WSJCppCore::split(test.sStr, test.sDelim);
        compareN(bTestSuccess, sPrefix + ": size", vSplitted.size(), test.vExpectedVector.size());
        int nMin = std::min(vSplitted.size(), test.vExpectedVector.size());
        for (int n = 0; n < nMin; n++) {
            compareS(bTestSuccess, sPrefix + ", element: " + std::to_string(n), vSplitted[n], test.vExpectedVector[n]);
        }
    }

    return bTestSuccess;
}

Answer is already there, but selected-answer uses erase function which is very costly, think of some very big string(in MBs). Therefore I use below function.

vector<string> split(const string& i_str, const string& i_delim)
{
    vector<string> result;
    
    size_t found = i_str.find(i_delim);
    size_t startIndex = 0;

    while(found != string::npos)
    {
        result.push_back(string(i_str.begin()+startIndex, i_str.begin()+found));
        startIndex = found + i_delim.size();
        found = i_str.find(i_delim, startIndex);
    }
    if(startIndex != i_str.size())
        result.push_back(string(i_str.begin()+startIndex, i_str.end()));
    return result;      
}

This code splits lines from text, and add everyone into a vector.

vector<string> split(char *phrase, string delimiter){
    vector<string> list;
    string s = string(phrase);
    size_t pos = 0;
    string token;
    while ((pos = s.find(delimiter)) != string::npos) {
        token = s.substr(0, pos);
        list.push_back(token);
        s.erase(0, pos + delimiter.length());
    }
    list.push_back(s);
    return list;
}

Called by:

vector<string> listFilesMax = split(buffer, "\n");

As a bonus, here is a code example of a split function and macro that is easy to use and where you can choose the container type :

#include <iostream>
#include <vector>
#include <string>

#define split(str, delim, type) (split_fn<type<std::string>>(str, delim))
 
template <typename Container>
Container split_fn(const std::string& str, char delim = ' ') {
    Container cont{};
    std::size_t current, previous = 0;
    current = str.find(delim);
    while (current != std::string::npos) {
        cont.push_back(str.substr(previous, current - previous));
        previous = current + 1;
        current = str.find(delim, previous);
    }
    cont.push_back(str.substr(previous, current - previous));
    
    return cont;
}

int main() {
    
    auto test = std::string{"This is a great test"};
    auto res = split(test, ' ', std::vector);
    
    for(auto &i : res) {
        std::cout << i << ", "; // "this", "is", "a", "great", "test"
    }
    
    
    return 0;
}

Since C++11 it can be done like this:

std::vector<std::string> splitString(const std::string& str,
                                     const std::regex& regex)
{
  return {std::sregex_token_iterator{str.begin(), str.end(), regex, -1}, 
          std::sregex_token_iterator() };
} 

// usually we have a predefined set of regular expressions: then
// let's build those only once and re-use them multiple times
static const std::regex regex1(R"some-reg-exp1", std::regex::optimize);
static const std::regex regex2(R"some-reg-exp2", std::regex::optimize);
static const std::regex regex3(R"some-reg-exp3", std::regex::optimize);

string str = "some string to split";
std::vector<std::string> tokens( splitString(str, regex1) ); 

Notes:


You can also use regex for this:

std::vector<std::string> split(const std::string str, const std::string regex_str)
{
    std::regex regexz(regex_str);
    std::vector<std::string> list(std::sregex_token_iterator(str.begin(), str.end(), regexz, -1),
                                  std::sregex_token_iterator());
    return list;
}

which is equivalent to :

std::vector<std::string> split(const std::string str, const std::string regex_str)
{
    std::sregex_token_iterator token_iter(str.begin(), str.end(), regexz, -1);
    std::sregex_token_iterator end;
    std::vector<std::string> list;
    while (token_iter != end)
    {
        list.emplace_back(*token_iter++);
    }
    return list;
}

and use it like this :

#include <iostream>
#include <string>
#include <regex>

std::vector<std::string> split(const std::string str, const std::string regex_str)
{   // a yet more concise form!
    return { std::sregex_token_iterator(str.begin(), str.end(), std::regex(regex_str), -1), std::sregex_token_iterator() };
}

int main()
{
    std::string input_str = "lets split this";
    std::string regex_str = " "; 
    auto tokens = split(input_str, regex_str);
    for (auto& item: tokens)
    {
        std::cout<<item <<std::endl;
    }
}

play with it online! http://cpp.sh/9sumb

you can simply use substrings, characters, etc like normal, or use actual regular expressions to do the splitting.
its also concise and C++11!


You can use next function to split string:

vector<string> split(const string& str, const string& delim)
{
    vector<string> tokens;
    size_t prev = 0, pos = 0;
    do
    {
        pos = str.find(delim, prev);
        if (pos == string::npos) pos = str.length();
        string token = str.substr(prev, pos-prev);
        if (!token.empty()) tokens.push_back(token);
        prev = pos + delim.length();
    }
    while (pos < str.length() && prev < str.length());
    return tokens;
}

For string delimiter

Split string based on a string delimiter. Such as splitting string "adsf-+qwret-+nvfkbdsj-+orthdfjgh-+dfjrleih" based on string delimiter "-+", output will be {"adsf", "qwret", "nvfkbdsj", "orthdfjgh", "dfjrleih"}

#include <iostream>
#include <sstream>
#include <vector>

using namespace std;

// for string delimiter
vector<string> split (string s, string delimiter) {
    size_t pos_start = 0, pos_end, delim_len = delimiter.length();
    string token;
    vector<string> res;

    while ((pos_end = s.find (delimiter, pos_start)) != string::npos) {
        token = s.substr (pos_start, pos_end - pos_start);
        pos_start = pos_end + delim_len;
        res.push_back (token);
    }

    res.push_back (s.substr (pos_start));
    return res;
}

int main() {
    string str = "adsf-+qwret-+nvfkbdsj-+orthdfjgh-+dfjrleih";
    string delimiter = "-+";
    vector<string> v = split (str, delimiter);

    for (auto i : v) cout << i << endl;

    return 0;
}


Output

adsf
qwret
nvfkbdsj
orthdfjgh
dfjrleih




For single character delimiter

Split string based on a character delimiter. Such as splitting string "adsf+qwer+poui+fdgh" with delimiter "+" will output {"adsf", "qwer", "poui", "fdg"h}

#include <iostream>
#include <sstream>
#include <vector>

using namespace std;

vector<string> split (const string &s, char delim) {
    vector<string> result;
    stringstream ss (s);
    string item;

    while (getline (ss, item, delim)) {
        result.push_back (item);
    }

    return result;
}

int main() {
    string str = "adsf+qwer+poui+fdgh";
    vector<string> v = split (str, '+');

    for (auto i : v) cout << i << endl;

    return 0;
}


Output

adsf
qwer
poui
fdgh

Here's my take on this. It handles the edge cases and takes an optional parameter to remove empty entries from the results.

bool endsWith(const std::string& s, const std::string& suffix)
{
    return s.size() >= suffix.size() &&
           s.substr(s.size() - suffix.size()) == suffix;
}

std::vector<std::string> split(const std::string& s, const std::string& delimiter, const bool& removeEmptyEntries = false)
{
    std::vector<std::string> tokens;

    for (size_t start = 0, end; start < s.length(); start = end + delimiter.length())
    {
         size_t position = s.find(delimiter, start);
         end = position != string::npos ? position : s.length();

         std::string token = s.substr(start, end - start);
         if (!removeEmptyEntries || !token.empty())
         {
             tokens.push_back(token);
         }
    }

    if (!removeEmptyEntries &&
        (s.empty() || endsWith(s, delimiter)))
    {
        tokens.push_back("");
    }

    return tokens;
}

Examples

split("a-b-c", "-"); // [3]("a","b","c")

split("a--c", "-"); // [3]("a","","c")

split("-b-", "-"); // [3]("","b","")

split("--c--", "-"); // [5]("","","c","","")

split("--c--", "-", true); // [1]("c")

split("a", "-"); // [1]("a")

split("", "-"); // [1]("")

split("", "-", true); // [0]()

This is a complete method that splits the string on any delimiter and returns a vector of the chopped up strings.

It is an adaptation from the answer from ryanbwork. However, his check for: if(token != mystring) gives wrong results if you have repeating elements in your string. This is my solution to that problem.

vector<string> Split(string mystring, string delimiter)
{
    vector<string> subStringList;
    string token;
    while (true)
    {
        size_t findfirst = mystring.find_first_of(delimiter);
        if (findfirst == string::npos) //find_first_of returns npos if it couldn't find the delimiter anymore
        {
            subStringList.push_back(mystring); //push back the final piece of mystring
            return subStringList;
        }
        token = mystring.substr(0, mystring.find_first_of(delimiter));
        mystring = mystring.substr(mystring.find_first_of(delimiter) + 1);
        subStringList.push_back(token);
    }
    return subStringList;
}

strtok allows you to pass in multiple chars as delimiters. I bet if you passed in ">=" your example string would be split correctly (even though the > and = are counted as individual delimiters).

EDIT if you don't want to use c_str() to convert from string to char*, you can use substr and find_first_of to tokenize.

string token, mystring("scott>=tiger");
while(token != mystring){
  token = mystring.substr(0,mystring.find_first_of(">="));
  mystring = mystring.substr(mystring.find_first_of(">=") + 1);
  printf("%s ",token.c_str());
}

std::vector<std::string> split(const std::string& s, char c) {
  std::vector<std::string> v;
  unsigned int ii = 0;
  unsigned int j = s.find(c);
  while (j < s.length()) {
    v.push_back(s.substr(i, j - i));
    i = ++j;
    j = s.find(c, j);
    if (j >= s.length()) {
      v.push_back(s.substr(i, s,length()));
      break;
    }
  }
  return v;
}

Since this is the top-rated Stack Overflow Google search result for C++ split string or similar, I'll post a complete, copy/paste runnable example that shows both methods.

splitString uses stringstream (probably the better and easier option in most cases)

splitString2 uses find and substr (a more manual approach)

// SplitString.cpp

#include <iostream>
#include <vector>
#include <string>
#include <sstream>

// function prototypes
std::vector<std::string> splitString(const std::string& str, char delim);
std::vector<std::string> splitString2(const std::string& str, char delim);
std::string getSubstring(const std::string& str, int leftIdx, int rightIdx);


int main(void)
{
  // Test cases - all will pass
  
  std::string str = "ab,cd,ef";
  //std::string str = "abcdef";
  //std::string str = "";
  //std::string str = ",cd,ef";
  //std::string str = "ab,cd,";   // behavior of splitString and splitString2 is different for this final case only, if this case matters to you choose which one you need as applicable
  
  
  std::vector<std::string> tokens = splitString(str, ',');
  
  std::cout << "tokens: " << "\n";
  
  if (tokens.empty())
  {
    std::cout << "(tokens is empty)" << "\n";
  }
  else
  {
    for (auto& token : tokens)
    {
      if (token == "") std::cout << "(empty string)" << "\n";
      else std::cout << token << "\n";
    }
  }
    
  return 0;
}

std::vector<std::string> splitString(const std::string& str, char delim)
{
  std::vector<std::string> tokens;
  
  if (str == "") return tokens;
  
  std::string currentToken;
  
  std::stringstream ss(str);
  
  while (std::getline(ss, currentToken, delim))
  {
    tokens.push_back(currentToken);
  }
  
  return tokens;
}

std::vector<std::string> splitString2(const std::string& str, char delim)
{
  std::vector<std::string> tokens;
  
  if (str == "") return tokens;
  
  int leftIdx = 0;
  
  int delimIdx = str.find(delim);
  
  int rightIdx;
  
  while (delimIdx != std::string::npos)
  {
    rightIdx = delimIdx - 1;
    
    std::string token = getSubstring(str, leftIdx, rightIdx);
    tokens.push_back(token);
    
    // prep for next time around
    leftIdx = delimIdx + 1;
    
    delimIdx = str.find(delim, delimIdx + 1);
  }
  
  rightIdx = str.size() - 1;
  
  std::string token = getSubstring(str, leftIdx, rightIdx);
  tokens.push_back(token);
  
  return tokens;
}

std::string getSubstring(const std::string& str, int leftIdx, int rightIdx)
{
  return str.substr(leftIdx, rightIdx - leftIdx + 1);
}

Examples related to c++

Method Call Chaining; returning a pointer vs a reference? How can I tell if an algorithm is efficient? Difference between opening a file in binary vs text How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Install Qt on Ubuntu #include errors detected in vscode Cannot open include file: 'stdio.h' - Visual Studio Community 2017 - C++ Error How to fix the error "Windows SDK version 8.1" was not found? Visual Studio 2017 errors on standard headers How do I check if a Key is pressed on C++

Examples related to parsing

Got a NumberFormatException while trying to parse a text file for objects Uncaught SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) Python/Json:Expecting property name enclosed in double quotes Correctly Parsing JSON in Swift 3 How to get response as String using retrofit without using GSON or any other library in android UIButton action in table view cell "Expected BEGIN_OBJECT but was STRING at line 1 column 1" How to convert an XML file to nice pandas dataframe? How to extract multiple JSON objects from one file? How to sum digits of an integer in java?

Examples related to split

Parameter "stratify" from method "train_test_split" (scikit Learn) Pandas split DataFrame by column value How to split large text file in windows? Attribute Error: 'list' object has no attribute 'split' Split function in oracle to comma separated values with automatic sequence How would I get everything before a : in a string Python Split String by delimiter position using oracle SQL JavaScript split String with white space Split a String into an array in Swift? Split pandas dataframe in two if it has more than 10 rows

Examples related to token

Sending the bearer token with axios JWT (JSON Web Token) library for Java Python requests library how to pass Authorization header with single token best practice to generate random token for forgot password syntax error: unexpected token < What is the difference between a token and a lexeme? how to generate a unique token which expires after 24 hours? Parse (split) a string in C++ using string delimiter (standard C++) How do I fix a "Expected Primary-expression before ')' token" error? How can a Jenkins user authentication details be "passed" to a script which uses Jenkins API to create jobs?

Examples related to tokenize

How to get rid of punctuation using NLTK tokenizer? How do I tokenize a string sentence in NLTK? Splitting string into multiple rows in Oracle Parse (split) a string in C++ using string delimiter (standard C++) How to use stringstream to separate comma separated strings Split string with PowerShell and do something with each token Splitting comma separated string in a PL/SQL stored proc Convert comma separated string to array in PL/SQL Is there a function to split a string in PL/SQL? How to split a string in shell and get the last field