Let's say I have a string such as:
"Hello how are you doing?"
I would like a function that turns multiple spaces into one space.
So I would get:
"Hello how are you doing?"
I know I could use regex or call
string s = "Hello how are you doing?".replace(" "," ");
But I would have to call it multiple times to make sure all sequential whitespaces are replaced with only one.
Is there already a built in method for this?
This question is related to
c#
string
whitespace
A fast extra whitespace remover by Felipe Machado. (Modified by RW for multi-space removal)
static string DuplicateWhiteSpaceRemover(string str)
{
var len = str.Length;
var src = str.ToCharArray();
int dstIdx = 0;
bool lastWasWS = false; //Added line
for (int i = 0; i < len; i++)
{
var ch = src[i];
switch (ch)
{
case '\u0020': //SPACE
case '\u00A0': //NO-BREAK SPACE
case '\u1680': //OGHAM SPACE MARK
case '\u2000': // EN QUAD
case '\u2001': //EM QUAD
case '\u2002': //EN SPACE
case '\u2003': //EM SPACE
case '\u2004': //THREE-PER-EM SPACE
case '\u2005': //FOUR-PER-EM SPACE
case '\u2006': //SIX-PER-EM SPACE
case '\u2007': //FIGURE SPACE
case '\u2008': //PUNCTUATION SPACE
case '\u2009': //THIN SPACE
case '\u200A': //HAIR SPACE
case '\u202F': //NARROW NO-BREAK SPACE
case '\u205F': //MEDIUM MATHEMATICAL SPACE
case '\u3000': //IDEOGRAPHIC SPACE
case '\u2028': //LINE SEPARATOR
case '\u2029': //PARAGRAPH SEPARATOR
case '\u0009': //[ASCII Tab]
case '\u000A': //[ASCII Line Feed]
case '\u000B': //[ASCII Vertical Tab]
case '\u000C': //[ASCII Form Feed]
case '\u000D': //[ASCII Carriage Return]
case '\u0085': //NEXT LINE
if (lastWasWS == false) //Added line
{
src[dstIdx++] = ' '; // Updated by Ryan
lastWasWS = true; //Added line
}
continue;
default:
lastWasWS = false; //Added line
src[dstIdx++] = ch;
break;
}
}
return new string(src, 0, dstIdx);
}
The benchmarks...
| | Time | TEST 1 | TEST 2 | TEST 3 | TEST 4 | TEST 5 |
| Function Name |(ticks)| dup. spaces | spaces+tabs | spaces+CR/LF| " " -> " " | " " -> " " |
|---------------------------|-------|-------------|-------------|-------------|-------------|-------------|
| SwitchStmtBuildSpaceOnly | 5.2 | PASS | FAIL | FAIL | PASS | PASS |
| InPlaceCharArraySpaceOnly | 5.6 | PASS | FAIL | FAIL | PASS | PASS |
| DuplicateWhiteSpaceRemover| 7.0 | PASS | PASS | PASS | PASS | PASS |
| SingleSpacedTrim | 11.8 | PASS | PASS | PASS | FAIL | FAIL |
| Fubo(StringBuilder) | 13 | PASS | FAIL | FAIL | PASS | PASS |
| User214147 | 19 | PASS | PASS | PASS | FAIL | FAIL |
| RegExWithCompile | 28 | PASS | FAIL | FAIL | PASS | PASS |
| SwitchStmtBuild | 34 | PASS | FAIL | FAIL | PASS | PASS |
| SplitAndJoinOnSpace | 55 | PASS | FAIL | FAIL | FAIL | FAIL |
| RegExNoCompile | 120 | PASS | PASS | PASS | PASS | PASS |
| RegExBrandon | 137 | PASS | FAIL | PASS | PASS | PASS |
Benchmark notes: Release Mode, no-debugger attached, i7 processor, avg of 4 runs, only short strings tested
SwitchStmtBuildSpaceOnly by Felipe Machado 2015 and modified by Sunsetquest
InPlaceCharArraySpaceOnly by Felipe Machado 2015 and modified by Sunsetquest
SwitchStmtBuild by Felipe Machado 2015 and modified by Sunsetquest
SwitchStmtBuild2 by Felipe Machado 2015 and modified by Sunsetquest
SingleSpacedTrim by David S 2013
Fubo(StringBuilder) by fubo 2014
SplitAndJoinOnSpace by Jon Skeet 2009
RegExWithCompile by Jon Skeet 2009
User214147 by user214147
RegExBrandon by Brandon
RegExNoCompile by Tim Hoolihan
Replacement groups provide impler approach resolving replacement of multiple white space characters with same single one:
public static void WhiteSpaceReduce()
{
string t1 = "a b c d";
string t2 = "a b\n\nc\nd";
Regex whiteReduce = new Regex(@"(?<firstWS>\s)(?<repeatedWS>\k<firstWS>+)");
Console.WriteLine("{0}", t1);
//Console.WriteLine("{0}", whiteReduce.Replace(t1, x => x.Value.Substring(0, 1)));
Console.WriteLine("{0}", whiteReduce.Replace(t1, @"${firstWS}"));
Console.WriteLine("\nNext example ---------");
Console.WriteLine("{0}", t2);
Console.WriteLine("{0}", whiteReduce.Replace(t2, @"${firstWS}"));
Console.WriteLine();
}
Please notice the second example keeps single \n
while accepted answer would replace end of line with space.
If you need to replace any combination of white space characters with the first one, just remove the back-reference \k
from the pattern.
There is no way built in to do this. You can try this:
private static readonly char[] whitespace = new char[] { ' ', '\n', '\t', '\r', '\f', '\v' };
public static string Normalize(string source)
{
return String.Join(" ", source.Split(whitespace, StringSplitOptions.RemoveEmptyEntries));
}
This will remove leading and trailing whitespce as well as collapse any internal whitespace to a single whitespace character. If you really only want to collapse spaces, then the solutions using a regular expression are better; otherwise this solution is better. (See the analysis done by Jon Skeet.)
Smallest solution:
var regExp=/\s+/g, newString=oldString.replace(regExp,' ');
A regular expressoin would be the easiest way. If you write the regex the correct way, you wont need multiple calls.
Change it to this:
string s = System.Text.RegularExpressions.Regex.Replace(s, @"\s{2,}", " ");
I'm sharing what I use, because it appears I've come up with something different. I've been using this for a while and it is fast enough for me. I'm not sure how it stacks up against the others. I uses it in a delimited file writer and run large datatables one field at a time through it.
public static string NormalizeWhiteSpace(string S)
{
string s = S.Trim();
bool iswhite = false;
int iwhite;
int sLength = s.Length;
StringBuilder sb = new StringBuilder(sLength);
foreach(char c in s.ToCharArray())
{
if(Char.IsWhiteSpace(c))
{
if (iswhite)
{
//Continuing whitespace ignore it.
continue;
}
else
{
//New WhiteSpace
//Replace whitespace with a single space.
sb.Append(" ");
//Set iswhite to True and any following whitespace will be ignored
iswhite = true;
}
}
else
{
sb.Append(c.ToString());
//reset iswhitespace to false
iswhite = false;
}
}
return sb.ToString();
}
Regex regex = new Regex(@"\W+");
string outputString = regex.Replace(inputString, " ");
Using the test program that Jon Skeet posted, I tried to see if I could get a hand written loop to run faster.
I can beat NormalizeWithSplitAndJoin every time, but only beat NormalizeWithRegex with inputs of 1000, 5.
static string NormalizeWithLoop(string input)
{
StringBuilder output = new StringBuilder(input.Length);
char lastChar = '*'; // anything other then space
for (int i = 0; i < input.Length; i++)
{
char thisChar = input[i];
if (!(lastChar == ' ' && thisChar == ' '))
output.Append(thisChar);
lastChar = thisChar;
}
return output.ToString();
}
I have not looked at the machine code the jitter produces, however I expect the problem is the time taken by the call to StringBuilder.Append() and to do much better would need the use of unsafe code.
So Regex.Replace() is very fast and hard to beat!!
While the existing answers are fine, I'd like to point out one approach which doesn't work:
public static string DontUseThisToCollapseSpaces(string text)
{
while (text.IndexOf(" ") != -1)
{
text = text.Replace(" ", " ");
}
return text;
}
This can loop forever. Anyone care to guess why? (I only came across this when it was asked as a newsgroup question a few years ago... someone actually ran into it as a problem.)
You can try this:
/// <summary>
/// Remove all extra spaces and tabs between words in the specified string!
/// </summary>
/// <param name="str">The specified string.</param>
public static string RemoveExtraSpaces(string str)
{
str = str.Trim();
StringBuilder sb = new StringBuilder();
bool space = false;
foreach (char c in str)
{
if (char.IsWhiteSpace(c) || c == (char)9) { space = true; }
else { if (space) { sb.Append(' '); }; sb.Append(c); space = false; };
}
return sb.ToString();
}
VB.NET
Linha.Split(" ").ToList().Where(Function(x) x <> " ").ToArray
C#
Linha.Split(" ").ToList().Where(x => x != " ").ToArray();
Enjoy the power of LINQ =D
As already pointed out, this is easily done by a regular expression. I'll just add that you might want to add a .trim() to that to get rid of leading/trailing whitespace.
This question isn't as simple as other posters have made it out to be (and as I originally believed it to be) - because the question isn't quite precise as it needs to be.
There's a difference between "space" and "whitespace". If you only mean spaces, then you should use a regex of " {2,}"
. If you mean any whitespace, that's a different matter. Should all whitespace be converted to spaces? What should happen to space at the start and end?
For the benchmark below, I've assumed that you only care about spaces, and you don't want to do anything to single spaces, even at the start and end.
Note that correctness is almost always more important than performance. The fact that the Split/Join solution removes any leading/trailing whitespace (even just single spaces) is incorrect as far as your specified requirements (which may be incomplete, of course).
The benchmark uses MiniBench.
using System;
using System.Text.RegularExpressions;
using MiniBench;
internal class Program
{
public static void Main(string[] args)
{
int size = int.Parse(args[0]);
int gapBetweenExtraSpaces = int.Parse(args[1]);
char[] chars = new char[size];
for (int i=0; i < size/2; i += 2)
{
// Make sure there actually *is* something to do
chars[i*2] = (i % gapBetweenExtraSpaces == 1) ? ' ' : 'x';
chars[i*2 + 1] = ' ';
}
// Just to make sure we don't have a \0 at the end
// for odd sizes
chars[chars.Length-1] = 'y';
string bigString = new string(chars);
// Assume that one form works :)
string normalized = NormalizeWithSplitAndJoin(bigString);
var suite = new TestSuite<string, string>("Normalize")
.Plus(NormalizeWithSplitAndJoin)
.Plus(NormalizeWithRegex)
.RunTests(bigString, normalized);
suite.Display(ResultColumns.All, suite.FindBest());
}
private static readonly Regex MultipleSpaces =
new Regex(@" {2,}", RegexOptions.Compiled);
static string NormalizeWithRegex(string input)
{
return MultipleSpaces.Replace(input, " ");
}
// Guessing as the post doesn't specify what to use
private static readonly char[] Whitespace =
new char[] { ' ' };
static string NormalizeWithSplitAndJoin(string input)
{
string[] split = input.Split
(Whitespace, StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", split);
}
}
A few test runs:
c:\Users\Jon\Test>test 1000 50
============ Normalize ============
NormalizeWithSplitAndJoin 1159091 0:30.258 22.93
NormalizeWithRegex 26378882 0:30.025 1.00
c:\Users\Jon\Test>test 1000 5
============ Normalize ============
NormalizeWithSplitAndJoin 947540 0:30.013 1.07
NormalizeWithRegex 1003862 0:29.610 1.00
c:\Users\Jon\Test>test 1000 1001
============ Normalize ============
NormalizeWithSplitAndJoin 1156299 0:29.898 21.99
NormalizeWithRegex 23243802 0:27.335 1.00
Here the first number is the number of iterations, the second is the time taken, and the third is a scaled score with 1.0 being the best.
That shows that in at least some cases (including this one) a regular expression can outperform the Split/Join solution, sometimes by a very significant margin.
However, if you change to an "all whitespace" requirement, then Split/Join does appear to win. As is so often the case, the devil is in the detail...
Here is the Solution i work with. Without RegEx and String.Split.
public static string TrimWhiteSpace(this string Value)
{
StringBuilder sbOut = new StringBuilder();
if (!string.IsNullOrEmpty(Value))
{
bool IsWhiteSpace = false;
for (int i = 0; i < Value.Length; i++)
{
if (char.IsWhiteSpace(Value[i])) //Comparion with WhiteSpace
{
if (!IsWhiteSpace) //Comparison with previous Char
{
sbOut.Append(Value[i]);
IsWhiteSpace = true;
}
}
else
{
IsWhiteSpace = false;
sbOut.Append(Value[i]);
}
}
}
return sbOut.ToString();
}
so you can:
string cleanedString = dirtyString.TrimWhiteSpace();
Source: Stackoverflow.com