It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n
, \x0A
, \u000A
), "Carriage Return" (ASCII 13, \r
, \x0D
, \u000D
), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A
, so this combination should only emit a single line. Unix uses a single \u000A
, and very old Macs used a single \u000D
character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:
\u000D\u000A
) then these two together skip just one line.String.Empty
is the only input that returns no lines (any character entails at least one line)The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty
.
/// <summary>
/// Enumerates the text lines from the string.
/// ? Mixed CR-LF scenarios are handled correctly
/// ? String.Empty is returned for each empty line
/// ? No returned string ever contains CR or LF
/// </summary>
public static IEnumerable<String> Lines(this String s)
{
int j = 0, c, i;
char ch;
if ((c = s.Length) > 0)
do
{
for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
;
yield return s.Substring(i, j - i);
}
while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
}
Note: If you don't mind the overhead of creating a StringReader
instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.
public static IEnumerable<String> Lines(this String s)
{
using (var tr = new StringReader(s))
while (tr.ReadLine() is String L)
yield return L;
}