[java] Counting number of words in a file

I'm having a problem counting the number of words in a file. The approach that I am taking is when I see a space or a newLine then I know to count a word.

The problem is that if I have multiple lines between paragraphs then I ended up counting them as words also. If you look at the readFile() method you can see what I am doing.

Could you help me out and guide me in the right direction on how to fix this?

Example input file (including a blank line):

word word word
word word

word word word

This question is related to java algorithm loops io

The answer is


I would change your approach a bit. First, I would use a BufferedReader to read the file file in line-by-line using readLine(). Then split each line on whitespace using String.split("\\s") and use the size of the resulting array to see how many words are on that line. To get the number of characters you could either look at the size of each line or of each split word (depending of if you want to count whitespace as characters).


Take a look at my solution here, it should work. The idea is to remove all the unwanted symbols from the words, then separate those words and store them in some other variable, i was using ArrayList. By adjusting the "excludedSymbols" variable you can add more symbols which you would like to be excluded from the words.

public static void countWords () {
    String textFileLocation ="c:\\yourFileLocation";
    String readWords ="";
    ArrayList<String> extractOnlyWordsFromTextFile = new ArrayList<>();
    // excludedSymbols can be extended to whatever you want to exclude from the file 
    String[] excludedSymbols = {" ", "," , "." , "/" , ":" , ";" , "<" , ">", "\n"};
    String readByteCharByChar = "";
    boolean testIfWord = false;


    try {
        InputStream inputStream = new FileInputStream(textFileLocation);
        byte byte1 = (byte) inputStream.read();
        while (byte1 != -1) {

            readByteCharByChar +=String.valueOf((char)byte1);
            for(int i=0;i<excludedSymbols.length;i++) {
            if(readByteCharByChar.equals(excludedSymbols[i])) {
                if(!readWords.equals("")) {
                extractOnlyWordsFromTextFile.add(readWords);
                }
                readWords ="";
                testIfWord = true;
                break;
            }
            }
            if(!testIfWord) {
                readWords+=(char)byte1;
            }
            readByteCharByChar = "";
            testIfWord = false;
            byte1 = (byte)inputStream.read();
            if(byte1 == -1 && !readWords.equals("")) {
                extractOnlyWordsFromTextFile.add(readWords);
            }
        }
        inputStream.close();
        System.out.println(extractOnlyWordsFromTextFile);
        System.out.println("The number of words in the choosen text file are: " + extractOnlyWordsFromTextFile.size());
    } catch (IOException ioException) {

        ioException.printStackTrace();
    }
}

I think a correct approach would be by means of Regex:

String fileContent = <text from file>;    
String[] words = Pattern.compile("\\s+").split(fileContent);
System.out.println("File has " + words.length + " words");

Hope it helps. The "\s+" meaning is in Pattern javadoc


File Word-Count

If in between words having some symbols then you can split and count the number of Words.

Scanner sc = new Scanner(new FileInputStream(new File("Input.txt")));
        int count = 0;
        while (sc.hasNext()) {

            String[] s = sc.next().split("d*[.@:=#-]"); 

            for (int i = 0; i < s.length; i++) {
                if (!s[i].isEmpty()){
                    System.out.println(s[i]);
                    count++;
                }   
            }           
        }
        System.out.println("Word-Count : "+count);

BufferedReader bf= new BufferedReader(new FileReader("G://Sample.txt"));
        String line=bf.readLine();
        while(line!=null)
        {
            String[] words=line.split(" ");
            System.out.println("this line contains " +words.length+ " words");
            line=bf.readLine();
        }

The below code supports in Java 8

//Read file into String

String fileContent=new String(Files.readAlBytes(Paths.get("MyFile.txt")),StandardCharacters.UFT_8);

//Keeping these into list of strings by splitting with a delimiter

List<String> words = Arrays.asList(contents.split("\\PL+"));

int count=0;
for(String x: words){
 if(x.length()>1) count++;
}

sop(x);

This can be done in a very way using Java 8:

Files.lines(Paths.get(file))
    .flatMap(str->Stream.of(str.split("[ ,.!?\r\n]")))
    .filter(s->s.length()>0).count();

Just keep a boolean flag around that lets you know if the previous character was whitespace or not (pseudocode follows):

boolean prevWhitespace = false;
int wordCount = 0;
while (char ch = getNextChar(input)) {
  if (isWhitespace(ch)) {
    if (!prevWhitespace) {
      prevWhitespace = true;
      wordCount++;
    }
  } else {
    prevWhitespace = false;
  }
}

Hack solution

You can read the text file into a String var. Then split the String into an array using a single whitespace as the delimiter StringVar.Split(" ").

The Array count would equal the number of "Words" in the file. Of course this wouldnt give you a count of line numbers.


So easy we can get the String from files by method: getText();

public class Main {

    static int countOfWords(String str) {
        if (str.equals("") || str == null) {
            return 0;
        }else{
            int numberWords = 0;
            for (char c : str.toCharArray()) {
                if (c == ' ') {
                    numberWords++;
                }
            }

            return ++numberWordss;
        }
    }
}

You can use a Scanner with a FileInputStream instead of BufferedReader with a FileReader. For example:-

File file = new File("sample.txt");
try(Scanner sc = new Scanner(new FileInputStream(file))){
    int count=0;
    while(sc.hasNext()){
        sc.next();
        count++;
    }
System.out.println("Number of words: " + count);
}

3 steps: Consume all the white spaces, check if is a line, consume all the nonwhitespace.3

while(true){
    c = inFile.read();                
    // consume whitespaces
    while(isspace(c)){ inFile.read() }
    if (c == '\n'){ numberLines++; continue; }
    while (!isspace(c)){
         numberChars++;
         c = inFile.read();
    }
    numberWords++;
}

import java.io.BufferedReader;
import java.io.FileReader;

public class CountWords {

    public static void main (String args[]) throws Exception {

       System.out.println ("Counting Words");       
       FileReader fr = new FileReader ("c:\\Customer1.txt");        
       BufferedReader br = new BufferedReader (fr);     
       String line = br.readLin ();
       int count = 0;
       while (line != null) {
          String []parts = line.split(" ");
          for( String w : parts)
          {
            count++;        
          }
          line = br.readLine();
       }         
       System.out.println(count);
    }
}

This is just a thought. There is one very easy way to do it. If you just need number of words and not actual words then just use Apache WordUtils

import org.apache.commons.lang.WordUtils;

public class CountWord {

public static void main(String[] args) {    
String str = "Just keep a boolean flag around that lets you know if the previous character was whitespace or not pseudocode follows";

    String initials = WordUtils.initials(str);

    System.out.println(initials);
    //so number of words in your file will be
    System.out.println(initials.length());    
  }
}

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to loops

How to increment a letter N times per iteration and store in an array? Angular 2 Cannot find control with unspecified name attribute on formArrays What is the difference between i = i + 1 and i += 1 in a 'for' loop? Prime numbers between 1 to 100 in C Programming Language Python Loop: List Index Out of Range JavaScript: Difference between .forEach() and .map() Why does using from __future__ import print_function breaks Python2-style print? Creating an array from a text file in Bash Iterate through dictionary values? C# Wait until condition is true

Examples related to io

Reading file using relative path in python project How to write to a CSV line by line? Getting "java.nio.file.AccessDeniedException" when trying to write to a folder Exception: Unexpected end of ZLIB input stream How to get File Created Date and Modified Date Printing Mongo query output to a file while in the mongo shell Load data from txt with pandas Writing File to Temp Folder How to get resources directory path programmatically ValueError : I/O operation on closed file