[java] How to list all files in a directory and its subdirectories in hadoop hdfs

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this

FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );

but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?

This question is related to java hadoop hdfs

The answer is


If you are using hadoop 2.* API there are more elegant solutions:

    Configuration conf = getConf();
    Job job = Job.getInstance(conf);
    FileSystem fs = FileSystem.get(conf);

    //the second boolean parameter here sets the recursion to true
    RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
            new Path("path/to/lib"), true);
    while(fileStatusListIterator.hasNext()){
        LocatedFileStatus fileStatus = fileStatusListIterator.next();
        //do stuff with the file like ...
        job.addFileToClassPath(fileStatus.getPath());
    }

Code snippet for both recursive and non-recursive approaches:

//helper method to get the list of files from the HDFS path
public static List<String>
    listFilesFromHDFSPath(Configuration hadoopConfiguration,
                          String hdfsPath,
                          boolean recursive) throws IOException,
                                        IllegalArgumentException
{
    //resulting list of files
    List<String> filePaths = new ArrayList<String>();

    //get path from string and then the filesystem
    Path path = new Path(hdfsPath);  //throws IllegalArgumentException
    FileSystem fs = path.getFileSystem(hadoopConfiguration);

    //if recursive approach is requested
    if(recursive)
    {
        //(heap issues with recursive approach) => using a queue
        Queue<Path> fileQueue = new LinkedList<Path>();

        //add the obtained path to the queue
        fileQueue.add(path);

        //while the fileQueue is not empty
        while (!fileQueue.isEmpty())
        {
            //get the file path from queue
            Path filePath = fileQueue.remove();

            //filePath refers to a file
            if (fs.isFile(filePath))
            {
                filePaths.add(filePath.toString());
            }
            else   //else filePath refers to a directory
            {
                //list paths in the directory and add to the queue
                FileStatus[] fileStatuses = fs.listStatus(filePath);
                for (FileStatus fileStatus : fileStatuses)
                {
                    fileQueue.add(fileStatus.getPath());
                } // for
            } // else

        } // while

    } // if
    else        //non-recursive approach => no heap overhead
    {
        //if the given hdfsPath is actually directory
        if(fs.isDirectory(path))
        {
            FileStatus[] fileStatuses = fs.listStatus(path);

            //loop all file statuses
            for(FileStatus fileStatus : fileStatuses)
            {
                //if the given status is a file, then update the resulting list
                if(fileStatus.isFile())
                    filePaths.add(fileStatus.getPath().toString());
            } // for
        } // if
        else        //it is a file then
        {
            //return the one and only file path to the resulting list
            filePaths.add(path.toString());
        } // else

    } // else

    //close filesystem; no more operations
    fs.close();

    //return the resulting list
    return filePaths;
} // listFilesFromHDFSPath

Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.

def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
    val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
            files.foreach { fileStatus => {
                if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {                
                    filePaths+=fileStatus.getPath().toString()      
                }
                else if(fileStatus.isDirectory()) {
                    traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
                }
            }
    }   
}

Thanks Radu Adrian Moldovan for the suggestion.

Here is an implementation using queue:

private static List<String> listAllFilePath(Path hdfsFilePath, FileSystem fs)
throws FileNotFoundException, IOException {
  List<String> filePathList = new ArrayList<String>();
  Queue<Path> fileQueue = new LinkedList<Path>();
  fileQueue.add(hdfsFilePath);
  while (!fileQueue.isEmpty()) {
    Path filePath = fileQueue.remove();
    if (fs.isFile(filePath)) {
      filePathList.add(filePath.toString());
    } else {
      FileStatus[] fileStatus = fs.listStatus(filePath);
      for (FileStatus fileStat : fileStatus) {
        fileQueue.add(fileStat.getPath());
      }
    }
  }
  return filePathList;
}

/**
 * @param filePath
 * @param fs
 * @return list of absolute file path present in given path
 * @throws FileNotFoundException
 * @throws IOException
 */
public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
    List<String> fileList = new ArrayList<String>();
    FileStatus[] fileStatus = fs.listStatus(filePath);
    for (FileStatus fileStat : fileStatus) {
        if (fileStat.isDirectory()) {
            fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
        } else {
            fileList.add(fileStat.getPath().toString());
        }
    }
    return fileList;
}

Quick Example : Suppose you have the following file structure:

a  ->  b
   ->  c  -> d
          -> e 
   ->  d  -> f

Using the code above, you get:

a/b
a/c/d
a/c/e
a/d/f

If you want only the leaf (i.e. fileNames), use the following code in else block :

 ...
    } else {
        String fileName = fileStat.getPath().toString(); 
        fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
    }

This will give:

b
d
e
f

Have you tried this:

import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class cat{
    public static void main (String [] args) throws Exception{
        try{
            FileSystem fs = FileSystem.get(new Configuration());
            FileStatus[] status = fs.listStatus(new Path("hdfs://test.com:9000/user/test/in"));  // you need to pass in your hdfs path

            for (int i=0;i<status.length;i++){
                BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(status[i].getPath())));
                String line;
                line=br.readLine();
                while (line != null){
                    System.out.println(line);
                    line=br.readLine();
                }
            }
        }catch(Exception e){
            System.out.println("File not found");
        }
    }
}

don't use recursive approach (heap issues) :) use a queue

queue.add(param_dir)
while (queue is not empty){

  directory=  queue.pop
 - get items from current directory
 - if item is file add to a list (final list)
 - if item is directory => queue.push
}

that was easy, enjoy!


Here is a code snippet, that counts number of files in a particular HDFS directory (I used this to determine how many reducers to use in a particular ETL code). You can easily modify this to suite your needs.

private int calculateNumberOfReducers(String input) throws IOException {
    int numberOfReducers = 0;
    Path inputPath = new Path(input);
    FileSystem fs = inputPath.getFileSystem(getConf());
    FileStatus[] statuses = fs.globStatus(inputPath);
    for(FileStatus status: statuses) {
        if(status.isDirectory()) {
            numberOfReducers += getNumberOfInputFiles(status, fs);
        } else if(status.isFile()) {
            numberOfReducers ++;
        }
    }
    return numberOfReducers;
}

/**
 * Recursively determines number of input files in an HDFS directory
 *
 * @param status instance of FileStatus
 * @param fs instance of FileSystem
 * @return number of input files within particular HDFS directory
 * @throws IOException
 */
private int getNumberOfInputFiles(FileStatus status, FileSystem fs) throws IOException  {
    int inputFileCount = 0;
    if(status.isDirectory()) {
        FileStatus[] files = fs.listStatus(status.getPath());
        for(FileStatus file: files) {
            inputFileCount += getNumberOfInputFiles(file, fs);
        }
    } else {
        inputFileCount ++;
    }

    return inputFileCount;
}

Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hdfs

What are the pros and cons of parquet format compared to other formats? How to copy data from one HDFS to another HDFS? Spark - load CSV file as DataFrame? hadoop copy a local file system folder to HDFS What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? How to fix corrupt HDFS FIles How to copy file from HDFS to the local file system Name node is in safe mode. Not able to leave Hive load CSV with commas in quoted fields Permission denied at hdfs