How to list all files in a directory and its subdirectories in hadoop hdfs

Question

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which finally each one contains xml files  I want to list all xml files giving only the main folder s path  Locally I can do this with apache commons-io s FileUtils listFiles    I have tried this  FileStatus   status   fs listStatus  new Path  args  0          but it only lists the two first subfolders and it doesn t go further  Is there any way to do this in hadoop

User · Answer

Thanks Radu Adrian Moldovan for the suggestion.

Here is an implementation using queue:

private static List<String> listAllFilePath(Path hdfsFilePath, FileSystem fs)
throws FileNotFoundException, IOException {
  List<String> filePathList = new ArrayList<String>();
  Queue<Path> fileQueue = new LinkedList<Path>();
  fileQueue.add(hdfsFilePath);
  while (!fileQueue.isEmpty()) {
    Path filePath = fileQueue.remove();
    if (fs.isFile(filePath)) {
      filePathList.add(filePath.toString());
    } else {
      FileStatus[] fileStatus = fs.listStatus(filePath);
      for (FileStatus fileStat : fileStatus) {
        fileQueue.add(fileStat.getPath());
      }
    }
  }
  return filePathList;
}

User · Answer

Here is a code snippet  that counts number of files in a particular HDFS directory  I used this to determine how many reducers to use in a particular ETL code   You can easily modify this to suite your needs   private int calculateNumberOfReducers String input  throws IOException       int numberOfReducers   0      Path inputPath   new Path input       FileSystem fs   inputPath getFileSystem getConf         FileStatus   statuses   fs globStatus inputPath       for FileStatus status  statuses            if status isDirectory                  numberOfReducers    getNumberOfInputFiles status  fs             else if status isFile                  numberOfReducers                         return numberOfReducers            Recursively determines number of input files in an HDFS directory        param status instance of FileStatus     param fs instance of FileSystem     return number of input files within particular HDFS directory     throws IOException     private int getNumberOfInputFiles FileStatus status  FileSystem fs  throws IOException        int inputFileCount   0      if status isDirectory              FileStatus   files   fs listStatus status getPath             for FileStatus file  files                inputFileCount    getNumberOfInputFiles file  fs                   else           inputFileCount                return inputFileCount

User · Answer

Have you tried this   import java io    import java util    import java net    import org apache hadoop fs    import org apache hadoop conf    import org apache hadoop io    import org apache hadoop mapred    import org apache hadoop util     public class cat      public static void main  String    args  throws Exception          try              FileSystem fs   FileSystem get new Configuration                 FileStatus   status   fs listStatus new Path  hdfs   test com 9000 user test in         you need to pass in your hdfs path              for  int i 0 i lt status length i                     BufferedReader br new BufferedReader new InputStreamReader fs open status i  getPath                       String line                  line br readLine                    while  line    null                       System out println line                       line br readLine                                             catch Exception e               System out println  File not found

User · Answer

If you are using hadoop 2   API there are more elegant solutions       Configuration conf   getConf        Job job   Job getInstance conf       FileSystem fs   FileSystem get conf          the second boolean parameter here sets the recursion to true     RemoteIterator lt LocatedFileStatus gt  fileStatusListIterator   fs listFiles              new Path  path to lib    true       while fileStatusListIterator hasNext             LocatedFileStatus fileStatus   fileStatusListIterator next              do stuff with the file like             job addFileToClassPath fileStatus getPath

User · Answer

Code snippet for both recursive and non-recursive approaches     helper method to get the list of files from the HDFS path public static List lt String gt      listFilesFromHDFSPath Configuration hadoopConfiguration                            String hdfsPath                            boolean recursive  throws IOException                                          IllegalArgumentException         resulting list of files     List lt String gt  filePaths   new ArrayList lt String gt            get path from string and then the filesystem     Path path   new Path hdfsPath      throws IllegalArgumentException     FileSystem fs   path getFileSystem hadoopConfiguration          if recursive approach is requested     if recursive                   heap issues with recursive approach    gt  using a queue         Queue lt Path gt  fileQueue   new LinkedList lt Path gt                add the obtained path to the queue         fileQueue add path              while the fileQueue is not empty         while   fileQueue isEmpty                            get the file path from queue             Path filePath   fileQueue remove                   filePath refers to a file             if  fs isFile filePath                                 filePaths add filePath toString                               else     else filePath refers to a directory                                 list paths in the directory and add to the queue                 FileStatus   fileStatuses   fs listStatus filePath                   for  FileStatus fileStatus   fileStatuses                                        fileQueue add fileStatus getPath                          for                  else               while           if     else          non-recursive approach   gt  no heap overhead                 if the given hdfsPath is actually directory         if fs isDirectory path                         FileStatus   fileStatuses   fs listStatus path                  loop all file statuses             for FileStatus fileStatus   fileStatuses                                  if the given status is a file  then update the resulting list                 if fileStatus isFile                        filePaths add fileStatus getPath   toString                      for              if         else          it is a file then                         return the one and only file path to the resulting list             filePaths add path toString                  else           else        close filesystem  no more operations     fs close           return the resulting list     return filePaths       listFilesFromHDFSPath

User · Answer

don t use recursive approach  heap issues     use a queue  queue add param dir  while  queue is not empty      directory   queue pop  - get items from current directory  - if item is file add to a list  final list   - if item is directory   gt  queue push     that was easy  enjoy

User · Answer

You ll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories   You can also apply a PathFilter to only return the xml files using the listStatus Path  PathFilter  method  The hadoop FsShell class has examples of this for the hadoop fs -lsr command  which is a recursive ls - see the source  around line 590  the recursive step is triggered on line 635

User · Answer

param filePath     param fs     return list of absolute file path present in given path     throws FileNotFoundException     throws IOException     public static List lt String gt  getAllFilePath Path filePath  FileSystem fs  throws FileNotFoundException  IOException       List lt String gt  fileList   new ArrayList lt String gt         FileStatus   fileStatus   fs listStatus filePath       for  FileStatus fileStat   fileStatus            if  fileStat isDirectory                  fileList addAll getAllFilePath fileStat getPath    fs              else               fileList add fileStat getPath   toString                         return fileList      Quick Example   Suppose you have the following file structure   a  - gt   b    - gt   c  - gt  d           - gt  e     - gt   d  - gt  f   Using the code above  you get   a b a c d a c e a d f   If you want only the leaf  i e  fileNames   use the following code in else block               else           String fileName   fileStat getPath   toString             fileList add fileName substring fileName lastIndexOf        1            This will give   b d e f

User · Answer

Now  one can use Spark to do the same and its way faster than other approaches  such as Hadoop MR   Here is the code snippet   def traverseDirectory filePath String recursiveTraverse Boolean filePaths ListBuffer String         val files   FileSystem get  sparkContext hadoopConfiguration   listStatus new Path filePath               files foreach   fileStatus   gt                    if  fileStatus isDirectory    amp  amp  fileStatus getPath   getName   endsWith   xml                                          filePaths  fileStatus getPath   toString                                           else if fileStatus isDirectory                          traverseDirectory fileStatus getPath   toString    recursiveTraverse  filePaths

[java] How to list all files in a directory and its subdirectories in hadoop hdfs

Examples related to java

Examples related to hadoop

Examples related to hdfs