[hadoop] The way to check a HDFS directory's size?

I know du -sh in common Linux filesystems. But how to do that with HDFS?

This question is related to hadoop command-line directory hdfs

The answer is


hadoop version 2.3.33:

hadoop fs -dus  /path/to/dir  |   awk '{print $2/1024**3 " G"}' 

enter image description here


To get the size of the directory hdfs dfs -du -s -h /$yourDirectoryName can be used. hdfs dfsadmin -report can be used to see a quick cluster level storage report.


When trying to calculate the total of a particular group of files within a directory the -s option does not work (in Hadoop 2.7.1). For example:

Directory structure:

some_dir
+abc.txt    
+count1.txt 
+count2.txt 
+def.txt    

Assume each file is 1 KB in size. You can summarize the entire directory with:

hdfs dfs -du -s some_dir
4096 some_dir

However, if I want the sum of all files containing "count" the command falls short.

hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt

To get around this I usually pass the output through awk.

hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048 

hadoop fs -du -s -h /path/to/dir displays a directory's size in readable form.


% of used space on Hadoop cluster
sudo -u hdfs hadoop fs –df

Capacity under specific folder:
sudo -u hdfs hadoop fs -du -h /user


hdfs dfs -count <dir>

info from man page:

-count [-q] [-h] [-v] [-t [<storage type>]] [-u] <path> ... :
  Count the number of directories, files and bytes under the paths
  that match the specified file pattern.  The output columns are:
  DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  or, with the -q option:
  QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA
        DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME

Command Should be hadoop fs -du -s -h \dirPath

  • -du [-s] [-h] ... : Show the amount of space, in bytes, used by the files that match the specified file pattern.

  • -s : Rather than showing the size of each individual file that matches the
    pattern, shows the total (summary) size.

  • -h : Formats the sizes of files in a human-readable fashion rather than a number of bytes. (Ex MB/GB/TB etc)

    Note that, even without the -s option, this only shows size summaries one level deep into a directory.

    The output is in the form size name(full path)


Extending to Matt D and others answers, the command can be till Apache Hadoop 3.0.0

hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

It displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.

Options:

  • The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, the calculation is done by going 1-level deep from the given path.
  • The -h option will format file sizes in a human-readable fashion (e.g 64.0m instead of 67108864)
  • The -v option will display the names of columns as a header line.
  • The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

du returns three columns with the following format:

 +-------------------------------------------------------------------+ 
 | size  |  disk_space_consumed_with_all_replicas  |  full_path_name | 
 +-------------------------------------------------------------------+ 

##Example command:

hadoop fs -du /user/hadoop/dir1 \
    /user/hadoop/file1 \
    hdfs://nn.example.com/user/hadoop/dir1 

Exit Code: Returns 0 on success and -1 on error.

source: Apache doc


With this you will get size in GB

hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to command-line

Git is not working after macOS Update (xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools) Flutter command not found Angular - ng: command not found how to run python files in windows command prompt? How to run .NET Core console app from the command line Copy Paste in Bash on Ubuntu on Windows How to find which version of TensorFlow is installed in my system? How to install JQ on Mac by command-line? Python not working in the command line of git bash Run function in script from command line (Node JS)

Examples related to directory

Moving all files from one directory to another using Python What is the reason for the error message "System cannot find the path specified"? Get folder name of the file in Python How to rename a directory/folder on GitHub website? Change directory in Node.js command prompt Get the directory from a file path in java (android) python: get directory two levels up How to add 'libs' folder in Android Studio? How to create a directory using Ansible Troubleshooting misplaced .git directory (nothing to commit)

Examples related to hdfs

What are the pros and cons of parquet format compared to other formats? How to copy data from one HDFS to another HDFS? Spark - load CSV file as DataFrame? hadoop copy a local file system folder to HDFS What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? How to fix corrupt HDFS FIles How to copy file from HDFS to the local file system Name node is in safe mode. Not able to leave Hive load CSV with commas in quoted fields Permission denied at hdfs