Spark read file from S3 using sc textFile s3n

Question

Trying to read a file located in S3 using spark-shell   scala gt  val myRdd   sc textFile  s3n   myBucket myFile1 log   lyrics  org apache spark rdd RDD String    s3n   myBucket myFile1 log MappedRDD 55  at textFile at  lt console gt  12  scala gt  myRdd count java io IOException  No FileSystem for scheme  s3n     at org apache hadoop fs FileSystem getFileSystemClass FileSystem java 2607      at org apache hadoop fs FileSystem createFileSystem FileSystem java 2614      at org apache hadoop fs FileSystem access 200 FileSystem java 91          etc       The IOException  No FileSystem for scheme  s3n error occurred with    Spark 1 31 or 1 40 on dev machine  no Hadoop libs  Running from the Hortonworks Sandbox HDP v2 2 4  Hadoop 2 60  which integrates Spark 1 2 1 out of the box Using s3    or s3n    scheme   What is the cause of this error  Missing dependency  Missing configuration  or mis-use of sc textFile     Or may be this is due to a bug that affects Spark build specific to Hadoop 2 60 as this post seems to suggest  I am going to try Spark for Hadoop 2 40 to see if this solves the issue

User · Answer

You probably have to use s3a   scheme instead of s3   or s3n   However  it is not working out of the box  for me  for the spark shell  I see the following stacktrace   java lang RuntimeException  java lang ClassNotFoundException  Class org apache hadoop fs s3a S3AFileSystem not found         at org apache hadoop conf Configuration getClass Configuration java 2074          at org apache hadoop fs FileSystem getFileSystemClass FileSystem java 2578          at org apache hadoop fs FileSystem createFileSystem FileSystem java 2591          at org apache hadoop fs FileSystem access 200 FileSystem java 91          at org apache hadoop fs FileSystem Cache getInternal FileSystem java 2630          at org apache hadoop fs FileSystem Cache get FileSystem java 2612          at org apache hadoop fs FileSystem get FileSystem java 370          at org apache hadoop fs Path getFileSystem Path java 296          at org apache hadoop mapred FileInputFormat singleThreadedListStatus FileInputFormat java 256          at org apache hadoop mapred FileInputFormat listStatus FileInputFormat java 228          at org apache hadoop mapred FileInputFormat getSplits FileInputFormat java 313          at org apache spark rdd HadoopRDD getPartitions HadoopRDD scala 207          at org apache spark rdd RDD  anonfun partitions 2 apply RDD scala 219          at org apache spark rdd RDD  anonfun partitions 2 apply RDD scala 217          at scala Option getOrElse Option scala 120          at org apache spark rdd RDD partitions RDD scala 217          at org apache spark rdd MapPartitionsRDD getPartitions MapPartitionsRDD scala 32          at org apache spark rdd RDD  anonfun partitions 2 apply RDD scala 219          at org apache spark rdd RDD  anonfun partitions 2 apply RDD scala 217          at scala Option getOrElse Option scala 120          at org apache spark rdd RDD partitions RDD scala 217          at org apache spark SparkContext runJob SparkContext scala 1781          at org apache spark rdd RDD count RDD scala 1099          at  iwC  iwC  iwC  iwC  iwC  iwC  iwC  iwC  lt init gt   lt console gt  24          at  iwC  iwC  iwC  iwC  iwC  iwC  iwC  lt init gt   lt console gt  29          at  iwC  iwC  iwC  iwC  iwC  iwC  lt init gt   lt console gt  31          at  iwC  iwC  iwC  iwC  iwC  lt init gt   lt console gt  33          at  iwC  iwC  iwC  iwC  lt init gt   lt console gt  35          at  iwC  iwC  iwC  lt init gt   lt console gt  37          at  iwC  iwC  lt init gt   lt console gt  39          at  iwC  lt init gt   lt console gt  41          at  lt init gt   lt console gt  43          at   lt init gt   lt console gt  47          at   lt clinit gt   lt console gt           at   lt init gt   lt console gt  7          at   lt clinit gt   lt console gt           at  print  lt console gt           at sun reflect NativeMethodAccessorImpl invoke0 Native Method          at sun reflect NativeMethodAccessorImpl invoke NativeMethodAccessorImpl java 62          at sun reflect DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl java 43          at java lang reflect Method invoke Method java 497          at org apache spark repl SparkIMain ReadEvalPrint call SparkIMain scala 1065          at org apache spark repl SparkIMain Request loadAndRun SparkIMain scala 1338          at org apache spark repl SparkIMain loadAndRunReq 1 SparkIMain scala 840          at org apache spark repl SparkIMain interpret SparkIMain scala 871          at org apache spark repl SparkIMain interpret SparkIMain scala 819          at org apache spark repl SparkILoop reallyInterpret 1 SparkILoop scala 857          at org apache spark repl SparkILoop interpretStartingWith SparkILoop scala 902          at org apache spark repl SparkILoop command SparkILoop scala 814          at org apache spark repl SparkILoop processLine 1 SparkILoop scala 657          at org apache spark repl SparkILoop innerLoop 1 SparkILoop scala 665          at org apache spark repl SparkILoop org apache spark repl SparkILoop  loop SparkILoop scala 670          at org apache spark repl SparkILoop  anonfun org apache spark repl SparkILoop  process 1 apply mcZ sp SparkILoop scala 997          at org apache spark repl SparkILoop  anonfun org apache spark repl SparkILoop  process 1 apply SparkILoop scala 945          at org apache spark repl SparkILoop  anonfun org apache spark repl SparkILoop  process 1 apply SparkILoop scala 945          at scala tools nsc util ScalaClassLoader  savingContextLoader ScalaClassLoader scala 135          at org apache spark repl SparkILoop org apache spark repl SparkILoop  process SparkILoop scala 945          at org apache spark repl SparkILoop process SparkILoop scala 1059          at org apache spark repl Main  main Main scala 31          at org apache spark repl Main main Main scala          at sun reflect NativeMethodAccessorImpl invoke0 Native Method          at sun reflect NativeMethodAccessorImpl invoke NativeMethodAccessorImpl java 62          at sun reflect DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl java 43          at java lang reflect Method invoke Method java 497          at org apache spark deploy SparkSubmit  org apache spark deploy SparkSubmit  runMain SparkSubmit scala 665          at org apache spark deploy SparkSubmit  doRunMain 1 SparkSubmit scala 170          at org apache spark deploy SparkSubmit  submit SparkSubmit scala 193          at org apache spark deploy SparkSubmit  main SparkSubmit scala 112          at org apache spark deploy SparkSubmit main SparkSubmit scala  Caused by  java lang ClassNotFoundException  Class org apache hadoop fs s3a S3AFileSystem not found         at org apache hadoop conf Configuration getClassByName Configuration java 1980          at org apache hadoop conf Configuration getClass Configuration java 2072              68 more   What I think - you have to manually add the hadoop-aws dependency manually http   search maven org  artifactdetails org apache hadoop hadoop-aws 2 7 1 jar But I have no idea how to add it to spark-shell properly

User · Answer

USe s3a instead of s3n  I had similar issue on a Hadoop job  After switching from s3n to s3a it worked   e g    s3a   myBucket myFile1 log

User · Answer

There is a Spark JIRA  SPARK-7481  open as of today  oct 20  2016  to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb  need  along with tests    And a Spark PR to match  This is how I get s3a support into my spark builds  If you do it by hand  you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have  and a version of the AWS JARs 100  in sync with what Hadoop aws was compiled against  For Hadoop 2 7  1  2  3        hadoop-aws-2 7 x jar  aws-java-sdk-1 7 4 jar joda-time-2 9 3 jar   jackson- -2 6 5 jar   Stick all of these into SPARK HOME jars  Run spark with your credentials set up in Env vars or in spark-default conf  the simplest test is can you do a line count of a CSV File  val landsatCSV    s3a   landsat-pds scene list gz  val lines   sc textFile landsatCSV  val lineCount   lines count     Get a number  all is well  Get a stack trace  Bad news

User · Answer

You can add the --packages parameter with the appropriate jar   to your submission    bin spark-submit --packages com amazonaws aws-java-sdk-pom 1 10 34 org apache hadoop hadoop-aws 2 6 0 code py

User · Answer

Download the hadoop-aws jar from maven repository matching your hadoop version  Copy the jar to  SPARK HOME jars location    Now in your Pyspark script  setup AWS Access Key  amp  Secret Access Key    spark sparkContext  jsc hadoopConfiguration   set  fs s3 awsAccessKeyId    ACCESS KEY   spark sparkContext  jsc hadoopConfiguration   set  fs s3 awsSecretAccessKey    YOUR SECRET ACCESSS KEY       where spark is SparkSession instance   For Spark scala   spark sparkContext hadoopConfiguration set  fs s3 awsAccessKeyId    ACCESS KEY   spark sparkContext hadoopConfiguration set  fs s3 awsSecretAccessKey    YOUR SECRET ACCESSS KEY

User · Answer

I had to copy the jar files from a hadoop download into the  SPARK HOME jars directory  Using the --jars flag or the --packages flag for spark-submit didn t work   Details    Spark 2 3 0 Hadoop downloaded was 2 7 6 Two jar files copied were from  hadoop dir  share hadoop tools lib    aws-java-sdk-1 7 4 jar hadoop-aws-2 7 6 jar

User · Answer

For Spark 1 4 x  Pre built for Hadoop 2 6 and later    I just copied needed S3  S3native packages from hadoop-aws-2 6 0 jar to spark-assembly-1 4 1-hadoop2 6 0 jar   After that I restarted spark cluster and it works  Do not forget to check owner and mode of the assembly jar

User · Answer

Ran into the same problem in Spark 2 0 2  Resolved it by feeding it the jars  Here s what I ran     spark-shell --jars aws-java-sdk-1 7 4 jar hadoop-aws-2 7 3 jar jackson-annotations-2 7 0 jar jackson-core-2 7 0 jar jackson-databind-2 7 0 jar joda-time-2 9 6 jar  scala gt  val hadoopConf   sc hadoopConfiguration scala gt  hadoopConf set  fs s3 impl   org apache hadoop fs s3native NativeS3FileSystem   scala gt  hadoopConf set  fs s3 awsAccessKeyId  awsAccessKeyId  scala gt  hadoopConf set  fs s3 awsSecretAccessKey   awsSecretAccessKey  scala gt  val sqlContext   new org apache spark sql SQLContext sc  scala gt  sqlContext read parquet  s3   your-s3-bucket      obviously  you need to have the jars in the path where you re running spark-shell from

User · Answer

Despite that this question has already an accepted answer  I think that the exact details of why this is happening are still missing  So I think there might be a place for one more answer   If you add the required hadoop-aws dependency  your code should work   Starting Hadoop 2 6 0  s3 FS connector has been moved to a separate library called hadoop-aws  There is also a Jira for that  Move s3-related FS connector code to hadoop-aws   This means that any version of spark  that has been built against Hadoop 2 6 0 or newer will have to use another external dependency to be able to connect to the S3 File System  Here is an sbt example that I have tried and is working as expected using Apache Spark 1 6 2 built against Hadoop 2 6 0      libraryDependencies     org apache hadoop     hadoop-aws     2 6 0    In my case  I encountered some dependencies issues  so I resolved by adding exclusion      libraryDependencies     org apache hadoop     hadoop-aws     2 6 0  exclude  tomcat    jasper-compiler   excludeAll ExclusionRule organization    javax servlet     On other related note  I have yet to try it  but that it is recommended to use  s3a  and not  s3n  filesystem starting Hadoop 2 6 0      The third generation  s3a  filesystem  Designed to be a switch in replacement for s3n   this filesystem binding supports larger files and promises higher performance

User · Answer

S3N is not a default file format  You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility  Additional info I found here  https   www hakkalabs co articles making-your-local-hadoop-more-like-aws-elastic-mapreduce

User · Answer

I was facing the same issue  It worked fine after setting the value for fs s3n impl and adding hadoop-aws dependency    sc hadoopConfiguration set  fs s3n awsAccessKeyId   awsAccessKeyId  sc hadoopConfiguration set  fs s3n awsSecretAccessKey   awsSecretAccessKey  sc hadoopConfiguration set  fs s3n impl    org apache hadoop fs s3native NativeS3FileSystem

User · Answer

Confirmed that this is related to the Spark build against Hadoop 2 60  Just installed Spark 1 4 0  Pre built for Hadoop 2 4 and later   instead of Hadoop 2 6   And the code now works OK    sc textFile  s3n   bucketname Filename   now raises another error   java lang IllegalArgumentException  AWS Access Key ID and Secret Access Key must be specified as the username or password  respectively  of a s3n URL  or by setting the fs s3n awsAccessKeyId or fs s3n awsSecretAccessKey properties  respectively     The code below uses the S3 URL format to show that Spark can read S3 file  Using dev machine  no Hadoop libs    scala gt  val lyrics   sc textFile  s3n   MyAccessKeyID MySecretKey zpub01 SafeAndSound Lyrics txt   lyrics  org apache spark rdd RDD String    MapPartitionsRDD 3  at textFile at  lt console gt  21  scala gt  lyrics count res1  Long   9   Even Better  the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward      Configuring AWS Credentials in SparkContext will fix it  Code works whether the S3 file is public or private   sc hadoopConfiguration set  fs s3n awsAccessKeyId    BLABLA   sc hadoopConfiguration set  fs s3n awsSecretAccessKey              can contain     val myRDD   sc textFile  s3n   myBucket MyFilePattern   myRDD count

User · Answer

This is a sample spark code which can read the files present on s3  val hadoopConf   sparkContext hadoopConfiguration hadoopConf set  fs s3 impl    org apache hadoop fs s3native NativeS3FileSystem   hadoopConf set  fs s3 awsAccessKeyId   s3Key  hadoopConf set  fs s3 awsSecretAccessKey   s3Secret  var jobInput   sparkContext textFile  s3       s3 location

[java] Spark read file from S3 using sc.textFile ("s3n://...)

Examples related to java

Examples related to scala

Examples related to apache-spark

Examples related to rdd

Examples related to hortonworks-data-platform