How to load local file in sc textFile instead of HDFS

Question

I m following the great spark tutorial  so i m trying at 46m 00s to load the README md but fail to what i m doing is this     sudo docker run -i -t -h sandbox sequenceiq spark 1 1 0  etc bootstrap sh -bash bash-4 1  cd  usr local spark-1 1 0-bin-hadoop2 4 bash-4 1  ls README md README md bash-4 1    bin spark-shell scala gt  val f   sc textFile  README md   14 12 04 12 11 14 INFO storage MemoryStore  ensureFreeSpace 164073  called with curMem 0  maxMem 278302556 14 12 04 12 11 14 INFO storage MemoryStore  Block broadcast 0 stored as values in memory  estimated size 160 2 KB  free 265 3 MB  f  org apache spark rdd RDD String    README md MappedRDD 1  at textFile at  lt console gt  12 scala gt  val wc   f flatMap l   gt  l split       map word   gt   word  1   reduceByKey        org apache hadoop mapred InvalidInputException  Input path does not exist  hdfs   sandbox 9000 user root README md     at org apache hadoop mapred FileInputFormat singleThreadedListStatus FileInputFormat java 285    how can I load that README md

User · Accepted Answer

Try explicitly specify sc textFile  file    path to the file     The error occurs when Hadoop environment is set    SparkContext textFile internally calls org apache hadoop mapred FileInputFormat getSplits  which in turn uses org apache hadoop fs getDefaultUri if schema is absent  This method reads  fs defaultFS  parameter of Hadoop conf  If you set HADOOP CONF DIR environment variable  the parameter is usually set as  hdfs         otherwise  file

User · Answer

This has been discussed into spark mailing list  and please refer this mail   You should use hadoop fs -put  lt localsrc gt       lt dst gt  copy the file into hdfs       HADOOP COMMON HOME  bin hadoop fs -put  path to README md README md

User · Answer

This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster   Load the raw HVAC csv file  parse it using the function  data   sc textFile  wasb    HdiSamples SensorSampleData hvac HVAC csv     We use  wasb      to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder   For example  If the path for your file in File Explorer in Spark cluster dashboard is   sflcc1 sflccspark1 HdiSamples SensorSampleData hvac  So to describe the path is as follows  sflcc1  is the name of the storage account  sflccspark  is the cluster node name   So we refer to the current cluster node name with the relative three slashes   Hope this helps

User · Answer

While Spark supports loading files from the local filesystem  it requires that the files are available at the same path on all nodes in your cluster    Some network filesystems  like NFS  AFS  and MapR   s NFS layer  are exposed to the user as a regular filesystem    If your data is already in one of these systems  then you can use it as an input by just specifying a file    path  Spark will handle it as long as the filesystem is mounted at the same path on each node  Every node needs to have the same path   rdd   sc textFile  file    path to file     If your file isn   t already on all nodes in the cluster  you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers  Take care to put file    in front and the use of     or     according to OS

User · Answer

If your trying to read file form HDFS  trying setting path in SparkConf   val conf   new SparkConf   setMaster  local      setAppName  HDFSFileReader    conf set  fs defaultFS    hdfs   hostname 9000

User · Answer

This has happened to me with Spark 2 3 with Hadoop also installed under the common  hadoop  user home directory Since both Spark and Hadoop was installed under the same common directory  Spark by default considers the scheme as hdfs  and starts looking for the input files under hdfs as specified by fs defaultFS in Hadoop s core-site xml  Under such cases  we need to explicitly specify the scheme as file     lt absoloute path to file gt

User · Answer

I have a file called NewsArticle txt on my Desktop    In Spark  I typed   val textFile  sc textFile    file    C  Users 582767 Desktop NewsArticle txt       I needed to change all the   to   character for the filepath    To test if it worked  I typed   textFile foreach println    I m running Windows 7 and I don t have Hadoop installed

User · Answer

I tried the following and it worked from my local file system   Basically spark can read from local  HDFS and AWS S3 path  listrdd sc textFile  file     home cloudera Downloads master-data retail db products

User · Answer

Attention   Make sure that you run spark in local mode when you load data from  local sc textFile  file    path to the file     or you will get error like this Caused by  java io FileNotFoundException  File file  data sparkjob config2 properties does not exist  Becasuse executors which run on different workers will not find this file in it s local path

User · Answer

gonbe s answer is excellent  But still I want to mention that file                 not  SPARK HOME  Hope this could save some time for newbs like me

User · Answer

You need just to specify the path of the file as  file    directory file   example   val textFile   sc textFile  file    usr local spark README md

User · Answer

try   val f   sc textFile    README md

User · Answer

If the file is located in your Spark master node  e g   in case of using AWS EMR   then launch the spark-shell in local mode first     spark-shell --master local scala gt  val df   spark read json  file    usr lib spark examples src main resources people json   df  org apache spark sql DataFrame    age  bigint  name  string   scala gt  df show    ---- -------    age    name   ---- -------   null Michael     30    Andy     19  Justin   ---- -------    Alternatively  you can first copy the file to HDFS from the local file system and then launch Spark in its default mode  e g   YARN in case of using AWS EMR  to read the file directly     hdfs dfs -mkdir -p  hdfs spark examples   hadoop fs -put  usr lib spark examples src main resources people json  hdfs spark examples   hadoop fs -ls  hdfs spark examples Found 1 items -rw-r--r--   1 hadoop hadoop         73 2017-05-01 00 49  hdfs spark examples people json    spark-shell scala gt  val df   spark read json   hdfs spark examples people json   df  org apache spark sql DataFrame    age  bigint  name  string   scala gt  df show    ---- -------    age    name   ---- -------   null Michael     30    Andy     19  Justin   ---- -------

User · Answer

You do not have to use sc textFile      to convert local files into dataframes  One of options is  to read a local file line by line and then transform it into Spark Dataset  Here is an example for Windows machine in Java   StructType schemata   DataTypes createStructType              new StructField                        createStructField  COL1   StringType  false                       createStructField  COL2   StringType  false                                                 String separator        String filePath    C   work  myProj  myFile csv   SparkContext sparkContext   new SparkContext new SparkConf   setAppName  MyApp   setMaster  local     JavaSparkContext jsc   new JavaSparkContext  sparkContext    SQLContext sqlContext   SQLContext getOrCreate sparkContext     List lt String   gt  result   new ArrayList lt  gt     try  BufferedReader br   new BufferedReader new FileReader filePath          String line      while   line   br readLine       null          String   vals   line split separator         result add vals            catch  Exception ex           System out println ex getMessage            throw new RuntimeException ex         JavaRDD lt String   gt  jRdd   jsc parallelize result     JavaRDD lt Row gt  jRowRdd   jRdd  map RowFactory  create     Dataset lt Row gt  data   sqlContext createDataFrame jRowRdd  schemata     Now you can use dataframe data in your code

[scala] How to load local file in sc.textFile, instead of HDFS

Examples related to scala

Examples related to apache-spark