Cache vs persist method in Spark

The difference between cache and persist operations that cache is persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY whereas persist() can be as: MEMORY_AND_DISK,MEMORY_ONLY_SER,MEMORY_AND_DISK_SER,DISK_ONLY,MEMORY_ONLY_2, MEMORY_AND_DISK_2 cache is having below representation: cache(): this.type = persist() cache is a synonym of persist with MEMORY_ONLY storage level. Eg. cache: – rdd.persist(StorageLevel.MEMORY_ONLY). persist is having […]

Continue reading


Simple way to analyse your data using Spark

If you want to perform sql operation in data to be analysed the simplest way in spark is as follows: Read the data from your file and create a dataframe for that: val userDF = spark.read.format(“json”).load(“user/Test/userdata.json”) userDF.select(“name”, “userid”).write.format(“parquet”).save(“namesAndUid.parquet”) Instead of using read API to load a file into DataFrame and query it, you can also […]

Continue reading


How to get JSON attribute and use it in your Spark program?

Here is the scala code to fetch each attribute from JSON and use it in your code jsonFile = s”hdfs://${jsonFilefolder}”……… //If present in HDFS else provide local directory path val strRdd = sc.parallelize(List(sc.textFile(jsonFile).collect().mkString(” “))) …..//Create JSONRDD val jsonDF = ssc.jsonRDD(strRdd) jsonDF.printSchema() ………//print schema of JSON data val root = jsonDF.select(“RootAttribute”).collect().mkString(“”).replaceAll(“\\[“, “”).replaceAll(“\\]”, “”)……….//Replace all [,comma,] to […]

Continue reading