How to do SQL aggregation on dataframes

Here is the example of how to perform sum(), count(), groupBy operation in DataFrames in Spark //Read the files from hdfs and create a dataframe by applying a schema on that val filePath = “hdfs://user/Test/*.csv” val User_Dataframe = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “false”).option(“mode”, “permissive”).load(filePath) //Define Schema val User_schema = StructType(Array(StructField(“USER_ID”, StringType, true), StructField(“APP_ID”, StringType, true), StructField(“TIMESTAMP”, StringType, true), […]

Continue reading


Cache vs persist method in Spark

The difference between cache and persist operations that cache is persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY whereas persist() can be as: MEMORY_AND_DISK,MEMORY_ONLY_SER,MEMORY_AND_DISK_SER,DISK_ONLY,MEMORY_ONLY_2, MEMORY_AND_DISK_2 cache is having below representation: cache(): this.type = persist() cache is a synonym of persist with MEMORY_ONLY storage level. Eg. cache: – rdd.persist(StorageLevel.MEMORY_ONLY). persist is having […]

Continue reading


How to set hive configuration in spark code

If you want to use hive for data loading or quering in spark and want to optiize your queries to fetch partitioned data then set below mentioned configurations in our spark code hiveTest.scala object hiveTest  { def main(args: Array[String]) { //Creation of spark context System.setProperty(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) val conf = new SparkConf().setAppName(“Load_Hive_Data”) conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) val sc […]

Continue reading


Simple way to analyse your data using Spark

If you want to perform sql operation in data to be analysed the simplest way in spark is as follows: Read the data from your file and create a dataframe for that: val userDF = spark.read.format(“json”).load(“user/Test/userdata.json”) userDF.select(“name”, “userid”).write.format(“parquet”).save(“namesAndUid.parquet”) Instead of using read API to load a file into DataFrame and query it, you can also […]

Continue reading