How to do SQL aggregation on dataframes

Here is the example of how to perform sum(), count(), groupBy operation in DataFrames in Spark //Read the files from hdfs and create a dataframe by applying a schema on that val filePath = “hdfs://user/Test/*.csv” val User_Dataframe = sqlContext.read.format(“com.databricks.spark.csv”).option(“header”, “false”).option(“mode”, “permissive”).load(filePath) //Define Schema val User_schema = StructType(Array(StructField(“USER_ID”, StringType, true), StructField(“APP_ID”, StringType, true), StructField(“TIMESTAMP”, StringType, true), […]

Continue reading


Simple way to analyse your data using Spark

If you want to perform sql operation in data to be analysed the simplest way in spark is as follows: Read the data from your file and create a dataframe for that: val userDF = spark.read.format(“json”).load(“user/Test/userdata.json”) userDF.select(“name”, “userid”).write.format(“parquet”).save(“namesAndUid.parquet”) Instead of using read API to load a file into DataFrame and query it, you can also […]

Continue reading