Big Data Analysis with Hadoop and RHadoop

Big Data Analysis with Hadoop and RHadoop

Organised by VSC Research Center (TU Wien) in cooperation with EuroCC Austria, EuroCC Slovakia and EuroCC Slovenia and co-hosted by Giovanna Roda (EuroCC Austria, BOKU, and TU Wien, Austria).

“This training course will focus on the foundations of β€œBig Data” processing by introducing the Hadoop distributed computing architecture and providing an introductory level tutorial for Big Data analysis using Hadoop, Rhadoop, and R libraries parallel, doParallel, foreach and Rmpi. Although online, the course will be hands-on, allowing participants to work interactively on real data on the High Performance Computing environment of the University of Ljubljana and on the Vienna Scientific Cluster.

The training event will consist of two 4-hour trainings in two consecutive days. The first day will focus on big data management and data analysis with Hadoop. The participant will learn how to (i) move big data efficiently to a cluster and to Hadoop distributed file system, and (ii) how to perform simple big data analysis by Python scripts using MapReduce and Hadoop. The second day will focus on big data management and analysis using R and Rhadoop. We will first stick to work within RStudio and will write all scripts within R using several state-of-the-art libraries for parallel computations, like parallel, doParallel, foreach, Rmpi and libraries to work with Hadoop, like rmr, rhdfs and rhbase. Finally, we will show how to perform parallel slurm jobs with R scripts.”