Cloudera Apache Hadoop: Big Data, the revolution of data
Speakers:
Ramon de la Rosa Falguera – IT Manager at PUE
Description:
Social networks, industry 4.0, digitalization, IoT and instant messaging have increased the volume of data that organizations can and need to work with. The storage, processing and analysis of these data allow us to better know our customers, improve our products and streamline distribution, as well as predict diseases among many more user cases.
Apache Hadoop is a free software project involving Twitter, LinkedIn, Uber, Facebook, Intel and Cloudera, among other organizations. This project integrates dozens of tools and solutions that allow storing, processing and analysing large volumes of data through the use of non-specialized servers.
In this practical workshop we will explore the problems generated by working with large volumes of data and how the use of Big Data solutions solves them. For this, we will implement a Cloudera CDH virtualized Cluster, one of the most used Apache Hadoop distributions in the business world.
We will use 5 virtual servers to assemble a Big Data cluster in order to see how the capacities of these servers are added and tasks can be launched on that cluster as if it were a single machine.
Once the cluster is deployed, we will check its operation and explore the main components included in the CDH.
- Cloudera Manager: The management tool for hosts and services.
- HDFS: The distributed storage system. It offers us redundant and error-free storage services.
- YARN: The resource manager of the cluster, which allows us to execute Map Reduce and Spark tasks.
- HIVE / Impala: Two tools that allow you to explore the data stored in Big Data with a pseudo SQL syntax.
- HUE: The human interface to interact with the cluster, allows us to launch queries about the data, navigate through the storage space and execute tasks in the cluster, among others.
- Flume: Facilitates the process of data ingestion in the cluster.
In the final part of the workshop, we will implement on the cluster a Big Data solution in order to store and analyse the logs originated in different systems: firewalls, web servers, proxies. During this part of the workshop we will address:
- The ingestion of the logs in HDFS, using Flume.
- The phase of ETL (Extract Transform Load) that prepares the logs to be used afterwards.
- Creation of metadata that allows access to data as if it were a table.
- Display of the data in HUE.
The present activity is part of the Cloudera Academic Program (CAP) initiative. It has been designed using the tools, software and teaching resources that Cloudera places at the disposal of educational institutions that wish to train, in an official and recognized way, their students in Big Data technologies Apache Hadoop.
Organizations of all types are experiencing the Big Data revolution, increasing the need to incorporate professionals with skills that generate value in the analysis and exploitation of all types of data.
As a leader in the Big Data Open Source market, Cloudera focuses on helping the development of the next generation of data professionals.
The Cloudera Academic Program (CAP) brings academic institutions training in Hadoop, providing benefits to students, teachers and centres in the acquisition of qualified and recognized knowledge in Big Data.
PUE is the partner of Cloudera for the exclusive management of its Cloudera Academic Program initiative in Spain.
More information:
https://www.pue.es/educacion/cloudera-academic-program