Live on 3rd March, 2019 - Introduction to Big Data for Machine Learning and Deep Learning

Hadoop Overview

Hadoop is a distributed computing framework that was first introduced in 2005 by Doug Cutting and Phil McKinney. It is designed to process large amounts of data across a cluster of computers, making it an ideal solution for big data processing. Hadoop's main goal is to store and process large datasets by splitting them into smaller chunks called "splits" or "blooms," which are then processed in parallel across the cluster.

Hadoop consists of two main components: HDFS (Hadoop Distributed File System) and MapReduce. HDFS is a distributed file system that stores data across multiple nodes, allowing for high availability and scalability. MapReduce is a programming model used to process data in parallel across the cluster.

MapReduce works by breaking down large datasets into smaller chunks, processing each chunk in parallel, and then reassembling the results. This allows Hadoop to scale horizontally, making it an ideal solution for handling large amounts of data.

Tools like Pig and Hive were developed on top of Hadoop to provide a more convenient interface for users. Pig is a high-level language that allows users to write data processing programs using a familiar syntax. Hive is a data warehousing and SQL-like query language that allows users to store and manage data in Hadoop.

Advantages of Hadoop include its ability to scale horizontally, making it ideal for handling large amounts of data. It also provides a cost-effective solution for processing big data, as it can handle data across multiple nodes without the need for expensive hardware upgrades.

In recent years, Spark has become a popular alternative to Hadoop for big data processing. Spark is an open-source framework that provides high-performance processing capabilities and is designed to work seamlessly with existing tools like Pig and Hive.

Spark's main advantage is its speed. It can process data in parallel across multiple nodes, making it ideal for real-time analytics and reporting. Spark also provides a more convenient interface than Hadoop, with a simpler syntax and faster execution times.

However, Spark has some limitations. It requires more memory than Hadoop, which can be a challenge for systems with limited resources. Additionally, Spark is still evolving, and its API can change from version to version, making it harder to learn and use.

NoSQL Databases

NoSQL databases are designed to handle large amounts of data efficiently. They provide flexible schema options, allowing users to store and manage data in a more dynamic way than traditional relational databases.

NoSQL databases are particularly useful for big data applications, as they can handle high volumes of data with minimal performance impact. They also provide a cost-effective solution for storing and retrieving large datasets.

Machine Learning and Deep Learning

Machine learning and deep learning are rapidly growing fields that use complex algorithms to analyze and interpret data. Machine learning involves using algorithms to make predictions or decisions based on data, while deep learning uses neural networks to learn patterns in complex data sets.

Deep learning is a subset of machine learning that uses neural networks with multiple layers to analyze and interpret data. It has been applied successfully in various domains, including computer vision, natural language processing, and speech recognition.

Distributed Deep Learning

Distributed deep learning involves using multiple machines or nodes to train deep neural networks simultaneously. This allows for faster training times and better scalability than traditional deep learning approaches.

Cloud Computing

Cloud computing is a model of delivering computing services over the internet. It provides users with on-demand access to compute resources, storage, and applications, allowing them to scale up or down as needed.

The main players in cloud computing include Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and IBM Cloud. Each provider offers a unique set of services and tools for processing big data, including Spark, Hadoop, and NoSQL databases.

Leveraging Cloud Computing Tools with Big Data

Cloud computing provides an ideal platform for leveraging big data tools like Spark, Hadoop, and NoSQL databases. It allows users to scale up or down as needed, making it easy to handle large amounts of data.

The cloud also provides access to pre-configured clusters and nodes that can be used to process big data in parallel. This reduces the need for expensive hardware upgrades and allows users to focus on developing their applications instead of managing infrastructure.

Design Choices

When choosing a tool or platform for processing big data, there are several design choices to consider. These include:

* Data schema: Does the dataset have a fixed schema or is it highly dynamic?

* Data size and complexity: How large and complex is the dataset?

* Performance requirements: What level of performance is required?

Based on these considerations, users can choose from a range of tools and platforms, including Spark, Hadoop, NoSQL databases, and cloud computing services.

Cloud Computing Overview

Cloud computing evolved as a response to the increasing demand for scalable and flexible IT infrastructure. It provides users with on-demand access to compute resources, storage, and applications, allowing them to scale up or down as needed.

The main players in cloud computing include AWS, GCP, Microsoft Azure, and IBM Cloud. Each provider offers a unique set of services and tools for processing big data, including Spark, Hadoop, and NoSQL databases.

Cloud computing provides several benefits, including:

* Scalability: Cloud providers can scale up or down to meet changing demand.

* Cost-effectiveness: Users only pay for the resources they use.

* Flexibility: Cloud services are easily deployable and portable.

However, cloud computing also has some limitations. It requires a stable internet connection and can be affected by outages or security breaches.

Conclusion

Hadoop, Spark, and NoSQL databases are essential tools for processing big data. They provide flexible schema options, allowing users to store and manage data in a more dynamic way than traditional relational databases.

Cloud computing provides an ideal platform for leveraging these tools. It allows users to scale up or down as needed, making it easy to handle large amounts of data.

When choosing a tool or platform, users must consider several design choices, including the data schema, size and complexity, and performance requirements.

Ultimately, the choice between Hadoop, Spark, NoSQL databases, and cloud computing services will depend on the specific needs and goals of the project.

"WEBVTTKind: captionsLanguage: enhi friends the next life session that we have is on 3rd of March it's on 3rd of March which is a Sunday from 10 a.m. to 12 p.m. the regular AAR's that we have our live sessions in and the topic this time is very exciting and very close to my heart so this is a question that a lot of our students have asked which is what is Big Data why do we need big data and how big data is how all the techniques in Big Data or all the methods and all the tools in Big Data are related to machine learning and deep learning so big data is a very hyped up term it's a very hyped up term people hear it every time and people do not know what it means exactly on what tools fall under Big Data they don't know why we need it and how some of these big data tools are actually used for applications in machine learning and deep learning there are many applications of big data tools but what we will do here in this live session is we will give you an introduction to Big Data for machine learning and deep learning applications so we limit our applications mostly to machine learning and deep learning and we'll cover the breadth of big data tools and techniques and why we need them we will go a very foundational concepts we will go over the fundamentals we will go about the fundamentals of why we need big data and things like that so few of the topics that we would like to cover is what is Big Data to start with what is this term big data which is because this is this is a big umbrella term right there is a big umbrella term set a bunch of tools and techniques so we will explain what the data is why do we need these new tools in Big Data especially currently and how did these tools evolve how did many of the techniques and tools that are currently referred to as Big Data evolve over time in the last 20 to 30 years right why is the need for them and how did they evolve so we'll go over some historical perspective of all the tools will also give you an idea of the landscape what all tools are available and what are the choices that you have and when to pick which choice right of course we will touch upon some of the basic and the most widely used data tools like Hadoop we give an overview of Hadoop again this is a two hour session so it will be impossible to go into any of these techniques or tools in full depth right so we will give you a ten thousand feet view think of it like this we will give you a ten thousand feet view so that you know what tool is useful and why a tool is needed and where to use which tool we will not go into each of these tools in depth because that's not possible in to our session probably we could do a future live sessions diving deep into some of these techniques for example when we grow or Hadoop I will explain you a little bit about what is hive what is pig what is HDFS if you don't know these terms don't worry I'll explain all of them in very simple terms during the live session so again what what was the need for SPARC why is everybody moving from Hadoop's to Spock what advantages the SPARC give similarly an overview have spark an overview of SPARC SQL and spark ml spark ml is probably one of the best tools here for big data based machine learning today right again we will also touch upon distributed deep learning so how do you perform deep learning where you have when you have extremely large datasets when you have extremely large datasets and we have lots of computational lead right how do you do it when you have thousands or even hundreds of GPUs right so we touch upon all these topics and after this discussion we'll also for design choices which means which told to pick event so a design choice basically helps you understand which tool to pick when and why which tool to pick and wire if I should pick a tool here why do you even have to pick a big data tool why can't you just do with not using a big data tool right we will give a brief overview of something called as a no sequel databases again I'll show you how no sequel databases are used for machine learning and deep learning applications especially when your production izing these models especially when you are production izing these models right of course will also touch upon how no sequel databases are powerful for data pre-processing they are very powerful for data pre process right so we touch upon some of these things we will also conclude the discussion with an overview of cloud computing because this is a term that is extensively used so I will give you a brief overview of what is cloud computing how did this whole a whole area of cloud computing evolved what are the major choices that we have like Amazon's web services Google's compute platform Microsoft Azure and how how we can actually leverage some of these tools or some of these cloud computing platforms along with big data tools so so we I will try to mix all of them so how do you leverage cloud computing tools plus big data tools plus machine learning and deep learning applications how all of them fit together so I'll give you an overview of how everything fits together because there is a lot of confusion among students on how these three things fit together how cloud how cloud and big data and machine learning and deep learning fit together a lot of people don't understand how they fit how each of the needs everything else right so we'll go over that in at the end of the discussion the prerequisites to understand this would be we have already covered this we have covered SQL and basics of databases in our course so knowledge of SQL and databases would certainly help you understand what is happening in this live session better you also I would also assume that you have some basic understanding if not a very deep understanding some basic understanding of machine learning and deep learning algorithms at least some of them if not all so this live session will be limited only to a registered student so this live session will be limited only to a registered students and this will be available via our desktop app right so this will be available via our desktop app if any of you have not set up your desktop app please reach out to us before the live session get the pin that you need to access the classes the whole live session via the nest or app so just repeating itself please ensure that your desktop app is all set and ready to go before the live session starts so that you do not miss any part of the live session it's so yeah we are looking forward to the 3rd of March which is a Sunday to dive into this is a very interesting topic and afters but let's let's be very frank and clear here I will not be able to dive deep I will not be able to dive deep into any topic we will give you overviews here we will give you overviews and basic design choices that we have right so yeah looking forward to it I hope most of you try and join this discussion and hope you have some great fa Q's alsohi friends the next life session that we have is on 3rd of March it's on 3rd of March which is a Sunday from 10 a.m. to 12 p.m. the regular AAR's that we have our live sessions in and the topic this time is very exciting and very close to my heart so this is a question that a lot of our students have asked which is what is Big Data why do we need big data and how big data is how all the techniques in Big Data or all the methods and all the tools in Big Data are related to machine learning and deep learning so big data is a very hyped up term it's a very hyped up term people hear it every time and people do not know what it means exactly on what tools fall under Big Data they don't know why we need it and how some of these big data tools are actually used for applications in machine learning and deep learning there are many applications of big data tools but what we will do here in this live session is we will give you an introduction to Big Data for machine learning and deep learning applications so we limit our applications mostly to machine learning and deep learning and we'll cover the breadth of big data tools and techniques and why we need them we will go a very foundational concepts we will go over the fundamentals we will go about the fundamentals of why we need big data and things like that so few of the topics that we would like to cover is what is Big Data to start with what is this term big data which is because this is this is a big umbrella term right there is a big umbrella term set a bunch of tools and techniques so we will explain what the data is why do we need these new tools in Big Data especially currently and how did these tools evolve how did many of the techniques and tools that are currently referred to as Big Data evolve over time in the last 20 to 30 years right why is the need for them and how did they evolve so we'll go over some historical perspective of all the tools will also give you an idea of the landscape what all tools are available and what are the choices that you have and when to pick which choice right of course we will touch upon some of the basic and the most widely used data tools like Hadoop we give an overview of Hadoop again this is a two hour session so it will be impossible to go into any of these techniques or tools in full depth right so we will give you a ten thousand feet view think of it like this we will give you a ten thousand feet view so that you know what tool is useful and why a tool is needed and where to use which tool we will not go into each of these tools in depth because that's not possible in to our session probably we could do a future live sessions diving deep into some of these techniques for example when we grow or Hadoop I will explain you a little bit about what is hive what is pig what is HDFS if you don't know these terms don't worry I'll explain all of them in very simple terms during the live session so again what what was the need for SPARC why is everybody moving from Hadoop's to Spock what advantages the SPARC give similarly an overview have spark an overview of SPARC SQL and spark ml spark ml is probably one of the best tools here for big data based machine learning today right again we will also touch upon distributed deep learning so how do you perform deep learning where you have when you have extremely large datasets when you have extremely large datasets and we have lots of computational lead right how do you do it when you have thousands or even hundreds of GPUs right so we touch upon all these topics and after this discussion we'll also for design choices which means which told to pick event so a design choice basically helps you understand which tool to pick when and why which tool to pick and wire if I should pick a tool here why do you even have to pick a big data tool why can't you just do with not using a big data tool right we will give a brief overview of something called as a no sequel databases again I'll show you how no sequel databases are used for machine learning and deep learning applications especially when your production izing these models especially when you are production izing these models right of course will also touch upon how no sequel databases are powerful for data pre-processing they are very powerful for data pre process right so we touch upon some of these things we will also conclude the discussion with an overview of cloud computing because this is a term that is extensively used so I will give you a brief overview of what is cloud computing how did this whole a whole area of cloud computing evolved what are the major choices that we have like Amazon's web services Google's compute platform Microsoft Azure and how how we can actually leverage some of these tools or some of these cloud computing platforms along with big data tools so so we I will try to mix all of them so how do you leverage cloud computing tools plus big data tools plus machine learning and deep learning applications how all of them fit together so I'll give you an overview of how everything fits together because there is a lot of confusion among students on how these three things fit together how cloud how cloud and big data and machine learning and deep learning fit together a lot of people don't understand how they fit how each of the needs everything else right so we'll go over that in at the end of the discussion the prerequisites to understand this would be we have already covered this we have covered SQL and basics of databases in our course so knowledge of SQL and databases would certainly help you understand what is happening in this live session better you also I would also assume that you have some basic understanding if not a very deep understanding some basic understanding of machine learning and deep learning algorithms at least some of them if not all so this live session will be limited only to a registered student so this live session will be limited only to a registered students and this will be available via our desktop app right so this will be available via our desktop app if any of you have not set up your desktop app please reach out to us before the live session get the pin that you need to access the classes the whole live session via the nest or app so just repeating itself please ensure that your desktop app is all set and ready to go before the live session starts so that you do not miss any part of the live session it's so yeah we are looking forward to the 3rd of March which is a Sunday to dive into this is a very interesting topic and afters but let's let's be very frank and clear here I will not be able to dive deep I will not be able to dive deep into any topic we will give you overviews here we will give you overviews and basic design choices that we have right so yeah looking forward to it I hope most of you try and join this discussion and hope you have some great fa Q's also\n"