Connecting to Spark: A Review of the Process
The connection with Spark is established by the driver, which can be written in either Java, Scala, or Python. Each of these languages has its own advantages and disadvantages. Java is relatively verbose, requiring a lot of code to accomplish even simple tasks. On the other hand, Scala and Python are high-level languages that can accomplish much with only a small amount of code. They also offer a rip-off (read-evaluate-print) loop, which is crucial for interactive development.
In this lesson, we will review the process of connecting to Spark. To connect to Spark, you need to import the PastBach module, which makes Spark functionality available in the Python interpreter. It's essential to know what version of Spark you're working with, as the interface is evolving rapidly. We'll be using version 2.4 point 1, which was released in March 2019.
In addition to the main PastBach module, there are a few submodules that implement different aspects of the Spark API. There are two versions of Spark machine learning (ML) lib: one uses an unstructured representation of data and RDDs, which has been deprecated; and another is based on a structured tabular representation of data and DataFrames. We'll be using the latter with the Passport module loaded, which enables us to connect to Spark.
Once you've imported the PastBach module and selected the correct version of Spark ML lib, you can tell Spark where the cluster is located. There are two options: connecting to a remote cluster or creating a local cluster on a single computer. If you choose to connect to a remote cluster, you need to specify a Spark URL that gives the network location of the cluster's master node. The URL is composed of an IP address or DNS name and a port number; the default port for Spark is 7077.
However, this must still be explicitly specified when figuring out how Spark works. The infrastructure of a distributed system can get in the way, so it's useful to create a local cluster where everything happens on a single computer. This is the setup we'll use throughout this course for a local cluster. To create a local cluster, you need only specify "local" and optionally the number of cores; by default, a local cluster will run on a single core.
Alternatively, you can give a specific number of cores or simply use the wildcard to choose all available cores. When connecting Spark by creating a Spark session object, you'll need to specify the location of the cluster using the master method. Optionally, you can assign a name to the application using the appName method, which is good practice. Finally, you call the get or create method, which will either create a new session object or return an existing object once the session has been created.
Once the Spark session has been created, you're able to interact with Spark. Although it's possible for multiple Spark sessions to co-exist, it's essential to stop the Spark session when you're done to avoid any potential issues.
"WEBVTTKind: captionsLanguage: enthe previous lesson was high-level overview of machine learning and spark in this lesson you'll review the process of connecting to spark the connection with spark is established by the driver which can be written in either Java Scala Python or are each of these languages has advantages and disadvantages Java is relatively verbose requiring a lot of code to accomplish even simple tasks by contrast Scala Python anar are high-level languages which can accomplish much with only a small amount of code they also offer a ripple or read evaluate print loop which is crucial for interactive development you'll be using Python Python doesn't talk natively to spark so we'll kick off by importing the past Bach module which makes Bach functionality available in the Python interpreter SPARC is under vigorous development because the interface is evolving it's important to know what version you're working with will be using version 2.4 point 1 which was released in March 2019 in addition to the main past Bach module there are a few sub modules which implement different aspects of the SPOC interface there are two versions of Spock machine learning ml lib which uses an unstructured representation of data and rdd's and has been deprecated an ml which is based on a structured tabular representation of data and data frames we'll be using the letter with the passport module loaded you're able to connect to Spock the next thing you need to do is tell Spock where the cluster is located here there are two options you can either connect to remote cluster in which case you need to specify a spark URL which gives the network location of the clusters master node the URL is composed of an IP address or DNS name and a port number the default port for spark is seven zero seven seven but this must still be explicitly specified when you're figuring out how spark works the infrastructure of a distributed Kuster can get in the way that's why it's useful to create a local cluster where everything happens on a single computer this is the sitter that you're going to use throughout this course for a local cluster you need only specify local and optionally the number of cores use by default a local cluster will run on a single call alternatively you can give a specific number of cores or simply use the wild card to choose all available cause you connect a spark by creating a spark session object the spark session class is found in the pass Bach but sequel sub-module you specify the location of the cluster using the master method optionally you can assign a name to the application using the F name method finally you call the get or create method which will either create a new session object or return an existing object once the session has been created you're able to interact with spark finally although it's possible for multiple spark sessions to co-exist it's good practice to stop the spark session when you're done great letthe previous lesson was high-level overview of machine learning and spark in this lesson you'll review the process of connecting to spark the connection with spark is established by the driver which can be written in either Java Scala Python or are each of these languages has advantages and disadvantages Java is relatively verbose requiring a lot of code to accomplish even simple tasks by contrast Scala Python anar are high-level languages which can accomplish much with only a small amount of code they also offer a ripple or read evaluate print loop which is crucial for interactive development you'll be using Python Python doesn't talk natively to spark so we'll kick off by importing the past Bach module which makes Bach functionality available in the Python interpreter SPARC is under vigorous development because the interface is evolving it's important to know what version you're working with will be using version 2.4 point 1 which was released in March 2019 in addition to the main past Bach module there are a few sub modules which implement different aspects of the SPOC interface there are two versions of Spock machine learning ml lib which uses an unstructured representation of data and rdd's and has been deprecated an ml which is based on a structured tabular representation of data and data frames we'll be using the letter with the passport module loaded you're able to connect to Spock the next thing you need to do is tell Spock where the cluster is located here there are two options you can either connect to remote cluster in which case you need to specify a spark URL which gives the network location of the clusters master node the URL is composed of an IP address or DNS name and a port number the default port for spark is seven zero seven seven but this must still be explicitly specified when you're figuring out how spark works the infrastructure of a distributed Kuster can get in the way that's why it's useful to create a local cluster where everything happens on a single computer this is the sitter that you're going to use throughout this course for a local cluster you need only specify local and optionally the number of cores use by default a local cluster will run on a single call alternatively you can give a specific number of cores or simply use the wild card to choose all available cause you connect a spark by creating a spark session object the spark session class is found in the pass Bach but sequel sub-module you specify the location of the cluster using the master method optionally you can assign a name to the application using the F name method finally you call the get or create method which will either create a new session object or return an existing object once the session has been created you're able to interact with spark finally although it's possible for multiple spark sessions to co-exist it's good practice to stop the spark session when you're done great let\n"