How to Install PySpark on Mac(Mac上安装PySpark)

发布于:2024-05-10 ⋅ 阅读:(19) ⋅ 点赞:(0)

Installing PySpark on macOS allows users to experience the power of Apache Spark, a distributed computing framework, for big data processing and analysis using Python. PySpark seamlessly integrates Spark’s capabilities with Python’s simplicity and flexibility, making it an ideal choice for data engineers and data scientists working on large-scale data projects.

To install PySpark on macOS, users typically follow a series of steps that involve setting up the Java Development Kit (JDK), installing Apache Spark, configuring Python, and setting environment variables. Additionally, installing the findspark package can streamline the process by facilitating the location of the Spark installation within Python scripts.

PySpark installation steps for Mac OS using Homebrew

  • Step 1 – Install Homebrew
  • Step 2 – Install Java Development Kit (JDK)
  • Step 3 – Install Python
  • Step 4 – Install Apache Spark (PySpark)
  • Step 5 – Set Environment Variables
  • Step 6 – Start PySpark shell and Validate Installation
  • Step 7 – Initiate DataFrame

1. Install PySpark on Mac using Homebrew

Homebrew is a package manager for macOS and Linux systems. It allows users to easily install, update, and manage software packages from the command line. With Homebrew, users can install a wide range of software packages and utilities, including development tools, programming languages, libraries, and applications, directly from the terminal.

To use homebrew, first you need to install it.

maxwellpan@maxwellpans-MacBook-Pro Downloads % /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
==> Checking for `sudo` access (which may request your password)...
Password:
==> This script will install:
/opt/homebrew/bin/brew
/opt/homebrew/share/doc/homebrew
/opt/homebrew/share/man/man1/brew.1
/opt/homebrew/share/zsh/site-functions/_brew
/opt/homebrew/etc/bash_completion.d/brew
/opt/homebrew

Press RETURN/ENTER to continue or any other key to abort:
==> /usr/bin/sudo /usr/sbin/chown -R maxwellpan:admin /opt/homebrew
==> Downloading and installing Homebrew...
remote: Enumerating objects: 4908, done.
remote: Counting objects: 100% (4078/4078), done.
remote: Compressing objects: 100% (1629/1629), done.
remote: Total 4908 (delta 2588), reused 3699 (delta 2306), pack-reused 830
Receiving objects: 100% (4908/4908), 3.15 MiB | 3.43 MiB/s, done.
Resolving deltas: 100% (2809/2809), completed with 212 local objects.
From https://github.com/Homebrew/brew
 * [new branch]            bundle-install-euid                                    -> origin/bundle-install-euid
 + 36d8a3478e...a1cc3c54bf dependabot/bundler/Library/Homebrew/json_schemer-2.2.1 -> origin/dependabot/bundler/Library/Homebrew/json_schemer-2.2.1  (forced update)
 * [new branch]            deps-filters                                           -> origin/deps-filters
 * [new branch]            github_actions_opoo_odie                               -> origin/github_actions_opoo_odie
 * [new branch]            intel-runner-tag                                       -> origin/intel-runner-tag
 * [new branch]            long-build-queue                                       -> origin/long-build-queue
   bf4039e120..9d58b797d4  master                                                 -> origin/master
 * [new branch]            sbom_tweaks                                            -> origin/sbom_tweaks
 * [new branch]            tap-shard-fonts                                        -> origin/tap-shard-fonts
 * [new branch]            tapioca-patch                                          -> origin/tapioca-patch
 * [new tag]               4.2.17                                                 -> 4.2.17
 * [new tag]               4.2.18                                                 -> 4.2.18
 * [new tag]               4.2.19                                                 -> 4.2.19
 * [new tag]               4.2.20                                                 -> 4.2.20
 * [new tag]               4.2.21                                                 -> 4.2.21
Reset branch 'stable'
==> Updating Homebrew...
Updated 2 taps (homebrew/core and homebrew/cask).
==> Installation successful!

==> Homebrew has enabled anonymous aggregate formulae and cask analytics.
Read the analytics documentation (and how to opt-out) here:
  https://docs.brew.sh/Analytics
No analytics data has been sent yet (nor will any be during this install run).

==> Homebrew is run entirely by unpaid volunteers. Please consider donating:
  https://github.com/Homebrew/brew#donations

==> Next steps:
- Run brew help to get started
- Further documentation:
    https://docs.brew.sh

maxwellpan@maxwellpans-MacBook-Pro Downloads %

Once the installation is done, set the homebrew to your $PATH environment variable by using the below command.

# Set brew to Path
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/admin/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
maxwellpan@maxwellpans-MacBook-Pro Downloads % echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/admin/.zprofile
zsh: no such file or directory: /Users/admin/.zprofile
maxwellpan@maxwellpans-MacBook-Pro Downloads % echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/maxwellpan/.zprofile
maxwellpan@maxwellpans-MacBook-Pro Downloads % eval "$(/opt/homebrew/bin/brew shellenv)"
maxwellpan@maxwellpans-MacBook-Pro Downloads %

Note: When users interact with Homebrew from the terminal, they typically use commands like brew installbrew update, or brew upgrade to manage software installations and updates. These commands are part of the Homebrew package manager

2. Install Java Development Kit (JDK)

Java is a prerequisite for running PySpark as it provides the runtime environment necessary for executing Spark applications. When PySpark is initialized, it starts a JVM (Java Virtual Machine) process to run the Spark runtime, which includes the Spark Core, SQL, Streaming, MLlib, and GraphX libraries. This JVM process executes the Spark code.

Java from Oracle is not open-source hence, I will use Java from openjdk and use brew to install it. The following command install Java/JDK 11 version from openjdk.

maxwellpan@maxwellpans-MacBook-Pro Downloads % brew install openjdk@11

3. Install Python

PySpark is a Python library; hence, you need Python to run.

MacOS, by default, comes with a Python version, and it is recommended not to touch that version as it is needed to run several Mac applications. Hence, I will create a virtual environment and install the required Python version.


brew install pyenv # Install pyenv
pyenv install 3.11.5 # Install Python version
brew install pyenv-virtualenv # Required to create a virtual environment
pyenv virtualenv 3.11.5 devenv # Create virtual environment devenv with python version 3.11.5
pyenv shell devenv # Initialize virtualenv for your shell
maxwellpan@maxwellpans-MacBook-Pro Downloads % brew install pyenv
pyenv 2.3.36 is already installed but outdated (so it will be upgraded).
==> Downloading https://ghcr.io/v2/homebrew/core/pyenv/manifests/2.4.0
############################################################################################################################################################################################################### 100.0%
==> Fetching pyenv
==> Downloading https://ghcr.io/v2/homebrew/core/pyenv/blobs/sha256:423e0c467f4c0e07f093132d036790679752c6ec3e150966f448284e8666870d
############################################################################################################################################################################################################### 100.0%

==> Upgrading pyenv
  2.3.36 -> 2.4.0
==> Pouring pyenv--2.4.0.arm64_sonoma.bottle.tar.gz
🍺  /opt/homebrew/Cellar/pyenv/2.4.0: 1,192 files, 3.5MB
==> Running `brew cleanup pyenv`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
Removing: /opt/homebrew/Cellar/pyenv/2.3.36... (1,158 files, 3.4MB)
Removing: /Users/maxwellpan/Library/Caches/Homebrew/pyenv_bottle_manifest--2.3.36... (26KB)
Removing: /Users/maxwellpan/Library/Caches/Homebrew/pyenv--2.3.36... (771.7KB)
maxwellpan@maxwellpans-MacBook-Pro Downloads % pyenv install 3.11.5
python-build: use openssl@3 from homebrew
python-build: use readline from homebrew
Downloading Python-3.11.5.tar.xz...
-> https://www.python.org/ftp/python/3.11.5/Python-3.11.5.tar.xz
Installing Python-3.11.5...
python-build: use readline from homebrew
python-build: use ncurses from homebrew
python-build: use zlib from xcode sdk
Installed Python-3.11.5 to /Users/maxwellpan/.pyenv/versions/3.11.5
maxwellpan@maxwellpans-MacBook-Pro Downloads % brew install pyenv-virtualenv
==> Downloading https://formulae.brew.sh/api/formula.jws.json
#=O#-   #       #
==> Downloading https://formulae.brew.sh/api/cask.jws.json
##O=- #      #
==> Downloading https://ghcr.io/v2/homebrew/core/pyenv-virtualenv/manifests/1.2.3
############################################################################################################################################################################################################### 100.0%
==> Fetching pyenv-virtualenv
==> Downloading https://ghcr.io/v2/homebrew/core/pyenv-virtualenv/blobs/sha256:9e4afc272034aa96f61df27d68296542cd21a92f8abdf7c2126684fb78fe93e2
############################################################################################################################################################################################################### 100.0%
==> Pouring pyenv-virtualenv--1.2.3.arm64_sonoma.bottle.tar.gz
🍺  /opt/homebrew/Cellar/pyenv-virtualenv/1.2.3: 22 files, 75KB
==> Running `brew cleanup pyenv-virtualenv`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
maxwellpan@maxwellpans-MacBook-Pro Downloads % pyenv virtualenv 3.11.5 devenv
maxwellpan@maxwellpans-MacBook-Pro Downloads % pyenv shell devenv
pyenv: shell integration not enabled. Run `pyenv init' for instructions.
maxwellpan@maxwellpans-MacBook-Pro Downloads % pyenv init
# Load pyenv automatically by appending
# the following to
# ~/.zprofile (for login shells)
# and ~/.zshrc (for interactive shells) :

export PYENV_ROOT="$HOME/.pyenv"
[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# Restart your shell for the changes to take effect.

maxwellpan@maxwellpans-MacBook-Pro Downloads % pyenv shell devenv
pyenv: shell integration not enabled. Run `pyenv init' for instructions.
maxwellpan@maxwellpans-MacBook-Pro Downloads % vi ~/.zprofile
maxwellpan@maxwellpans-MacBook-Pro Downloads % vi ~/.zprofile
maxwellpan@maxwellpans-MacBook-Pro Downloads %

To activate and use the devenv virtual environment, you need to run the following command every time when you open a new terminal. 

# Activate devenv virtual environment

maxwellpan@maxwellpans-mbp ~ % pyenv shell devenv

3.2 Without Virtual Environment

Using the brew command, install Python without a virtual environment.


maxwellpan@maxwellpans-mbp ~ % brew install python
==> Downloading https://formulae.brew.sh/api/formula.jws.json
############################################################################################################################################################################################################### 100.0%
==> Downloading https://formulae.brew.sh/api/cask.jws.json
############################################################################################################################################################################################################### 100.0%
Warning: python@3.12 3.12.3 is already installed and up-to-date.
To reinstall 3.12.3, run:
  brew reinstall python@3.12
maxwellpan@maxwellpans-mbp ~ %

Note: You need to install a Python version that is compatible with the Apache Spark/PySpark you going to install.

4. Install PySpark Latest Version on Mac

PySpark is available in PyPI, so it is easy to install from here. Installing PySpark via pip (the PyPI package manager) is straightforward and can be done with a single command, eliminating the need for manual downloads and configurations.

PyPI manages dependencies automatically, ensuring that all required packages and dependencies are installed correctly, saving time and effort.

To install PySpark from PyPI, you should use the pip command.

maxwellpan@maxwellpans-mbp ~ % brew install python
==> Downloading https://formulae.brew.sh/api/formula.jws.json
############################################################################################################################################################################################################### 100.0%
==> Downloading https://formulae.brew.sh/api/cask.jws.json
############################################################################################################################################################################################################### 100.0%
Warning: python@3.12 3.12.3 is already installed and up-to-date.
To reinstall 3.12.3, run:
  brew reinstall python@3.12
maxwellpan@maxwellpans-mbp ~ % pip install pyspark
Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.0/317.0 MB 1.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting py4j==0.10.9.7 (from pyspark)
  Obtaining dependency information for py4j==0.10.9.7 from https://files.pythonhosted.org/packages/10/30/a58b32568f1623aaad7db22aa9eafc4c6c194b429ff35bdc55ca2726da47/py4j-0.10.9.7-py2.py3-none-any.whl.metadata
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 1.8 MB/s eta 0:00:00
Building wheels for collected packages: pyspark
  Building wheel for pyspark (pyproject.toml) ... done
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=8f00360ae9fd05b36b2a69f61795fc9e7ee294e2038748842da37f476207b469
  Stored in directory: /Users/maxwellpan/Library/Caches/pip/wheels/95/13/41/f7f135ee114175605fb4f0a89e7389f3742aa6c1e1a5bcb657
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.5.1

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python3.11 -m pip install --upgrade pip
maxwellpan@maxwellpans-mbp ~ %

Alternatively, you can also install Apache Spark using the brew command.

maxwellpan@maxwellpans-mbp ~ %
maxwellpan@maxwellpans-mbp ~ % brew install apache-spark
==> Downloading https://formulae.brew.sh/api/formula.jws.json
############################################################################################################################################################################################################### 100.0%
==> Downloading https://formulae.brew.sh/api/cask.jws.json
############################################################################################################################################################################################################### 100.0%
==> Downloading https://ghcr.io/v2/homebrew/core/apache-spark/manifests/3.5.1
############################################################################################################################################################################################################### 100.0%
==> Fetching dependencies for apache-spark: openjdk@17
==> Downloading https://ghcr.io/v2/homebrew/core/openjdk/17/manifests/17.0.11
############################################################################################################################################################################################################### 100.0%
==> Fetching openjdk@17
==> Downloading https://ghcr.io/v2/homebrew/core/openjdk/17/blobs/sha256:863da25f8214b887d286a382e9af611d9ae1c056139fa4ce2ad97769c981433c
############################################################################################################################################################################################################### 100.0%
==> Fetching apache-spark
==> Downloading https://ghcr.io/v2/homebrew/core/apache-spark/blobs/sha256:79ec442b9aaf93f302007436b8c5194b67e2cdfbcf57f89a7cffb3bf90a3cf54
############################################################################################################################################################################################################### 100.0%
==> Installing dependencies for apache-spark: openjdk@17
==> Installing apache-spark dependency: openjdk@17
==> Downloading https://ghcr.io/v2/homebrew/core/openjdk/17/manifests/17.0.11
Already downloaded: /Users/maxwellpan/Library/Caches/Homebrew/downloads/204ac00846a17a30f891394d0dc1928f4e2339a1141b95858d883257de5b2a16--openjdk@17-17.0.11.bottle_manifest.json
==> Pouring openjdk@17--17.0.11.arm64_sonoma.bottle.tar.gz
🍺  /opt/homebrew/Cellar/openjdk@17/17.0.11: 635 files, 305.0MB
==> Installing apache-spark
==> Pouring apache-spark--3.5.1.all.bottle.tar.gz
🍺  /opt/homebrew/Cellar/apache-spark/3.5.1: 1,823 files, 423.3MB
==> Running `brew cleanup apache-spark`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
maxwellpan@maxwellpans-mbp ~ %

5. Set Environment Variables

If you installed Apache Spark instead of PySpark, you need to set the SPARK_HOME environment variable to point to the directory where Apache Spark is installed.

And, you also need to set the PYSPARK_PYTHON environment variable to point to your Python executable, typically located at /usr/local/bin/python3.

Setting the PYSPARK_PYTHON environment variable is important when working with PySpark because it allows users to specify which Python executable should be used by PySpark. This is particularly useful in environments where multiple versions of Python are installed or when PySpark needs to run with a specific Python interpreter.

6. Validate PySpark Installation from Shell

Once the PySpark or Apache Spark installation is done, start the PySpark shell from the command line by issuing the pyspark coammand.

The PySpark shell refers to the interactive Python shell provided by PySpark, which allows users to interactively run PySpark code and execute Spark operations in real-time. It provides an interactive environment for exploring and analyzing data using PySpark without the need to write full Python scripts or Spark applications.

maxwellpan@maxwellpans-mbp ~ % pyspark
Python 3.11.5 (main, May  9 2024, 14:20:14) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/09 15:54:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.11.5 (main, May  9 2024 14:20:14)
Spark context Web UI available at http://maxwellpans-mbp.cn.ibm.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1715241246352).
SparkSession available as 'spark'.
>>>
>>>

7. Initiate DataFrame

Finally, let’s create a DataFrame to confirm the installation is done successfully.

# Create DataFrame in PySpark Shell

data = [("Java","20000"),("Python","100000"),("Scala","3000")]

df = spark.createDataFrame(data)

df.show()
maxwellpan@maxwellpans-mbp ~ % pyspark
Python 3.11.5 (main, May  9 2024, 14:20:14) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/09 15:54:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.11.5 (main, May  9 2024 14:20:14)
Spark context Web UI available at http://maxwellpans-mbp.cn.ibm.com:4040
Spark context available as 'sc' (master = local[*], app id = local-1715241246352).
SparkSession available as 'spark'.
>>>
>>> data = [("Java","20000"),("Python","100000"),("Scala","3000")]
>>> df = spark.createDataFrame(data)
>>> df.show()
+------+------+
|    _1|    _2|
+------+------+
|  Java| 20000|
|Python|100000|
| Scala|  3000|
+------+------+

>>>

倘若您觉得我写的好,那么请您动动你的小手粉一下我,你的小小鼓励会带来更大的动力。Thanks.


网站公告

今日签到

点亮在社区的每一天
去签到