← Return to Archives

Distributed Gold Price Forecasting with PySpark

Abstract

A scalable time-series forecasting pipeline built with Apache Spark. By leveraging PySpark Window functions for distributed lag-feature engineering and MLlib's Linear Regression, this model accurately predicts gold prices based on a 15-year dataset, achieving an R² of 0.9995.

Figure 1
FIG. 1

Full Document

Overview

Predicting financial markets is notoriously difficult, but the challenge scales exponentially when dealing with massive historical datasets. This project aims to forecast daily gold prices using a distributed Machine Learning pipeline built entirely on Apache Spark (PySpark).

Instead of relying on single-node libraries like scikit-learn or pandas, I engineered this solution to run natively in a distributed cluster environment, showcasing my ability to handle Big Data constraints and scale computations horizontally.

The Challenge: Time-Series in a Distributed Paradigm

Time-series forecasting typically requires sequential data access, which fundamentally clashes with the distributed, partitioned nature of Spark DataFrames. To predict the gold price on day tt, the model needs the prices from the 10 preceding days (t10t-10 to t1t-1).

If this were a simple Pandas dataframe, a basic .shift() operation would suffice. However, in Spark, data is scattered across multiple nodes, making sequential operations complex.

The Solution: PySpark Windowing & MLlib Pipeline

To solve this efficiently without shuffling massive amounts of data across the network, I built a robust pipeline:

  • Distributed Lag Features: I partitioned the data chronologically and applied lag() operations over a PySpark sliding Window. This successfully flattened the temporal dependencies into 10 distinct feature columns per row in a distributed manner.
  • Vector Assembly: I used VectorAssembler to merge these 10 lag features into a single DenseVector per row, conforming to Spark MLlib's required input format.
  • Standardization: To ensure the gradient descent algorithm converged smoothly, I fitted a StandardScaler to center the features (mean=0, std=1) before passing them to the estimator.

Results & Impact

The resulting LinearRegression model was trained on historical data spanning from August 2009 to January 2025. It achieved an outstanding R2R^2 score of 0.9995 on the unseen test set, with a minimal RMSE of 0.3567.

Gold Price PredictionGold Price Prediction

This project not only yielded a highly accurate forecasting model but also solidified my expertise in constructing end-to-end, scalable machine learning workflows using PySpark MLlib.

Source Code

For more information about the evaluation metrics, and the full implementation, please visit the repository: https://github.com/LPH1110/gold_price_pyspark.