Distributed Gold Price Forecasting with PySpark
A scalable time-series forecasting pipeline built with Apache Spark. By leveraging PySpark Window functions for distributed lag-feature engineering and MLlib's Linear Regression, this model accurately predicts gold prices based on a 15-year dataset, achieving an R² of 0.9995.
Full Document
Overview
Predicting financial markets is notoriously difficult, but the challenge scales exponentially when dealing with massive historical datasets. This project aims to forecast daily gold prices using a distributed Machine Learning pipeline built entirely on Apache Spark (PySpark).
Instead of relying on single-node libraries like scikit-learn or pandas, I engineered this solution to run natively in a distributed cluster environment, showcasing my ability to handle Big Data constraints and scale computations horizontally.
The Challenge: Time-Series in a Distributed Paradigm
Time-series forecasting typically requires sequential data access, which fundamentally clashes with the distributed, partitioned nature of Spark DataFrames. To predict the gold price on day , the model needs the prices from the 10 preceding days ( to ).
If this were a simple Pandas dataframe, a basic .shift() operation would suffice. However, in Spark, data is scattered across multiple nodes, making sequential operations complex.
The Solution: PySpark Windowing & MLlib Pipeline
To solve this efficiently without shuffling massive amounts of data across the network, I built a robust pipeline:
- Distributed Lag Features: I partitioned the data chronologically and applied
lag()operations over a PySpark slidingWindow. This successfully flattened the temporal dependencies into 10 distinct feature columns per row in a distributed manner. - Vector Assembly: I used
VectorAssemblerto merge these 10 lag features into a singleDenseVectorper row, conforming to Spark MLlib's required input format. - Standardization: To ensure the gradient descent algorithm converged smoothly, I fitted a
StandardScalerto center the features (mean=0, std=1) before passing them to the estimator.
Results & Impact
The resulting LinearRegression model was trained on historical data spanning from August 2009 to January 2025. It achieved an outstanding score of 0.9995 on the unseen test set, with a minimal RMSE of 0.3567.
Gold Price Prediction
This project not only yielded a highly accurate forecasting model but also solidified my expertise in constructing end-to-end, scalable machine learning workflows using PySpark MLlib.
Source Code
For more information about the evaluation metrics, and the full implementation, please visit the repository: https://github.com/LPH1110/gold_price_pyspark.