📖 Overview
This project implements a movie recommendation system using the Alternating Least Squares (ALS) algorithm in PySpark. The system is based on the MovieLens dataset, which includes 100,000 ratings provided by 1,000 users on 1,700 movies. The ALS algorithm is a matrix factorization technique widely used in collaborative filtering for recommendation systems.
Key Highlights:
- Achieved an RMSE (Root-Mean-Square Error) of 0.938 on the test dataset using Spark’s ALS model.
- Trained on 80,000 samples and tested on 20,000 samples from the dataset.
- Generated top 5 movie recommendations for all users, optimizing the personalized experience.
📂 Project Structure
The project follows these key steps:
- Data Loading and Preprocessing
- The MovieLens dataset is split into a training set (80%) and a test set (20%).
- Movie data is mapped to corresponding movie titles using the
items
file.
2. Model Development
- The ALS algorithm is applied to build the recommendation model.
- Regularization (
regParam
) and other hyperparameters are fine-tuned for optimal performance.
3. Evaluation
The model’s performance is evaluated using Root Mean Squared Error (RMSE) on the test set.
4. Recommendation Generation
Top-5 personalized movie recommendations are generated for each user based on their past ratings and preferences.
📊 Dataset
- Source: MovieLens 100k Dataset
- Content:
- 100,000 ratings from 1,000 users on 1,700 movies.
- Each rating is on a scale of 1 to 5, representing the user’s preference for the movie.
⚙️ Python Code Implementation
1. Installing Required Packages and Setting Up Environment:
!pip install pyspark
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema_ratings = StructType([
StructField("user_id", IntegerType(), False),
StructField("item_id", IntegerType(), False),
StructField("rating", IntegerType(), False),
StructField("timestamp", IntegerType(), False)
])
schema_items = StructType([
StructField("item_id", IntegerType(), False),
StructField("movie", StringType(), False)
])
# Load data into Spark DataFrames
training = spark.read.option("sep", "\t").csv("MovieLens.training", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("MovieLens.test", header=False, schema=schema_ratings)
items = spark.read.option("sep", "|").csv("MovieLens.item", header=False, schema=schema_items)"
from pyspark.ml.recommendation import ALS
# Set ALS parameters
als = ALS(maxIter=20, regParam=0.09, userCol="user_id", itemCol="item_id", ratingCol="rating", coldStartStrategy="drop")
# Train the ALS model
model = als.fit(training)
])
# Load data into Spark DataFrames
training = spark.read.option("sep", "\t").csv("MovieLens.training", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("MovieLens.test", header=False, schema=schema_ratings)
items = spark.read.option("sep", "|").csv("MovieLens.item", header=False, schema=schema_items)
from pyspark.ml.recommendation import ALS
# Set ALS parameters
als = ALS(maxIter=20, regParam=0.09, userCol="user_id", itemCol="item_id", ratingCol="rating", coldStartStrategy="drop")
# Train the ALS model
model = als.fit(training)
])
# Load data into Spark DataFrames
training = spark.read.option("sep", "\t").csv("MovieLens.training", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("MovieLens.test", header=False, schema=schema_ratings)
items = spark.read.option("sep", "|").csv("MovieLens.item", header=False, schema=schema_items)
top_K = 5
userRecs = model.recommendForAllUsers(top_K)
# Display top-5 recommendations for 5 users
userRecs.show(5, False)
💻 How to Use
Training the Model
Run the notebook to load and preprocess the dataset, then fit the ALS model on the training data.Evaluating the Model
After training, the model is evaluated using RMSE on the test dataset. You can visualize the RMSE results as shown above.Generating Recommendations
The model will generate the top-5 movie recommendations for each user based on their past ratings and preferences.
🎯 Conclusion
The system achieved an RMSE of 0.938 on the test dataset.
Successfully generated top-5 personalized movie recommendations for all users.
The ALS algorithm proves to be an effective method for collaborative filtering in recommendation systems.