In this article, we are going to see the recommendation system and How it works. After that we will develop a movie recommendation model system. Lets Begin.
What is a recommender System ?
Recommender Systems is techniques providing suggestions for items to be of use to a user. The suggestions provided are aimed at supporting their users in various decision-making processes such as what product to buy, which book to read, which movie to watch etc. The system has proven to be a valuable tool for online users to cope with information overload. Become one of the most popular and most powerful tools in electronic commerce. Various techniques for recommendation generation have been proposed during the last decade and many of them are successfully deployed in commercial environments.
Recommendation Techniques
There are three main technique to build the recommendation system
Content based method
Collaborative filtering methods
Hybrid methods
Content based method
Content based system which uses characteristic information. This information about item keywords categories etc. and users preferences, profile etc. The system learns to recommend items that are similar to the one that users liked in the past. The similarity of items is calculated based on the features associated with the compared items. For example, if a user has positively rated a movie that belongs to the action or thriller genre, then the system can learn to recommend other movies from this genre.
Collaborative Filtering methods
In this method recommend items based on similarity measures between users and items. The items recommended to a user are those preferred by similar users. For example a user likes Product A and another user likes the same product A as well as another product B, The first user could also be interested in the second product. Their aim is to predict new interactions based on historical ones. There are two types of Collaborative filtering methods.
Memory based
Model based
In a memory based method the first way is to identify the cluster of users and interactions of one specific user to predict the interactions of other similar users. Second way identifies clusters of items that have been rated by user A and utilizes them to predict the interaction of user A with a different but similar product B.
In a model based is used data mining and machine learning techniques. The aim is to train the models to be able to make the prediction.
Hybrid Methods
Hybrid method combines the collaborative filtering and content based methods.
Benefits of Recommender system
Increase number of sales items
Very few techniques to increase the number of sales items without increasing the marketing efforts. Once you build the automated recommendation system, you will get recurring additional sales without any efforts.
Increase user satisfaction
A well developed Recommendation model can also improve the experience of the user. The user will find the recommendations interesting, relevant and, with a properly designed human-computer interaction, and will also enjoy using the system
Better understand what the user wants
The description of the user’s preferences, either collected explicitly or predicted by the system. The service provider may then decide to re-use this knowledge for a number of other goals such as improving the management of the item’s stock or production
Increase user fidelity
A user should be loyal. Many recommendation models compute recommendations, leveraging the formal information acquired from the user in previous interaction. For example, rating of items.
Application of Recommedation systems
Product recommendation: Most important use of the recommender system is at online retailers. All Ecommerce websites and online vendors try to present each returning user with some suggestion of a product that they might like to buy.
Movie recommendation : Netflix offers its customers recommendations of movies they might like. These recommendations are based on ratings provided by users
Books recommendation : like kindle offers its customers recommendations of books they might like. These recommendations are based on ratings provided by users
Now we are going to build the Movie recommendation model system. which provides you with the recommendations of the movies that are similar to the ones that have been watched in the past. For that The dataset I used here directly comes from netflix and I am importing this dataset from kaggle. The dataset contains the 4 text files. Each text file contains over 20 m rows. But here I am using only one text file for building the recommendation model due to processing time.
Here I have already imported some packages which will be needed for build the model
Code snippet :
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
In this step I have defined a read_data function for reading the data and then stored data in a variable.
Code snippet :
# define function for reading data
def read_data(data_loc):
df = pd.read_csv(data_loc, header=None, names = ['Customer_Id','Ratings'],usecols=[0,1])
return df
# data path
path_1 = "/content/combined_data_1.txt"
path_2 = "/content/combined_data_2.txt"
path_3 = "/content/combined_data_3.txt"
path_4 = "/content/combined_data_4.txt"
data_1 = read_data(path_1)
data_2 = read_data(path_2)
data_3 = read_data(path_3)
data_4 = read_data(path_4)
Here we can see the shape of all text file data in each file contains over 20 million records
In this step Count the total records, total movies, total customer and total ratings
Code snippet :
# total movies
total_movies = all_data.isnull().sum()[1]
# total customers
total_customer = all_data['Customer_Id'].nunique() - total_movies
# total ratings
total_ratings = all_data['Customer_Id'].count() - total_movies
print("Total Records ",all_data.shape[0])
print("Total Movies ",total_movies)
print("Total Customers ",total_customer)
print("Total ratings ",total_ratings)
Output :
In this step Now here Iterating the ratings data in a loop and calculating the percentage of each rating then and that data will be visualizing.
Code snippet :
erc = all_data.groupby('Ratings')['Ratings'].agg(['count'])
## Plotting the graph
ax = erc.plot(kind = 'barh', legend = False, figsize = (15,10))
plt.title('Total : {:,} Movies, {:,} customers, {:,} ratings given'.format(total_movies, total_customer, total_ratings), fontsize=20)
plt.axis('off')
for i in range(1,6):
ax.text(erc.iloc[i-1][0]/4, i-1, 'Rating {}: {:.0f}%'.format(i, erc.iloc[i-1][0]*100 / erc.sum()[0]), color = 'white', weight = 'bold')
Now here I am extracting records ratings nan value is True and creating a new dataframe to know all that where does the movie counting start from.
creating a numpy array containing movie ids according the 'ratings' dataset and then store it in list arr_movie
create a new array Account for last record and corresponding length
append the created array to the dataset after removing the 'nan' rows. After that these two columns are converted into an integer.
Code Snippet :
# # To count all the 'nan' values in the Ratings column in the 'ratings' dataset
data_nan = pd.DataFrame(pd.isna(all_data.Ratings))
data_nan = data_nan[data_nan['Ratings'] == True]
arr_movie = []
movie_id = 1
for i,j in zip(data_nan['index'][1:],data_nan['index'][:-1]):
temp = np.full((1,i-j-1), movie_id)
arr_movie = np.append(arr_movie, temp)
movie_id += 1
final_rec = np.full((1,len(all_data) - data_nan.iloc[-1, 0] - 1),movie_id)
arr_movie = np.append(arr_movie, final_rec)
print('Movie numpy',arr_movie)
print('Length',(len(arr_movie)))
all_data = all_data[pd.notnull(all_data['Ratings'])]
all_data['Movie_Id'] = arr_movie.astype(int)
all_data['Customer_Id'] = all_data['Customer_Id'].astype(int)
print('Data')
print(all_data.iloc[::5000, :])
Output :
Created a list of all the movies that are rated less often.It includes only top 30% rated movies, for that count the ratings and find the mean value by movie id. And this code will return output the minimum number of times of review, which is less often.
Code Snippet :
f = ['count','mean']
movie_gb_mi = all_data.groupby('Movie_Id')['Ratings'].agg(f)
movie_gb_mi.index = movie_gb_mi.index.map(int)
movie_benchmark = round(movie_gb_mi['count'].quantile(0.7),0)
drop_movie_list = movie_gb_mi[movie_gb_mi['count'] < movie_benchmark].index
print('Movie minimum times of review:',(movie_benchmark))
Similarly created a list of all the inactive users who rate less often for that count the ratings by customer_id. This will return the top 30 % minimum times of review by customers.
Code Snippet :
cust_gb_ci = all_data.groupby('Customer_Id')['Ratings'].agg(f)
cust_gb_ci.index = cust_gb_ci.index.map(int)
cust_benchmark = round(cust_gb_ci['count'].quantile(0.7),0)
drop_cust_list = cust_gb_ci[cust_gb_ci['count'] < cust_benchmark].index
print('Customer minimum times of review:',(cust_benchmark))
Dropping a list of all the movies which get rated less often. And also Dropping a list of all the inactive customers who rate less often.
Code Snippet :
print('Original Shape: ',all_data.shape)
all_data = all_data[~all_data['Movie_Id'].isin(drop_movie_list)]
all_data = all_data[~all_data['Customer_Id'].isin(drop_cust_list)]
print('After droping the Shape is : ',(all_data.shape))
print('Data')
all_data
Output :
Creating the matrix ratings for values, index for customer id and movie id for columns. we need it for our recommendation system
Code Snippet :
data_pivot = pd.pivot_table(all_data,values='Ratings',index='Customer_Id',columns='Movie_Id')
print(data_pivot.shape)
data_pivot
Output :
We have one more dataset of which is movie titles. taking only 2 lacs record for fast processing. After that applied the SVD algorithm on that dataset. Which is created.
Code Snippet :
movie_titles = pd.read_csv('/content/movie_titles.csv', encoding = "ISO-8859-1", header = None, names = ['Movie_Id', 'Year', 'Name'])
movie_titles.set_index('Movie_Id', inplace = True)
movie_titles.head()
# reader
reader = Reader()
# get just top 2Lacs rows for faster run time
data = Dataset.load_from_df(all_data[['Customer_Id', 'Movie_Id', 'Ratings']][:200000], reader)
# Use the SVD algorithm.
svd = SVD()
# Compute the RMSE of the SVD algorithm
cross_validate(svd, data, measures=['RMSE', 'MAE'])
Here we can see the cross validation result of SVD algorithm
Output :
Taking one user customer id extract the record who has given five ratings to those movies
Code Snippet :
cust_1493615 = all_data[(all_data['Customer_Id'] == 1493615) & (all_data['Ratings'] == 5)]
cust_1493615 = cust_1493615.set_index('Movie_Id')
cust_1493615 = cust_1493615.join(movie_titles)['Name']
cust_1493615
Output :
Now here predicting the movie above users who have rate 5 rating the movies. Now here see which movies he loves to watch.
First dropping the list of all movies who rate less often from the movie title dataset.
Taking full data set customer id movie id and ratings and store it in variable data1
After that fit train set on svd model and then predicted the ratings which can say here estimated score of this user
Code snippet :
customer_1493615 = movie_titles.copy()
customer_1493615 = customer_1493615.reset_index()
customer_1493615 = customer_1493615[~customer_1493615['Movie_Id'].isin(drop_movie_list)]
# getting full dataset
data1 = Dataset.load_from_df(all_data[['Customer_Id', 'Movie_Id', 'Ratings']], reader)
trainset = data1.build_full_trainset()
svd.fit(trainset)
customer_1493615['Estimate_Score'] = customer_1493615['Movie_Id'].apply(lambda x: svd.predict(1493615, x).est)
customer_1493615 = customer_1493615.drop('Movie_Id', axis = 1)
customer_1493615 = customer_1493615.sort_values('Estimate_Score', ascending=False)
customer_1493615
Our Final Prediction
Output :
Thank You,
Comments