Ramgopal Prajapat:

Learnings and Views

Movie Recommendation - Content-Based Filtering

By: Ram on Oct 30, 2020

In this blog, we will discuss a few approaches to recommend a movie. We will use the dataset made by GroupLens. GroupLens Research has collected and made available rating data sets from the MovieLens web site.

We will use a small dataset as our focus is to learn different steps of movie recommendations. This dataset has 9,000 movies which were rated by 600 users.

Content-Based Filtering: Overview

We spend a lot more time exploring various things on the web. Be it exploring products, interacting with friends & family over social media platform, watching movies on Netflix, or listening to music. 

And these platforms have huge options to deliver to their users. These platform aims to deliver personalized and relevant options to their users to improve customer engagement and experience.  They use a lot of AI and ML algorithms to achieve these objectives. We will discuss one of the techniques in this blog.

In an age of Intelligence,  we are aiming at a scenario where probably a system knows a bit more about the user preference than the user itself. But, it is making things easy for the user, so why to complain.

Let's learn about a very simple, interesting, and frequently used algorithm that is used by Amazon and a lot of others for showing relevant (or recommending) products to their customers on their web platforms.   It is called Content-Based Collaborative Filtering or ITEM-ITEM collaborative filtering


» Content-based filtering recommendation approach is focused around features of the item (or description of the item). For example, in movie recommendation, the attributes of movies become a consideration set for the similarity
» In this approach, items that are similar to the items liked or purchased earlier are recommended.  For example, if a user liked or watched comedy movies earlier then the next recommendations will be of comedy movies
» Features for similarity can be extracted by different methods – TF-IDF or Decision Tree (or other machine learning methods)
» For example,  for a new article recommendation, keywords or topics can be identified using NLP and Topic Modeling
» When limited information about user preference available, this method can be used to match based on product features that users liked or purchased earlier.

Read Data

There are a few data files

  • Movie
  • Rating
  • Tags

We will explore these files and understand the data first. Now, get the data in.


# Read Data

import pandas as pd

ratings = pd.read_csv("/content/drive/My Drive/MICA/Recommendation/ml-latest-small/ratings.csv")

movies = pd.read_csv("/content/drive/My Drive/MICA/Recommendation/ml-latest-small/movies.csv")

tags = pd.read_csv("/content/drive/My Drive/MICA/Recommendation/ml-latest-small/tags.csv")

Explore Data

Now, we have 3 data files and we need to explore the content of these files.

In this file, the rating of each user for a movie is captured. It has a timestamp that may not be relevant for our analysis.

We will do basic frequency analysis - on average how many movies are reviewed by a user or what is a distribution of the number of movies reviewed by a user.

Also, what are the top and bottom rated movies? Also, which are the highly-rated (more number of users rated).

import seaborn as sns

import matplotlib.pyplot as plt

# Popular Movies - Rated High

popular_movies = pd.DataFrame(ratings.groupby('movieId')['rating'].count())

most_popular_movies = popular_movies.sort_values('rating', ascending=False)[:10]

# most_popular_movies.head(10)



most_popular_movies.columns = ['movieId', 'rating_count']

We need to get the movies' names matched to these ids. First, let's explore the movie's data.

This file has information about movies such as titles and genres.


# Join and get the moview title

most_popular_movies = pd.merge(most_popular_movies, movies, on ='movieId', how='left')


import matplotlib.pyplot as plt


plt.bar(most_popular_movies.title, most_popular_movies.rating_count, color='green')

plt.xlabel("Most Popular Movie")

plt.ylabel("# of Users Rated")

plt.title("Rating Counts by Movie")

plt.xticks(rotation=90) # change orientation of X axis tick label




We can also see the users who have rated the movies the most.

import seaborn as sns

import matplotlib.pyplot as plt

# Popular Movies - Rated High

frequent_movies_reviewers = pd.DataFrame(ratings.groupby('userId')['rating'].count())

most_frequent_movies_reviewers = frequent_movies_reviewers.sort_values('rating', ascending=False)[:10]



most_frequent_movies_reviewers.columns = ['userId', 'review_count']


Third data file - tags have information about the movie tags. We may want o to understand the different types of Tags and count of movies related to each of these tags.

Tags are not much overlapping across movies. We will see how these can be used for finding similar movies.

We will see the genres of each movie.

We can split genre values by "|" and get the list.

moview_genres = movies['genres'].str.split("|", n = -1, expand = False


Now, we need to get a distinct list of genres ( a dictionary of genres) and count of movies associated with each of these genres.

# # Assign Integer to each of the words

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()


moview_genres_encoded = tokenizer.texts_to_sequences(moview_genres) 

# Number of Genre

print("Number of distinct words: ",len(tokenizer.word_index))

# List of Genre


Number of distinct words: 20 Genre {'drama': 1, 'comedy': 2, 'thriller': 3, 'action': 4, 'romance': 5, 'adventure': 6, 'crime': 7, 'sci-fi': 8, 'horror': 9, 'fantasy': 10, 'children': 11, 'animation': 12, 'mystery': 13, 'documentary': 14, 'war': 15, 'musical': 16, 'western': 17, 'imax': 18, 'film-noir': 19, '(no genres listed)': 20}

Based on the data sample. There are 20 distinct movie genres. If we want to count the number of movies under each of these genres, we can use below step.

# Prepare genre disctionary

genres ={}

for row in moview_genres:

  for g in row:

    if g in genres:




# Plot the movies under each of the genre

import matplotlib.pyplot as plt

import numpy as np


plt.bar(genres.keys(), height=genres.values(), color='gray')

plt.xlabel("Movie Genre")

plt.ylabel("# of Moviews")

plt.title("# of Movies by Genre")

plt.xticks(rotation=90) # change orientation of X axis tick label




A lot of movies are classified as Drama and Comedy in this sample.

We have built a good understanding of the data. Now, we need to build a recommendation engine.

We will use the Content-Based Collaborative Filtering algorithm.

Based on the current movie, the next movie or movies will be selected based on similarity with the current movie.

The similarity metric will be calculated in two different ways.

  • User Rating Based Movie Similarity: If a movie has similar ratings with another movie, the most similar movies will be recommended
  • Genres Based Movie Similarity: In this based on genre, similar movies will be defined and recommended.

User Rating Based Movie Similarity

Restructure the data frame, each of the columns represents a movie and each of the rows are the ratings given by a user for the movies.

Python Pandas has a very useful function pivot_table() to help us restructure the data.

  • index: UserID becomes a row index variable where each user has a row.
  • columns: We want to keep a column for each movie.
  • values: Value for each of the combinations of user and movie is a rating value.


pivot_user_movie =pd.pivot_table(ratings,index=["userId"],values=["rating"],


# Drop level of column index




#Correlation Between Movies





We have got a similarity calculated between the combination of the movies.

We have used the Pearson Correlation Coefficient as a similarity measure.

The value varies between -1 and +1. If the value is close to 0, no relationship (or similarity). If negative, the movies are rated quite opposite. Ideally, we want to recommend which have PCC as close to +1.

We can validate the outcome for a few movies to be confident of the result and process.

Movie 1 has around a 0.97 correlation coefficient with movie 8. Let's explore these two movies.


print("Moview 1 Rating \n", ratings.loc[ratings['movieId']==1])

movie_1 = ratings.loc[ratings['movieId']==1]


print("Moview 8 Rating \n", ratings.loc[ratings['movieId']==8])

movie_8 = ratings.loc[ratings['movieId']==8]


We want to combine to see if users who have rated both movies 1 and 8 have a similar pattern.

5 users have rated these 2 movies and ratings are somewhat close, hence showing the higher correction coefficient between these two. We can try to check a few more examples.

Some of the points to note.

We visualize that number of co-ratings is not considered in the correlation calculation. We can apply the weights to give higher importance when more users have rated them.

We can write a function to recommend the top 5 movies given a given movie based on the correlation matrix calculated.

We can also apply some cut off on the minimum number of co-ratings and a minimum value of correlation for the recommendation.

Top 5 Movie Recommendation

The overall approach is as per below flow chart.


# Function with step to generate top 5 recommendations

def Top5MoviesRecommend(curr_movieIDcorr_matrix):

  import pandas as pd


    corr = pd.DataFrame(corr_matrix[curr_movieID].sort_values(ascending=False))



    # Join to get movie names/details

    corr = pd.merge(corr, movies, on='movieId', how='inner')

    # Print Input Movie

    print("Current Movie \n",movies.loc[movies['movieId']==curr_movieID])


    # Print Recommended Movies 

    print("Top 5 Recommened Movies \n",corr.loc[:5,['title','genres']])


    print('Movie ID or Correlation Matrix is not valid')


The similarity between movies is calculated based on the rating pattern. There may be challenges with the rating based similarity and some of these are:

  • New movies may not have the rating or very few users may have rated
  • Pearson Correlation Coefficient (PCC) be biased by a few observations which have a similar user rating
  • Some of the users might rate higher as compared to the other users

Now, we can check the recommendations based on the similarity between the movies based on their "genres".

Genres Based Movie Similarity

We have already tokenized the genres. Now, we have to count the matching genres, and based on that the top N movies can be recommended.

A movie can have a single value of "Genres" and is matching with another movie with a single genre & matching or a movie with 2 values of 'genres' with one matching. Do, we need to give different priorities to these two scenarios?

We can use Jaccard Similarity measures.


def jaccard_similarity(list1list2):

    s1 = set(list1)

    s2 = set(list2)

    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

We have encoded the genres of movies and created the list of integers. We can add that as a column in the data frame.



Here are the steps to find recommendations using similarity-based on movies' genres.

  • Get input movie ID (Current movie)
  • Get the encoded movie genres list
  • For each of the other movies in the list and calculate the similarity value
  • Sort the movie list based on a similarity measure
  • Select the top movies based on similarity and give a recommendation


# Function with step to generate top 5 recommendations

def Top5MoviesGenresRecommend(curr_movieID):

  import pandas as pd

  genres_similarity ={}


    curr_movie = movies.iloc[curr_movieID]

    print("Top 5 Recommened Movies \n",curr_movie)

    # Prepare genre disctionary

    for row in movies.T.iteritems():

        if row[1][0in genres_similarity:

          jac = jaccard_similarity(curr_movie[3], row[1][3])



          if curr_movie[0] != row[1][0]:

            jac = jaccard_similarity(curr_movie[3], row[1][3])



    # Change to Data Frame

    g1_df = pd.DataFrame({'movieId':genres_similarity.keys(),

                      'similarity': genres_similarity.values()})

    # Sort the dataframe based on similarity 

    topRec_movies=g1_df.sort_values('similarity', ascending=False)[:5]

    # Get the movie details 

    topRec_movies= pd.merge(topRec_movies, movies, on='movieId', how='inner')

     # Print Recommended Movies 

    print("Top 5 Recommened Movies \n",topRec_movies.loc[:5,['title','genres']])


    print('Movie ID or Correlation Matrix is not valid')

  return topRec_movies



We can now use any of these or both of these approaches together to recommend the movies to the user.

As a next step, we can also consider movie attributes - cast and direction and other information to find probably a better way to find the recommendations.


Leave a comment