Understanding Recommendation system and KNN with project — Book Recommendation System

8 min readDec 26, 2020

Recommender systems are the systems that are designed to recommend things to the user based on many different factors. These systems predict the most likely product that the users are most likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use recommender systems to help their users to identify the correct product or movies for them.

The recommender system deals with a large volume of information present by filtering the most important information based on the data provided by a user and other factors that take care of the user’s preference and interest. It finds out the match between user and item and imputes the similarities between users and items for recommendation.

Both the users and the services provided have benefited from these kinds of systems. The quality and decision-making process has also improved through these kinds of systems.

Why the Recommendation system?

Benefits users in finding items of their interest.A recommendation engine is a system that suggests products, services, information to users based on analysis of data.

the recommendation can derive from a variety of factors such as the history of the user and the behaviors of similar users.

Recommendation systems are quickly becoming the primary way for users to expose to the whole digital world through the lens of their experiences, behaviors, preferences and interests.

And in a world of information density and product overload, a recommendation engine provides an efficient way for companies to provide consumers with personalized information and solutions.

Help item providers in delivering their items to the right user furthermore it also Identifies products that are most relevant to users.

also it increases the engagement with users by interacting with the recommendations it provide.

What can be Recommended?

There are many different things that can be recommended by the system like movies, books, news, articles, jobs, advertisements, etc. Netflix uses a recommender system to recommend movies & web-series to its users. Similarly, YouTube recommends different videos. There are many examples of recommender systems that are widely used today.

source :https://images.app.goo.gl/mAE17us5r7uJZXGE6

Types of Recommendation System

1. Popularity-Based Recommendation System

2. Classification Model

3. Content-Based Recommendation System

4. Collaborative Filtering

lets understand each of them in some details.

1. Popularity-Based Recommendation System

As the name suggests Popularity based recommendation system works with the trend. It basically uses the items which are in trend right now. For example, if any product which is usually bought by every new user then there are chances that it may suggest that item to the user who just signed up.

There are some problems as well with the popularity based recommender system and it also solves some of the problems with it as well.

The problems with popularity based recommendation system is that the personalization is not available with this method i.e. even though you know the behavior of the user you cannot recommend items accordingly.

2. Classification Model

The model that uses features of both products as well as users to predict whether a user will like a product or not.

The output can be either 0 or 1. If the user likes it then 1 and vice-versa.

It is a rigorous task to collect a high volume of information about different users and also products.

Also, if the collection is done then also it can be difficult to classify.

also it has some issues on Flexibility issue.

3. Content-Based Recommendation System

It is another type of recommendation system which works on the principle of similar content. If a user is watching a movie, then the system will check about other movies of similar content or the same genre of the movie the user is watching. There are various fundamentals attributes that are used to compute the similarity while checking about similar content.

There are different scenarios where we need to check about the similarities, so there are different metrics to be used. For computing the similarity between numeric data, Euclidean distance is used, for textual data, cosine similarity is calculated and for categorical data, Jaccard similarity is computed.

Here in this project/blog we are using collaborative filtering so let’s dive on that

Collaborative Filtering

It is considered to be one of the very smart recommender systems that work on the similarity between different users and also items that are widely used as an e-commerce website and also online movie websites. It checks about the taste of similar users and does recommendations.

The similarity is not restricted to the taste of the user moreover there can be consideration of similarity between different items also. The system will give more efficient recommendations if we have a large volume of information about users and items.

Types of collaborative filtering :

a) User-based nearest-neighbor collaborative filtering

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.
Many websites use collaborative filtering for building their recommendation system.

source:https://images.app.goo.gl/18TqS2HobYpAqseaA

b) Item-based nearest-neighbor collaborative filtering

Here, we explore the relationship between the pair of items (the user who bought Y, also bought Z). We find the missing rating with the help of the ratings given to the other items by the user.

source : https://images.app.goo.gl/yiTq55AAHQ7WKKzq7

c) Singular value decomposition and matrix-factorization

Singular value decomposition also known as the SVD algorithm is used as a collaborative filtering method in recommendation systems. SVD is a matrix factorization method that is used to reduce the features in the data by reducing the dimensions from N to K where (K<N).

For the part of the recommendation, the only part which is taken care of is matrix factorization that is done the user-item rating matrix. Matrix-factorization is all about taking 2 matrices whose product is the original matrix. Vectors are used to represent item ‘qi’ and user ‘pu’ such that their dot product is the expected rating.

‘qi’ and ‘pu’ can be calculated in such a way that the square error difference between the dot product of user and item and the original ratings in the user-item matrix is least.

Regularization: Avoiding over fitting of the model is an important aspect of any machine learning model because it results in low accuracy of the model. Regularization eliminates the risk of models being over fitted.

For this purpose in regularization, a penalty term is introduced to the above minimization equation. λ is the regularization factor which is multiplied by the square sum of the magnitudes of user and item vectors.

source :https://images.app.goo.gl/gKpjmbwjQWKGVQWq7

Lets start with the project of Book Recommendation system

Dataset url: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

Read the dataset with the necessary features

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltbooks = pd.read_csv('/content/drive/My Drive/BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1")books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']users = pd.read_csv('/content/drive/My Drive/BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")users.columns = ['userID', 'Location', 'Age']ratings = pd.read_csv('/content/drive/My Drive/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")ratings.columns = ['userID', 'ISBN', 'bookRating']

Printing the number of rows and shape of the dataset

print(ratings.shape)print(list(ratings.columns))plt.rc("font", size=15)ratings.bookRating.value_counts(sort=False).plot(kind='bar')plt.title('Rating Distribution\n')plt.xlabel('Rating')plt.ylabel('Count')plt.savefig('system1.png', bbox_inches='tight')plt.show()print(books.shape)print(list(books.columns))print(users.shape)print(list(users.columns))users.Age.hist(bins=[0, 10, 20, 30, 40, 50, 100])plt.title('Age Distribution\n')plt.xlabel('Age')plt.ylabel('Count')plt.savefig('system2.png', bbox_inches='tight')plt.show()

To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.

counts1 = ratings['userID'].value_counts()ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)]counts = ratings['bookRating'].value_counts()ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]

Collaborative Filtering Using k-Nearest Neighbors (kNN)

kNN is a machine learning algorithm to find clusters of similar users based on common book ratings, and make predictions using the average rating of top-k nearest neighbors. For example, we first present ratings in a matrix with the matrix having one row for each item (book) and one column for each user,

combine_book_rating = pd.merge(ratings, books, on='ISBN')columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL']combine_book_rating = combine_book_rating.drop(columns, axis=1)combine_book_rating.head()

We then group by book titles and create a new column for total rating count.

combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle'])book_ratingCount = (combine_book_rating.groupby(by = ['bookTitle'])['bookRating'].count().reset_index().rename(columns = {'bookRating': 'totalRatingCount'})[['bookTitle', 'totalRatingCount']])book_ratingCount.head()

We combine the rating data with the total rating count data, this gives us exactly what we need to find out which books are popular and filter out lesser-known books.

rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left')rating_with_totalRatingCount.head()

pd.set_option('display.float_format', lambda x: '%.3f' % x)print(book_ratingCount['totalRatingCount'].describe())

The median book has been rated only once. Let’s look at the top of the distribution

print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))

popularity_threshold = 50rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold')rating_popular_book.head()

now we are printing the shape of the rating_popular_book that is our filtered dataset.

rating_popular_book.shape

Filter to users in US and Canada only

combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left')us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")]us_canada_user_rating=us_canada_user_rating.drop('Age', axis=1)us_canada_user_rating.head()

Implementing kNN

We convert our table to a 2D matrix, and fill the missing values with zeros (since we will calculate distances between rating vectors). We then transform the values(ratings) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.

Finding the Nearest Neighbors We use unsupervised algorithms with sklearn.neighbors. The algorithm we use to compute the nearest neighbors is “brute”, and we specify “metric=cosine” so that the algorithm will calculate the cosine similarity between rating vectors. Finally, we fit the model.

from scipy.sparse import csr_matrixus_canada_user_rating = us_canada_user_rating.drop_duplicates(['userID', 'bookTitle'])us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0)us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values)from sklearn.neighbors import NearestNeighborsmodel_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')model_knn.fit(us_canada_user_rating_matrix)

now we are randomly selecting any one index of the books and based on that we are giving recommendation on that.

query_index = np.random.choice(us_canada_user_rating_pivot.shape[0])print(query_index)distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6)

now as per the book index randomly coming we are giving 5 books recommendation.

for i in range(0, len(distances.flatten())):if i == 0:print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index]))else:print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))

Final results