# How to find similarity between items in your data

(*Article 1 of 2*)

A lot of data mining tasks involve finding similar items within the dataset. By the term “data”, I mean real valued data points, categorical data or textual data (in NLP tasks). There are multiple ways with which we can handle this similarity task. These methods can be used as per your application needs and the data that you are dealing with.

# Table of Contents

- Dealing with the question of similarity
- Problem statement
- The set intersection problem
- Jaccard Similarity
- How to convert our text data into sets
- Applications of set intersection method

# 1. Dealing with the question of similarity

You may have encountered the problem of how to find similar items within your data as a part of the data mining process. We can see the same problem while handling text data in natural language processing applications. We also need to understand what type of similarity task are we looking for when we handle text data. We can find *semantic similarity* between two words (based on their meaning) or their *character level similarity* (usually used in plagiarism checks).

The problem can be handled in mainly two ways: breaking it down into a **set intersection problem** or using **distance metrics**. We will be looking at the first method in detail.

# 2. Problem Statement

Find the similarity between documents (that contain text data) by converting it into *objects* in *abstract space* and measure the distance between them. The object in abstract space can be represented by an *unordered set* {s₁, s₂, s₃, …..}. Thus we will extract objects(shingles) from text documents and store them in sets.

# 3. The set intersection problem

This problem, in simple terms, can be explained as finding the intersection of two sets A and B. Here A and B can be a set of items and we need to find similarity between the two sets. The sets that we are referring to depends on our use case.

Lets say, we work for a website and each set represents a customer in our website (set elements represent customer purchases). We want to build a recommendation system to suggest new things to them. One way we can solve this problem is to find the customers that have similar taste and recommend based on their activity (this is called collaborative filtering). We can break down the problem into a set intersection problem and calculate similarity accordingly.

# 4. Jaccard Similarity

But the question is, how do we find similarity from the sets itself. That's when Jaccard similarity comes into play. Jaccard similarity is a similarity metric used for set intersection problem. It is the ratio of number of set intersection elements divided by the number of elements in their union.

# 5. How to convert our text data into sets

Now that we know how to find similarity between to sets, the next task will be to convert our data at hand to set format. Usually the data will be in text form which means we can use the below techniques for conversion.

## 5.1 Shingling

Next question is how to find the elements in text documents. Shingling is the method of converting text data** **into **k-shingles** (group of k characters) generated through their order of occurrence. The value of k depends on the what type of documents you are going to handle.

For example, a sentence “The sky is blue” will generate 2-shingles like Th, he, e<space> , <space>s, sk, ky etc…

The Example above is shingling done at *character level*. We can also do shingling at *word level *(also known as k-gram or n-gram).

**NB**: These steps can be done along with preprocessing steps such as whitespace removal(as done in the above example), capitalization, punctuation etc..

## 5.2 Minhasing

Real world documents won’t be limited to a couple of shingles in each set. It depends on the size of the document and language used. Since it is computationally expensive to find similarity between sets that have large number of elements, we need to compress the data. *Minhashing* is a method which can help in compression without much loss of data.

We create a *characteristic matrix* which will help represent all of our sets in a single matrix. The columns of the matrix will be each of the sets. Rows will be shingles. If a shingle is present in a set, the corresponding entry will have value 1 (else it will be 0).

After creating this matrix, we find a set of *minhash functions* to compress the data. It may be difficult to comprehend this method. I suggest you read the references to get a better understanding.

For now, just understand that the probability that minhash function of two sets produce same value will be equal to that of the Jaccard similarity between them.

**NB**: Minhashing is a data compression and approximation technique which may not compute the exact Jaccard similarity but approximates it to such an extend that the difference is negligible.

## 5.3 Locality Sensitive Hashing

The problem that we will encounter when we try to implement the above techniques in a practical use case will be the huge computation time needed to find similarity of each pair of documents. Assuming that we have N documents, we need to compute Jaccard similarity between ᴺC₂ number of pairs. This can be approximated to N²/2 value. This means that, when N is very large, the time taken to compute all pairs will be very high even though computing a single pair is next to easy.

*Local Sensitive Hashing* (LSH) (also known as *Nearest Neighbor Search*) simplifies the computation problem by repeatedly hashing the data and storing it into buckets in such a way that similar items will be in the same bucket.

# 6. Applications of set intersection method

- Assume that you want to crawl web and find relevant web pages, the steps itself is resource consuming. So you don’t want to waste time in crawling pages that are identical in content.
- Plagiarism checking in journals, answer sheets, text documents. Usually people will rearrange the text within these files to prevent detection of plagiarism. Since we are looking at character level detection, these methods prove effective.
- Recommendation engines in popular sites use a technique called collaborative filtering to suggest new content/products to users. Each user can be represented by a set and similar users can be found out.

References:

- Jure Leskovec, Anand Rajaraman, Jeff Ullman.
*Mining of massive datasets**.*2016 - L4-Jaccard+Shingle.pdf (utah.edu)
- https://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/