How to find similarity between items in your data

(Article 1 of 2)

Table of Contents

  1. Dealing with the question of similarity
  2. Problem statement
  3. The set intersection problem
  4. Jaccard Similarity
  5. How to convert our text data into sets
  6. Applications of set intersection method

1. Dealing with the question of similarity

2. Problem Statement

Find the similarity between documents (that contain text data) by converting it into objects in abstract space and measure the distance between them. The object in abstract space can be represented by an unordered set {s₁, s₂, s₃, …..}. Thus we will extract objects(shingles) from text documents and store them in sets.

3. The set intersection problem

Source: xoax

4. Jaccard Similarity

But the question is, how do we find similarity from the sets itself. That's when Jaccard similarity comes into play. Jaccard similarity is a similarity metric used for set intersection problem. It is the ratio of number of set intersection elements divided by the number of elements in their union.

Fig: Output of the code

5. How to convert our text data into sets

Now that we know how to find similarity between to sets, the next task will be to convert our data at hand to set format. Usually the data will be in text form which means we can use the below techniques for conversion.

5.1 Shingling

Next question is how to find the elements in text documents. Shingling is the method of converting text data into k-shingles (group of k characters) generated through their order of occurrence. The value of k depends on the what type of documents you are going to handle.

Fig: Character level shingling with k=3
Fig: Word level shingling with k=3

5.2 Minhasing

Real world documents won’t be limited to a couple of shingles in each set. It depends on the size of the document and language used. Since it is computationally expensive to find similarity between sets that have large number of elements, we need to compress the data. Minhashing is a method which can help in compression without much loss of data.

Fig: Underlying concept in Minhashing method

5.3 Locality Sensitive Hashing

The problem that we will encounter when we try to implement the above techniques in a practical use case will be the huge computation time needed to find similarity of each pair of documents. Assuming that we have N documents, we need to compute Jaccard similarity between ᴺC₂ number of pairs. This can be approximated to N²/2 value. This means that, when N is very large, the time taken to compute all pairs will be very high even though computing a single pair is next to easy.

6. Applications of set intersection method

  • Assume that you want to crawl web and find relevant web pages, the steps itself is resource consuming. So you don’t want to waste time in crawling pages that are identical in content.
  • Plagiarism checking in journals, answer sheets, text documents. Usually people will rearrange the text within these files to prevent detection of plagiarism. Since we are looking at character level detection, these methods prove effective.
  • Recommendation engines in popular sites use a technique called collaborative filtering to suggest new content/products to users. Each user can be represented by a set and similar users can be found out.
  1. You can find the code for this article here
  2. The distance metric method for finding similarity will be published as another blog.

MSc CS (Data Analytics), Data science enthusiast