Simple Similarity Rankings

In this post we will build a simple similarity ranking system. It can find the top things that are most similar to each other within a set. Lets start with some names:

Top 100 names for girls

raw_documents = ["Sophia", "Emma", "Olivia", "Ava", "Isabella", "Mia", "Zoe", "Lily", "Emily", "Madelyn", "Madison", "Chloe", "Charlotte", "Aubrey", "Avery", "Abigail", "Kaylee", "Layla", "Harper", "Ella", "Amelia", "Arianna", "Riley", "Aria", "Hailey", "Hannah", "Aaliyah", "Evelyn", "Addison", "Mackenzie", "Adalyn", "Ellie", "Brooklyn", "Nora", "Scarlett", "Grace", "Anna", "Isabelle", "Natalie", "Kaitlyn", "Lillian", "Sarah", "Audrey", "Elizabeth", "Leah", "Annabelle", "Kylie", "Mila", "Claire", "Victoria", "Maya", "Lila", "Elena", "Lucy", "Savannah", "Gabriella", "Callie", "Alaina", "Sophie", "Makayla", "Kennedy", "Sadie", "Skyler", "Allison", "Caroline", "Charlie", "Penelope", "Alyssa", "Peyton", "Samantha", "Liliana", "Bailey", "Maria", "Reagan", "Violet", "Eliana", "Adeline", "Eva", "Stella", "Keira", "Katherine", "Vivian", "Alice", "Alexandra", "Camilla", "Kayla", "Alexis", "Sydney", "Kaelyn", "Jasmine", "Julia", "Cora", "Lauren", "Piper", "Gianna", "Paisley", "Bella", "London", "Clara", "Cadence"]  

Each of the names in the list is called a document. Each document contains a set of features. To find if two things are similar we compare the number of features they have in common.

Our first task is to pull out the features from each of the documents. We will use a simple method of taking letters and letter combinations from each of the words. We will call these ngrams.

# build a list of character ngram features from a document
ngrams = (raw_document, size, min=1) ->  
  results = []
  for s in [min..size]
    results = results.concat (raw_document[i..i + s - 1] for i in [0..raw_document.length - s])

The ngrams function decomposes a word into a set of letter combinations. It takes as parameters a string to decompose and the maximum length of a feature. It is easiest to see as an example.

ngrams 'Layla', 3  

will return a list of features:

[ 'l', 'a', 'y', 'l', 'a', 'la', 'ay', 'yl', 'la', 'lay', 'ayl', 'yla' ]

The next step is to count how many times a feature appears within a document.

# counts the number of times a feature appears in a document
feature_count = (features) ->  
  fc = {}
  for feature in features
    fc[feature] = (fc[feature] or 0) + 1

Calling feature_count with our decomposed feature document yields:

{ "l": 2, "a": 2, "y": 1, "la": 2, "ay": 1, "yl": 1, "lay": 1, "ayl": 1, "yla": 1 }

Each document in our list will be passed through the ngrams and feature_count functions yielding a list of documents containing feature counts. At this point we will need a complete list of features encountered so far.

# return unique set of feature
all_features = (feature_counts) ->  
  all = {}
  for document in feature_counts
    for feature, count of document
      all[feature] = true
  (feature for feature, _ of all)

Using these functions we build a list of documents and features.

documents = (ngrams(word.toLowerCase(), size) for word in raw_documents)  
documents = (feature_count(document) for document in documents)  
all = all_features documents  

Our next task is to structure each document so that it may be compared against each other document. For this purpose we build a list with one row for each document. The row will contain the number of occurrences of a specific feature, even if that feature does not occur in that document.

Again, an example will be easier to see:

sparse_list = (documents, all) ->  
  lst = []
  for document in documents
    lst.push (document[feature] or 0 for feature in all)

To be continued...