Matching in MeOwns
0
The Relation And The Related
While working on meOwns.com, we introduced several features that are concerned with how objects are related to each other. In the user profile page, you get a list of other users that are similar to yourself and you are faced with the chemistry meter which tells you how much you are similar to that user (if you are logged in of course). If you happen to be viweing your own profile page, you will get a list of recommended items that you may be interested in. Last, but not least, when you view an item, you get a list of similar items.
In this article I will be talking about the features from an implementor's point of view. Naturally, the first hurdle was to define the problem. What we needed was to find a way to match items and users, so first we needed to represent them in a way that can be matched. We started from the items first and looked at how to match two items together.
What is an item? In meOwns an item is simply a name, a description, a type and some tags. We decided to ignore photos and comments from the item. Each of those fields gets a weight which affects the value of terms found in it.
Here is a sample product (encoded in Yaml):
Name : Fiat Sienna
Type : Car
Description : 1.6 HP, not bad for a sedan, relatively good performance for the price, best sedan i've bought
Tags : Cars Fiat Silver
The above fields are then processed to extract relevant terms from them; this is done in the following manner:
1 - Remove punctuation and non alphanumeric characters (replace them with spaces)
2 - Collapse spaces and split the text on them
3 - Match the generated list of terms to a stop word list to remove them (words like "on", "the" should not be considered in the index)
4 - The remaining terms are converted to lower case and then converted to their stem representation (we are using snowball stemmer for now)
The above can be represented as follows:
Attribute : Value
===================
Name : fiat sienna
Type : car
Description : hp sedan performance price sedan buy
Tags : car fiat silver
After doing so, we create a list of terms with their frequencies; each field has a frequency multiplier according to its significance.
Assumming a multiplier value of 1 for all the fields:
Term : Frequency
=======================
fiat : 2
car : 2
sedan : 2
sienna : 1
performance : 1
silver : 1
buy : 1
hp : 1
This Term-Frequency vector is the basis of doing item matching in meOwns. Several different approaches can be implemented to reach an item representation. Even the details of a given approach can vary significantly. I have chosen to stick to the easiest approach in this article. Those willing to dig further are free to lookup more into document representation and indexing strategies.
User representation is just an aggregation of their item (and wished item) representations. This way users and items share the same term vector structure and, hence, we can match users to items as much as we can match items to items and users to users.
The matching process involves further encoding of the term frequency vector. Those in the academia refer to this as TF-IDF (Term Frequency, Inverse Document Frequency) representation. In lay man's terms, this is a representation of how significant a term is to a certain document. It is composed of the product of two parts: the first, which is TF (Term Frequency), is simply the frequency of the term in the document divided by the total sum of frequency of terms. The second, (IDF), is the total number of documents in the corpus (the document store) divided by the number of documents which contain this term. If the term is found in all documents, then the IDF will be equal to 1 and, hence, won't have an effect on the final product of the two parts. On the other hand, if we have a corpus of 1,000,000 documents and only one with the given term, then it will multiply the TF value by 1,000,000, which is significant. A slight widely used variation is to multiply the TF with the Logarithmic value of the IDF.
After we are done calculating TF-IDF values for all the terms in term vectors, matching can be done as follows
Term Vector A vs. Term Vector B =
cos a = (A . B) / (|A|.|B|)
A.B = dot product for the two vectors of TF-IDF values
|A|.|B| = scalar product of the magnitudes of the two vectors
The returned value is called the cosine similarity between the two items. It ranges between 0 and 1 where zero means no correlation and one means exact match.
Written By:
Muhammad A. Ali (oldmoe.blogspot.com)
Post a Comment
eSpace podcast Prodcast
Archive
- September 2011
- April 2011
- March 2011
- December 2010
- November 2010
- September 2010
- August 2010
- July 2010
- June 2010
- April 2010
- March 2010
- November 2009
- October 2009
- September 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- November 2008
- October 2008
- September 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- January 2008
- April 2007
- March 2007
Latest Comments
- SpectraMind Commented on Egypt Wins UK's National Outsourcing Association Award
- Rofaida Awad Commented on Go Egypt Go!
- Different Mike Commented on Only idiots change their iPhone root password!
- Mike Commented on Only idiots change their iPhone root password!
- smile Commented on Only idiots change their iPhone root password!

