My nodeJS project is finally completed, so I could turn back to Sitecore and look for some interesting questions and solutions for them. Fortunately, I recently got one request.

We have an e-commerce solution integrated with Sitecore, we also have DMS in place, and it is gathering statistics about users and their behavior on the site. We could add personalization to the site, trying to engage more clients, but it would be quite static, as a limited set of prepared content blocks would be shown based on defined rules. This is cool when you want to show promotion, but what if you need to show relevant products or pages.

APPROACH

To calculate relevant products we need to match a profile of a current user with the attributes of a product. We would use aggregated values from DMS in a visitor profile as a target pattern and product page tags as weighted categorization info. Also to match visitor profile with the product we would use the SOLR relevancy engine.

Firstly, relevant products are, obviously, personalization of content. Taking into account that Sitecore uses weights in profile keys to tag the content and patterns to match them in runtime in a built-in personalization engine, it would be wise to use some parts of this engine for our solution.

Secondly, if you define any profile categories in DMS you would like to use this data, rather than create categories or tags for a product once again. (Thanks to Martin Davies and his video for insights).

And last part of this is a pattern matching itself. As we are talking about e-commerce and relevancy to a visitor, you might guess that number of queries that we need to make could be very significant and if you have thousands of products problem is even worse. Let’s see how SOLR could help us with it.

Profiling and patterns

Let’s take a look at profiling. The Sitecore profile might have several profile keys with a weight scale (let’s assume that we have two keys “entertainment” and “education” in profile “theme” and scale from 1..10 in each, this might be useful for further examples). It is possible to assign some score to a product, “The Mysterious Island” book e.g., a score according to this profile (8 for entertainment and 4 for education). Surfing around a site a visitor would hit pages with profiles and gather points in categories defined by it (e.g. 7 for entertainment and 7 for education). Both product and visitor now have a profile, but how to compare them and say that 8 & 4 is closer to 8 & 1 than 7 & 7.

If you think about values, they could be represented as a point in 2-dimensional space which could give you some math to compare them at least by distance. Or as a vector starting at the origin of coordinates system, which would provide you with a whole set of operation you could make with them: merge (vector add), calculate the distance (subtract, length), compare directions (analyze angle), align scales (normalization). In our case comparison of direction of vectors is the most interesting and it is defined by cosine similarity.

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].

SOLR and patterns Matching

Solr should provide you with the most relevant results based on your query. The first part of query execution is Boolean logic (do we have a term in a document or not), but after that, we have a more interesting part with weighing and judging which result would be the most useful. This part using cosine similarity as a logical basis but also adds other elements to an equation, as it needs to deal with boosting and terms frequency.

I would not put here the whole description of Solr Default Similarity (see link), but it is important to highlight some parts of a scoring function where individual term scores are calculated:

  • tf(t in d) correlates to the term’s frequency, which means that the more you have some token in a field, the bigger score for this field would be;
  • ** t.getBoost() **represents boost of a term in a query;
  • idf(t) correlates to the inverse of docFreq (the number of documents in which the term t appears);

Combining all together

The first two params could be used to implement pattern matching, while the third might negatively affect relevancy (you do not care how often your category presented among the whole catalog of products) but is could be easily disabled.

To use term frequency, we need to **create computed field for each profile **in Sitecore Tracking field and **during products indexing duplicate profile key name **(or ID) n times, where n - profile key value. E.g. if you are searching for products with an “entertainment” profile key, products where this key was mentioned more times would be higher - which is obviously a part of the goal.

Another part is to give a higher score to terms which are more relevant to visitors profile and here boosting would take place, as we could add individual boost values to terms according to profile keys in visitors profile.

Strength of the approach:

  • This approach would work well with product promotion in search implemented via document boosts at index time.
  • It also should be quite scalable as we are using standard SOLR query and indexing options.
  • Less computing load on Sitecore instance, which means relevant products might be heavily used.
  • Products or any other pages might be matched against visitors profile, current page profile, or even merged value.

Cautions:

  • You should maintain the scale of profile keys reasonable like up to 10 or 20, as you need to duplicate its name during indexing.
  • Also, a scale should be the same for all profile keys as one could become more important just because of different scales.
  • Might require advanced updates of SOLR configurations.

And disclaimer: as you probably understand by the absence of code this is a solution design, but I hope it won’t take too long to write some code for it =)


Follow me on Twitter @true_shoorik. Would be glad to discuss the ideas above in the comments.