Document Data Stores - Dimensionality Reduction, Hierarchical Modeling using Materialized Paths, Nested Sets and Proximity Queries

By:   |   Updated: 2022-09-06   |   Comments   |   Related: More > Big Data


Problem

In this second part of this series, we discuss additional things to consider when building a data model for document data stores.

Solution

In the first part of this article, we discussed data modeling techniques using Embedded, Composite Key, Inverted Index, and Single Table. In this second part, we look at other things to consider.

Dimensionality Reduction

In Machine Learning, a dataset's number of attributes and features is called dimensionality. The more dimensions there are, the bigger the dataset and the more complex the dataset's geometrical representation. Dimensionality reduction is the process of reducing the number of attributes while keeping as much variation as possible in the original dataset. There are numerous dimensionality reduction algorithms. The below example makes the dataset smaller and reduces noise in data. Removing attributes that are noise is one of the goals in data cleaning because noise hinders many types of data analysis:

  1. Attribute "buyer_year_of_birth" is redundant because there is a "buyer_age" attribute, which is smaller, and the year can be re-calculated if needed. So, "buyer_year_of_birth" can be dropped.
  2. Shampoo will always be under the Cosmetics category. We can combine those correlated attributes into a single attribute and call it "cosmetics-shampoo." It can be used for exact and similarity queries.
  3. If 99% of our sales are done in the same country, like "US," for example, then we can remove the "sales_location_country." It's a noise attribute. Even if it is used in a filter, it will not filter out many documents. By removing attributes that don't vary widely, we are making our dataset smaller and more manageable. Of course, if we know that data analysts must use this field as a filter, we can keep it regardless of its low cardinality.
  4. Sometimes, we will also want to replace "product_price" with "price_category" if we use this field in range searches.

As with many other modeling techniques, the dimensionality reduction data model's downside is that sometimes we will need to maintain both the raw data set and the reduced dataset.

Dimensionality reduction data model example

If the data can be represented as a graph, it is best to store it inside the graph datastore, not in the document store. However, sometimes you must store the hierarchical dataset in the document store. For instance, when the team's expertise or company budget does not allow the introduction of an additional data store, or the scale is too heavy for the graph store. I once was calculating the Neo4j cluster for my POC and the calculation suggested that we would need to set up a cluster with almost 300 nodes. Such a solution would be extremely expensive.

Using a document store in similar situations would be much more cost-effective, and you can consider the following two techniques to model the hierarchical data.

Hierarchical Modeling: Materialized Path

Well-known by relational database DBAs, we can use a hierarchical modeling technique called "materialized path." It's an easy, straightforward technique to model trees. If we imagine a File Store structure, our materialized path is a full "file path" for each child. For instance, materialized path to shampoo would be "pharmacy/hair_care/shampoo." We can use this technique for trees that are not too deeply nested. The longer the path string, the less efficient the read operations are on it. We can also use node ids instead of values, but the path cannot be used for similarity queries or queries that will require additional API calls to translate parent ids.

However, materialized path is less complicated than the next technique, nested sets. The Nested Sets and inserts are easy, and there is no need to change any additional document besides the one planned to be inserted. If we need to store properties of the connection between parent and child using materialized path model, connection attributes can be added to the child node.

Materialized path data model example

Hierarchical Modeling: Nested Sets

The "nested sets" is another technique for storing trees or hierarchies in two-dimensional data stores. All children of each node can be found by searching within the right and left boundary of the parent node. Sometimes, a direct parent id is stored as a child attribute to make direct children searches easier, but it's optional. Each parent set "size" (right minus left) is wider than the children it contains. The root parent will have the "largest" size. "All descendants" or "all ancestors" or "depth" queries are very easily done using nested sets.

Nested sets data model example

Proximity Queries

Proximity queries help search for the keywords where one is "near" another. Distance between keywords in the document approximates how related the terms are in the text. Those queries usually focus the search rather than providing exact results. For example, a search may be for a movie where two actors are mentioned and get many results. Some of the results will be less relevant because one of the actors is a producer and not an actor. The proximity query helps to rank the results.

For proximity queries, we see all our documents as a collection of keywords related by a distance between them. For each type of entity, the distance function can be different. For hierarchies, it can be the level of relationship between people or how the employees are related in terms of teams or departments. We can either base our proximity search on a single document or on the distance between documents. For searches based on a single document, we can record the position of each word and search for the shortest distance between the words.

In the movie database example below, all document keywords have been added as an additional set of properties ranking each of them with their position in a text. We can search for the movie produced by Clint Eastwood. The distance between the keywords "clint," "eastwood," and "produced" is very short, which means that the below document is highly relevant to our search.

Example of proximity index
Next Steps

Check out these tips for more information:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Maria Zakourdaev Maria Zakourdaev has been working with SQL Server for more than 20 years. She is also managing other database technologies such as MySQL, PostgreSQL, Redis, RedShift, CouchBase and ElasticSearch.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2022-09-06

Comments For This Article

















get free sql tips
agree to terms