MongoDB schema design for SQL developers

Wojciech Marusarz
May 7, 2019

We should already know when to use relational and Non-Relational DBS. If we decide to select Non-Relational MongoDB, let’s use it properly. For SQL developers this requires changing the way of thinking about data and connections between them. Let’s dive into the MongoDB world.

First of all, some terms translated for further reading

SQL MongoDB
Database Database
Table Collection
Index Index
Row Document
Column Field
Joins Embed in document or link via DBRef

Even if such a mapping exists, it doesn’t mean that data persists in a similar way. For Relational DBS, the data structure is defined for Table, for Document DBS, each document has its structure defined, and documents in collections may differ.

No Schema Design ≠ No Database Design

In Relational DB, we need to design a database structure carefully. We need to create an appropriate table with predefined rows and to create foreign keys to join them. In MongoDB, we do not need to predefine collections, but instead, we have to decide how we are going to use our data and design a suitable document schema up front so that it allows to access data efficiently.

Let’s get through all mappings between tables for RDBS, and let’s look, how to implement it in MongoDB way using an online bookstore as an example.

One-To-One (1:1)

Relationship: Book Author has a single Address

Data model: Embedding: This first approach is to embed the Address document as an embedded document in the User document.

{
  name: "Peter Wilkinson",
  age: 27,
  address: {
    street: "100 some road",
    city: "Nevermore"
  }
}

Data model: Linking: The second approach is to link the Address and User document using a foreign key.

{
  _id: 1,
  name: "Peter Wilkinson",
  age: 27
}
{
 _id: 1,
  user_id: 1,
  street: "100 some road",
  city: "Nevermore"
}

In the one to one relationship, Embedding is the preferred way to model the relationship as it’s more efficient to retrieve the document if querying about author address. If asking about all addresses, more efficient would be linking, but it is a rare case.

One-To-Many (1:N)

Relationship: A Book might have a many Comments with user ratings, but a Comment is only related to a single Book

Data model: Embedding: The embedding of the Comments in the Book means we can easily retrieve all the comments belong to a particular Book. Adding new comments is as simple as appending the new comment document to the end of the comments array.

{
  title: "Great Expectations",
  description: "This is an awesome book of Charles Dickens",
  comments: [{
    name: "Peter Critic",
    created_on: ISODate("2019-01-01T10:01:22Z"),
    comment: "Not so awesome book"
  }, {
    name: "John Lenient",
    created_on: ISODate("2018-12-01T11:01:22Z"),
    comment: "Awesome book"
  }]
}

Data model: Linking: allows returning paginated comments as the application can slice and dice the comments more easily. On the downside, if we have 1000 comments on a book, we would need to retrieve all 1000 documents causing way too many reads from the database.

{
  _id: 1,
  title: "Great Expectations",
  description: "This is an awesome book of Charles Dickens"
}
{
  book_id: 1,
  name: "Peter Critic",
  created_on: ISODate("2019-01-01T10:01:22Z"),
  comment: "Not so awesome book"
}
{
  book_id: 1,
  name: "John Lenient",
  created_on: ISODate("2018-12-01T11:01:22Z"),
  comment: "Awesome book"
}

Data model: Bucketing: The third approach is a hybrid of the two above. It tries to balance the rigidity of the embedding strategy with the flexibility of the linking strategy. For this example, we split the comments into buckets with a maximum of 50 comments in each bucket.

{
  _id: 1,
 title: "Great Expectations",
 description: "This is an awesome book of Charles Dickens"
}
{
  blog_entry_id: 1,
  page: 1,
  count: 50,
  comments: [{
    name: "Peter Critic",
    created_on: ISODate("2019-01-01T10:01:22Z"),
    comment: "Not so awesome book"
  }, ...]
}
{
  blog_entry_id: 1,
  page: 2,
  count: 1,
  comments: [{
    name: "John Lenient",
    created_on: ISODate("2018-12-01T11:01:22Z"),
    comment: "Awesome book"
  }]
}

The main benefit of using buckets, in this case, is that we can perform a single read to fetch 50 comments at a time, allowing for efficient pagination.

Many-To-Many (N:M)

Relationship: Let’s think about a Book that was written by many Authors. At the same time, an Author might have written many Books

Data model: Two Way Embedding

{
  _id: 1,
  name: "Peter Standford",
  books: [1, 2]
}
{
  _id: 2,
  name: "Georg Peterson",
  books: [2]
}
{
  _id: 1,
  title: "A tale of two people",
  categories: ["drama"],
  authors: [1, 2]
}
{
  _id: 2,
  title: "A tale of two space ships",
  categories: ["scifi"],
  authors: [1]
}

Queries Fetch books by a specific author

var db = db.getSisterDB("library");
var booksCollection = db.books;
var authorsCollection = db.authors;

var author = authorsCollection.findOne({name: "Peter Standford"});
var books = booksCollection.find({_id: {$in: author.books}}).toArray();

Fetch authors by a specific book

var db = db.getSisterDB("library");
var booksCollection = db.books;
var authorsCollection = db.authors;

var book = booksCollection.findOne({title: "A tale of two space ships"});
var authors = authorsCollection.find({_id: {$in: book.authors}}).toArray();

Data model: One Way Embedding for uneven N:M relationships Let’s take category drama that might have thousands of books in it and with many new books consistently being added and removed. It would be impracticable to embed all the books in a category document. Each book, however, can easily have categories embedded within it, as the rate of categories change for a specific book might be very low. In this case, we should consider One Way Embedding as a strategy.

The One Way Embedding strategy chooses to optimize the read performance of an N:M relationship by embedding the references in one side of the relationship. An example might be where several books belong to a few categories, but a couple of categories have many books. Let’s look at an example, pulling the categories out into a separate document.

{
 _id: 1,
 name: "drama"
}
{
  _id: 1,
  title: "A tale of two people",
  categories: [1],
  authors: [1, 2]
}

The reason we choose to embed all the references to categories in the books is due to there being a lot more books in the drama category than categories in a book.

Queries Fetch categories for a specific book

var db = db.getSisterDB("library");
var booksCol = db.books;
var categoriesCol = db.categories;

var book = booksCol.findOne({title: "A tale of two space ships"});
var categories = categoriesCol.find({_id: {$in: book.categories}}).toArray();

Fetch books for a specific category

var db = db.getSisterDB("library");
var booksCollection = db.books;
var categoriesCollection = db.categories;

var category = categoriesCollection.findOne({name: "drama"});
var books = booksCollection.find({categories: category.id}).toArray();

Aggregation framework - SQL Statements for MongoDB

When we designed documents structure and modeled relationships between them, we may need to process data. In SQL databases we use SQL Statements that allow to search, filter, sort and many many more. For this purposes in MongoDB shell, Aggregation Pipeline is used.

The drawback of Aggregation Pipeline is lack of support for joining documents similar to JOIN operations in RDBS. For the processing of unstructured data, it is often required to modify documents structure.

The first MongoDB method that provided data processing paradigm for condensing large volumes of data into useful aggregated results is Map-Reduce. MapReduce is useful when used to implement a data pipeline. Multiple MapReduce commands can be chained to produce different results. Its main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop. Using MapReduce, MongoDB applies a map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer gets key-value pairs with the same key as input, processing all multiple values. The reducer’s output is a single key-value pair for each key.

Aggregation framework

Since version 2.2, MongoDB has provided a better way to work with aggregation. The aggregation framework is modeled as a data processing pipelines.

In data processing pipelines there are two primary operations: filters that operate like queries, filtering documents, and document transformations that transform documents to get them ready for the next stage. In the following table, we can see how SQL commands map to the aggregation framework operators:

SQL Aggregation Framework
WHERE / HAVING $match
GROUP BY $group
SELECT $project
ORDER BY $sort
LIMIT $limit
sum() / count() $sum
join $lookup

If you need more details about migration from RDB to NoSQL, please read the official
MongoDB Documentation

Summary

As we could see, migration from RDB to NoSQL can be pretty straightforward. The most important thing is that we need to change our mindset to do it properly. First of all, we need to take care of proper relation assignment, to embrace MongoDB capabilities. In MongoDB, those relations seem to be more intuitive, but for bigger datasets, our assumptions need to change.

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.

LET'S CONNECT