Mongo String Indexing part 1

misc1Overview

We have a Mongo Database and in that database we have an ‘items’ collection which has five string fields and a number of ID fields.  We’ve added 250k records to this collection and started testing the time it takes to perform various queries on these fields.  Simply adding indexes on the ID fields makes those types of queries lightning fast, however, we had a number of issues dealing with various string queries.  This is a record of our findings.

Querying Strings Sucks!

 

Mongo can do some wonderful things with its schema-less system, however, querying strings can be reaaaaalllly slooooooow.  We had to make a number of changes to the schema to optimize string queries and I will get into this later.

Initial Setup

 

In our queries, we always include ‘CurrentOfficeId’ (which is a long int) when performing them, so most of the indexes are compound and contain an office id.  Here are the indexes for our 5 troublesome string fields:

{ "CurrentOfficeId" : 1 , "Make" : 1}

{ "CurrentOfficeId" : 1 , "Model" : 1}

{ "CurrentOfficeId" : 1 , "SerialNumber" : 1}

{ "CurrentOfficeId" : 1 , "RecoveryLocation" : 1}

{ "CurrentOfficeId" : 1 , "Description" : 1}

NOTE: To create an index in the mongo command prompt, you would type:


db.collection.createIndex({“Fieldname”, sort order});

e.g.:  db.items.createIndex({ "CurrentOfficeId" : 1 , "Make" : 1});

On the UI, the user has 3 options for string searches: ‘Equals’, ‘Contains’, and ‘Not Equals’.  Furthermore, the strings are stored with case (case-sensitive) and we wanted the searches to be case insensitive, so our mongo searches were regular expressions:

 

Sample ‘Equals’ Query:


db.items.find({ "Model" :  /^Iphone$/i, "CurrentOfficeId" : NumberLong(1) });

Sample ‘Contains’ Query:


db.items.find({"Model" : /Iphone/i,"CurrentOfficeId" : NumberLong(1)});

Sample ‘Not Equals’ Query:


db.items.find({ "Model" : { "$not" : /^Iphone$/i }, "CurrentOfficeId" : NumberLong(1) });

Therese queries were running really slow, and with only 250k records!!

Sample Equals Query: 21 seconds

Sample Contains Query: 21 seconds

Sample Not-Equals Query: 1 second (WTH???)

 

Analysis

 

Ok we have our issue, now what?  After a bit of searching, it turns out that mongo has a wonderful function, called ‘explain’ (https://docs.mongodb.org/manual/reference/method/cursor.explain/) which will tell you how it came up with its solution (what indexes it used, total time it took, etc).  If I tack “.explain()” to the end of my queries, I can get loads of useful data to help us.  So in the mongo command line, we started testing:


db.items.find({ "Model" :  /^Iphone$/i, "CurrentOfficeId" : NumberLong(1) }).explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_SubmittedById_1",

"isMultiKey" : false,

"n" : 4,

"nscannedObjects" : 133119,

"nscanned" : 133119,

"nscannedObjectsAllPlans" : 2187347,

"nscannedAllPlans" : 2280721,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 17818,

"nChunkSkips" : 0,

"millis" : 21028,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

At the top, you can see what indexes it used, (this is under ‘Cursor’).  So this query used BtreeCursor which is a standard Mongo cursor and CurrentOfficeId_1_SubmittedById_1 which is an index but the *wrong* index for us!  Under this, it displays a number of interesting fields, particularly “millis” which gives the time it took in milliseconds.  So this query took 21.028 seconds.

One way to force Mongo into using an index, is using the ‘hint’ function with the index name inside of it.  Let’s try to force this query to use our “Model” index:


db.items.find({ "Model" :  /^Iphone$/i, "CurrentOfficeId" : NumberLong(1) }).hint("CurrentOfficeId_1_Model_1").explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_Model_1",

"isMultiKey" : false,

"n" : 4,

"nscannedObjects" : 4,

"nscanned" : 133084,

"nscannedObjectsAllPlans" : 4,

"nscannedAllPlans" : 133084,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 1039,

"nChunkSkips" : 0,

"millis" : 484,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

Under “cursor” we can see that the query used the index that we wanted and you can see that the search only took 484ms.  That’s a heck of a time saver!

In the end, it seems that the MongoDb Optimizer has trouble selecting the correct index to use when the query contains regular expressions (and justly so).

 

A Possible Solution

 

Since regular expressions cause problems, our solution was to create alternate string fields with all characters being lowercase and query on them instead.  This removes the regular expressions for ‘Equals’ and ‘Not Equals’ searches, but ‘Contains’ searches still need it.  First, we must re-create the indexes, using the new field names:


{ "CurrentOfficeId" : 1 , "MakeLC" : 1}

{ "CurrentOfficeId" : 1 , "ModelLC" : 1}

{ "CurrentOfficeId" : 1 , "SerialNumberLC" : 1}

{ "CurrentOfficeId" : 1 , "RecoveryLocationLC" : 1}

{ "CurrentOfficeId" : 1 , "DescriptionLC" : 1}

Now let’s test the ‘Equals’ query using our newly created lower case fields and without using a regular expression:


db.items.find({ "ModelLC" :  "iphone", "CurrentOfficeId" : NumberLong(1) }).explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_ModelLC_1",

"isMultiKey" : false,

"n" : 4,

"nscannedObjects" : 4,

"nscanned" : 4,

"nscannedObjectsAllPlans" : 119,

"nscannedAllPlans" : 119,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 0,

"nChunkSkips" : 0,

"millis" : 4,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

This is nice!  As you can see, Mongo has selected the correct index, “CurrentOfficeId_1_ModelLC_1″ and the entire query only took 4ms!  But what about ‘Contains’ and ‘Not Equals’?  Let’s try ‘Not Equals’:


db.items.find({ "ModelLC" : { "$ne" : "iphone" }, "CurrentOfficeId" : NumberLong(1) }).explain();

Outputs:


{

"cursor" : "BtreeCursor LocationCount",

"isMultiKey" : false,

"n" : 133115,

"nscannedObjects" : 133119,

"nscanned" : 133119,

"nscannedObjectsAllPlans" : 135442,

"nscannedAllPlans" : 135442,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 1076,

"nChunkSkips" : 0,

"millis" : 1041,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

Even though it didn’t use the index we wanted, the speed is only 1 second which is not too terrible but we want this to be faster.  Let’s try forcing the system to use the ModelLC index:


db.items.find({ "ModelLC" : { "$ne" : "iphone" }, "CurrentOfficeId" : NumberLong(1) }).hint("CurrentOfficeId_1_ModelLC_1").explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_ModelLC_1",

"isMultiKey" : false,

"n" : 133115,

"nscannedObjects" : 133115,

"nscanned" : 133116,

"nscannedObjectsAllPlans" : 133115,

"nscannedAllPlans" : 133116,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 1039,

"nChunkSkips" : 0,

"millis" : 430,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

Hmmm… It only took 400ms that time.  This leaves me stumped as to why the optimizer doesn’t select the fastest this as an index.  For now, we’re going to require this index when performing Model ‘Not Equals’ queries.

 

Our ‘Contains’ query worked similar to our ‘Not Equals’ query:


db.items.find({"ModelLC" : /iphone/,"CurrentOfficeId" : NumberLong(1)}).explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_SubmittedById_1",

"isMultiKey" : false,

"n" : 4,

"nscannedObjects" : 133119,

"nscanned" : 133119,

"nscannedObjectsAllPlans" : 2187347,

"nscannedAllPlans" : 2280721,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 17818,

"nChunkSkips" : 0,

"millis" : 26513,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

This uses the incorrect index and takes a whopping 26 seconds to complete!!! That’s terrible!  Let’s try forcing the index again:


db.items.find({"ModelLC" : /iphone/,"CurrentOfficeId" : NumberLong(1) } ).hint("CurrentOfficeId_1_ModelLC_1").explain();

Outputs:


{

"cursor" : "BtreeCursor CurrentOfficeId_1_ModelLC_1",

"isMultiKey" : false,

"n" : 4,

"nscannedObjects" : 4,

"nscanned" : 133084,

"nscannedObjectsAllPlans" : 4,

"nscannedAllPlans" : 133084,

"scanAndOrder" : false,

"indexOnly" : false,

"nYields" : 1039,

"nChunkSkips" : 0,

"millis" : 409,

"server" : "h016470.mongolab.com:43802",

"filterSet" : false

}

Now it’s using the index we think it should and it only took 400ms.

 

Conclusion

 

I believe that more research is needed, but for now we require certain indexes (via ‘hint) on ‘Not Equals’ and ‘Contains’ style queries.  This currently provides quite a speed boost, but may not be enough when we move to 1 million and then 10 million records.  As we make changes, I will post more of our findings.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s