What's the difference when query using `in` and an array between two indexes?

问题: A collection in MongoDB with docs like below: {a: 1, b: 1} {a: 2, B: 2} {a: 3, B: 3} {a: 3, B: 2} {a: 2, B: 1} with uniq index a_1_b_1 or b_1_a_1 Query: {a: x, b: {...

问题:

A collection in MongoDB with docs like below:

{a: 1, b: 1}
{a: 2, B: 2}
{a: 3, B: 3}
{a: 3, B: 2}
{a: 2, B: 1}

with uniq index a_1_b_1 or b_1_a_1

Query: {a: x, b: { $in: [....] } }

which index better? or same?

How about query match array works?


Update: Does the shard key impact the query index? shard key: a_1_c_1 extra index: b_1_a_1 query: {a: x, b: y}

  1. route to the shard by a=x in shard key a_1_c_1 , then query in the shard using index b_1_a_1
  2. route by shard key and query must using shard key?

回答1:

From the MongoDB manual's section on compound indexes:

db.products.createIndex( { "item": 1, "stock": 1 } )

The order of the fields listed in a compound index is important. The index will contain references to documents sorted first by the values of the item field and, within each value of the item field, sorted by values of the stock field.

Given the above, we can see that a_1_b_1 will segment first by a and then by b, whereas b_1_a_1 will segment first by b and then by a.

Now let's examine your query: {a: x, b: { $in: [....] } }
Note that this query matches a specific a value and a range of possible b values. In index a_1_b_1 the index scan will be limited to only the matching a block and all b values will be searched for within; if, however, you use index b_1_a_1, then the index scan must "jump" between the different b blocks and search each of them for the matching a value.

It's typically far more efficient to access data that is "close" together, so you will want to select the index in which your matching documents are more likely to be closely located. In this case, having all of your documents in the same a block will likely be a much better choice as there should be less "jumping" going on, so you should go with index a_1_b_1.

This is grossly over-simplified, however. The actual performance impact may be negligible, particularly in cases where the range of possible a and b are quite low.

There is also an extra consideration you should make: query prefixes. If you find yourself in a situation where you sometimes perform queries with only an a value, then you should select index a_1_b_1. Likewise, if you sometimes perform queries with only a b value, then you should probably select b_1_a_1.

This is because if your query doesn't completely match an index but matches a prefix of that index, the index will still apply. Thus, in index a_1_b_1 you can perform efficient queries on {a: x, b: {$in: [....]}} as well as {a: x}, but you cannot perform an efficient query on {b: {$in: [....]}}.

Finally, it's often also possible to take advantage of index intersection to have two separate indexes a_1 and b_1, giving you a middle ground between performance and flexibility.


With everything above in mind, I wouldn't recommend concerning yourself too much with index performance until the size of your data starts to necessitate it. After all, you can drop old indexes and build new ones as needed. Use what works for now, monitor the performance over time, and reassess when it looks like you might be outgrowing what you're currently using.

  • 发表于 2019-05-19 21:38
  • 阅读 ( 85 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除