Skip to content

Conversation

Kami
Copy link
Member

@Kami Kami commented Dec 19, 2019

Background, Details

There is a known issue with some mongoengine operations (especially retrievals and to lesser extend, also insertions) being very slow when working with large documents (MongoEngine/mongoengine#1230).

The primary reason for that is massive overhead mongoengine adds when converting native pymongo representation to the document class instance and vice versa.

There are multiple approaches on how to tackle that, including switching to something like pymodm which is much more efficient than mongoengine.

Most of those libraries claim very similar and compatible API to the mongoengine, but this doesn't help us at all.

Our DB later abstraction is based heavily around mongoengine and we utilize a lot of low level mongoengine functionality and primitives (and also extend it in various places).

This means that switching the ORM library we use is pretty much impossible at the moment. It would simply require too much time and resources and it would likely also result in a lot of hard to catch / edge case bugs.

Proposed Solution

While profiling and measuring hot code paths and trying to see which places we could optimize (switch directly to pymongo and avoid mongoengine layer all together), I found out that calling as_pymongo() on mongoengine QuerySet object results in massive performance improvements.

The nice thing is that it's much easier to make that change compatible with out existing code and DB layer compared to fully switching an ORM layer which would require rewriting very big chunks of the code.

The reason for that is that this method avoids very expensive mongoengine conversion and dereferrencing layer which we don't need anyway. It simply returns raw pymongo result aka dictionary.

In this proposed implementation, I updated our get() and query() method to utilize as_pymongo() method and manually convert those raw dictionaries to our database model class instances.

It turns out that approach is much more efficient.

As part of that changes, I also need to update various code to utilize query() in a manner that it expects that method to return actual DB model class instances and not a QuerySet object.

In fact, we already did that for a lot of the code in the past. If you check, 90% of the code which calls query() already expects it to return actual DB model objects and not a query set, so the change shouldn't be that big.

In places where there is not possible and where we need access to raw QuerySet (there should be very few of those), we can use raw_query() method directly.

Keep in mind that this approach is still not 100% ideal. In ideal world, we would avoid instantiating DB class all together and work directly with raw pymongo dictionaries (at least in critical code paths), but that change would be much harder and more involved.

Profiling Data

Here is some profiling data when retrieving a single large object with and without utilizing as_pymongo.

Retrieve large execution with as_pymongo=False
99.9 percentile: 0.011759551286697398s
99 percentile: 0.011098940372467038s
95 percentile: 0.009706759452819822s
90 percentile: 0.011098940372467038s

Retrieve large execution with as_pymongo=True 
99.9 percentile: 0.005786781072616579s
99 percentile: 0.005684897899627685s
95 percentile: 0.005177915096282959s
90 percentile: 0.005684897899627685s

As you can see, the difference is substantial (aka as_pymongo + manual model object instantiation around 50% faster).

TODO

Overall, I'm quite optimistic I can get all that change finished and call the tests passing compared to many other much more involved and complex approach (I already have a huge chunk of tests passing locally).

  • Update affected code and make sure all the tests pass

Kami added 4 commits December 19, 2019 16:30
internally and manually convert raw pymongo result (dict) to our
database model instances.

This approach is much more efficient compared to letting mongoengine do
the SON -> document conversion (that conversion is slow and has a lot of
overhead, especially for large documents).
@Kami Kami added the mongodb label Dec 19, 2019
@Kami Kami added this to the 3.2.0 milestone Dec 19, 2019
@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Dec 19, 2019
instance so it utilizes it in a fashion so it expects actual DB model
instance(s).

Also fix some of broken tests.
@Kami
Copy link
Member Author

Kami commented Dec 19, 2019

I'm also looking if I can get rid of auto conversion / dereferencing in the base document DB class.

This only needs to happen where EmbeddedDocumentField field type is used. We only use that field type in a couple of places, yet dereferrencing adds huge overhead.

There are multiple ways to get rid of auto conversion - probably the easiest / best way is to change those field types to simple Dict / EscapedDynamicField fields.

Another option is to explicitly try to de-reference only those fields which use that field type in a custom class constructor.

I will look into both approaches, second one would probably require less changed.

inside the base mongoengine document class.

This process is very slow and adds a tons of overhead since it calls
"to_python()" on every model field even if the field doesn't need to be
referenced.

With this approach we manually de-reference only the fields which need
to be dereferenced / converted.

The most ideal and faster approach would be to work directly with
dictionaries as returned by pymongo (this way we could avoid the whole
overhead), but this is a compromise.
@Kami
Copy link
Member Author

Kami commented Dec 19, 2019

And here is the WIP approach with manual de-referencing I'm testing - a7582ae.

It speeds those two methods up by 20-30% so it's, imo, well worth it (and we don't really have any better option anyway).


Another problematic and very slow part is using EscapedDictField and EscapedDyanmicField.

It means we need to unescape the values when we retrieve those models from the database and that's a very expensive and slow process (especially on large and nested dicts - I tested it with and without escaping and without it's around 5x faster which in absolute numbers is around 0.2s vs 0.02s.).

It seems like that method of escaping is not recommend anymore anyway - https://jira.mongodb.org/browse/DOCS-9311.

And:

Field names cannot contain the null character.

Top-level field names cannot start with the dollar sign ($) character.

Otherwise, starting in MongoDB 3.6, the server permits storage of field names that contain dots (i.e. .) and dollar signs (i.e. $).

https://docs.mongodb.com/manual/reference/limits/#Restrictions-on-Field-Names

Sadly we don't have much choice here, we still need to support escaping (at least for a while) for backward compatibility reasons.

@Kami
Copy link
Member Author

Kami commented Dec 19, 2019

First step on optimizing those escaped fields would be changing fields which are controlled by us and can't contain $ and . characters to be regular dict fields.

@m4dcoder Can all those fields contain special characters ($, .) - https://gist.github.com/Kami/04718da61cab79b02e2a2a19c3f4f397?

Since as per my comment, using that field type adds massive overhead on object retrieval + insertion time.

Another alternative for fields such as result would be to store result as JSON serialized string. I still need to do some more benchmarking, but I believe that should be faster than escaping and unescaping very large result dictionaries.

@Kami
Copy link
Member Author

Kami commented Dec 19, 2019

I did some retrieval benchmarking to see if we could indeed utilize a special field type which stores large dictionaries / values as JSON serialized strings.

EscapedDynamicField with character unescaping on retrieval:

99.9 percentile: 0.11631401681900028s
99 percentile: 0.1144586706161499s
95 percentile: 0.0900427460670471s
90 percentile: 0.1144586706161499s

Using JSON field type where unescapping is not necessary:

99.9 percentile: 0.06011564707756043s
99 percentile: 0.059931752681732176s
95 percentile: 0.0596500039100647s
90 percentile: 0.059931752681732176s

Only "downside" of using a JSONDictField is that we can't query on actual dictionary values, but that's not something we are doing anyway and for all purposes, result field an opaque string / dictionary for us.

Now the main problem is how to implement this change in a backward compatible manner since we don't really have object / model versioning in place.

One approach is to do something like this and handle conversion from old format to a new one "lazily" on initial retrieval / write of the existing objects:

  1. Add new model_converted_to_new_json_field_type field or similar to all the affected models and default it to True for new models on insertion and False for existing models on retrieval.

  2. When retrieving an existing object, if that field is not set, assume we are working with the old data format, unescape the dictionary and on save, store it in MongoDB in a new format and set model_converted_to_new_json_field_type field to True.

  3. When writing a new object we can just assume and use new format (same as we are retrieving those objects from a database)

This applies to existing installations which upgrade to a new version aka backward compatibility layer. If we wouldn't need to support that (but we don't really have a choice), the change would obviously be much more straight forward.

Another option would be to add a data migration script, but this could potentially cause a long down time for installations with a lot of existing database models, so I think "lazy" approach is the way to go.

I believe that with that change + other changes in this PR we would get retrieval performance down to a reasonable level for the foreseeable future.

@Kami
Copy link
Member Author

Kami commented Dec 20, 2019

As part of my benchmarking and quest to improve the performance with a JSON field idea proposed above, here are some more numbers.

In this benchmark, I used randomly generated data, but in the future, I also plan to benchmark it with various real-world data from our CI/CD server.

Here are the results.


EscapedDict/DynamicField

Small execution

insert

Mean duration: 0.005912823677062988s
99.9 percentile: 0.012965323686599837s
99 percentile: 0.007483551502227781s
95 percentile: 0.006837272644042968s
90 percentile: 0.007483551502227781s

retrieve

Mean duration: 0.0017433881759643555s
99.9 percentile: 0.002846957683563236s
99 percentile: 0.00250579357147217s
95 percentile: 0.0021214723587036134s

Medium size execution (50 000 bytes)

insert

Mean duration: 0.006792712211608887s
99.9 percentile: 0.0159519746303559s
99 percentile: 0.009502100944519013s
95 percentile: 0.007520365715026855s
90 percentile: 0.009502100944519013s

retrieve

Mean duration: 0.0025107145309448244s
99.9 percentile: 0.002986632108688355s
99 percentile: 0.002956254482269287s
95 percentile: 0.002732992172241211s
90 percentile: 0.002956254482269287s

Large execution (single large dict field)

insert

Mean duration: 0.024022660255432128s
99.9 percentile: 0.027110637426376365s
99 percentile: 0.026071195602416993s
95 percentile: 0.025422227382659913s
90 percentile: 0.026071195602416993s

retrieve

Mean duration: 0.022088223695755006s
99.9 percentile: 0.026436156034469655s
99 percentile: 0.023751575946807862s
95 percentile: 0.02322307825088501s
90 percentile: 0.023751575946807862s

Large execution (mixed dict with some large fields)

insert

Mean duration: 0.11646514534950256s
99.9 percentile: 0.12528950285911566s
99 percentile: 0.12221453905105592s
95 percentile: 0.12034807205200196s
90 percentile: 0.12221453905105592s

retrieve

Mean duration: 0.10380237460136414s
99.9 percentile: 0.11098703455924988s
99 percentile: 0.11004779577255248s
95 percentile: 0.10878726243972778s
90 percentile: 0.11004779577255248s

JSONDictEscapedFieldCompatibilityField

Small execution

insert

Mean duration: 0.0057206833362579345s
99.9 percentile: 0.0076095359325408975s
99 percentile: 0.007337703704833984s
95 percentile: 0.006766223907470702s
90 percentile: 0.007337703704833984s

retrieve

Mean duration: 0.0017057633399963378s
99.9 percentile: 0.0022321674823760993s
99 percentile: 0.0021707749366760257s
95 percentile: 0.002055680751800537s
90 percentile: 0.0021707749366760257s

Medium size execution

insert

Mean duration: 0.006151213645935059s
99.9 percentile: 0.010596619129180942s
99 percentile: 0.008227782249450678s
95 percentile: 0.0069027066230773915s
90 percentile: 0.008227782249450678s

retrieve

Mean duration: 0.0023874282836914063s
99.9 percentile: 0.0028212878704071053s
99 percentile: 0.002770516872406006s
95 percentile: 0.0026262402534484863s
90 percentile: 0.002770516872406006s

Large execution (single large dict field)

insert

Mean duration: 0.012752714157104493s
99.9 percentile: 0.014751040458679202s
99 percentile: 0.014529399871826172s
95 percentile: 0.014111042022705078s
90 percentile: 0.014529399871826172s

retrieve

Mean duration: 0.020717523097991943s
99.9 percentile: 0.024162772893905657s
99 percentile: 0.022991216182708746s
95 percentile: 0.021985554695129392s
90 percentile: 0.022991216182708746s

Large execution (mixed dict with some large fields)

insert

Mean duration: 0.037574822902679446s
99.9 percentile: 0.04342061400413514s
99 percentile: 0.04313043117523194s
95 percentile: 0.04008703231811523s
90 percentile: 0.04313043117523194s

retrieve

Mean duration: 0.09272492170333863s
99.9 percentile: 0.0965105586051941s
99 percentile: 0.09633594036102294s
95 percentile: 0.09582105875015258s
90 percentile: 0.09633594036102294s

As you can see for small and medium sized fields / executions with not many number of dictionary items, the difference is quite small aka almost a rounding error.

The bigger difference starts to show up on large fields.

That's especially true for dictionaries with many keys because with the escaped field, we need to recursively process the whole dictionary to escape / unescape special characters and this adds a lot of overhead on serialization / deserialization.

Another thing worth noting is that I measured those changes on top of other changes in this branch.

Without other changes in this branch, performance for EscapedDictField (aka current approach) would be substantially worse.

The plan is to implement those change in two incremental steps:

  1. First the new JSONDictEscapedFieldCompatibilityField field change
  2. Then the changes from this branch

Functionality will also be shipped behind a feature flag.

Those changes go together hand in hand and offer 30-50% (and more, depending on the field / dictionary size) overall performance improvements in benchmarks with synthetic data.

@arm4b
Copy link
Member

arm4b commented Apr 7, 2020

@StackStorm/maintainers I assume we have to remove this item from the v3.2 release plan.

@Kami Do you plan to work on this in any future?

@arm4b arm4b removed this from the 3.2.0 milestone Apr 7, 2020
@stale
Copy link

stale bot commented Jul 6, 2020

Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.

@Kami
Copy link
Member Author

Kami commented Feb 19, 2021

Closing in favor of #5159.

@stale stale bot removed the stale label Feb 19, 2021
@Kami Kami closed this Feb 19, 2021
@arm4b arm4b deleted the as_pymongo_optimization branch November 29, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mongodb performance size/M PR that changes 30-99 lines. Good size to review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants