-
-
Notifications
You must be signed in to change notification settings - Fork 766
[WIP] [EXPERIMENT] [DONT MERGE] Speed up "get one" and "get all" / query MongoDB (mongoengine) operations #4838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
internally and manually convert raw pymongo result (dict) to our database model instances. This approach is much more efficient compared to letting mongoengine do the SON -> document conversion (that conversion is slow and has a lot of overhead, especially for large documents).
integer and we can avoid the conversion.
instance so it utilizes it in a fashion so it expects actual DB model instance(s). Also fix some of broken tests.
I'm also looking if I can get rid of auto conversion / dereferencing in the base document DB class. This only needs to happen where There are multiple ways to get rid of auto conversion - probably the easiest / best way is to change those field types to simple Dict / EscapedDynamicField fields. Another option is to explicitly try to de-reference only those fields which use that field type in a custom class constructor. I will look into both approaches, second one would probably require less changed. |
inside the base mongoengine document class. This process is very slow and adds a tons of overhead since it calls "to_python()" on every model field even if the field doesn't need to be referenced. With this approach we manually de-reference only the fields which need to be dereferenced / converted. The most ideal and faster approach would be to work directly with dictionaries as returned by pymongo (this way we could avoid the whole overhead), but this is a compromise.
And here is the WIP approach with manual de-referencing I'm testing - a7582ae. It speeds those two methods up by 20-30% so it's, imo, well worth it (and we don't really have any better option anyway). Another problematic and very slow part is using It means we need to unescape the values when we retrieve those models from the database and that's a very expensive and slow process (especially on large and nested dicts - I tested it with and without escaping and without it's around 5x faster which in absolute numbers is around 0.2s vs 0.02s.). It seems like that method of escaping is not recommend anymore anyway - https://jira.mongodb.org/browse/DOCS-9311. And:
https://docs.mongodb.com/manual/reference/limits/#Restrictions-on-Field-Names Sadly we don't have much choice here, we still need to support escaping (at least for a while) for backward compatibility reasons. |
First step on optimizing those escaped fields would be changing fields which are controlled by us and can't contain @m4dcoder Can all those fields contain special characters ($, .) - https://gist.github.com/Kami/04718da61cab79b02e2a2a19c3f4f397? Since as per my comment, using that field type adds massive overhead on object retrieval + insertion time. Another alternative for fields such as |
I did some retrieval benchmarking to see if we could indeed utilize a special field type which stores large dictionaries / values as JSON serialized strings. EscapedDynamicField with character unescaping on retrieval: 99.9 percentile: 0.11631401681900028s
99 percentile: 0.1144586706161499s
95 percentile: 0.0900427460670471s
90 percentile: 0.1144586706161499s Using JSON field type where unescapping is not necessary: 99.9 percentile: 0.06011564707756043s
99 percentile: 0.059931752681732176s
95 percentile: 0.0596500039100647s
90 percentile: 0.059931752681732176s Only "downside" of using a JSONDictField is that we can't query on actual dictionary values, but that's not something we are doing anyway and for all purposes, Now the main problem is how to implement this change in a backward compatible manner since we don't really have object / model versioning in place. One approach is to do something like this and handle conversion from old format to a new one "lazily" on initial retrieval / write of the existing objects:
This applies to existing installations which upgrade to a new version aka backward compatibility layer. If we wouldn't need to support that (but we don't really have a choice), the change would obviously be much more straight forward. Another option would be to add a data migration script, but this could potentially cause a long down time for installations with a lot of existing database models, so I think "lazy" approach is the way to go. I believe that with that change + other changes in this PR we would get retrieval performance down to a reasonable level for the foreseeable future. |
As part of my benchmarking and quest to improve the performance with a JSON field idea proposed above, here are some more numbers. In this benchmark, I used randomly generated data, but in the future, I also plan to benchmark it with various real-world data from our CI/CD server. Here are the results. EscapedDict/DynamicFieldSmall executioninsertMean duration: 0.005912823677062988s
99.9 percentile: 0.012965323686599837s
99 percentile: 0.007483551502227781s
95 percentile: 0.006837272644042968s
90 percentile: 0.007483551502227781s retrieveMean duration: 0.0017433881759643555s
99.9 percentile: 0.002846957683563236s
99 percentile: 0.00250579357147217s
95 percentile: 0.0021214723587036134s Medium size execution (50 000 bytes)insertMean duration: 0.006792712211608887s
99.9 percentile: 0.0159519746303559s
99 percentile: 0.009502100944519013s
95 percentile: 0.007520365715026855s
90 percentile: 0.009502100944519013s retrieveMean duration: 0.0025107145309448244s
99.9 percentile: 0.002986632108688355s
99 percentile: 0.002956254482269287s
95 percentile: 0.002732992172241211s
90 percentile: 0.002956254482269287s Large execution (single large dict field)insertMean duration: 0.024022660255432128s
99.9 percentile: 0.027110637426376365s
99 percentile: 0.026071195602416993s
95 percentile: 0.025422227382659913s
90 percentile: 0.026071195602416993s retrieveMean duration: 0.022088223695755006s
99.9 percentile: 0.026436156034469655s
99 percentile: 0.023751575946807862s
95 percentile: 0.02322307825088501s
90 percentile: 0.023751575946807862s Large execution (mixed dict with some large fields)insertMean duration: 0.11646514534950256s
99.9 percentile: 0.12528950285911566s
99 percentile: 0.12221453905105592s
95 percentile: 0.12034807205200196s
90 percentile: 0.12221453905105592s retrieveMean duration: 0.10380237460136414s
99.9 percentile: 0.11098703455924988s
99 percentile: 0.11004779577255248s
95 percentile: 0.10878726243972778s
90 percentile: 0.11004779577255248s JSONDictEscapedFieldCompatibilityFieldSmall executioninsertMean duration: 0.0057206833362579345s
99.9 percentile: 0.0076095359325408975s
99 percentile: 0.007337703704833984s
95 percentile: 0.006766223907470702s
90 percentile: 0.007337703704833984s retrieveMean duration: 0.0017057633399963378s
99.9 percentile: 0.0022321674823760993s
99 percentile: 0.0021707749366760257s
95 percentile: 0.002055680751800537s
90 percentile: 0.0021707749366760257s Medium size executioninsertMean duration: 0.006151213645935059s
99.9 percentile: 0.010596619129180942s
99 percentile: 0.008227782249450678s
95 percentile: 0.0069027066230773915s
90 percentile: 0.008227782249450678s retrieveMean duration: 0.0023874282836914063s
99.9 percentile: 0.0028212878704071053s
99 percentile: 0.002770516872406006s
95 percentile: 0.0026262402534484863s
90 percentile: 0.002770516872406006s Large execution (single large dict field)insertMean duration: 0.012752714157104493s
99.9 percentile: 0.014751040458679202s
99 percentile: 0.014529399871826172s
95 percentile: 0.014111042022705078s
90 percentile: 0.014529399871826172s retrieveMean duration: 0.020717523097991943s
99.9 percentile: 0.024162772893905657s
99 percentile: 0.022991216182708746s
95 percentile: 0.021985554695129392s
90 percentile: 0.022991216182708746s Large execution (mixed dict with some large fields)insertMean duration: 0.037574822902679446s
99.9 percentile: 0.04342061400413514s
99 percentile: 0.04313043117523194s
95 percentile: 0.04008703231811523s
90 percentile: 0.04313043117523194s retrieveMean duration: 0.09272492170333863s
99.9 percentile: 0.0965105586051941s
99 percentile: 0.09633594036102294s
95 percentile: 0.09582105875015258s
90 percentile: 0.09633594036102294s As you can see for small and medium sized fields / executions with not many number of dictionary items, the difference is quite small aka almost a rounding error. The bigger difference starts to show up on large fields. That's especially true for dictionaries with many keys because with the escaped field, we need to recursively process the whole dictionary to escape / unescape special characters and this adds a lot of overhead on serialization / deserialization. Another thing worth noting is that I measured those changes on top of other changes in this branch. Without other changes in this branch, performance for EscapedDictField (aka current approach) would be substantially worse. The plan is to implement those change in two incremental steps:
Functionality will also be shipped behind a feature flag. Those changes go together hand in hand and offer 30-50% (and more, depending on the field / dictionary size) overall performance improvements in benchmarks with synthetic data. |
@StackStorm/maintainers I assume we have to remove this item from the @Kami Do you plan to work on this in any future? |
Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue. |
Closing in favor of #5159. |
Background, Details
There is a known issue with some mongoengine operations (especially retrievals and to lesser extend, also insertions) being very slow when working with large documents (MongoEngine/mongoengine#1230).
The primary reason for that is massive overhead mongoengine adds when converting native pymongo representation to the document class instance and vice versa.
There are multiple approaches on how to tackle that, including switching to something like pymodm which is much more efficient than mongoengine.
Most of those libraries claim very similar and compatible API to the mongoengine, but this doesn't help us at all.
Our DB later abstraction is based heavily around mongoengine and we utilize a lot of low level mongoengine functionality and primitives (and also extend it in various places).
This means that switching the ORM library we use is pretty much impossible at the moment. It would simply require too much time and resources and it would likely also result in a lot of hard to catch / edge case bugs.
Proposed Solution
While profiling and measuring hot code paths and trying to see which places we could optimize (switch directly to pymongo and avoid mongoengine layer all together), I found out that calling
as_pymongo()
on mongoengine QuerySet object results in massive performance improvements.The nice thing is that it's much easier to make that change compatible with out existing code and DB layer compared to fully switching an ORM layer which would require rewriting very big chunks of the code.
The reason for that is that this method avoids very expensive mongoengine conversion and dereferrencing layer which we don't need anyway. It simply returns raw pymongo result aka dictionary.
In this proposed implementation, I updated our
get()
andquery()
method to utilizeas_pymongo()
method and manually convert those raw dictionaries to our database model class instances.It turns out that approach is much more efficient.
As part of that changes, I also need to update various code to utilize
query()
in a manner that it expects that method to return actual DB model class instances and not aQuerySet
object.In fact, we already did that for a lot of the code in the past. If you check, 90% of the code which calls
query()
already expects it to return actual DB model objects and not a query set, so the change shouldn't be that big.In places where there is not possible and where we need access to raw QuerySet (there should be very few of those), we can use
raw_query()
method directly.Keep in mind that this approach is still not 100% ideal. In ideal world, we would avoid instantiating DB class all together and work directly with raw pymongo dictionaries (at least in critical code paths), but that change would be much harder and more involved.
Profiling Data
Here is some profiling data when retrieving a single large object with and without utilizing as_pymongo.
As you can see, the difference is substantial (aka as_pymongo + manual model object instantiation around 50% faster).
TODO
Overall, I'm quite optimistic I can get all that change finished and call the tests passing compared to many other much more involved and complex approach (I already have a huge chunk of tests passing locally).