Skip to content

Conversation

dokshina
Copy link
Contributor

@dokshina dokshina commented Jan 22, 2021

The problem is that if schema is changed on storages, router still uses
outdated schema from vshard connection. It leads to errors like
"non-existent space" when space is already created.
The solution is quite simple - all such errors leads to conn:reload_schema()
call and then crud retries. In case of error on the storage-side (e.g. in *_object
functions when object was flattened using outdated space format) space schema
hash is returned. Router compares this hash with hash of schema that was used
on router-side, and reloads schema if hashes are different.
conn:reload_schema() is called by one fiber at time - if reloading is in progress,
other fibers just wait for it to end.

Closes #98

local space = replicaset.master.conn.space[space_name]

local master = replicaset.master
if not master.conn:ping({timeout = ping_timeout}) then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we'll do an extra PING call on each crud operation?
Although it is a working solution, we need to measure the performance impact. What about pinging once, saving the schema internally and update it when the schema version changes, as it is done inside the connection?
Also, it is possible to fetch schema from replicas too. Probably it may be faster and save time for the master a bit.

@knazarov
Copy link
Contributor

I don't believe that issuing ping requests on every operation is a good measure. Instead, we can do the same thing regular box operations do: send our schema ID along with the call and if they diverge with what is currently on the storage, return a special error and the router will then retry.

@olegrok
Copy link
Contributor

olegrok commented Jan 22, 2021

send our schema ID along with the call and if they diverge with what is currently on the storage, return a special error and the router will then retry.

That's still not a general purpose solution. There are some examples when schema could be dynamic and schema will be changed quite often - this will lead to continuous retries.
Moreover you shouldn't send schema_id manually because in case of schema mismatch Tarantool fetches it automatically. Seems your suggestion could be simplified to

while true do
    local schema_id = conn.schema_version
    local data, err = conn:call(...)
    if err ~= nil and conn.schema_version == schema_id then
        return nil, err
    elseif data ~= nil then
        return data
    end
end

I don't like it. I could update some spaces that don't related to my queries but it will be cause some retries.
May be it's rare cases but anyway we shouldn't neglect them.
Seems that the most appropriate solution is support "DDL". I suggest to support it optionally I don't want to impose this on the user.
I don't see any reliable solutions until tarantool/tarantool#802 won't be implemented.

@Totktonada
Copy link
Member

Totktonada commented Jan 23, 2021

Ugh… I was sure that net.box sends schema_id in a request, performs retrying in case of E_WRONG_SCHEMA_VERSION error and so guarantees that a request is made against a schema that is not older than one that were actual at the moment when a net.box method was called. That is not so. I'll file an issue against net.box later (I briefly described the problem in Russian here). [Update 2021-12-23: The issue against tarantool is here.]

So calling <connection>:reload_schema() (it is just wrapper for :ping()) when a space does not exist is not enough. And, in any circumstances, we should not issue a ping request before any other request: it doubles latency.

However we still can send a request without the client side checking of a space presence and it'll be the correct solution. You can even use the local net.box's schema after receiving a response (to form a response metadata), because the response is returned only after the schema fetching (if it appears to be obsoleted). Okay, the schema is not guaranteed to correspond the same schema_id, but is at least not older that one against which the request was made. The only way to achieve the guarantee of the exactly same schema_id of a response and a local schema (and use current net.box implementation) is to send the metadata from the server (which is costly).

Sure, there are cases, when you need to use some schema information to produce a request, but I don't see such cases here.

@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch from b3b067f to fddb05e Compare January 26, 2021 14:35
@dokshina dokshina marked this pull request as draft January 26, 2021 14:36
@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch 7 times, most recently from 4bf1b1b to a54215a Compare January 29, 2021 10:36
@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch 2 times, most recently from 19bf303 to a80f6c2 Compare February 1, 2021 09:12
@dokshina dokshina marked this pull request as ready for review February 1, 2021 09:12
return digest.md5(msgpack.encode(space_info))
end

function schema.wrap_box_space_func_result(add_space_schema_hash, space, func, ...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we need to describe why every function in this module are needed in comment. Otherwise it's your personal secret knowledge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what do you mean =(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently you don't understand me. But without good comments you will not understand you in several weeks/months. And nobody won't understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't understand that you mean adding comments for functions, sorry, that's my fault.

Anyway, I've added comments.

crud/delete.lua Outdated
@@ -21,14 +22,55 @@ local function delete_on_storage(space_name, key)
return nil, DeleteError:new("Space %q doesn't exist", space_name)
end

local tuple = space:delete(key)
return tuple
return schema.wrap_box_space_func_result(false, space, space.delete, space, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "false"? Please describe for all places where you use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false? What?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? Describe why did you pass "false" as the first argument here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it looks bad now (I mean passing this flag as a first argument).
I can add schema.wrap_box_space_func_result_no_schema_hash for such places, but it will require a comment anyway (to describe a logic "why it don't need space hash").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments to all calls.

I've also changed schema.wrap_box_space_func_result signature to avoid passing space, space:<func>, space.

@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch from a80f6c2 to 9d96220 Compare February 1, 2021 13:59
local const = {}

const.RELOAD_RETRIES_NUM = 1
const.RELOAD_SCHEMA_TIMEOUT = 30 -- half of minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with 1k rps you will create 30k fibers. It's too huge amount. I think you need to decrease to 3-5 seconds timeout.

if result == nil then
result = nil
if tuple == nil then
tuple = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tuple -> tuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, indentation was wrong, but now it's OK.

@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch 2 times, most recently from 46bbc14 to ffa237f Compare February 2, 2021 18:56

local ok, reload_schema_err = reload_schema(vshard.router.routeall())
if not ok then
log.debug("Failed to reload schema: %s", reload_schema_err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest warn.

Copy link
Contributor

@olegrok olegrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In CRUD we trust" say them.

Currently it's not so.

@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch from ffa237f to a257bd8 Compare February 3, 2021 14:50
@dokshina dokshina force-pushed the fix-using-outdated-space-schema branch 2 times, most recently from 310f820 to cf454cb Compare February 3, 2021 16:05
@dokshina dokshina merged commit 5fa62ae into master Feb 4, 2021
@dokshina dokshina deleted the fix-using-outdated-space-schema branch February 4, 2021 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[4pt] CRUD uses outdated schema from vshard connections
5 participants