-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I set the sessionTimeoutMs to 1d, but the actual effective value is 500654ms.
Testing Details
server conf:
tickTime=2000
initLimit=10
syncLimit=5
minSessionTimeout=7200000
maxSessionTimeout=86400000
curator client conf:
CuratorFrameworkFactory.builder()
.connectString(zkQuorum)
.sessionTimeoutMs(86400000)
.connectionTimeoutMs(15000)
.simulatedSessionExpirationPercent(100)
.retryPolicy(new ExponentialBackoffRetry(5000, 24))
.namespace("xxx")
.aclProvider(aclProvider);
There are 3 zookeeper servers, kill 2 of them, simulate a long-term unavailability failure of zookeeper.
The curator client enters SUSPEND state after the leader is unavailable, and is expected to enter LOST state after 1 day, but in reality it will enter LOST state after about 8 minutes.
Related logs:
2025-02-21 18:55:12,181 [main-EventThread] DEBUG org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState - Negotiated session timeout: 86400000
2025-02-21 19:03:33,443 [Curator-ConnectionStateManager-0] WARN org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 500654. Adjusted session timeout ms: 500654
Root cause
(useSessionTimeoutMs * sessionExpirationPercent)
resulted in integer overflow, CuratorZookeeperClient
were reset unexpectedly:
Line 284 in f0646f9
int useSessionTimeoutMs = getUseSessionTimeoutMs(); |
Lines 320 to 328 in f0646f9
private int getUseSessionTimeoutMs() { | |
int lastNegotiatedSessionTimeoutMs = client.getZookeeperClient().getLastNegotiatedSessionTimeoutMs(); | |
int useSessionTimeoutMs = | |
(lastNegotiatedSessionTimeoutMs > 0) ? lastNegotiatedSessionTimeoutMs : sessionTimeoutMs; | |
useSessionTimeoutMs = sessionExpirationPercent > 0 && startOfSuspendedEpoch != 0 | |
? (useSessionTimeoutMs * sessionExpirationPercent) / 100 | |
: useSessionTimeoutMs; | |
return useSessionTimeoutMs; | |
} |