Skip to content

Conversation

patterniha
Copy link
Collaborator

@patterniha patterniha commented Jul 7, 2025

when a portal-worker is draining, heartbeat-reader/writer is immediately closed, this causes some problems, let explain why:

suppose we only have one active portal-worker and it starts draining.

also support the only sub-connection(session) of this worker is heartbeat-sub-connection and it has no client-sub-connection.

after heartbeat-sub-connection is closed, the number of sub-connection(session) of the worker reaches 0,

worker check every 16 seconds if there is no session(sub-connection), It closes itself.

So there is a chance that the worker will be closed after heartbeat-sub-connection is closed and before the new worker arrives from bridge-side.

So during this time, until a new worker is created and arrives from bridge-side, our connection with bridge-side will be completely broken, so if new sub-connection is received from the client-side, we can't connect it to the bridge-side.

so we need to check that there is at least one other active worker before closing heartbeat-sub-connection.

///

also, apart from the fact that our connection to the bridge-side can be interrupted for a moment, there is another reason for this change:

during Iran-Israel 12 days war, some Iran-servers become "Iran-Access", means they only accept connection from Iran-IP, but if the connection is established before becoming "Iran-Access" from foreign-IP, the connection wouldn't disconnect after becoming "Iran-Access".

so Xray-core should not disconnect a connection(worker), before making sure there is at least one other active connection (worker).

///

also, when worker is not draining, heartbeat period increase to 10 seconds.

@Fangliding
Copy link
Member

Fangliding commented Jul 7, 2025

如果只是为了至少有一个活连接 正确的方法应该是把monitor()函数里的timer放入ClientWorker 然后在heartbeat()中发送drain消息前把timer重设为16秒 给客户端时间打开新的连接 而不是还丢个parentPicker进去

@patterniha
Copy link
Collaborator Author

patterniha commented Jul 7, 2025

If you just want to have at least one active connection, the correct way should be to put the timer in the monitor() function into the ClientWorker and then reset the timer to 16 seconds before sending the drain message in heartbeat() to give the client time to open a new connection instead of throwing a parentPicker in.

@Fangliding

But, this doesn't help the "Iran-Access" problem (read the second reason)

After becoming "Iran-Access", the new connection cannot be established from bridge-server to Iran-server.

So after 16 seconds the only active worker is disconnected and new connection cannot be established.

If this change had been made sooner, my Iran-server wouldn't have been disconnected from the foreign-bridge-server during the war

@Fangliding
Copy link
Member

Fangliding commented Jul 7, 2025

我不觉得核心应该为只会发生一次两次的Internet shutdown的特殊行为做准备 持有父对象的引用是不太好的行为 这么实现还额外绕开内部draining机制 难懂又难看 而且即便是这样这样tcp连接不见得可以活太久 以及mux本来就有的硬上限65535摆在那 发送65535个请求后mux就没有多余的流ID了 总的来说可用面很小 还不如在外面开个mux或者用grpc和xhttp3之类的当底层传输

@patterniha
Copy link
Collaborator Author

OK, I implemented your timer-reset idea instead.

@patterniha
Copy link
Collaborator Author

patterniha commented Jul 10, 2025

randomLength := dice.Roll(64)

another problem is that, If 0 is selected for randomLength here, the heartbeat-packet will not be sent at all.

and if the packet is drain-notify packet, bridge-server doesn't notice draining and finally, after filling up 65535-mux-limit, our connection is completely cut off.

so randomLength should be at least 1.

UPDATE:

it only affect keepAlive-hearbeat, not drain-notify-heartbeat

@Fangliding
Copy link
Member

那要不你试试drain数据包能不能发出去

@patterniha
Copy link
Collaborator Author

I tested, I always test after making a change.

set the randomLength to 0, and you see heartbeat-packet is not sent.

@Fangliding
Copy link
Member

你有试过这会导致drain数据包不会被发送导致超过256吗

@patterniha
Copy link
Collaborator Author

patterniha commented Jul 10, 2025

if msg.State = Control_DRAIN the packet is sent, so it doesn't affect drain-packet.

but heartbeat-packet has two roles drain-notify and keepAlive.

and it is still cause some keepAlive packets is not sent.(If there are several consecutive fail-keepAlive, the connection may be disconnected)

so randomLength still needs to be set to at least 1.

@Fangliding
Copy link
Member

我没有说不行 我只是说你臆想问题不是一回两回了。。

@RPRX RPRX merged commit b065595 into XTLS:main Jul 23, 2025
39 checks passed
@patterniha patterniha deleted the rev-fix branch July 23, 2025 15:28
maoxikun added a commit to maoxikun/Xray-core that referenced this pull request Aug 22, 2025
…e there is at least one other active worker (XTLS#4869)"

This reverts commit b065595.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants