To totally speculate: let's say RIFT is served by blade servers in a hosting center, and that TRION has blades, talking to network switches in blade chassis, talking to a higher level switch for the whole cage/cluster/installation, talking to the one line they have to the router to the internet.
That means if there is a burst of traffic from the servers, those switches have to have enough space to hold onto all the packets until they can drain into the router at the rate of the one line in. If there isn't enough space, packets get dropped. Every time a packet gets dropped, the client waits a 2 second TCP timeout, then reports back to the server something got lost. The server then retransmits (and slows down its max transmission rate for that particular connection). A storm of such retransmissions can of course further overload the network, which of course can get rather ugly. Ugly, like, say, 15 seconds of traffic from server to client waiting and then all being delivered at the same time.
If there's enough network bandwidth for what the servers are trying to do, the fix is putting a deep buffered switch in between the blade enclosure servers and the uplink to the router to the internet.
If there isn't enough network bandwidth, the fix is either tuning the application to send less data, or writing a check for a bigger internet connection.
But again, this is all just speculation...


38Likes


Bookmarks