Don't be overly aggressive on stream failures and closing#6525
Don't be overly aggressive on stream failures and closing#6525zwoop merged 1 commit intoapache:masterfrom
Conversation
|
@masaori335 Now that we're looking at this code, why is this x2 on the threshold? |
| Warning("HTTP/2 session error client_ip=%s session_id=%" PRId64 | ||
| " closing a connection, because its stream error rate (%f) is too high", | ||
| client_ip, connection_id(), this->connection_state.get_stream_error_rate()); | ||
| " closing a connection, because its stream error rate (%f) exceeded the threshold (%f)", |
There was a problem hiding this comment.
This was only done to make the Warning() consistent with the other two places we do this check.
There was a problem hiding this comment.
Adding configured threshold value is fine, but removing "too high" would make it hard to see whether the close was graceful or immediate.
| int total = get_stream_requests(); | ||
| if (total > 0) { | ||
|
|
||
| if (total >= (1 / Http2::stream_error_rate_threshold)) { |
There was a problem hiding this comment.
The point of this is to require a minimum number of samples before we can calculate a reasonably trustworthy rate of failures. The smaller the threshold, the more samples needed. Probably not statistically safe, but this at least avoids the issues where if (in the default configs) any of the first 10 streams has an error, it's enough to close the connection.
|
|
This fixes #5195 |
It could be 1.5, 3 or whatever, but 5 is probably too big. If stream error rate exceeds the threshold, the connection will be closed gracefully. If stream error rate exceeds 2x of the threshold, the connection will be closed immediately. This is why the error message is slightly different, and the one for 2x says "too high". I'm ok with this change, but the reason I didn't check a number of requests was that I thought we want to close stupid clients that causes stream error on the first request. |
maskit
left a comment
There was a problem hiding this comment.
Personally I'm fine with this change. I hope others are fine as well. When the original code was merged nobody had the questions (or nobody read the code).
|
Cherry-picked to 8.1.x |
Randall found this in our production. I'm thinking 1/threshold is a good limit here, such that you can at least calculate a reasonable percentage based on a large enough sample.