Skip to content

Adjust connection timeout for TLS#4903

Merged
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-tls-connect-timeout
Feb 4, 2019
Merged

Adjust connection timeout for TLS#4903
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-tls-connect-timeout

Conversation

@shinrich
Copy link
Member

Found the following crash.

[ 0 ] traffic_server      write_to_net_io                    ( UnixNetVConnection.cc:445 )
[ 1 ] traffic_server      NetHandler::waitForActivity(long)  ( UnixNet.cc:514 )
[ 2 ] traffic_server      EThread::execute_regular()         ( UnixEThread.cc:277 )
[ 3 ] traffic_server      spawn_thread_internal              ( Thread.cc:85 )
[ 4 ] libpthread-2.17.so  start_thread                       

Line numbers on our branch are off from open source master, but top of stack from write_to_net_io is in code below

 if (!vc->getSSLHandShakeComplete()) {
    if (vc->trackFirstHandshake()) {
      // Send the write ready on up to the state machine
      write_signal_and_update(VC_EVENT_WRITE_READY, vc);
      vc->write.triggered = 0;
      nh->write_ready_list.remove(vc);
    }

    int err, ret;

    if (vc->get_context() == NET_VCONNECTION_OUT) {
      ret = vc->sslStartHandShake(SSL_EVENT_CLIENT, err);
    } else {
      ret = vc->sslStartHandShake(SSL_EVENT_SERVER, err);
    }

Specifically the crash is at the "ret = vc->sslStartHandShake(SSL_EVENT_CLIENT, err);" and looking in the core shows that "vc" has been freed (vtable pointer is bogus).

I think the issue is how we are handling notifying the state machine that the socket is in a write-ready state (SYN exchange has completed) with the call to write_signal_and_update. We assume that the vc is in a good state after this call, but it is quite possible that the HttpSM has determined it is in an error state and closed the vc.

I think we should reconsider how the connect timeout should apply to the TLS connection. Rather than just covering the SYN exchange, I would argue that it should cover the entire TLS handshake. If the TLS handshake stalls out, that should be covered by the connection timeout not the data no-activity timeout. Making that change would remove the write_signal_and_update here. Instead we don't notify the state machine until the read_complete signal sent at the end of the TLS handshake.

This PR makes that change.

We haven't see this crash very often, but the original TTFB timeout fix was backed out of the 8.0.x branch I assume due to instability issues.

@shinrich shinrich added this to the 9.0.0 milestone Jan 31, 2019
@shinrich shinrich self-assigned this Jan 31, 2019
@shinrich shinrich merged commit 33818ba into apache:master Feb 4, 2019
@zwoop
Copy link
Contributor

zwoop commented Feb 11, 2019

@shinrich So with this fix, we don't need to backout the TTFB PR from 8.0.x? If so, we should put that back in for 8.1.x as well IMO.

@bryancall
Copy link
Contributor

This is a fix for #4028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants