I am experiencing a reasonably frequent deadlock between a thread initiating garbage collection and the CurlHandler worker thread which is part of the HttpClient implementation on Linux.
This is running in docker on CentOS 7 with base image microsoft/dotnet:2.2.1-runtime. Also occurs with the 2.0.0-runtime image version.
When the deadlock occurs the whole application freezes and requires killing and restarting.
When attaching to the process there are about 20 threads in the application and the majority are inside libcoreclr.so!Thread::RareDisablePreemptiveGC() which I assume (perhaps incorrectly?) means they are suspended for GC. However, I always see the same pattern in the following two stacks; the thread which initiated the GC and the CurlHandler thread. My comments inline.
Thread 1
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libpthread.so.0!__pthread_cond_timedwait + 0x128
Child-SP RetAddr Caller, Callee
00007FFE21B6E440 00007fd33f8e75f5 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x125, calling libcoreclr.so!pthread_cond_timedwait
00007FFE21B6E4A0 00007fd33f8e7254 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x194, calling libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*)
00007FFE21B6E500 00007fd33f8ec59c libcoreclr.so!SleepEx + 0x9c
00007FFE21B6E510 00007fd34108b0f5 ld-linux-x86-64.so.2 + 0x5f5, calling ld-linux-x86-64.so.2 + 0xffffffff
00007FFE21B6E550 00007fd33f5b1c5a libcoreclr.so!__SwitchToThread(unsigned int, unsigned int) + 0x2a, calling libcoreclr.so!SleepEx
// ReaderLock held by Thread 28 so cannot acquire WriterLock.
00007FFE21B6E560 00007fd33f4b04b9 libcoreclr.so!ExecutionManager::WriterLockHolder::WriterLockHolder() + 0x79, calling libcoreclr.so!__SwitchToThread(unsigned int, unsigned int)
00007FFE21B6E590 00007fd33f4b3b8e libcoreclr.so!ExecutionManager::DeleteRange(unsigned long) + 0x1e, calling libcoreclr.so!ExecutionManager::WriterLockHolder::WriterLockHolder()
00007FFE21B6E5C0 00007fd33f4b3ad9 libcoreclr.so!EEJitManager::CleanupCodeHeaps() + 0x99, calling libcoreclr.so!ExecutionManager::DeleteRange(unsigned long)
00007FFE21B6E5F0 00007fd33f5ac1c5 libcoreclr.so!GCToEEInterface::GcStartWork(int, int) + 0x25, calling libcoreclr.so!ExecutionManager::CleanupCodeHeaps()
00007FFE21B6E610 00007fd33f78969e libcoreclr.so!WKS::gc_heap::garbage_collect(int) + 0x4de, calling libcoreclr.so!GCToEEInterface::GcStartWork(int, int)
00007FFE21B6E620 00007fd33f8e2999 libcoreclr.so!ResetEvent + 0xb9
00007FFE21B6E650 00007fd33f926ece libcoreclr.so!EventPipeWriteEventGCTriggered + 0x2e, calling libcoreclr.so!EventPipeEvent::IsEnabled() const
00007FFE21B6E660 00007fd33f7786ad libcoreclr.so!WKS::gc_heap::reset_gc_done() + 0xad, calling libcoreclr.so!GCEvent::Reset()
00007FFE21B6E6A0 00007fd33f77d268 libcoreclr.so!WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) + 0x2e8, calling libcoreclr.so!WKS::gc_heap::garbage_collect(int)
00007FFE21B6E700 00007fd33f77eceb libcoreclr.so!WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int) + 0x21b, calling libcoreclr.so!WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason)
Thread 28
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libpthread.so.0!__pthread_cond_wait + 0xbf
Child-SP RetAddr Caller, Callee
00007FD32792E970 00007fd33f8e7612 libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x142, calling libcoreclr.so!pthread_cond_wait
00007FD32792E9D0 00007fd33f8e7254 libcoreclr.so!CorUnix::CPalSynchronizationManager::BlockThread(CorUnix::CPalThread*, unsigned int, bool, bool, CorUnix::ThreadWakeupReason*, unsigned int*) + 0x194, calling libcoreclr.so!CorUnix::CPalSynchronizationManager::ThreadNativeWait(CorUnix::_ThreadNativeWaitData*, unsigned int, CorUnix::ThreadWakeupReason*, unsigned int*)
00007FD32792EA30 00007fd33f8ec071 libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int) + 0x751
00007FD32792EBD0 00007fd33f8ec292 libcoreclr.so!WaitForSingleObjectEx + 0x52, calling libcoreclr.so!CorUnix::InternalWaitForMultipleObjectsEx(CorUnix::CPalThread*, unsigned int, void* const*, int, unsigned int, int)
00007FD32792EC00 00007fd33f62c962 libcoreclr.so!CLREventBase::WaitEx(unsigned int, WaitMode, PendingSync*) + 0xe2, calling libcoreclr.so!WaitForSingleObjectEx
00007FD32792ECA0 00007fd33f631229 libcoreclr.so!Thread::RareDisablePreemptiveGC() + 0x1a9
00007FD32792ECF0 00007fd33f4da0a5 libcoreclr.so!HashMap::LookupValue(unsigned long, unsigned long) + 0xc5, calling libcoreclr.so!Thread::RareDisablePreemptiveGC()
00007FD32792ED70 00007fd33f56d9dc libcoreclr.so!ReadyToRunInfo::GetMethodDescForEntryPoint(unsigned long) + 0x2c, calling libcoreclr.so!HashMap::LookupValue(unsigned long, unsigned long)
00007FD32792ED80 00007fd33f4b70ba libcoreclr.so!ReadyToRunJitManager::JitCodeToMethodInfo(RangeSection*, unsigned long, MethodDesc**, EECodeInfo*) + 0xea, calling libcoreclr.so!ReadyToRunInfo::GetMethodDescForEntryPoint(unsigned long)
00007FD32792EDC8 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
00007FD32792EDE0 00007fd33f4b4797 libcoreclr.so!ExecutionManager::IsManagedCodeWorker(unsigned long) + 0x137
00007FD32792EDE8 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
00007FD32792EDF8 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
// IsManagedCodeWithLock acquires the reader lock which prevents the GC thread acquiring the writer lock.
00007FD32792EE10 00007fd33f4b4600 libcoreclr.so!ExecutionManager::IsManagedCodeWithLock(unsigned long) + 0x80, calling libcoreclr.so!ExecutionManager::IsManagedCodeWorker(unsigned long)
00007FD32792EE28 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
00007FD32792EE50 00007fd33f530bd8 libcoreclr.so!Thread::VirtualUnwindToFirstManagedCallFrame(_CONTEXT*) + 0x28, calling libcoreclr.so!ExecutionManager::IsManagedCode(unsigned long)
00007FD32792EE70 00007fd33f658875 libcoreclr.so!DispatchManagedException(PAL_SEHException&, bool) + 0xc5, calling libcoreclr.so!Thread::VirtualUnwindToFirstManagedCallFrame(_CONTEXT*)
00007FD32792EE90 00007fd340456ff3 libgcc_s.so.1 + 0xffffffff
00007FD32792EF70 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
00007FD32792F270 00007fd33f5c8ea7 libcoreclr.so!IL_Throw(Object*) + 0x297, calling libcoreclr.so!RaiseTheExceptionInternalOnly(Object*, int, int)
00007FD32792F360 00007fd33f5c8f2a libcoreclr.so!IL_Throw(Object*) + 0x31a, calling libcoreclr.so!DispatchManagedException(PAL_SEHException&, bool)
00007FD32792F370 00007fd2c72554f1 (MethodDesc 00007fd2c69dae70 + 0x71 System.Resources.ResourceManager.GetString(System.String, System.Globalization.CultureInfo))
00007FD32792F490 00007fd2c7253135 (MethodDesc 00007fd2c6484378 + 0x25 System.SR.GetResourceString(System.String, System.String)), calling (MethodDesc 00007fd2c6484388 + 0 System.SR.InternalGetResourceString(System.String))
00007FD32792F4A0 00007fd33f5c8c6a libcoreclr.so!IL_Throw(Object*) + 0x5a, calling libcoreclr.so!LazyMachStateCaptureState
00007FD32792F4E0 00007fd2c90ede6f (MethodDesc 00007fd2c927c808 + 0x1f System.Net.SystemWebProxy.IsBypassed(System.Uri)), calling 00007fd2c90c3067
00007FD32792F4F0 00007fd2c777545a (MethodDesc 00007fd2c789a9c0 + 0x15a System.Net.Http.CurlHandler+EasyRequest.SetProxyOptions(System.Uri))
00007FD32792F560 00007fd2c7773dba (MethodDesc 00007fd2c789a8b0 + 0x17a System.Net.Http.CurlHandler+EasyRequest.InitializeCurl()), calling (MethodDesc 00007fd2c789a9c0 + 0 System.Net.Http.CurlHandler+EasyRequest.SetProxyOptions(System.Uri))
00007FD32792F5D0 00007fd2c77781af (MethodDesc 00007fd2c789a488 + 0x14f System.Net.Http.CurlHandler+MultiAgent.ActivateNewRequest(EasyRequest)), calling (MethodDesc 00007fd2c789a8b0 + 0 System.Net.Http.CurlHandler+EasyRequest.InitializeCurl())
00007FD32792F5E0 00007fd33f7a43dc libcoreclr.so!HndLogSetEvent(OBJECTHANDLE__*, Object*) + 0x4c, calling libcoreclr.so!EventPipe::Enabled()
00007FD32792F6B0 00007fd2c7777993 (MethodDesc 00007fd2c789a438 + 0x173 System.Net.Http.CurlHandler+MultiAgent.HandleIncomingRequests()), calling (MethodDesc 00007fd2c789a488 + 0 System.Net.Http.CurlHandler+MultiAgent.ActivateNewRequest(EasyRequest))
00007FD32792F720 00007fd2c77776ab (MethodDesc 00007fd2c789a428 + 0x4b System.Net.Http.CurlHandler+MultiAgent.WorkerBodyLoop()), calling (MethodDesc 00007fd2c789a438 + 0 System.Net.Http.CurlHandler+MultiAgent.HandleIncomingRequests())
00007FD32792F770 00007fd2c7777413 (MethodDesc 00007fd2c789a418 + 0x73 System.Net.Http.CurlHandler+MultiAgent.WorkerBody()), calling (MethodDesc 00007fd2c789a428 + 0 System.Net.Http.CurlHandler+MultiAgent.WorkerBodyLoop())
00007FD32792F7D0 00007fd2c777a229 (MethodDesc 00007fd2c91c95c0 + 0x19 System.Net.Http.CurlHandler+MultiAgent+<>c.<EnsureWorkerIsRunning>b__20_0(System.Object)), calling (MethodDesc 00007fd2c789a418 + 0 System.Net.Http.CurlHandler+MultiAgent.WorkerBody())
00007FD32792F7E0 00007fd2c868ff31 (MethodDesc 00007fd2c69df590 + 0x71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
00007FD32792F7F0 00007fd2c940c247 (MethodDesc 00007fd2c64ab9b0 + 0x47 System.Threading.Tasks.Task.FinishSlow(Boolean)), calling (MethodDesc 00007fd2c64ab9c0 + 0 System.Threading.Tasks.Task.FinishStageTwo())
00007FD32792F830 00007fd2c8697479 (MethodDesc 00007fd2c64aba80 + 0x199 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)), calling (MethodDesc 00007fd2c69df590 + 0 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
00007FD32792F8D0 00007fd2c868ff31 (MethodDesc 00007fd2c69df590 + 0x71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
00007FD32792F8E0 00007fd33f528881 libcoreclr.so!MetaSig::SkipArg() + 0x31, calling libcoreclr.so!SigParser::SkipExactlyOne()
00007FD32792F920 00007fd33f664f07 libcoreclr.so!CallDescrWorkerInternal + 0x7c
00007FD32792F940 00007fd33f575ce0 libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) + 0x4f0, calling libcoreclr.so!CallDescrWorkerInternal
00007FD32792FA30 00007fd33f52f1b8 libcoreclr.so!MetaSig::GetReturnTypeNormalized(TypeHandle*) const + 0x58, calling libcoreclr.so!SigPointer::PeekElemTypeClosed(Module*, SigTypeContext const*) const
00007FD32792FB30 00007fd33f580f1e libcoreclr.so!ThreadNative::KickOffThread_Worker(void*) + 0x22e, calling libcoreclr.so!MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
00007FD32792FCD0 00007fd33f545bd2 libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) + 0x192
00007FD32792FCE0 00007fd33f4bc264 libcoreclr.so!CExecutionEngine::GetTlsData() + 0x14, calling libcoreclr.so!__tls_get_addr
00007FD32792FCF0 00007fd33f4be699 libcoreclr.so!CrstBase::Leave() + 0x29
00007FD32792FD80 00007fd33f53c99d libcoreclr.so!Thread::HasStarted(int) + 0x3ed, calling libcoreclr.so!EEToProfInterfaceImpl::ThreadAssignedToOSThread(unsigned long, unsigned int)
00007FD32792FDF0 00007fd33f546356 libcoreclr.so!ManagedThreadBase::KickOff(ADID, void (*)(void*), void*) + 0x46, calling libcoreclr.so!ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)
00007FD32792FE40 00007fd33f58114a libcoreclr.so!ThreadNative::KickOffThread(void*) + 0x15a, calling libcoreclr.so!ManagedThreadBase::KickOff(ADID, void (*)(void*), void*)
00007FD32792FEF0 00007fd33f8f2342 libcoreclr.so!CorUnix::CPalThread::ThreadEntry(void*) + 0x132
00007FD32792FFB0 00007fd34018462d libc.so.6!clone + 0x6d
Having looked at the stacks and the coreclr code I have come to the conclusion that the CurlHandler thread has acquired a read lock here inside libcoreclr.so!ExecutionManager::IsManagedCodeWithLock. Then GC is initiated in another thread which causes the CurlHandler thread to be suspended while it holds the read lock. The GC thread then proceeds to attempt to acquire the write lock on the same resource and we have a deadlock.
This is causing quite a headache in our production application so would appreciate some guidance as to whether this is a genuine bug or if there is something I should be doing to avoid this scenario.
Thanks.
I am experiencing a reasonably frequent deadlock between a thread initiating garbage collection and the CurlHandler worker thread which is part of the HttpClient implementation on Linux.
This is running in docker on CentOS 7 with base image microsoft/dotnet:2.2.1-runtime. Also occurs with the 2.0.0-runtime image version.
When the deadlock occurs the whole application freezes and requires killing and restarting.
When attaching to the process there are about 20 threads in the application and the majority are inside
libcoreclr.so!Thread::RareDisablePreemptiveGC()which I assume (perhaps incorrectly?) means they are suspended for GC. However, I always see the same pattern in the following two stacks; the thread which initiated the GC and the CurlHandler thread. My comments inline.Having looked at the stacks and the coreclr code I have come to the conclusion that the CurlHandler thread has acquired a read lock here inside
libcoreclr.so!ExecutionManager::IsManagedCodeWithLock. Then GC is initiated in another thread which causes the CurlHandler thread to be suspended while it holds the read lock. The GC thread then proceeds to attempt to acquire the write lock on the same resource and we have a deadlock.This is causing quite a headache in our production application so would appreciate some guidance as to whether this is a genuine bug or if there is something I should be doing to avoid this scenario.
Thanks.