Foxmail无法响应

Foxmail是一款不错的邮件客户端软件,小巧实用,我在公司就是用的它。早上一如既往的先打开Foxmail,然后去倒杯水,回来发现Foxmail还在收取邮件的状态,“这也太慢了”,我心想。用鼠标点了点,出现了程序挂起的特征,标题栏上显示无法响应,程序界面变白。经验告诉我挂起的问题70%都还是比较容易调的,好吧,就让我看看这是怎么回事。

打开Windbg,Attach到Foxmail上,习惯做的第一件事情就是保存Full dump

1
.dump /ma /u e:\foxmail.dmp

由于程序的主界面出现了挂起的现象,而一般情况下主线程就是程序的界面线程,所以此时根本没必要去查看所有线程的情况,直接看看主线程的栈信息吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
0:000> k
ChildEBP RetAddr
0018f454 770fd993 ntdll!ZwWaitForSingleObject+0x15
0018f4b8 770fd877 ntdll!RtlpWaitOnCriticalSection+0x13e
0018f4e0 770f84ca ntdll!RtlEnterCriticalSection+0x150
0018f514 76734e8c ntdll!RtlLockHeap+0x3d
0018f558 004091af KERNELBASE!GlobalAlloc+0x92
WARNING: Stack unwind information not available. Following frames may be wrong.
0018f578 0042bf97 Foxmail+0x91af
0018f588 0042c115 Foxmail+0x2bf97
0018f5a4 007c5f25 Foxmail+0x2c115
0018f9fc 007c50ce Foxmail+0x3c5f25
0018fa34 007c516a Foxmail+0x3c50ce
0018fa44 007c9e57 Foxmail+0x3c516a
0018faac 007ca28b Foxmail+0x3c9e57
0018fae0 007c9370 Foxmail+0x3ca28b
0018fb10 007ca3fb Foxmail+0x3c9370
0018fb1c 00dd8a96 Foxmail+0x3ca3fb
0018fb44 007c93bb Foxmail+0x9d8a96
0018fb64 007c9479 Foxmail+0x3c93bb
0018fba8 00dd0492 Foxmail+0x3c9479
0018fcd4 00dd019d Foxmail+0x9d0492
0018fd00 00dd454a Foxmail+0x9d019d
0018fd44 00dcf20d Foxmail+0x9d454a
0018fd60 007ca59d Foxmail+0x9cf20d
0018fd84 0044e9d6 Foxmail+0x3ca59d
0018fd94 0044e844 Foxmail+0x4e9d6
0018fdbc 00434a96 Foxmail+0x4e844
0018fdd4 74aa62fa Foxmail+0x34a96
0018fe00 74aa6d3a user32!InternalCallWinProc+0x23
0018fe78 74aa77c4 user32!UserCallWinProcCheckWow+0x109
0018fed8 74aa7bca user32!DispatchMessageWorker+0x3bc
0018fee8 00dc5e7a user32!DispatchMessageA+0xf
0018ff04 00dc5ee4 Foxmail+0x9c5e7a
0018ff2c 00dc61bf Foxmail+0x9c5ee4
0018ff5c 00ded3a0 Foxmail+0x9c61bf
0018ff88 75b9336a Foxmail+0x9ed3a0
0018ff94 770f9882 kernel32!BaseThreadInitThunk+0xe
0018ffd4 770f9855 ntdll!__RtlUserThreadStart+0x70
0018ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

可以看到界面线程调用GlobalAlloc的时候在等HeapLock被释放。

1
2
3
4
5
6
7
0:000> kb L5
ChildEBP RetAddr Args to Child
0018f454 770fd993 00000698 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
0018f4b8 770fd877 00000000 00000000 00000000 ntdll!RtlpWaitOnCriticalSection+0x13e
0018f4e0 770f84ca 014d0138 0044e9b8 0018f4ac ntdll!RtlEnterCriticalSection+0x150
0018f514 76734e8c 014d0000 4d5402d1 00000000 ntdll!RtlLockHeap+0x3d
0018f558 004091af 00000002 00002000 0042c07c KERNELBASE!GlobalAlloc+0x92

那么014d0138比如是默认堆的Critical Section了。来看看这个cs的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0:000> !cs 014d0138 
*************************************************************************
*** ***
*** ***
*** Either you specified an unqualified symbol, or your debugger ***
*** doesn't have full symbol information. Unqualified symbol ***
*** resolution is turned off by default. Please either specify a ***
*** fully qualified symbol module!symbolname, or enable resolution ***
*** of unqualified symbols by typing ".symopt- 100". Note that ***
*** enabling unqualified symbol resolution with network symbol ***
*** server shares in the symbol path may cause the debugger to ***
*** appear to hang for long periods of time when an incorrect ***
*** symbol name is typed or the network symbol server is down. ***
*** ***
*** For some commands to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: ntdll!_RTL_CRITICAL_SECTION ***
*** ***
*************************************************************************
Bad symbols for NTDLL (error 3). Aborting.

嗯,ntdll的符号文件的结构体信息没有了!这个问题发生在2015年7月份的,安装KB3071756和KB3060716补丁后产生的。详情可以查看
http://www.osronline.com/showthread.cfm?link=269221

既然新的符号不让用,那就只有用老的了

1
2
3
4
5
6
7
8
9
10
11
12
13
0:000> .reload /f /i E:\WorkSpace\MySymbols\ntdll.dll\49900AFA96000\ntdll.dll=77ffe000   
0:000> lm m ntdll*
start end module name
770c0000 77240000 ntdll (pdb symbols) e:\workspace\mysymbols\wntdll.pdb\FA9C48F9C11D4E0894B8970DECD92C972\wntdll.pdb
77ffe000 78094000 ntdll_77ffe000 (pdb symbols) e:\workspace\mysymbols\ntdll.pdb\6992F4DAF4B144068D78669D6CB5D2072\ntdll.pdb
0:000> dt ntdll_77ffe000!_RTL_CRITICAL_SECTION 014d0138
+0x000 DebugInfo : 0x771c4960 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : 0n-30
+0x008 RecursionCount : 0n1
+0x00c OwningThread : 0x00001730 Void
+0x010 LockSemaphore : 0x00000698 Void
+0x014 SpinCount : 0xfa0

好了,这样就够用了。可以看到TID=1730的线程正在占用这个cs,马上去看看这个线程在干什么。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

0:041> k
ChildEBP RetAddr
06e1f214 770fd993 ntdll!ZwWaitForSingleObject+0x15
06e1f278 770fd877 ntdll!RtlpWaitOnCriticalSection+0x13e
06e1f2a0 770f84ca ntdll!RtlEnterCriticalSection+0x150
06e1f2d4 7717fd7d ntdll!RtlLockHeap+0x3d
06e1f3bc 7714de8e ntdll!RtlpQueryExtendedHeapInformation+0xbd
06e1f3fc 7716476b ntdll!RtlQueryHeapInformation+0x4a
06e1f4a0 77143320 ntdll!RtlQueryProcessHeapInformation+0x288
06e1f51c 75c15f4b ntdll!RtlQueryProcessDebugInformation+0x28a
Unable to load image D:\Program Files\Foxmail 7.2\libeay32.dll, Win32 error 0n2
*** ERROR: Symbol file could not be found. Defaulted to export symbols for libeay32.dll -
06e1f54c 0606ad6a kernel32!Heap32Next+0x4d
WARNING: Stack unwind information not available. Following frames may be wrong.
06e1fab0 06069c77 libeay32!RAND_poll+0x5fa
06e1facc 06023db7 libeay32!RAND_SSLeay+0x447
06e1fb00 06069f71 libeay32!CRYPTO_set_ex_data_implementation+0x387
00000000 00000000 libeay32!RAND_SSLeay+0x741

从栈的信息看来,这个线程是OpenSSL的一个线程,正在做随机数处理,而枚举Heap的信息应该也是随机数的一个组成部分。在枚举Heap的时候也处于等待一个HeapLock的情况,来具体看看

1
2
3
4
5
6
7
0:041> kb L5
ChildEBP RetAddr Args to Child
06e1f214 770fd993 00000b10 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
06e1f278 770fd877 00000000 00000000 06e1f428 ntdll!RtlpWaitOnCriticalSection+0x13e
06e1f2a0 770f84ca 04560138 00000000 00000000 ntdll!RtlEnterCriticalSection+0x150
06e1f2d4 7717fd7d 04560000 71edefe5 06e1f468 ntdll!RtlLockHeap+0x3d
06e1f3bc 7714de8e 06e1f428 771640d3 00000000 ntdll!RtlpQueryExtendedHeapInformation+0xbd

看来正在等一个基地址是04560000的Heap的cs。顺藤摸瓜看看这个cs又被谁占用了。

1
2
3
4
5
6
7
0:041> dt ntdll_77ffe000!_RTL_CRITICAL_SECTION 04560138 
+0x000 DebugInfo : 0x0151df40 _RTL_CRITICAL_SECTION_DEBUG
+0x004 LockCount : 0n-6
+0x008 RecursionCount : 0n1
+0x00c OwningThread : 0x00001994 Void
+0x010 LockSemaphore : 0x00000b10 Void
+0x014 SpinCount : 0xfa0

继续看看0x00001994这个线程在做什么事情

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
0:002> kb
ChildEBP RetAddr Args to Child
0438d2c4 770fd993 00000698 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
0438d328 770fd877 00000000 00000000 014d0000 ntdll!RtlpWaitOnCriticalSection+0x13e
0438d350 770fdaf8 014d0138 7334c875 00078000 ntdll!RtlEnterCriticalSection+0x150
0438d42c 770f2fe3 000001f8 00000200 00000000 ntdll!RtlpAllocateHeap+0x159
0438d4b0 770f7bfb 014d0000 00800000 000001f8 ntdll!RtlAllocateHeap+0x23a
0438d4fc 770f70f1 00000028 7334c9d5 07240048 ntdll!RtlpAllocateUserBlock+0xae
0438d58c 770ee0e2 07240048 07240048 00000000 ntdll!RtlpLowFragHeapAllocFromContext+0x802
0438d600 770f8129 014d0000 00000000 00000020 ntdll!RtlAllocateHeap+0x206
0438d610 770f3f21 04560000 07240048 00000000 ntdll!RtlpAllocateDebugInfo+0x28
0438d64c 770f4262 07240048 00000000 00000000 ntdll!RtlInitializeCriticalSectionEx+0x93
0438d660 771031e2 07240048 07240048 04560000 ntdll!RtlInitializeCriticalSection+0x12
0438d674 771031b0 00000000 00000800 04560000 ntdll!RtlpInitializeLowFragHeap+0x28
0438d684 77102f5a 7334cae5 045600cc 04560000 ntdll!RtlpCreateLowFragHeap+0x28
0438d6bc 77103002 04560000 04560194 0438d7a8 ntdll!RtlpActivateLowFragmentationHeap+0xc9
0438d6cc 77102fce 04560000 7334cbf1 00000000 ntdll!RtlpPerformHeapMaintenance+0x2a
0438d7a8 770f2fe3 00000008 00000010 04560194 ntdll!RtlpAllocateHeap+0x172
0438d82c 6fef1e62 04560000 00000008 00000008 ntdll!RtlAllocateHeap+0x23a
0438d844 6fef20f9 0438d858 04560984 00000007 nlaapi!NlapAllocNlaDataSet+0x18
0438d85c 6fef2096 00000007 00000003 045608f0 nlaapi!QueryCtxtDeserializeTuplesToTrieMap+0x20
0438d888 6fef23a7 045608f0 000003c8 04560958 nlaapi!QueryCtxtDeserializeEntityToTrieMap+0x38
0438d924 6fef58e0 7972514e 00000001 00000000 nlaapi!NLA_QUERY_HANDLE_completion_internal+0x21f
0438d944 6fef5cd7 00000000 00000000 0c7507d0 nlaapi!NlaRegisterQuery+0x7d
0438d99c 6fef5bac 044907d0 4e3f415a 014587b0 nlaapi!GetNlaV2Handle+0x7e
0438d9d0 769d59b1 01435244 0438db88 044907d0 nlaapi!WSM_NSPLookupServiceBegin_v2+0xc9
0438d9ec 769d598d 01435208 0438db88 00000000 ws2_32!NSPROVIDER::NSPLookupServiceBegin+0x1b
0438da08 769d591c 01458960 0438db88 00000000 ws2_32!NSPROVIDERSTATE::LookupServiceBegin+0x1d
0438da6c 769d57cc 014587b0 0438db88 00000ff0 ws2_32!NSQUERY::LookupServiceBegin+0x18d
0438dabc 0fd7fca2 0438db88 00000ff0 0438dbd0 ws2_32!WSALookupServiceBeginW+0x7f
WARNING: Stack unwind information not available. Following frames may be wrong.
0438dd18 0fd0c6a1 0fa9df49 03ac9c20 03b5da80 libcef!cef_time_delta+0x3ddc22
0438e2f0 0fa9ec71 0438e698 03a91e00 0438e310 libcef!cef_time_delta+0x36a621
0438e37c 0fa8d58f 0438e698 03ac9c00 03b5da80 libcef!cef_time_delta+0xfcbf1
0438eb88 0fa9915f 0438ebc8 fffffffe ffffffff libcef!cef_time_delta+0xeb50f
0438ebe8 0fb3605d 03b5da80 fffffffe 03ac9d20 libcef!cef_time_delta+0xf70df
0438ec04 0fb3744d 03b5da80 fffffffe fffffffe libcef!cef_time_delta+0x193fdd
0438ecd8 0fb37ad2 00000003 fffffffe 00000000 libcef!cef_time_delta+0x1953cd
0438edc0 0fb37f5d 03ac9400 11c9bee1 03ac9400 libcef!cef_time_delta+0x195a52
0438edfc 0fa9ab19 03ac9400 03ac9c38 03ac9c00 libcef!cef_time_delta+0x195edd
0438f20c 0fa9b466 03ac9400 00000000 03ac9c38 libcef!cef_time_delta+0xf8a99
0438f220 0fb3c815 00000000 03ac9400 0fb3aee5 libcef!cef_time_delta+0xf93e6
0438f2e8 0fb3d2e3 00000000 03b53384 03ac9c38 libcef!cef_time_delta+0x19a795
0438f690 0fb3ad7b 0438f600 03b53384 03b2ddc0 libcef!cef_time_delta+0x19b263
0438f8ac 0f985613 0438fb64 03b53384 00000001 libcef!cef_time_delta+0x198cfb
0438fbec 0f98016a ffffffff ffffffff 03b53368 libcef!cef_string_multimap_free+0x18b63
0438fc28 0fa208a4 03b53340 03ac0dd0 03a87b00 libcef!cef_string_multimap_free+0x136ba
0438fd70 0fa223ba 0438fdd0 03a878c0 03ac0d00 libcef!cef_time_delta+0x7e824
0438fe0c 0fa4edfc 00000000 03a878c0 03ac0dd0 libcef!cef_time_delta+0x8033a
0438fe3c 0fa4ec2e 00000000 03ac0dd0 03ac0dd0 libcef!cef_time_delta+0xacd7c
0438fe5c 0fa2177f 03ac0dd0 0438ff34 03ac0dd0 libcef!cef_time_delta+0xacbae
0438ff24 0fa3a323 03a8d7d0 0fa1fc56 03ac0dd0 libcef!cef_time_delta+0x7f6ff
0438ff50 0fa39b9b 0438ff74 0fa39c6a 03ac0dd0 libcef!cef_time_delta+0x982a3
0438ff58 0fa39c6a 03ac0dd0 03a80f40 03a8d7d0 libcef!cef_time_delta+0x97b1b
0438ff74 0fa22ad8 00000000 00000000 03a80f40 libcef!cef_time_delta+0x97bea
0438ff88 75b9336a 000001a8 0438ffd4 770f9882 libcef!cef_time_delta+0x80a58
0438ff94 770f9882 03a80f40 7334e38d 00000000 kernel32!BaseThreadInitThunk+0xe
0438ffd4 770f9855 0fa22a80 03a80f40 00000000 ntdll!__RtlUserThreadStart+0x70
0438ffec 00000000 0fa22a80 03a80f40 00000000 ntdll!_RtlUserThreadStart+0x1b

这个线程正在为04560000这个heap创建LowFragHeap,但是在获取014d0138的cs的时候被卡住了。这里就发现问题所在了!014d0138不正是我们主线程在等待的cs么,这个线程的10f0和1994都在等待1730的cs=014d0138,而1730却在等待1994的cs=04560138。

20150908002438

反汇编RtlpQueryExtendedHeapInformation就能知道这里造成死锁的真正原因。实际上Heap32Next函数枚举所有Heap信息的时候,先统一锁住进程里面所有的Heap,然后做枚举工作,最后再统一释放锁。所以它在正在锁住所有HeapLock过程中的时候,例如锁住了一半,正在尝试锁住另一半,另外一个线程正好拥有他没锁住的HeapLock,但是不巧的时候他在给这个Heap创建LowFragHeap,而创建LowFragHeap需要初始化HeapLock,其中HeapLock的DebugInfo又是用默认Heap去分配内存的,默认堆的锁正好在第一个线程被锁住的那一半里,这就是事情的真相!

那么微软为什么要提供一个会造成死锁的API,而且不去修复呢?按照Raymond的说法,这个系列的函数目的只是诊断,性能很低,不应该用于普通程序中。
http://blogs.msdn.com/b/oldnewthing/archive/2012/03/23/10286665.aspx
但是,我觉得既然是文档化的接口,而且文档里面没有提到会造成死锁,那么它就应该是安全的。

debugging