GeXiangDong

精通Java、SQL、Spring的拼写,擅长Linux、Windows的开关机

0%

一次java.lang.OutOfMemoryError异常的排查

服务器上抛出了异常,java.lang.OutOfMemoryError,这种异常比较难解决,因为有可能大量占用内存的地方并不是抛出异常的位置,抛异常处可能是正常使用内存,只是内存没了。

java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:798)
        at me.chanjar.weixin.common.api.WxMessageInMemoryDuplicateChecker.checkBackgroundProcessStarted(WxMessageInMemoryDuplicateChecker.java:81)
        at me.chanjar.weixin.common.api.WxMessageInMemoryDuplicateChecker.isDuplicate(WxMessageInMemoryDuplicateChecker.java:89)
        at me.chanjar.weixin.mp.api.WxMpMessageRouter.isMsgDuplicated(WxMpMessageRouter.java:257)
        at me.chanjar.weixin.mp.api.WxMpMessageRouter.route(WxMpMessageRouter.java:172)
        at me.chanjar.weixin.open.api.impl.WxOpenMessageRouter.route(WxOpenMessageRouter.java:24)
        at me.chanjar.weixin.open.api.impl.WxOpenMessageRouter.route(WxOpenMessageRouter.java:20)
        at cn.devmgr.mall.wechatopen.WechatNotifyController.callback(WechatNotifyController.java:269)
        at jdk.internal.reflect.GeneratedMethodAccessor433.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
          ... ...

首先在抛出这种异常后,我先查看了下服务器的内存(之前已经写好的查询实例内存的接口),返回信息如下:

1
2
3
4
5
"memory": {
"total": 161480704,
"max": 536870912,
"free": 50741880
}

这个结果说明了不是内存不足,实际上内存不足的exception一般是 java.lang.OutOfMemoryError: Java heap space,而这次的是 java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
所以判断 process/resource limits reached 引发的。

于是写了段代码查询当前实例的线程数:

1
2
3
for (Thread t : Thread.getAllStackTraces().keySet()) {
// 统计每个线程的信息
}

发现抛出异常时,waiting状态的线程甚至达到了6000多个,这肯定是有些线程没有结束导致,查询自己的代码,没有搜到new Thread(),应该是调用其他类库不当导致,要找出原因还得写监控程序。

于是开始统计实例那所有的线程堆栈信息(这个异常虽然不容易重现,但可较容易观察到线程数量增常,这给查找原因带来了方便)

1
2
3
4
for (Thread t : Thread.getAllStackTraces().keySet()) {
StackTraceElement[] elements = t.getStackTrace();
// 按照堆栈分组统计各个线程的数量
}

按照堆栈分组,我采用的办法是把每个StackTraceElement的className, methodName, lineNumber拼接成一个字符串,用这个字符串作为key,判断是否属于一组线程。

发现了如下这个堆栈对应的线程数很多,且呈现出随时间增长不减退的趋势:

jdk.internal.misc.Unsafepark(Unsafe.java:-2)
java.util.concurrent.locks.LockSupportpark(LockSupport.java:194)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObjectawait(AbstractQueuedSynchronizer.java:2081)
java.util.concurrent.LinkedBlockingQueuetakeLinked(BlockingQueue.java:433)
java.util.concurrent.ThreadPoolExecutorgetTask(ThreadPoolExecutor.java:1054)
java.util.concurrent.ThreadPoolExecutorrunWorker(ThreadPoolExecutor.java:1114)
java.util.concurrent.ThreadPoolExecutor$Workerrun(ThreadPoolExecutor.java:628)
java.lang.Threadrun(Thread.java:829)

很遗憾的是,从这个堆栈里只能看出被创建了ThreadPoolExecutor,这个堆栈信息里所有类都是JDK的,没有任何第三方代码,还是没法知道是哪的程序。于是想找到线程的调用方,网上搜索到stackoverflow内有一个有趣的问答

https://stackoverflow.com/questions/18999444/how-to-find-out-who-create-a-thread-in-java

在此贴 Aaron Digulla给出了7种方法来查找线程的调用方,7种,引用如下:

Here is the list of approaches, sorted from quickest / most reliable to slowest / hardest:

1. If you have the source of the class, create an exception in the constructor (without actually throwing it). You can simply examine or print it when you need to know when the thread was created.
2. If you don't have the sources, the thread name can be a good hint who created it.
3. If the name hints to a generic service (like java.util.Timer), then you can create a conditional breakpoint in your IDE in the constructor. The condition should be the thread name; the debugger will then stop when someone creates a thread with this name.
4. If you don't have too many threads, set a breakpoint in the constructors of Thread.
5. If you have many threads, attach a debugger to the app and freeze it. Then examine the stack traces.
6. If everything else fails, get the source code for the Java runtime and add logging code in the classes you want to observe, compile a new rt.jar and replace the original one with your version. Don't try this in production, please.
7. If money isn't an issue, you can use dynamic tracing tools like Compuware APM or, if you're on Linux or Solaris, you can try SystemTap and dtrace, respectively.

我采用了第2种,修改了查询所有线程堆栈的地方,增加了线程名字,然后继续监控,发现了线程名就是一个第三方类库的类名,找到它源码,发现里面有new ExecutorService,而我代码多次new了这个类。修改代码,不再多次new它;部署后看监控程序 thread数目不再增长,解决。

附录:此次排查中使用的完整的监控代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

import com.zaxxer.hikari.HikariDataSource;
import com.zaxxer.hikari.HikariPoolMXBean;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@RestController
@RequestMapping("/service-status")
public class ServiceStatusController {
private static final Logger logger = LoggerFactory.getLogger(ServiceStatusController.class);

@Autowired(required = false)
private DataSource dataSource;

@Value("${spring.application.name:no_name}")
private String appName;

@GetMapping
public Map<String, Object> getStatus() {
Map<String, Object> result = new HashMap<>();
result.put("applicationName", appName);
Map<String, Object> memory = new HashMap<>();
Runtime rt = Runtime.getRuntime();
memory.put("total", rt.totalMemory());
memory.put("free", rt.freeMemory());
memory.put("max", rt.maxMemory());
result.put("memory", memory);
result.put("availableProcessors", rt.availableProcessors());

if (dataSource != null) {
if (dataSource instanceof HikariDataSource) {
HikariDataSource hds = (HikariDataSource) dataSource;
if (hds.getHikariPoolMXBean() == null) {
// 先获取一次才能取得 poolMXBean, 如果程序中有其他请求已经使用过数据库,则不需要这里的获取conn
try {
Connection conn = hds.getConnection();
conn.close();
} catch (SQLException sqlException) {
logger.error("cannot get conn", sqlException);
}
}
HikariPoolMXBean pool = hds.getHikariPoolMXBean();
if (pool != null) {
Map<String, Object> poolStatus = new HashMap<>();
poolStatus.put("active", pool.getActiveConnections());
poolStatus.put("idle", pool.getIdleConnections());
poolStatus.put("total", pool.getTotalConnections());
poolStatus.put("awaiting", pool.getThreadsAwaitingConnection());
poolStatus.put("maximumPoolSize", hds.getMaximumPoolSize());
poolStatus.put("minimumIdle", hds.getMinimumIdle());
poolStatus.put("idleTimeout", hds.getIdleTimeout());
result.put("dataSourcePool", poolStatus);
}
} else {
logger.info("datasource is not hikari");
}
} else {
logger.info("no datasource");
}

{
Map<String, Integer> threadsStatus = new HashMap<>();
for (Thread t : Thread.getAllStackTraces().keySet()) {
String key = t.getState().toString().toLowerCase();
Integer cnt = threadsStatus.get(key);
if (cnt == null) {
threadsStatus.put(key, 1);
} else {
threadsStatus.put(key, cnt + 1);
}
}
threadsStatus.put("total", Thread.getAllStackTraces().size());
result.put("threads", threadsStatus);
}

return result;
}

@GetMapping("/threads")
public Map<String, ?> getThreadStackTraces() {
logger.trace("getThreadStackTraces");
Map<String, List<Map<String, Object>>> result = new HashMap<>();
Map<String, Map<String, Object>> threadInfoByKey = new HashMap<>();
for (Thread t : Thread.getAllStackTraces().keySet()) {
StackTraceElement[] elements = t.getStackTrace();
List<Map<String, Object>> stackList = new ArrayList<>();
StringBuilder builder = new StringBuilder();
int i = 0;
for (StackTraceElement ele : elements) {
i++;
Map<String, Object> map = new HashMap<>();
map.put("className", ele.getClassName());
map.put("methodName", ele.getMethodName());
map.put("lineNum", ele.getLineNumber());
map.put("fileName", ele.getFileName());
map.put("module", ele.getModuleName());
stackList.add(map);

if (i < 30) {
builder.append(ele.getClassName());
builder.append(ele.getMethodName());
}
}
String status = t.getState().toString().toLowerCase();
String key = builder.toString();
Map<String, Object> threadInfo = threadInfoByKey.get(key);
if (threadInfo == null) {
threadInfo = new HashMap<>();
threadInfo.put("stack", stackList);
threadInfoByKey.put(key, threadInfo);
List<Map<String, Object>> list = result.get(status);
if (list == null) {
list = new ArrayList<>();
result.put(status, list);
}
list.add(threadInfo);
}
if (threadInfo.containsKey("counter")) {
threadInfo.put("counter", (Integer) threadInfo.get("counter") + 1);
} else {
threadInfo.put("counter", 1);
}
if (threadInfo.containsKey("names")) {
List<String> names = (List<String>) threadInfo.get("names");
names.add(t.getName());
} else {
List<String> names = new ArrayList<>();
names.add(t.getName());
threadInfo.put("names", names);
}
}
return result;
}

}