URL
date
AI summary
slug
status
tags
summary
type

背景

上午突然收到大量告警:开始是服务调用超时,然后是OOM,最后在排查过程中自行恢复了。虽然恢复了,但是还是必须要定位一下原因,不然下次可能就没这么“幸运”了。

现象

当时的调用超时异常如下:
请关注!复发告警,上一轮异常工单完成时间:2024-03-04 19:01:25 错误时间: 2024-06-20 11:25:21 近几秒 统计时间窗: [2024-06-20 11:20:21->当前] 期间发生同样的异常次数: 36次, 超过阈值: 10次(300秒) 本轮首次发生时间: 2024-03-05 13:54:04, 该问题已持续[106.9]天 环境: 生产环境 出错应用(服务提供方): course-service 服务调用方:notify-service 链路信息: notify-service[10.20.1.158]>>course-service[10.20.1.158],traceId:66739feab97b037ede3e08d2 排查提示: 可以在服务:notify-service[10.20.1.158]查询相关日志 触发规则: 通用rpc超时异常, 300秒10次忽略 累计异常次数: 267次, 累计报警次数: 43次 异常类型: Dubbo#InvokeRemoteMethodTimeout 原始异常类型: com.alibaba.dubbo.rpc.RpcException 业务信息: 课前5小时消息发送失败
OOM异常如下:
请关注!本次为本轮首次告警 错误时间: 2024-06-20 11:25:30 近几秒 本轮首次发生时间: 2024-06-20 11:21:55 环境: 生产环境 应用名: notify-service IP: 10.20.1.158 排查提示: 可以在服务:notify-service[10.20.1.158]查询相关日志 触发规则: 系统兜底规则, 60秒0次忽略 累计异常次数: 2次 异常类型: java.lang.OutOfMemoryError 签名: 29cb0899eaae3333212ff0a626131184 业务信息: postProcess occurs error 异常信息: Java heap space 堆栈信息: java.lang.OutOfMemoryError: Java heap space

分析

我们发现,这个notify-service的两台实例里,只有一台实例有服务调用超时以及OOM的现象,而另外一台并没有,从cpu使用率监控也能看出来:
notion image
另外,我们还观察了调用超时的相关提供者服务,它们在对应时间点并没有波动。
基于以上两点,基本可以断定是由于调用端GC,产生了STW,导致了调用超时。我们来看看notify-service 出问题那台实例的gc日志:
[root@prod-app0008 notify-service]# grep "stopped" gc.log.0.current | egrep 'stopped: 0.5[0-9]|stopped: [1-9].' 2024-06-20T11:19:05.664+0800: 57142.089: Total time for which application threads were stopped: 1.0198794 seconds, Stopping threads took: 0.0004496 seconds 2024-06-20T11:19:14.554+0800: 57150.978: Total time for which application threads were stopped: 0.5095942 seconds, Stopping threads took: 0.0002744 seconds 2024-06-20T11:19:19.558+0800: 57155.982: Total time for which application threads were stopped: 1.0982840 seconds, Stopping threads took: 0.0003070 seconds 2024-06-20T11:19:22.678+0800: 57159.102: Total time for which application threads were stopped: 1.0363128 seconds, Stopping threads took: 0.0003212 seconds 2024-06-20T11:19:25.818+0800: 57162.243: Total time for which application threads were stopped: 1.1529022 seconds, Stopping threads took: 0.0002791 seconds 2024-06-20T11:19:27.469+0800: 57163.894: Total time for which application threads were stopped: 1.0593043 seconds, Stopping threads took: 0.0003465 seconds 2024-06-20T11:19:29.345+0800: 57165.770: Total time for which application threads were stopped: 1.0799812 seconds, Stopping threads took: 0.0002881 seconds 2024-06-20T11:19:35.724+0800: 57172.149: Total time for which application threads were stopped: 1.0062648 seconds, Stopping threads took: 0.0002635 seconds 2024-06-20T11:19:46.998+0800: 57183.422: Total time for which application threads were stopped: 1.0420298 seconds, Stopping threads took: 0.0002761 seconds 2024-06-20T11:19:57.190+0800: 57193.614: Total time for which application threads were stopped: 1.0497680 seconds, Stopping threads took: 0.0002731 seconds 2024-06-20T11:20:00.621+0800: 57197.046: Total time for which application threads were stopped: 1.2272433 seconds, Stopping threads took: 0.0002479 seconds 2024-06-20T11:20:02.564+0800: 57198.989: Total time for which application threads were stopped: 1.0313140 seconds, Stopping threads took: 0.0002728 seconds 2024-06-20T11:20:04.048+0800: 57200.472: Total time for which application threads were stopped: 1.1869651 seconds, Stopping threads took: 0.0108022 seconds 2024-06-20T11:20:07.195+0800: 57203.619: Total time for which application threads were stopped: 1.0043289 seconds, Stopping threads took: 0.0003086 seconds 2024-06-20T11:20:08.867+0800: 57205.292: Total time for which application threads were stopped: 1.0173528 seconds, Stopping threads took: 0.0002874 seconds 2024-06-20T11:20:10.555+0800: 57206.980: Total time for which application threads were stopped: 1.0686381 seconds, Stopping threads took: 0.0002457 seconds 2024-06-20T11:20:14.349+0800: 57210.773: Total time for which application threads were stopped: 1.1319231 seconds, Stopping threads took: 0.0003060 seconds 2024-06-20T11:20:17.488+0800: 57213.912: Total time for which application threads were stopped: 1.1327066 seconds, Stopping threads took: 0.0002958 seconds 2024-06-20T11:20:22.596+0800: 57219.021: Total time for which application threads were stopped: 1.0645579 seconds, Stopping threads took: 0.0002824 seconds 2024-06-20T11:20:23.904+0800: 57220.328: Total time for which application threads were stopped: 1.0300699 seconds, Stopping threads took: 0.0061323 seconds 2024-06-20T11:20:26.921+0800: 57223.346: Total time for which application threads were stopped: 1.1136983 seconds, Stopping threads took: 0.0003034 seconds 2024-06-20T11:20:29.460+0800: 57225.884: Total time for which application threads were stopped: 1.0556971 seconds, Stopping threads took: 0.0002954 seconds 2024-06-20T11:20:32.278+0800: 57228.703: Total time for which application threads were stopped: 1.1123451 seconds, Stopping threads took: 0.0002661 seconds 2024-06-20T11:20:35.281+0800: 57231.705: Total time for which application threads were stopped: 1.1453540 seconds, Stopping threads took: 0.0002780 seconds 2024-06-20T11:20:38.283+0800: 57234.708: Total time for which application threads were stopped: 1.0493181 seconds, Stopping threads took: 0.0002568 seconds 2024-06-20T11:20:41.217+0800: 57237.641: Total time for which application threads were stopped: 1.1429915 seconds, Stopping threads took: 0.0002701 seconds 2024-06-20T11:20:42.507+0800: 57238.931: Total time for which application threads were stopped: 1.0083866 seconds, Stopping threads took: 0.0002895 seconds 2024-06-20T11:20:44.283+0800: 57240.708: Total time for which application threads were stopped: 1.2382664 seconds, Stopping threads took: 0.0002691 seconds 2024-06-20T11:20:45.652+0800: 57242.076: Total time for which application threads were stopped: 1.2686167 seconds, Stopping threads took: 0.0003115 seconds 2024-06-20T11:20:46.952+0800: 57243.377: Total time for which application threads were stopped: 1.1135155 seconds, Stopping threads took: 0.0002694 seconds 2024-06-20T11:20:48.152+0800: 57244.577: Total time for which application threads were stopped: 1.0037656 seconds, Stopping threads took: 0.0002797 seconds 2024-06-20T11:20:49.563+0800: 57245.987: Total time for which application threads were stopped: 1.2421865 seconds, Stopping threads took: 0.0002949 seconds 2024-06-20T11:20:51.203+0800: 57247.628: Total time for which application threads were stopped: 1.1623475 seconds, Stopping threads took: 0.0003028 seconds 2024-06-20T11:20:57.550+0800: 57253.975: Total time for which application threads were stopped: 1.1394110 seconds, Stopping threads took: 0.0002766 seconds 2024-06-20T11:21:00.082+0800: 57256.507: Total time for which application threads were stopped: 1.0713535 seconds, Stopping threads took: 0.0002802 seconds 2024-06-20T11:21:03.132+0800: 57259.557: Total time for which application threads were stopped: 1.2952534 seconds, Stopping threads took: 0.0002870 seconds 2024-06-20T11:21:08.516+0800: 57264.941: Total time for which application threads were stopped: 1.1847655 seconds, Stopping threads took: 0.0002749 seconds 2024-06-20T11:21:10.462+0800: 57266.887: Total time for which application threads were stopped: 1.9430063 seconds, Stopping threads took: 0.0003942 seconds 2024-06-20T11:21:15.624+0800: 57272.049: Total time for which application threads were stopped: 5.1616084 seconds, Stopping threads took: 0.0003223 seconds 2024-06-20T11:21:17.330+0800: 57273.754: Total time for which application threads were stopped: 1.2004076 seconds, Stopping threads took: 0.0003194 seconds 2024-06-20T11:21:20.697+0800: 57277.122: Total time for which application threads were stopped: 1.2746570 seconds, Stopping threads took: 0.0002516 seconds 2024-06-20T11:21:21.904+0800: 57278.329: Total time for which application threads were stopped: 1.0841685 seconds, Stopping threads took: 0.0002752 seconds 2024-06-20T11:21:23.225+0800: 57279.650: Total time for which application threads were stopped: 1.2049377 seconds, Stopping threads took: 0.0002604 seconds 2024-06-20T11:21:24.461+0800: 57280.886: Total time for which application threads were stopped: 1.0055006 seconds, Stopping threads took: 0.0002935 seconds 2024-06-20T11:21:26.077+0800: 57282.502: Total time for which application threads were stopped: 1.2420148 seconds, Stopping threads took: 0.0025064 seconds 2024-06-20T11:21:28.791+0800: 57285.216: Total time for which application threads were stopped: 1.0491318 seconds, Stopping threads took: 0.0003077 seconds 2024-06-20T11:21:32.080+0800: 57288.505: Total time for which application threads were stopped: 1.2203584 seconds, Stopping threads took: 0.0003023 seconds 2024-06-20T11:21:35.407+0800: 57291.832: Total time for which application threads were stopped: 2.1166195 seconds, Stopping threads took: 0.0002547 seconds 2024-06-20T11:21:36.720+0800: 57293.145: Total time for which application threads were stopped: 1.0504355 seconds, Stopping threads took: 0.0002693 seconds 2024-06-20T11:21:40.872+0800: 57297.296: Total time for which application threads were stopped: 1.2162478 seconds, Stopping threads took: 0.0004517 seconds 2024-06-20T11:21:45.432+0800: 57301.856: Total time for which application threads were stopped: 1.1285019 seconds, Stopping threads took: 0.0002740 seconds 2024-06-20T11:21:47.014+0800: 57303.439: Total time for which application threads were stopped: 1.2504247 seconds, Stopping threads took: 0.0002528 seconds 2024-06-20T11:21:50.525+0800: 57306.950: Total time for which application threads were stopped: 2.0244848 seconds, Stopping threads took: 0.0002497 seconds 2024-06-20T11:21:51.746+0800: 57308.170: Total time for which application threads were stopped: 1.0581819 seconds, Stopping threads took: 0.0002736 seconds 2024-06-20T11:21:54.101+0800: 57310.526: Total time for which application threads were stopped: 1.1833733 seconds, Stopping threads took: 0.0002662 seconds 2024-06-20T11:21:55.960+0800: 57312.385: Total time for which application threads were stopped: 1.8477239 seconds, Stopping threads took: 0.0003198 seconds 2024-06-20T11:22:01.577+0800: 57318.002: Total time for which application threads were stopped: 1.8006945 seconds, Stopping threads took: 0.0018363 seconds 2024-06-20T11:22:03.031+0800: 57319.456: Total time for which application threads were stopped: 1.2457566 seconds, Stopping threads took: 0.0002639 seconds 2024-06-20T11:22:05.823+0800: 57322.248: Total time for which application threads were stopped: 1.0134792 seconds, Stopping threads took: 0.0002990 seconds 2024-06-20T11:22:08.803+0800: 57325.228: Total time for which application threads were stopped: 1.9530978 seconds, Stopping threads took: 0.0002685 seconds 2024-06-20T11:22:14.429+0800: 57330.854: Total time for which application threads were stopped: 1.1132713 seconds, Stopping threads took: 0.0002751 seconds 2024-06-20T11:22:16.350+0800: 57332.774: Total time for which application threads were stopped: 1.1262424 seconds, Stopping threads took: 0.0003263 seconds 2024-06-20T11:22:17.745+0800: 57334.170: Total time for which application threads were stopped: 1.2842468 seconds, Stopping threads took: 0.0002043 seconds 2024-06-20T11:22:18.785+0800: 57335.210: Total time for which application threads were stopped: 1.0202137 seconds, Stopping threads took: 0.0005065 seconds 2024-06-20T11:22:20.070+0800: 57336.494: Total time for which application threads were stopped: 1.0800059 seconds, Stopping threads took: 0.0003179 seconds 2024-06-20T11:22:21.741+0800: 57338.165: Total time for which application threads were stopped: 1.6706469 seconds, Stopping threads took: 0.0006118 seconds 2024-06-20T11:22:23.574+0800: 57339.998: Total time for which application threads were stopped: 1.2481612 seconds, Stopping threads took: 0.0002577 seconds 2024-06-20T11:22:26.088+0800: 57342.512: Total time for which application threads were stopped: 1.1813863 seconds, Stopping threads took: 0.0002931 seconds 2024-06-20T11:22:29.065+0800: 57345.490: Total time for which application threads were stopped: 1.0935825 seconds, Stopping threads took: 0.0002724 seconds 2024-06-20T11:22:31.563+0800: 57347.988: Total time for which application threads were stopped: 1.4210116 seconds, Stopping threads took: 0.0002914 seconds 2024-06-20T11:22:32.800+0800: 57349.224: Total time for which application threads were stopped: 1.1355810 seconds, Stopping threads took: 0.0002750 seconds 2024-06-20T11:22:33.980+0800: 57350.404: Total time for which application threads were stopped: 1.1238661 seconds, Stopping threads took: 0.0002382 seconds 2024-06-20T11:22:36.178+0800: 57352.602: Total time for which application threads were stopped: 1.1642895 seconds, Stopping threads took: 0.0002627 seconds 2024-06-20T11:22:38.362+0800: 57354.787: Total time for which application threads were stopped: 1.1125251 seconds, Stopping threads took: 0.0002554 seconds 2024-06-20T11:22:40.664+0800: 57357.089: Total time for which application threads were stopped: 1.2757475 seconds, Stopping threads took: 0.0002533 seconds 2024-06-20T11:22:42.794+0800: 57359.219: Total time for which application threads were stopped: 1.0889745 seconds, Stopping threads took: 0.0002555 seconds 2024-06-20T11:22:45.501+0800: 57361.926: Total time for which application threads were stopped: 1.6776208 seconds, Stopping threads took: 0.0002886 seconds 2024-06-20T11:22:46.521+0800: 57362.946: Total time for which application threads were stopped: 1.0137361 seconds, Stopping threads took: 0.0003444 seconds 2024-06-20T11:22:48.007+0800: 57364.432: Total time for which application threads were stopped: 1.2778170 seconds, Stopping threads took: 0.0003751 seconds 2024-06-20T11:22:49.341+0800: 57365.766: Total time for which application threads were stopped: 1.1653803 seconds, Stopping threads took: 0.0002861 seconds 2024-06-20T11:22:50.724+0800: 57367.149: Total time for which application threads were stopped: 1.0273746 seconds, Stopping threads took: 0.0002875 seconds 2024-06-20T11:22:53.362+0800: 57369.787: Total time for which application threads were stopped: 1.1120561 seconds, Stopping threads took: 0.0002630 seconds 2024-06-20T11:22:58.576+0800: 57375.001: Total time for which application threads were stopped: 1.1583615 seconds, Stopping threads took: 0.0002566 seconds 2024-06-20T11:23:01.085+0800: 57377.510: Total time for which application threads were stopped: 1.3160743 seconds, Stopping threads took: 0.0002920 seconds 2024-06-20T11:23:02.822+0800: 57379.247: Total time for which application threads were stopped: 1.7247018 seconds, Stopping threads took: 0.0003089 seconds 2024-06-20T11:23:04.204+0800: 57380.628: Total time for which application threads were stopped: 1.1764910 seconds, Stopping threads took: 0.0002866 seconds 2024-06-20T11:23:06.975+0800: 57383.399: Total time for which application threads were stopped: 1.1608671 seconds, Stopping threads took: 0.0002510 seconds 2024-06-20T11:23:09.624+0800: 57386.049: Total time for which application threads were stopped: 1.0841561 seconds, Stopping threads took: 0.0003361 seconds 2024-06-20T11:23:13.357+0800: 57389.781: Total time for which application threads were stopped: 1.0631893 seconds, Stopping threads took: 0.0002576 seconds 2024-06-20T11:23:15.782+0800: 57392.207: Total time for which application threads were stopped: 1.0516367 seconds, Stopping threads took: 0.0002681 seconds 2024-06-20T11:23:18.384+0800: 57394.809: Total time for which application threads were stopped: 1.0675893 seconds, Stopping threads took: 0.0003043 seconds 2024-06-20T11:23:21.114+0800: 57397.538: Total time for which application threads were stopped: 1.1662032 seconds, Stopping threads took: 0.0002417 seconds 2024-06-20T11:23:23.574+0800: 57399.998: Total time for which application threads were stopped: 1.2160463 seconds, Stopping threads took: 0.0002526 seconds 2024-06-20T11:23:26.546+0800: 57402.970: Total time for which application threads were stopped: 1.1728652 seconds, Stopping threads took: 0.0002705 seconds 2024-06-20T11:23:28.787+0800: 57405.212: Total time for which application threads were stopped: 1.2705120 seconds, Stopping threads took: 0.0002410 seconds 2024-06-20T11:23:30.802+0800: 57407.227: Total time for which application threads were stopped: 1.1166253 seconds, Stopping threads took: 0.0002239 seconds 2024-06-20T11:23:31.827+0800: 57408.252: Total time for which application threads were stopped: 1.0020937 seconds, Stopping threads took: 0.0002421 seconds 2024-06-20T11:23:33.048+0800: 57409.472: Total time for which application threads were stopped: 1.1586475 seconds, Stopping threads took: 0.0002442 seconds 2024-06-20T11:23:35.202+0800: 57411.627: Total time for which application threads were stopped: 1.1799467 seconds, Stopping threads took: 0.0002662 seconds 2024-06-20T11:23:37.895+0800: 57414.320: Total time for which application threads were stopped: 1.7572378 seconds, Stopping threads took: 0.0002860 seconds 2024-06-20T11:23:39.159+0800: 57415.583: Total time for which application threads were stopped: 1.2199407 seconds, Stopping threads took: 0.0002006 seconds 2024-06-20T11:23:41.614+0800: 57418.039: Total time for which application threads were stopped: 1.1029387 seconds, Stopping threads took: 0.0003017 seconds 2024-06-20T11:23:44.612+0800: 57421.037: Total time for which application threads were stopped: 1.0957247 seconds, Stopping threads took: 0.0002734 seconds 2024-06-20T11:23:46.766+0800: 57423.191: Total time for which application threads were stopped: 1.1080531 seconds, Stopping threads took: 0.0002511 seconds 2024-06-20T11:23:49.059+0800: 57425.484: Total time for which application threads were stopped: 1.1267202 seconds, Stopping threads took: 0.0003399 seconds 2024-06-20T11:23:50.656+0800: 57427.081: Total time for which application threads were stopped: 1.5950660 seconds, Stopping threads took: 0.0011346 seconds 2024-06-20T11:23:57.512+0800: 57433.937: Total time for which application threads were stopped: 1.1131228 seconds, Stopping threads took: 0.0002698 seconds 2024-06-20T11:23:59.027+0800: 57435.451: Total time for which application threads were stopped: 1.5121620 seconds, Stopping threads took: 0.0003023 seconds 2024-06-20T11:24:01.860+0800: 57438.285: Total time for which application threads were stopped: 1.1480734 seconds, Stopping threads took: 0.0002536 seconds 2024-06-20T11:24:04.223+0800: 57440.647: Total time for which application threads were stopped: 1.1503839 seconds, Stopping threads took: 0.0002909 seconds 2024-06-20T11:24:06.501+0800: 57442.926: Total time for which application threads were stopped: 1.0422870 seconds, Stopping threads took: 0.0002825 seconds 2024-06-20T11:24:08.761+0800: 57445.186: Total time for which application threads were stopped: 1.0324164 seconds, Stopping threads took: 0.0002632 seconds 2024-06-20T11:24:11.136+0800: 57447.560: Total time for which application threads were stopped: 1.1635659 seconds, Stopping threads took: 0.0002910 seconds 2024-06-20T11:24:13.445+0800: 57449.870: Total time for which application threads were stopped: 1.0147621 seconds, Stopping threads took: 0.0002714 seconds 2024-06-20T11:24:15.081+0800: 57451.506: Total time for which application threads were stopped: 1.6329977 seconds, Stopping threads took: 0.0003979 seconds 2024-06-20T11:24:17.891+0800: 57454.316: Total time for which application threads were stopped: 1.0154202 seconds, Stopping threads took: 0.0002755 seconds 2024-06-20T11:24:20.054+0800: 57456.478: Total time for which application threads were stopped: 1.0265122 seconds, Stopping threads took: 0.0002623 seconds 2024-06-20T11:24:22.455+0800: 57458.879: Total time for which application threads were stopped: 1.1790529 seconds, Stopping threads took: 0.0002623 seconds 2024-06-20T11:24:25.489+0800: 57461.913: Total time for which application threads were stopped: 2.0495565 seconds, Stopping threads took: 0.0002253 seconds 2024-06-20T11:24:29.038+0800: 57465.463: Total time for which application threads were stopped: 1.1613540 seconds, Stopping threads took: 0.0002677 seconds 2024-06-20T11:24:30.508+0800: 57466.933: Total time for which application threads were stopped: 1.0936724 seconds, Stopping threads took: 0.0003030 seconds 2024-06-20T11:24:32.052+0800: 57468.477: Total time for which application threads were stopped: 1.1821966 seconds, Stopping threads took: 0.0002721 seconds 2024-06-20T11:24:34.442+0800: 57470.866: Total time for which application threads were stopped: 1.0867626 seconds, Stopping threads took: 0.0002698 seconds 2024-06-20T11:24:36.855+0800: 57473.280: Total time for which application threads were stopped: 1.0436898 seconds, Stopping threads took: 0.0002515 seconds 2024-06-20T11:24:41.210+0800: 57477.634: Total time for which application threads were stopped: 1.0955193 seconds, Stopping threads took: 0.0002892 seconds 2024-06-20T11:24:43.070+0800: 57479.495: Total time for which application threads were stopped: 1.7234845 seconds, Stopping threads took: 0.0002700 seconds 2024-06-20T11:24:44.464+0800: 57480.888: Total time for which application threads were stopped: 1.2058412 seconds, Stopping threads took: 0.0002787 seconds 2024-06-20T11:24:46.723+0800: 57483.148: Total time for which application threads were stopped: 1.1227343 seconds, Stopping threads took: 0.0015248 seconds 2024-06-20T11:24:48.965+0800: 57485.390: Total time for which application threads were stopped: 1.0005273 seconds, Stopping threads took: 0.0003569 seconds 2024-06-20T11:24:51.536+0800: 57487.961: Total time for which application threads were stopped: 1.6485981 seconds, Stopping threads took: 0.0007895 seconds 2024-06-20T11:24:52.762+0800: 57489.187: Total time for which application threads were stopped: 1.1361445 seconds, Stopping threads took: 0.0002859 seconds 2024-06-20T11:24:57.417+0800: 57493.841: Total time for which application threads were stopped: 1.0539808 seconds, Stopping threads took: 0.0002488 seconds 2024-06-20T11:25:02.322+0800: 57498.747: Total time for which application threads were stopped: 1.1493219 seconds, Stopping threads took: 0.0002539 seconds 2024-06-20T11:25:04.814+0800: 57501.239: Total time for which application threads were stopped: 1.0240951 seconds, Stopping threads took: 0.0002551 seconds 2024-06-20T11:25:07.868+0800: 57504.292: Total time for which application threads were stopped: 1.2747084 seconds, Stopping threads took: 0.0002786 seconds 2024-06-20T11:25:10.732+0800: 57507.157: Total time for which application threads were stopped: 1.1918052 seconds, Stopping threads took: 0.0002907 seconds 2024-06-20T11:25:13.509+0800: 57509.934: Total time for which application threads were stopped: 1.0047051 seconds, Stopping threads took: 0.0002578 seconds 2024-06-20T11:25:16.209+0800: 57512.634: Total time for which application threads were stopped: 1.1562507 seconds, Stopping threads took: 0.0005148 seconds 2024-06-20T11:25:18.890+0800: 57515.314: Total time for which application threads were stopped: 1.0835729 seconds, Stopping threads took: 0.0002538 seconds 2024-06-20T11:25:21.452+0800: 57517.876: Total time for which application threads were stopped: 1.0060095 seconds, Stopping threads took: 0.0002780 seconds 2024-06-20T11:25:24.337+0800: 57520.762: Total time for which application threads were stopped: 1.0036266 seconds, Stopping threads took: 0.0002834 seconds 2024-06-20T11:25:33.535+0800: 57529.959: Total time for which application threads were stopped: 0.5083679 seconds, Stopping threads took: 0.0012707 seconds 2024-06-20T11:25:41.813+0800: 57538.238: Total time for which application threads were stopped: 1.0136821 seconds, Stopping threads took: 0.0040605 seconds
可以看到,11:19:05~11:25:41 有大量超过0.5秒的gc,gc导致了长时间的STW。
那是什么导致了内存占用突然升高并且GC不掉呢?接下来我们就来分析hprof文件
notion image
如上图所示,内存占用高的几乎都是notifyTaskThreadPool这个自定义的线程池里的线程,单个线程占用的内存最高的竟然有60M+。我们来看看都是什么对象:
notion image
这里的4个字符数组就是内存占用的“罪魁祸首”,并且可以看到里面存储的都是具体待执行的SQL(有微小的区别,可以忽略)。我们再结合线程的堆栈信息一起分析:
notion image
可以看到这几个“大对象”都是MyBatis内部使用的。这里我们先不去进一步分析这些大对象具体都是干嘛的,先来看看具体的业务场景:
notifyTaskThreadPool这个线程池是用来做消息拆分的,所谓消息拆分就是把一条消息发送任务拆分到每个发送对象维度。比如我创建了一个给杭州所有的用户发一条“恭喜发财”通知的任务,这个时候,notifyTaskThreadPool就会去拆分这个任务,给每个杭州的用户都创建一条notify_record记录,显然这里的拆分数量可能会很大,所以我们采用了每500条记录分批插入的方式,而500条记录的插入SQL拼接起来竟然达到了10M+的大小,主要是MyBatis拼的SQL里包含了大量的空格和换行符,如下所示:
insert into `notify_record` ( id , notify_id , merchant_id ) values ( #{__frch_item_31.id,jdbcType=BIGINT} , #{__frch_item_31.notifyId,jdbcType=BIGINT} , #{__frch_item_31.merchantId,jdbcType=BIGINT} );
大量的空格和换行极大地增加了字符串的体积。并且,出问题的时候,这样的大型任务同一时间创建了50个,而我们的线程池的最大线程数量是64。
下面我们再看看MyBatis内部为什么会基于SQL字符串衍生出那么多“大对象”。除了DynamicContext,其他的变量都可以在下面这个类里找到线索:
public class GenericTokenParser { private final String openToken; private final String closeToken; private final TokenHandler handler; public GenericTokenParser(String openToken, String closeToken, TokenHandler handler) { this.openToken = openToken; this.closeToken = closeToken; this.handler = handler; } // text 是含有 #{} 占位符的原始SQL public String parse(String text) { if (text == null || text.isEmpty()) { return ""; } // search open token int start = text.indexOf(openToken); if (start == -1) { return text; } // 这里会copy出一个字符数组 char[] src = text.toCharArray(); int offset = 0; // 下面这个StringBuilder是用来存储替换占位符后的SQL,占位符会被替换成 ? final StringBuilder builder = new StringBuilder(); StringBuilder expression = null; while (start > -1) { if (start > 0 && src[start - 1] == '\\\\') { // this open token is escaped. remove the backslash and continue. builder.append(src, offset, start - offset - 1).append(openToken); offset = start + openToken.length(); } else { // found open token. let's search close token. if (expression == null) { expression = new StringBuilder(); } else { expression.setLength(0); } builder.append(src, offset, start - offset); offset = start + openToken.length(); int end = text.indexOf(closeToken, offset); while (end > -1) { if (end > offset && src[end - 1] == '\\\\') { // this close token is escaped. remove the backslash and continue. expression.append(src, offset, end - offset - 1).append(closeToken); offset = end + closeToken.length(); end = text.indexOf(closeToken, offset); } else { expression.append(src, offset, end - offset); break; } } if (end == -1) { // close token was not found. builder.append(src, start, src.length - start); offset = src.length; } else { builder.append(handler.handleToken(expression.toString())); offset = end + closeToken.length(); } } start = text.indexOf(openToken, offset); } if (offset < src.length) { builder.append(src, offset, src.length - offset); } return builder.toString(); } }

总结

至此,这个问题已经非常清晰了,并且,其实batchInsert场景下的OOM我们也不是第一次碰到了,只是之前没有深入的去分析问题,基本上把每批次的数量降下来就可以了,甚至有一些在写的时候并没有考虑批次,一次性查出来一次性插入。
针对于这个特定的业务场景,我们做了以下优化:
  1. 减少每批次插入的数量(500->100),因为每一次循环处理完之后内存就可以回收了
  1. 减少notifyTaskThreadPool线程池的线程数量(64->10),因为减少了并发的同时占用,并且发送大批量消息的实时性要求也不高
  1. 可以适当增加JVM堆内存
  1. 优化MyBatis拼SQL的逻辑,减少不必要的空格和换行,这块看了一下可能相对麻烦一下,以后再说了
消除MyBatis不必要的空格和换行MySQL JSON字段部分更新实验