URL
date
AI summary
slug
status
tags
summary
type

现象

在做线上巡检的时候,发现某个服务checking-service的4台实例里其中有一台的cpu使用率比其他三台明显要高很多
notion image
结合其他指标观察,发现进出带宽也明显比其他实例要高
notion image

排查

排查思路有两种:
  1. 从带宽入手
  1. 从cpu使用率入手

从带宽入手

我们使用iftop命令来看看带宽使用情况
iftop -i eth0
notion image
可以看到,主要的流量来自于和10.10.1.99的通信。10.10.1.99是我们的mysql数据库,所以可能是查询数据库出现了死循环。
我们再通过tcpdump命令抓包看看具体的SQL大概是什么
tcpdump -i eth0 -A 'tcp && port 3306 && ip 10.10.1.99'
从中找到一些关键的sql
11:58:36.779473 IP ace629bc436e.43712 > 10.10.1.99.mysql: Flags [P.], seq 90024:90884, ack 1582272, win 1471, length 860 E...:!@.@.E..... .c.....R.M#7.=P.......X....select student_id , lesson_start_date , lesson_end_date , lesson_id , class_id , status from `checking_lesson_student_271` WHERE ( school_id = 173327 and semester_id = 230662438013917 and active = 1 and id > 0 and lesson_start_date between '2024-09-01 00:00:00.0' and '2024-11-06 23:59:59.0' and status = 0 ) limit 500
通过这个SQL大致也能定位到问题代码。另外由于我们的阿里云的rds开了审计日志,所以也可以从数据库的审计日志上筛选出对应机器ip对应的SQL。

从cpu使用率入手

传统top+jstack方法定位

我们还可以从cpu使用率入手,通过命令top -H -p1查看cpu使用率高的线程
notion image
发现pid=2754这个线程的使用率较高,从COMMAND看应该是个RocketMQ的消费线程。根据pid转换成16进制:
[root@prod-app0004 ~]# printf '%x\\n' 2754 ac2
根据pid在jstack里定位到对应的线程(nid=0xac2):
"ConsumeMessageThread_2" #2745 prio=5 os_prio=0 tid=0x00007f2ae00b4e20 nid=0xac2 runnable [0x00007f29cbaf8000] java.lang.Thread.State: RUNNABLE at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at com.haoduo.checking.util.LockBillHelper.validBatchUpdateStudentCheckingStatus(LockBillHelper.java:299) at com.haoduo.checking.util.LockBillHelper.validBatchUpdateStudentCheckingStatus(LockBillHelper.java:270) at com.haoduo.checking.util.LockBillHelper.validBatchUpdateStudentApply(LockBillHelper.java:392) at com.haoduo.checking.mq.consumer.CheckingBatchUpdateApplyConsumer.checkingStudentBatchUpdateApply(CheckingBatchUpdateApplyConsumer.java:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:282) at com.haoduo.rocketmq.consumer.AbstractConsumer.doOnMessage(AbstractConsumer.java:110) at com.haoduo.rocketmq.consumer.AbstractConsumer.onMessage(AbstractConsumer.java:65) at com.haoduo.rocketmq.consumer.AbstractConsumer.onMessage(AbstractConsumer.java:46) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer.handleMessage(DefaultRocketMQListenerContainer.java:394) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer.access$100(DefaultRocketMQListenerContainer.java:70) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer$DefaultMessageListenerConcurrently.consumeMessage(DefaultRocketMQListenerContainer.java:354) at org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService$ConsumeRequest.run(ConsumeMessageConcurrentlyService.java:411) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

arthas定位

另外也可以使用arthas来定位,thread命令默认就是按照cpu使用率从高到底排序的
notion image
thread -n 1可以直接打印出最耗cpu的线程堆栈
[arthas@1]$ thread -n 1 "ConsumeMessageThread_2" Id=2745 cpuUsage=55.38% deltaTime=116ms time=37201167ms RUNNABLE at org.apache.commons.collections4.map.AbstractHashedMap.get(AbstractHashedMap.java:187) at org.apache.shardingsphere.driver.jdbc.core.resultset.ShardingSphereResultSet.getIndexFromColumnLabelAndIndexMap(ShardingSphereResultSet.java:422) at org.apache.shardingsphere.driver.jdbc.core.resultset.ShardingSphereResultSet.getLong(ShardingSphereResultSet.java:167) at org.apache.ibatis.type.LongTypeHandler.getNullableResult(LongTypeHandler.java:37) at org.apache.ibatis.type.LongTypeHandler.getNullableResult(LongTypeHandler.java:26) at org.apache.ibatis.type.BaseTypeHandler.getResult(BaseTypeHandler.java:81) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.getPropertyMappingValue(DefaultResultSetHandler.java:472) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.applyPropertyMappings(DefaultResultSetHandler.java:441) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.getRowValue(DefaultResultSetHandler.java:404) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.handleRowValuesForSimpleResultMap(DefaultResultSetHandler.java:354) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.handleRowValues(DefaultResultSetHandler.java:328) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.handleResultSet(DefaultResultSetHandler.java:301) at org.apache.ibatis.executor.resultset.DefaultResultSetHandler.handleResultSets(DefaultResultSetHandler.java:194) at org.apache.ibatis.executor.statement.PreparedStatementHandler.query(PreparedStatementHandler.java:65) at org.apache.ibatis.executor.statement.RoutingStatementHandler.query(RoutingStatementHandler.java:79) at org.apache.ibatis.executor.SimpleExecutor.doQuery(SimpleExecutor.java:63) at org.apache.ibatis.executor.BaseExecutor.queryFromDatabase(BaseExecutor.java:324) at org.apache.ibatis.executor.BaseExecutor.query(BaseExecutor.java:156) at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:109) at com.github.pagehelper.PageInterceptor.intercept(PageInterceptor.java:108) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49) at com.haoduo.localized.deploy.support.mybatis.LocalLocalIdConditionPlugin.intercept(LocalLocalIdConditionPlugin.java:56) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49) at com.haoduo.mask.support.mybatis.plugins.DecryptQueryPlugin.intercept(DecryptQueryPlugin.java:92) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49) at com.haoduo.sharding.biz.support.BaseShardingConditionPlugin.intercept(BaseShardingConditionPlugin.java:88) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63) at com.sun.proxy.$Proxy388.query(Unknown Source) at sun.reflect.GeneratedMethodAccessor282.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63) at com.sun.proxy.$Proxy388.query(Unknown Source) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:147) at org.apache.ibatis.session.defaults.DefaultSqlSession.selectList(DefaultSqlSession.java:140) at sun.reflect.GeneratedMethodAccessor532.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:433) at com.sun.proxy.$Proxy295.selectList(Unknown Source) at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:230) at org.apache.ibatis.binding.MapperMethod.executeForMany(MapperMethod.java:147) at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:80) at org.apache.ibatis.binding.MapperProxy$PlainMethodInvoker.invoke(MapperProxy.java:144) at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:85) at com.sun.proxy.$Proxy296.selectByExampleSelective(Unknown Source) at com.haoduo.checking.manager.CheckingStudentManager.find4ValidCanRefund4BatchUpdateApply(CheckingStudentManager.java:1225) at com.haoduo.checking.manager.CheckingStudentManager$$FastClassBySpringCGLIB$$856a8b48.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:687) at com.haoduo.checking.manager.CheckingStudentManager$$EnhancerBySpringCGLIB$$9e7af085.find4ValidCanRefund4BatchUpdateApply(<generated>) at com.haoduo.checking.util.LockBillHelper.validBatchUpdateStudentApply(LockBillHelper.java:386) at com.haoduo.checking.mq.consumer.CheckingBatchUpdateApplyConsumer.checkingStudentBatchUpdateApply(CheckingBatchUpdateApplyConsumer.java:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:282) at com.haoduo.rocketmq.consumer.AbstractConsumer.doOnMessage(AbstractConsumer.java:110) at com.haoduo.rocketmq.consumer.AbstractConsumer.onMessage(AbstractConsumer.java:65) at com.haoduo.rocketmq.consumer.AbstractConsumer.onMessage(AbstractConsumer.java:46) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer.handleMessage(DefaultRocketMQListenerContainer.java:394) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer.access$100(DefaultRocketMQListenerContainer.java:70) at org.apache.rocketmq.spring.support.DefaultRocketMQListenerContainer$DefaultMessageListenerConcurrently.consumeMessage(DefaultRocketMQListenerContainer.java:354) at org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService$ConsumeRequest.run(ConsumeMessageConcurrentlyService.java:411) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

修复

找到对应的问题代码,是在基于最小id分页遍历取数的时候,没有更新每次查询里的最小id
notion image
这里有2个问题:
  1. maxId更新
  1. 查询语句未按照id升序
此类分页查询已经是线上的第N次问题了,我们自己写了一个分页框架来规避此类问题。

参考

  1. Linux实时监控网卡入流量和出流量(iftop、dstat、nethogs、nload)
  1. 使用 tcpdump 抓包
  1. linux中用top、ps命令查看进程中的线程
关于怎么搭建一个这样的blogMySQL异步复制和半同步复制