URL
date
AI summary
slug
status
tags
summary
type

现象

在1024这个程序猿感受美好的日子,竟然来了一条告警
notion image
多年的开发经验告诉我,OutOfMemory应该不是小问题,必须赶紧排查。我先找到对应的请求在slb以及spring-cloud-gateway这两层的请求日志:
notion image
notion image
apply-merchant-settle-of-flow2 这个是机构入驻申请的接口,要求上传图片/视频等信息。这块之前的实现是直接用的web容器来接的文件。从上面2条请求日志可以看出来,肯定是用户上传了一个大文件。不过两边的请求大小相差甚远(slb上记录的是100多M,而请求头里的content-length有700多M)。我怀疑是slb的request_length不是直接取的请求头的content-length,更像是实际读取到的数据量。也就是用户上传了700M+的文件,但是传到100M+的时候,spring-cloud-gateway就因为内存不足报错了。
slb官方文档对于request_length字段的解释是:请求报文的长度,包括startline、HTTP header和HTTP Body。

止血

700多M的文件,瞬间让我菊花一紧,这种大文件的上传下载搞不好就能直接摧毁系统。刻不容缓,我可不想在1024被祭天,昨天语雀挂了N个小时已经被各种口诛笔伐了。赶紧赶紧!
先止血,很自然而然就想到了限制请求大小,spring-cloud-gateway也提供了一个内置的Filter:org.springframework.cloud.gateway.filter.factory.RequestSizeGatewayFilterFactory。于是满心欢喜的配置上这个Filter,先配了一个较小的限制测试功能,发现response code确实变成了413——request too large。但是为了严谨性,又试了一个较大的文件,竟然报了和线上一样的异常。此时我不禁怀疑起了这个Filter的实现:难道不是通过content-length来判断的?
notion image
不过看了上面贴出的源代码,很快就否定了上述“愚蠢的”猜想。于是又有一个新的猜想在我的脑海中浮现:是不是有会去读取body数据的Filter并且在RequestSizeFilter之前执行?
我试着把RequestSizeGatewayFilterFactory的优先级调高
notion image
💡
可以通过OrderedGatewayFilter来指定了它的order,因为spring-cloud-gateway里它的GlobalFilter和GatewayFilter是并列通过order来排序的。所以通过这样的方式我们就把RequestSizeGatewayFilter的优先级提高了。
改完之后,发现请求拦截得很丝滑,没有之前的问题了。

定位

下面就是要定位原因了:到底是前面的哪个Filter或者是哪些逻辑导致了内存溢出呢?
首先想到我们有一个自定义的Filter:RequestBodyAddCacheGlobalFilter。它会读取并缓存整个body以便于后续的Filter拿body做签名校验。不过针对于上传文件的场景之前就做了特殊处理:
notion image
那应该不是这个Filter的锅,那么怎么定位到有问题的Filter呢?我尝试使用arthas来定位
watch org.springframework.cloud.gateway.filter.GlobalFilter filter '{params}' -n 100 -x 3
很遗憾,没有看到耗时很长的Filter。不过有一个奇怪的现象是,FilterChain上总共有26个Filter,如果是上传大文件报错的场景下,执行日志输出到AdaptCachedBodyGlobalFilter就结束了。而如果是正常的场景,所有Filter的执行日志都能正常输出。于是我把重点锁定到了AdaptCachedBodyGlobalFilter
method=com.xxx.gateway.filter.global.TraceFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=2.206118ms] result=@ArrayList[ @Object[][ @DefaultServerWebExchange[ SAFE_METHODS=@ArrayList[isEmpty=false;size=2], FORM_DATA_TYPE=@ResolvableType[org.springframework.util.MultiValueMap<java.lang.String, java.lang.String>], MULTIPART_DATA_TYPE=@ResolvableType[org.springframework.util.MultiValueMap<java.lang.String, org.springframework.http.codec.multipart.Part>], EMPTY_FORM_DATA=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:73)], EMPTY_MULTIPART_DATA=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:77)], request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=@ReactorServerHttpResponse[org.springframework.http.server.reactive.ReactorServerHttpResponse@24988115], attributes=@ConcurrentHashMap[isEmpty=false;size=5], sessionMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<init>(DefaultServerWebExchange.java:129)], localeContextResolver=@AcceptHeaderLocaleContextResolver[org.springframework.web.server.i18n.AcceptHeaderLocaleContextResolver@3e5de990], formDataMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:73)], multipartDataMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.initMultipartData(DefaultServerWebExchange.java:171)], applicationContext=@AnnotationConfigReactiveWebServerApplicationContext[org.springframework.boot.web.reactive.context.AnnotationConfigReactiveWebServerApplicationContext@553a3d88, started on Thu Nov 09 15:44:09 CST 2023], notModified=@Boolean[false], urlTransformer=@DefaultServerWebExchange$$Lambda$890/171650895[org.springframework.web.server.adapter.DefaultServerWebExchange$$Lambda$890/171650895@37a7a735], logId=@String[27e3d254-13], logPrefix=@String[[27e3d254-13] ], ], @DefaultGatewayFilterChain[ index=@Integer[1], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.LogFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.899755ms] result=@ArrayList[ @Object[][ @DefaultServerWebExchange[ SAFE_METHODS=@ArrayList[isEmpty=false;size=2], FORM_DATA_TYPE=@ResolvableType[org.springframework.util.MultiValueMap<java.lang.String, java.lang.String>], MULTIPART_DATA_TYPE=@ResolvableType[org.springframework.util.MultiValueMap<java.lang.String, org.springframework.http.codec.multipart.Part>], EMPTY_FORM_DATA=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:73)], EMPTY_MULTIPART_DATA=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:77)], request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=@ReactorServerHttpResponse[org.springframework.http.server.reactive.ReactorServerHttpResponse@24988115], attributes=@ConcurrentHashMap[isEmpty=false;size=5], sessionMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<init>(DefaultServerWebExchange.java:129)], localeContextResolver=@AcceptHeaderLocaleContextResolver[org.springframework.web.server.i18n.AcceptHeaderLocaleContextResolver@3e5de990], formDataMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.<clinit>(DefaultServerWebExchange.java:73)], multipartDataMono=@MonoOnAssembly[Mono.cache ⇢ at org.springframework.web.server.adapter.DefaultServerWebExchange.initMultipartData(DefaultServerWebExchange.java:171)], applicationContext=@AnnotationConfigReactiveWebServerApplicationContext[org.springframework.boot.web.reactive.context.AnnotationConfigReactiveWebServerApplicationContext@553a3d88, started on Thu Nov 09 15:44:09 CST 2023], notModified=@Boolean[false], urlTransformer=@DefaultServerWebExchange$$Lambda$890/171650895[org.springframework.web.server.adapter.DefaultServerWebExchange$$Lambda$890/171650895@37a7a735], logId=@String[27e3d254-13], logPrefix=@String[[27e3d254-13] ], ], @DefaultGatewayFilterChain[ index=@Integer[2], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.alibaba.csp.sentinel.adapter.gateway.sc.SentinelGatewayFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=2.375161ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[3], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=org.springframework.cloud.gateway.filter.RemoveCachedBodyFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.424192ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[4], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.GrayFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.131167ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[5], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.127632ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[6], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.069537ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[7], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.RewritePathFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.129382ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[8], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.466282ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[9], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.148898ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[10], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=20.178851ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[11], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=com.xxx.gateway.filter.global.BaseFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=3.161773ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[12], filters=@ArrayList[isEmpty=false;size=26], ], ], ] method=org.springframework.cloud.gateway.filter.AdaptCachedBodyGlobalFilter.filter location=AtExit ts=2023-11-09 17:27:21; [cost=0.981795ms] result=@ArrayList[ @Object[][ @MutativeDecorator[ request=@ReactorServerHttpRequest[org.springframework.http.server.reactive.ReactorServerHttpRequest@48d512d2], response=null, principalMono=null, delegate=@DefaultServerWebExchange[org.springframework.web.server.adapter.DefaultServerWebExchange@19be6afc], ], @DefaultGatewayFilterChain[ index=@Integer[13], filters=@ArrayList[isEmpty=false;size=26], ], ], ]
另外,我们通过打开Hooks.onOperatorDebug();,在异常链路里也能定位到这个类
2023-10-30 17:56:12,899 ERROR [reactor-http-epoll-2|90] reactor.core.publisher.Operators:: Operator called default onErrorDropped reactor.netty.ReactorNetty$InternalNettyException: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 251658247, max: 255066112) Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: Assembly trace from producer [reactor.core.publisher.FluxLift] : reactor.core.publisher.Flux.map(Flux.java:5998) reactor.netty.ReactorNetty.publisherOrScalarMap(ReactorNetty.java:468) Error has been observed at the following site(s): |_ Flux.map ⇢ at reactor.netty.ReactorNetty.publisherOrScalarMap(ReactorNetty.java:468) |_ Flux.doOnNext ⇢ at reactor.netty.ByteBufFlux.retain(ByteBufFlux.java:308) |_ Flux.map ⇢ at org.springframework.http.server.reactive.ReactorServerHttpRequest.getBody(ReactorServerHttpRequest.java:184) |_ Flux.collect ⇢ at org.springframework.core.io.buffer.DataBufferUtils.join(DataBufferUtils.java:563) |_ Mono.filter ⇢ at org.springframework.core.io.buffer.DataBufferUtils.join(DataBufferUtils.java:564) |_ Mono.map ⇢ at org.springframework.core.io.buffer.DataBufferUtils.join(DataBufferUtils.java:565) |_ Mono.defaultIfEmpty ⇢ at org.springframework.cloud.gateway.support.ServerWebExchangeUtils.cacheRequestBody(ServerWebExchangeUtils.java:365) |_ Mono.map ⇢ at org.springframework.cloud.gateway.support.ServerWebExchangeUtils.cacheRequestBody(ServerWebExchangeUtils.java:367) |_ Mono.switchIfEmpty ⇢ at org.springframework.cloud.gateway.support.ServerWebExchangeUtils.cacheRequestBody(ServerWebExchangeUtils.java:368) |_ Mono.flatMap ⇢ at org.springframework.cloud.gateway.support.ServerWebExchangeUtils.cacheRequestBody(ServerWebExchangeUtils.java:368) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.doFinally ⇢ at org.springframework.cloud.gateway.filter.RemoveCachedBodyFilter.filter(RemoveCachedBodyFilter.java:35) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.transform ⇢ at com.alibaba.csp.sentinel.adapter.gateway.sc.SentinelGatewayFilter.filter(SentinelGatewayFilter.java:83) |_ Mono.transform ⇢ at com.alibaba.csp.sentinel.adapter.gateway.sc.SentinelGatewayFilter.filter(SentinelGatewayFilter.java:93) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.doFinally ⇢ at com.xxx.gateway.filter.global.LogFilter.filter(LogFilter.java:45) |_ Mono.defer ⇢ at org.springframework.cloud.gateway.handler.FilteringWebHandler$DefaultGatewayFilterChain.filter(FilteringWebHandler.java:113) |_ Mono.then ⇢ at org.springframework.web.reactive.result.SimpleHandlerAdapter.handle(SimpleHandlerAdapter.java:46) |_ Mono.flatMap ⇢ at org.springframework.web.reactive.DispatcherHandler.handle(DispatcherHandler.java:146) |_ Mono.flatMap ⇢ at org.springframework.web.reactive.DispatcherHandler.handle(DispatcherHandler.java:147) |_ Mono.defer ⇢ at org.springframework.web.server.handler.DefaultWebFilterChain.filter(DefaultWebFilterChain.java:119) |_ checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain] |_ Mono.defer ⇢ at org.springframework.web.server.handler.DefaultWebFilterChain.filter(DefaultWebFilterChain.java:119) |_ Mono.doOnSuccess ⇢ at org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter.filter(MetricsWebFilter.java:78) |_ Mono.doOnError ⇢ at org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter.filter(MetricsWebFilter.java:79) |_ Mono.transformDeferred ⇢ at org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter.filter(MetricsWebFilter.java:73) |_ checkpoint ⇢ org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter [DefaultWebFilterChain] |_ Mono.defer ⇢ at org.springframework.web.server.handler.DefaultWebFilterChain.filter(DefaultWebFilterChain.java:119) |_ Mono.error ⇢ at org.springframework.web.server.handler.ExceptionHandlingWebHandler$CheckpointInsertingHandler.handle(ExceptionHandlingWebHandler.java:98) |_ checkpoint ⇢ HTTP POST "/api/v1/merchant-service/business/merchant/settle/apply-merchant-settle-of-flow2?sign=1ae8d289500bf920861caee9cfbde8ec" [ExceptionHandlingWebHandler] |_ Mono.onErrorResume ⇢ at org.springframework.web.server.handler.ExceptionHandlingWebHandler.handle(ExceptionHandlingWebHandler.java:77) |_ Mono.error ⇢ at com.alibaba.csp.sentinel.adapter.gateway.sc.exception.SentinelGatewayBlockExceptionHandler.handle(SentinelGatewayBlockExceptionHandler.java:57) |_ Mono.onErrorResume ⇢ at org.springframework.web.server.handler.ExceptionHandlingWebHandler.handle(ExceptionHandlingWebHandler.java:77)

AdaptCachedBodyGlobalFilter

AdaptCachedBodyGlobalFilter是spring-cloud-gateway设计用来缓存body的,因为http请求的body只能读一次,这在某些场景下就不适用了,比如gateway的重试机制。这个时候就需要先把body缓存下来,后续每次读取都从缓存中读。所以其实只有在启用了RetryGatewayFilter的route下,AdaptCachedBodyGlobalFilter里缓存body的逻辑才会生效
notion image
那么是不是把对应route里的RetryGatewayFilter给去掉就能解决这个问题?我们尝试了一下,去掉之后确实没有内存溢出的异常,符合我们的预期。
但是这个Filter里具体的耗时是在哪一步呢?因为用arthas定位出来的耗时很短,估计这个应该是跟reactor的异步非阻塞的机制有关系
trace org.springframework.cloud.gateway.filter.AdaptCachedBodyGlobalFilter filter -n 5 --skipJDKMethod false
`---ts=2023-11-09 17:33:25;thread_name=reactor-http-epoll-2;id=5a;is_daemon=true;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@2aae9190 `---[1.83534ms] org.springframework.cloud.gateway.filter.AdaptCachedBodyGlobalFilter:filter() +---[1.17% 0.021501ms ] org.springframework.web.server.ServerWebExchange:getAttributeOrDefault() #59 +---[0.25% 0.004516ms ] org.springframework.web.server.ServerWebExchange:getAttributeOrDefault() #66 +---[0.34% 0.006174ms ] org.springframework.web.server.ServerWebExchange:getAttribute() #67 +---[0.36% 0.006646ms ] org.springframework.cloud.gateway.route.Route:getId() #69 +---[0.74% 0.013551ms ] java.util.concurrent.ConcurrentMap:containsKey() #69 `---[12.78% 0.234511ms ] org.springframework.cloud.gateway.support.ServerWebExchangeUtils:cacheRequestBody() #73
我们进一步去看看代码,定位到缓存body的逻辑
notion image
截图中框出来的这一段应该是读取整个body的buffer并聚合成一个大的DataBuffer,作为缓存。我们用arthas的monitor命令来看看它们的执行情况
timestamp | class | method | total | success | fail | avg-rt(ms) | fail-% ------------------------------------------------------------------------------------------------ 2023-11-09 | org.springframewo | add | 3247 | 3247 | 0 | 0.00 | 0.00 17:40:41 | rk.core.io.buffer.LimitedD | | | | | | | ataBufferList | | | | | | timestamp | class | method | total | success | fail | avg-rt(ms) | fail-% ------------------------------------------------------------------------------------------------ 2023-11-09 | org.springframewo | add | 5293 | 5293 | 0 | 0.00 | 0.00 17:40:51 | rk.core.io.buffer.LimitedD | | | | | | | ataBufferList | | | | | | timestamp | class | method | total | success | fail | avg-rt(ms) | fail-% ------------------------------------------------------------------------------------------------ 2023-11-09 | org.springframewo | add | 5339 | 5339 | 0 | 0.00 | 0.00 17:41:01 | rk.core.io.buffer.LimitedD | | | | | | | ataBufferList | | | | | | timestamp | class | method | total | success | fail | avg-rt(ms) | fail-% ------------------------------------------------------------------------------------------------ 2023-11-09 | org.springframewo | add | 576 | 576 | 0 | 0.02 | 0.00 17:41:02 | rk.core.io.buffer.LimitedD | | | | | | | ataBufferList | | | | | | timestamp | class | method | total | success | fail | avg-rt(ms) | fail-% ------------------------------------------------------------------------------------------------ 2023-11-09 | org.springframewo | add | 0 | 0 | 0 | 0.00 | 0.00 17:41:21 | rk.core.io.buffer.LimitedD | | | | | | | ataBufferList | | | | | |
看到这么多的执行次数,感觉终于要定位到耗时代码了。应该是因为请求体太大,然后每次能读的数据量有限,所以需要读取N多次。并且这里的执行速度很大程度上取决于客户端的上传速度,数据准备好之后,epoll线程才能介入读取数据。这也解释了为什么我在本地跑的时候,不管抛不抛异常,响应速度都很快。
到这里,其实我们已经搞清楚了
  1. 启用RetryGatewayFilter会间接启用AdaptCachedBodyGlobalFilter缓存body的逻辑,body过大可能会导致内存(默认使用的是DirectBuffer)溢出(这块内存分配的机制暂时不是很了解,但是看起来应该和文件大小并非完全对应,因为按照slb的日志来看应该才读取了100多M)。
  1. 整个请求的耗时很大程度取决于客户端的上传速度,但是得益于reactor这种异步非阻塞IO,线程不会阻塞,epoll线程可以穿插处理其他请求
    1. notion image

另外一种异常场景

这已经足以解释本文开头的问题了。不过我们后来仔细看了一下,其实还存在着另外一类异常:它的异常信息里并没有请求接口相关的checkpoint,并且对应的接口里也没有上传文件的逻辑。而是发生在spring-cloud-gateway请求下游数据响应回来之后,spring-cloud-gateway去读取响应数据的时候内存溢出了。
[id: 0x0fa1ae24, L:/172.17.0.13:40588 - R:10.20.1.159/10.20.1.159:24051] Error was received while reading the incoming data. The connection will be closed. io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006632967, max: 1020067840) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:754) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:709) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:188) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:128) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:71) ~[spring-core-5.2.9.RELEASE.jar!/:5.2.9.RELEASE] at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:39) ~[spring-core-5.2.9.RELEASE.jar!/:5.2.9.RELEASE] at org.springframework.core.codec.CharSequenceEncoder.encodeValue(CharSequenceEncoder.java:91) ~[spring-core-5.2.9.RELEASE.jar!/:5.2.9.RELEASE] at org.springframework.core.codec.CharSequenceEncoder.lambda$encode$0(CharSequenceEncoder.java:75) ~[spring-core-5.2.9.RELEASE.jar!/:5.2.9.RELEASE] at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:100) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:67) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1782) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:241) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2344) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.MonoFlatMap$FlatMapInner.onSubscribe(MonoFlatMap.java:230) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onSubscribe(MdcContextLifterConfiguration.java:50) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:54) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onNext(FluxOnAssembly.java:385) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:96) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:112) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onNext(MdcContextLifterConfiguration.java:62) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:127) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1782) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.core.publisher.MonoCollect$CollectSubscriber.onComplete(MonoCollect.java:152) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onComplete(MdcContextLifterConfiguration.java:72) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onComplete(MdcContextLifterConfiguration.java:72) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:252) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at com.xxx.gateway.config.MdcContextLifter.onComplete(MdcContextLifterConfiguration.java:72) ~[classes!/:1.0.0-SNAPSHOT] at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136) ~[reactor-core-3.3.10.RELEASE.jar!/:3.3.10.RELEASE] at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:374) ~[reactor-netty-0.9.12.RELEASE.jar!/:0.9.12.RELEASE] at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:373) ~[reactor-netty-0.9.12.RELEASE.jar!/:0.9.12.RELEASE] at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:429) ~[reactor-netty-0.9.12.RELEASE.jar!/:0.9.12.RELEASE] at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:655) ~[reactor-netty-0.9.12.RELEASE.jar!/:0.9.12.RELEASE] at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:96) ~[reactor-netty-0.9.12.RELEASE.jar!/:0.9.12.RELEASE] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292]
我们根据这个异常上的traceId也找到了对应的请求日志
notion image
整个请求在spring-cloud-gateway花费了接近60s,感觉应该是报了内存溢出的异常之后,整个请求流程就中断了,后面可能是因为slb有超时时间的配置(60s),才触发的spring-cloud-gateway的响应。(因为找了好几个这样的请求,都是60s结束)
notion image
notion image
notion image
另外,补充说明一下,spring-cloud-gateway对于下游的请求响应一般应该也不会读取并保存,而是直接输出到response,而我们的场景里可能会改写response所以才导致的会读取响应结果。

内存溢出?

这个问题的出现,让我不禁怀疑,难道前面上传文件占用的内存没有释放吗?通过前面的异常栈发现确实应该是没有释放:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1006632967, max: 1020067840) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:754) ~[netty-all-4.1.52.Final.jar!/:4.1.52.Final] at
我们来看看这一段异常堆栈对应的源代码
notion image

监控堆外内存的使用情况

netty内部维护了一个计数器用来记录当前分配的堆外内存,一旦超过限制就会报错。那么我们是不是可以监控这个值,看看它是怎么变化的
public class NettyNativeMemoryMonitor { private static final Logger log = LoggerFactory.getLogger(NettyNativeMemoryMonitor.class); private static final int KB = 1024; private static final ScheduledExecutorService memoryTimer = Executors.newSingleThreadScheduledExecutor(); @PostConstruct public void init() { memoryTimer.scheduleAtFixedRate(() -> { try { Field field = PlatformDependent.class.getDeclaredField("DIRECT_MEMORY_COUNTER"); field.setAccessible(true); AtomicLong counter = (AtomicLong) field.get(null); field = PlatformDependent.class.getDeclaredField("DIRECT_MEMORY_LIMIT"); field.setAccessible(true); long limit = (long) field.get(null); log.info("Used {} KB native memory, limit is {} KB", counter.get() / KB, limit / KB); } catch (Exception e) { log.error("Error on monitoring netty native memory", e); } }, 0, 1, TimeUnit.SECONDS); } }
我们先在默认的池化分配方式下做监控,在这种模式下我们发现这个内存其实是没有释放的,监控方法调用也没有发现有调用io.netty.util.internal.PlatformDependent#decrementMemoryCounter这个方法。最终的结果就是这个数值越来越大
2023-11-09 21:21:03,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:21:04,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 16384 KB native memory, limit is 249088 KB 2023-11-09 21:21:05,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 16384 KB native memory, limit is 249088 KB 2023-11-09 21:21:06,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 16384 KB native memory, limit is 249088 KB 2023-11-09 21:21:07,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 32768 KB native memory, limit is 249088 KB 2023-11-09 21:21:08,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 49152 KB native memory, limit is 249088 KB 2023-11-09 21:21:09,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 49152 KB native memory, limit is 249088 KB 2023-11-09 21:21:10,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 65536 KB native memory, limit is 249088 KB 2023-11-09 21:21:11,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 65536 KB native memory, limit is 249088 KB 2023-11-09 21:21:12,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 81920 KB native memory, limit is 249088 KB 2023-11-09 21:21:13,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 98304 KB native memory, limit is 249088 KB 2023-11-09 21:21:14,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 98304 KB native memory, limit is 249088 KB 2023-11-09 21:21:15,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 114688 KB native memory, limit is 249088 KB 2023-11-09 21:21:16,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 114688 KB native memory, limit is 249088 KB 2023-11-09 21:21:17,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 131072 KB native memory, limit is 249088 KB 2023-11-09 21:21:18,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 147456 KB native memory, limit is 249088 KB 2023-11-09 21:21:19,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 147456 KB native memory, limit is 249088 KB 2023-11-09 21:21:20,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 147456 KB native memory, limit is 249088 KB 2023-11-09 21:21:21,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 163840 KB native memory, limit is 249088 KB 2023-11-09 21:21:22,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 180224 KB native memory, limit is 249088 KB 2023-11-09 21:21:23,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 180224 KB native memory, limit is 249088 KB 2023-11-09 21:21:24,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 196608 KB native memory, limit is 249088 KB 2023-11-09 21:21:25,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 212992 KB native memory, limit is 249088 KB 2023-11-09 21:21:26,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 212992 KB native memory, limit is 249088 KB 2023-11-09 21:21:27,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 229376 KB native memory, limit is 249088 KB 2023-11-09 21:21:28,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 229376 KB native memory, limit is 249088 KB 2023-11-09 21:21:29,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 245760 KB native memory, limit is 249088 KB 2023-11-09 21:21:30,960 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 245760 KB native memory, limit is 249088 KB 2023-11-09 21:21:31,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 245760 KB native memory, limit is 249088 KB 2023-11-09 21:21:32,959 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 245760 KB native memory, limit is 249088 KB

非池化内存分配

我们又试了unpooled方式分配内存-Dio.netty.allocator.type=unpooled
monitor -E io.netty.util.internal.PlatformDependent incrementMemoryCounter|decrementMemoryCounter -n 10 --cycle 10
notion image
可以看到这次既有increment也有decrement的调用,次数也是差不多的
2023-11-09 21:40:13,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:14,266 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 1 KB native memory, limit is 249088 KB 2023-11-09 21:40:15,296 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 4953 KB native memory, limit is 249088 KB 2023-11-09 21:40:16,295 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 9051 KB native memory, limit is 249088 KB 2023-11-09 21:40:17,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 12723 KB native memory, limit is 249088 KB 2023-11-09 21:40:18,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 23605 KB native memory, limit is 249088 KB 2023-11-09 21:40:19,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 36697 KB native memory, limit is 249088 KB 2023-11-09 21:40:20,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 45051 KB native memory, limit is 249088 KB 2023-11-09 21:40:21,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 57127 KB native memory, limit is 249088 KB 2023-11-09 21:40:22,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 61125 KB native memory, limit is 249088 KB 2023-11-09 21:40:23,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 62433 KB native memory, limit is 249088 KB 2023-11-09 21:40:24,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 70039 KB native memory, limit is 249088 KB 2023-11-09 21:40:25,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 82069 KB native memory, limit is 249088 KB 2023-11-09 21:40:26,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 96875 KB native memory, limit is 249088 KB 2023-11-09 21:40:27,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 106189 KB native memory, limit is 249088 KB 2023-11-09 21:40:28,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 116443 KB native memory, limit is 249088 KB 2023-11-09 21:40:29,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 126797 KB native memory, limit is 249088 KB 2023-11-09 21:40:30,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 137917 KB native memory, limit is 249088 KB 2023-11-09 21:40:31,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 146989 KB native memory, limit is 249088 KB 2023-11-09 21:40:32,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 157161 KB native memory, limit is 249088 KB 2023-11-09 21:40:33,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 167721 KB native memory, limit is 249088 KB 2023-11-09 21:40:34,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 178769 KB native memory, limit is 249088 KB 2023-11-09 21:40:35,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 188941 KB native memory, limit is 249088 KB 2023-11-09 21:40:36,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 195613 KB native memory, limit is 249088 KB 2023-11-09 21:40:37,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 200203 KB native memory, limit is 249088 KB 2023-11-09 21:40:38,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 203459 KB native memory, limit is 249088 KB 2023-11-09 21:40:39,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 213707 KB native memory, limit is 249088 KB 2023-11-09 21:40:40,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 226007 KB native memory, limit is 249088 KB 2023-11-09 21:40:41,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 241151 KB native memory, limit is 249088 KB 2023-11-09 21:40:42,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:43,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:44,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:45,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:46,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB 2023-11-09 21:40:47,265 INFO [pool-9-thread-1|62] com.xxx.gateway.config.NettyNativeMemoryMonitor:: Used 0 KB native memory, limit is 249088 KB
虽然期间也报了同样的内存溢出的错误(这个应该是没法避免的),但是请求结束就全部回收掉了。这个问题我看github上也有人提相关问题,但是暂时也没给解决。大家使用的workaround都是使用unpooled内存分配方式
Netty Native memory increases continuously when using AdaptCachedBodyGlobalFilter
Updated Apr 13, 2023
包括阿里EDAS对于spring-cloud-gateway的增强也提到了
内存泄漏问题,该问题来自于 CSB 的生产实践,Spring Cloud Gateway 底层依赖 netty 进行 IO 通信,熟悉 netty 的人应当知道其有一个读写缓冲的设计,如果通信内容较小,一般会命中 chunked buffer,而通信内容较大时,例如文件上传,则会触发内存的新分配,而 Spring Cloud Gateway 在对接 netty 时存在逻辑缺陷,会导致新分配的池化内存无法完全回收,导致堆外内存泄漏。并且这块堆外内存时 netty 使用 unsafe 自行分配的,通过常规的 JVM 工具还无法观测,非常隐蔽。
EDAS 建议为 Spring Cloud Gateway 应用增加启动参数 -Dio.netty.allocator.type=unpooled,使得请求未命中 chunked buffer 时,分配的临时内存不进行池化,规避内存泄漏问题。-Dio.netty.allocator.type=unpooled 不会导致性能下降,只有大报文才会触发该内存的分配,而网关的最佳实践应该是不允许文件上传这类需求,加上该参数是为了应对非主流场景的一个兜底行为。

对于使用姿势的优化

其实这次的问题原因主要是由于我们在spring-cloud-gateway开启了重试机制,导致每个请求都要缓存请求body,在body过大的场景下可能就会内存溢出。关闭重试机制也不失为一种解决方案,但是一般来说,网关的重试还是有必要的,比如下游服务重启等场景,为了更稳定的服务都需要重试。所以以下对于使用姿势优化的考量都在需要重试的前提下。
那么,既然是body过大的场景下的问题,我们是不是可以针对这种场景做一些特殊处理呢?github上有人给出了2个解决方案:
  1. 增加可配置阈值限制内存缓存的大小,超过之后就保存到磁盘上
  1. body过大的请求直接不允许retry
Conditional Caching in ServerWebExchangeUtils.cacheRequestBody
Updated Oct 11, 2021
不过看起来1这种方案,性能堪忧。还是方案2更靠谱一些。不过我觉得还可以更激进一些:我们直接增加了全局请求大小限制的Filter,结合线上请求的大小,当前暂定1M,其实除去上传文件请求,正常场景的body都不会太大。
总结一下,通过这次的问题,我们在spring-cloud-gateway的使用姿势上做了如下优化:
  1. 增加启动参数 -Dio.netty.allocator.type=unpooled
  1. 增加全局请求大小限制Filter
  1. 不允许使用web server直接接收文件上传

关于前面用arthas trace耗时很短的问题

前面解释了是因为reactor异步非阻塞的模式导致的。是的,reactor编程模式下,所有的动作其实是发生在subscribe那一刻的,而前面的只是定义,而不涉及执行。下面给个简单的例子再感受一下
public class ArthasTest { public static void main(String[] args) throws IOException { System.in.read(); new ArthasTest().test(); try { Thread.sleep(1000000); } catch (InterruptedException e) { throw new RuntimeException(e); } } public void test() { Flux.range(0, 100).map(a -> { try { Thread.sleep(1000); } catch (InterruptedException e) { throw new RuntimeException(e); } System.out.println(a); return a; }).subscribe(); } }
[arthas@96585]$ trace ArthasTest test -n 5 --skipJDKMethod false Press Q or Ctrl+C to abort. Affect(class count: 1 , method count: 1) cost in 182 ms, listenerId: 1 `---ts=2023-11-24 15:14:30;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2 `---[100638.632167ms] ArthasTest:test() +---[0.19% 187.387167ms ] reactor.core.publisher.Flux:range() #19 +---[0.00% 0.165208ms ] reactor.core.publisher.Flux:map() #19 `---[99.81% 100449.607458ms ] reactor.core.publisher.Flux:subscribe() #27

过程中的一些记录

  • JVisiualVM/JConsole远程连接监控JVM
  • reactor编程模式
  • JVM内存使用情况的一些分析
这些内容都花了不少时间去研究,这个看后面能不能单独输出文章

参考

  1. Netty之有效规避内存泄漏 https://www.cnblogs.com/549294286/p/5168454.html
  1. Native Memory Tracking 详解(1):基础介绍 https://mp.weixin.qq.com/s?__biz=MzkyNTMwMjI2Mw==&mid=2247489204&idx=1&sn=c3a9ac4f291543b3504cd1bc1adf0bdb
  1. 聊聊HotSpot VM的Native Memory Tracking https://cloud.tencent.com/developer/article/1406522
  1. 全网最硬核 JVM 内存解析 - 1.从 Native Memory Tracking 说起 https://juejin.cn/post/7225871227743043644
  1. Netty堆外内存泄露排查盛宴 https://tech.meituan.com/2018/10/18/netty-direct-memory-screening.html
  1. Spring Boot引起的“堆外内存泄漏”排查及经验总结 https://tech.meituan.com/2019/01/03/spring-boot-native-memory-leak.html
  1. Spring Cloud Gateway之踩坑日记 https://blog.csdn.net/manzhizhen/article/details/115386684
  1. 今咱们来聊聊JVM 堆外内存泄露的BUG是如何查找的 https://cloud.tencent.com/developer/article/1129904
  1. 一次完整的JVM堆外内存泄漏故障排查记录 https://heapdump.cn/article/1821465
  1. 一次压缩引发堆外内存过高的教训 https://cloud.tencent.com/developer/article/1700083
  1. 分析 Java 内存泄漏 https://chenhm.com/post/2018-12-05-debuging-java-memory-leak
  1. jvisualvm远程连接的三种方式 https://blog.csdn.net/BushQiang/article/details/114709682
  1. EDAS 让 Spring Cloud Gateway 生产可用的二三策 https://www.cnkirito.moe/edas-scg-agent/
定制spring-cloud-gateway限流器来解决线上问题SpringBoot是如何做到一个jar包就可以直接运行的