URL
date
slug
status
tags
summary
type

优雅停机(上)

优雅停机是指以受控方式终止应用程序的过程,允许它完成任何正在进行的任务,释放资源并确保数据完整性。比如当前正在处理的请求、正在运行的定时任务、正在消费的消息等等。我们不能每次停机或者发布的时候都让外部有明显的感知。优雅停机的目标就是让外界(你的服务调用方)无感。
由于篇幅过长,文章分为上下两篇:
  • 上篇主要铺垫相关知识点以及介绍SpringBoot内嵌webServer的优雅停机实现
  • 下篇主要介绍几种常见的中间件(Dubbo、RocketMQ)的优雅停机实现
PS:如未做特殊说明,本文的代码都是基于spring-boot 2.3.4-RELEASE版本

Shutdown Hook

ShutdownHook是我们实现优雅停机的基石,它是JVM监听shutdown/kill信号的回调接口。我们通过注册应用级别的ShutdownHook来实现优雅停机。
ShutdownHook能处理三种场景下的停机:
  1. 代码主动停机,Runtime.exit()
  1. kill指令kill进程
  1. 用户线程全部结束
可以说,唯一无法处理的就是kill -9强制停机命令,所以我们尽量不要使用kill -9
而监听到停机信号之后,应用级别的ShutdownHook是并行执行的,无法控制顺序。如果不同的ShutdownHook之间存在依赖关系,只能通过注册在同一个ShutdownHook里面然后内部实现顺序。
比如使用Spring容器的应用,一般都是通过Spring实现的ShutdownHook内部的一些扩展点(destory回调ContextClosedEvent监听等等)来做的。
如果你想对ShutdownHook机制有一个更深入的了解,推荐你去看看这篇文章

SpringShutdownHook

SpringBoot在启动的过程中注册了一个应用级别的ShutdownHook,这个方法是ConfigurableApplicationContext接口定义的,也可以手动显式调用,但是一个Spring上下文只能注册一个shutdownHook
// org.springframework.context.support.AbstractApplicationContext#registerShutdownHook public void registerShutdownHook() { if (this.shutdownHook == null) { // No shutdown hook registered yet. this.shutdownHook = new Thread(SHUTDOWN_HOOK_THREAD_NAME) { @Override public void run() { synchronized (startupShutdownMonitor) { doClose(); } } }; Runtime.getRuntime().addShutdownHook(this.shutdownHook); } }
notion image
核心流程都在doClose()方法里
protected void doClose() { // Check whether an actual close attempt is necessary... if (this.active.get() && this.closed.compareAndSet(false, true)) { if (logger.isDebugEnabled()) { logger.debug("Closing " + this); } LiveBeansView.unregisterApplicationContext(this); // 第一步,监听了ContextClosedEvent事件的Bean,会被率先回调 try { publishEvent(new ContextClosedEvent(this)); } catch (Throwable ex) { logger.warn("Exception thrown from ApplicationListener handling ContextClosedEvent", ex); } // 第二步,会stop实现了LifeCycle接口的Bean if (this.lifecycleProcessor != null) { try { this.lifecycleProcessor.onClose(); } catch (Throwable ex) { logger.warn("Exception thrown from LifecycleProcessor on context close", ex); } } // 第三步,调用到Bean的生命周期的销毁方法(如@PreDestory、dispose等) destroyBeans(); // 第四步,关闭beanFactory本身 closeBeanFactory(); // 第五步,留给子类实现的回调 onClose(); // Reset local application listeners to pre-refresh state. if (this.earlyApplicationListeners != null) { this.applicationListeners.clear(); this.applicationListeners.addAll(this.earlyApplicationListeners); } // Switch to inactive. this.active.set(false); } }
整个停机流程还是比较清晰的,这里再简单总结一下:
  1. 最先广播ContextClosedEvent事件,此时所有监听了ContextClosedEvent事件的Bean会destory
  1. 第二步再destory实现了Lifecycle接口的Bean,可以看到Lifecycle还支持定义不同的phase,关于Lifecycle的停机流程我们后面会深入分析
  1. 第三步就是destory单例Bean。这个范围可能比你认知的要大一些,因为这里还会包含实现了AutoCloseable接口未指定destory-method但是有close或者shutdown方法的Bean
    1. class DisposableBeanAdapter implements DisposableBean, Runnable, Serializable { public static boolean hasDestroyMethod(Object bean, RootBeanDefinition beanDefinition) { if (bean instanceof DisposableBean || bean instanceof AutoCloseable) { return true; } String destroyMethodName = beanDefinition.getDestroyMethodName(); if (AbstractBeanDefinition.INFER_METHOD.equals(destroyMethodName)) { return (ClassUtils.hasMethod(bean.getClass(), CLOSE_METHOD_NAME) || ClassUtils.hasMethod(bean.getClass(), SHUTDOWN_METHOD_NAME)); } return StringUtils.hasLength(destroyMethodName); } }
  1. 关闭容器本身
  1. 提供一个扩展点(onClose)给子类扩展用的

SmartLifecycle

这里重点介绍一下Spring停机流程的第二步,Lifecycle相关Bean的stop,后面要讲的SpringBoot内嵌WebServer的优雅停机就是基于SmartLifecycle来实现的。如果你不想看枯燥的代码分析,也可以直接跳到这一段的结尾看文字总结。
public class DefaultLifecycleProcessor implements LifecycleProcessor, BeanFactoryAware { // stop所有的LifecycleBeans private void stopBeans() { // 获取容器内所有的LifecycleBeans Map<String, Lifecycle> lifecycleBeans = getLifecycleBeans(); // 把LifecycleBeans按照phases做groupBy得到phases,每个phase对应一个LifecycleGroup(也就是一组LifecycleBean) Map<Integer, LifecycleGroup> phases = new HashMap<>(); lifecycleBeans.forEach((beanName, bean) -> { int shutdownPhase = getPhase(bean); LifecycleGroup group = phases.get(shutdownPhase); if (group == null) { // LifecycleGroup包含了phase、phase整体的await时间以及lifecycleBeansMap group = new LifecycleGroup(shutdownPhase, this.timeoutPerShutdownPhase, lifecycleBeans, false); phases.put(shutdownPhase, group); } group.add(beanName, bean); }); if (!phases.isEmpty()) { List<Integer> keys = new ArrayList<>(phases.keySet()); keys.sort(Collections.reverseOrder()); for (Integer key : keys) { // 按照phase从大到小的顺序stop每个LifecycleGroup phases.get(key).stop(); } } } private class LifecycleGroup { // stop LifecycleGroup public void stop() { if (this.members.isEmpty()) { return; } // 组内的bean按照order倒序,order越大,越先stop this.members.sort(Collections.reverseOrder()); // CountDownLatch的值为SmartLifecycleBean的数量 CountDownLatch latch = new CountDownLatch(this.smartMemberCount); Set<String> countDownBeanNames = Collections.synchronizedSet(new LinkedHashSet<>()); // this.lifecycleBeans是所有的group共享的待stop的lifecycleBeans,一旦被stop,会从此map里移除。所以这里在stop某个Group时,lifecycleBeanNames都是当前还未stop的lifecycleBeanNames Set<String> lifecycleBeanNames = new HashSet<>(this.lifecycleBeans.keySet()); for (LifecycleGroupMember member : this.members) { if (lifecycleBeanNames.contains(member.name)) { // lifecycleBeanNames包含,说明还没被stop过 doStop(this.lifecycleBeans, member.name, latch, countDownBeanNames); } else if (member.bean instanceof SmartLifecycle) { // 这里说明该SmartLifecycleBean已经在更早执行的phase里作为被依赖的Bean被提前stop掉了 latch.countDown(); } } try { // 这里会等待整个group的所有SmartLifecycleBean全部stop成功或者超时 latch.await(this.timeout, TimeUnit.MILLISECONDS); } catch (InterruptedException ex) { Thread.currentThread().interrupt(); } } // lifecycleBeans是所有待stop的lifecycleBeans,由于是递归调用,所以停了就remove掉,防止1个bean被多次stop // latch的值是lifecycleBeans里所有的SmartLifecycle bean的数量 // 所以相同phase的SmartLifecycle bean是挂在一个CountDownLatch下的,它们可以并行执行,整体等待时长为`spring.lifecycle.timeout-per-shutdown-phase` private void doStop(Map<String, ? extends Lifecycle> lifecycleBeans, final String beanName, final CountDownLatch latch, final Set<String> countDownBeanNames) { Lifecycle bean = lifecycleBeans.remove(beanName); if (bean != null) { // 优先stop依赖的bean String[] dependentBeans = getBeanFactory().getDependentBeans(beanName); for (String dependentBean : dependentBeans) { doStop(lifecycleBeans, dependentBean, latch, countDownBeanNames); } try { if (bean.isRunning()) { // 如果bean的类型是SmartLifecycle,则通过回调来countDown if (bean instanceof SmartLifecycle) { countDownBeanNames.add(beanName); ((SmartLifecycle) bean).stop(() -> { latch.countDown(); countDownBeanNames.remove(beanName); }); } else { // 如果bean的类型是普通Lifecycle,则直接stop bean.stop(); } } else if (bean instanceof SmartLifecycle) { // 如果非running的SmartLifecycle bean,直接countDown即可 latch.countDown(); } } catch (Throwable ex) { } } } } }
关于Lifecycle和SmartLifecycle的停机流程,这里再总结一下:
  1. 把所有实现了Lifecycle接口的Bean按照phase值做groupBy,形成phase->LifecycleGroup的Map(没有实现Phased接口的phase按照0来处理)
  1. 按照phase值从大到小的顺序,依次stop LifecycleGroup
  1. LifecycleGroup组内按照order值从大到小stop里面的LifecycleBean,确保每个LifecycleBean只被stop一次,并且如果存在依赖关系,被依赖的Bean总是优先被stop
  1. 每个LifecycleGroup内的SmartLifecycleBean可以通过多线程来并行stop,最终会通过CountDownLatch来等待这个phase里的所有SmartLifecycleBean成功stop或者是超过spring.lifecycle.timeout-per-shutdown-phase配置的时长

需要优雅停机的场景

看完了上面的知识点铺垫,相信你对ShutdownHook、SpringShutdownHook以及Spring的LifecycleBean应该有一个大概的了解了。下面我们会结合实际的场景来看看它们是怎么处理优雅停机的。
一般来说,停机不优雅可以分为两类:
  1. 服务注册/发现:涉及到服务注册/发现的,在停机的时候没有及时去注册中心摘除节点,而是先做了应用的destory,导致有的请求路由到正在停机或者已经停机的实例(服务不可用)
  1. Spring容器内部依赖:Spring容器内部多种类型的Bean destory顺序导致。比如先destory的Bean实际上会被后destory的Bean依赖,那么在destory期间进来的请求就有可能报错。
两类问题的解决方案也比较简单:
  1. 针对服务注册/发现类的问题我们要按照下面的顺序处理
    1. 从注册中心下线
    2. 关闭服务端口
    3. 相关bean的销毁,回收
  1. 而Spring容器内部依赖,我们要厘清依赖关系,基于Spring容器停机时的不同回调组件的执行顺序,配置好各个Bean之间的destory顺序

SpringBoot 2.3内嵌WebServer的优雅停机

SpringBoot 2.3版本开始支持内嵌WebServer的优雅停机。通过配置server.shutdown=graceful,在应用停机时,WebServer将不再允许新请求,并等待一定时长(通过spring.lifecycle.timeout-per-shutdown-phase配置)来处理之前已经进来的请求。这个优雅停机就是通过我们前面介绍的LifecycleBean的stop回调来实现的。在此我强烈推荐你仔细阅读前文LifecycleBean的停机流程。
我们找到启动webServer的源码,可以看到此时注册了2个Bean:webServerGracefulShutdownwebServerStartStop,从名字上不难判断出都跟停机相关,并且这2个Bean都实现了SmartLifecycle接口
// org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext#createWebServer private void createWebServer() { WebServer webServer = this.webServer; ServletContext servletContext = getServletContext(); if (webServer == null && servletContext == null) { ServletWebServerFactory factory = getWebServerFactory(); this.webServer = factory.getWebServer(getSelfInitializer()); getBeanFactory().registerSingleton("webServerGracefulShutdown", new WebServerGracefulShutdownLifecycle(this.webServer)); getBeanFactory().registerSingleton("webServerStartStop", new WebServerStartStopLifecycle(this, this.webServer)); } else if (servletContext != null) { try { getSelfInitializer().onStartup(servletContext); } catch (ServletException ex) { throw new ApplicationContextException("Cannot initialize servlet context", ex); } } initPropertySources(); }
  • WebServerStartStopLifecycle负责webServer的启停,容器启动阶段会启动webServer,而容器停机阶段会销毁webServer,phase为Integer.MAX_VALUE - 1
  • WebServerGracefulShutdownLifecycle负责优雅停机,phase为Integer.MAX_VALUE
停机时,是按phase从大到小的顺序依次stop的。所以WebServerGracefulShutdownLifecycle的stop早于WebServerStartStopLifecycle。先优雅停机,再销毁webServer。让我们来重点看看优雅停机的逻辑:
class WebServerGracefulShutdownLifecycle implements SmartLifecycle { private final WebServer webServer; private volatile boolean running; WebServerGracefulShutdownLifecycle(WebServer webServer) { this.webServer = webServer; } @Override public void start() { this.running = true; } @Override public void stop() { throw new UnsupportedOperationException("Stop must not be invoked directly"); } @Override public void stop(Runnable callback) { this.running = false; this.webServer.shutDownGracefully((result) -> callback.run()); } @Override public boolean isRunning() { return this.running; } }
具体的逻辑都在WebServer的实现类里,重点是shutDownGracefully和stop这2个方法,我们以TomcatWebServer为例来分析:
public class TomcatWebServer implements WebServer { // 这里的callback主要是CountDownLatch.countDown() @Override public void shutDownGracefully(GracefulShutdownCallback callback) { if (this.gracefulShutdown == null) { callback.shutdownComplete(GracefulShutdownResult.IMMEDIATE); return; } // 如果配置了优雅停机,则执行优雅停机逻辑 this.gracefulShutdown.shutDownGracefully(callback); } @Override public void stop() throws WebServerException { synchronized (this.monitor) { boolean wasStarted = this.started; try { this.started = false; try { if (this.gracefulShutdown != null) { this.gracefulShutdown.abort(); } stopTomcat(); this.tomcat.destroy(); } catch (LifecycleException ex) { // swallow and continue } } catch (Exception ex) { throw new WebServerException("Unable to stop embedded Tomcat", ex); } finally { if (wasStarted) { containerCounter.decrementAndGet(); } } } } public TomcatWebServer(Tomcat tomcat, boolean autoStart, Shutdown shutdown) { Assert.notNull(tomcat, "Tomcat Server must not be null"); this.tomcat = tomcat; this.autoStart = autoStart; // 默认值为Shutdown.IMMEDIATE,所以gracefulShutdown为null this.gracefulShutdown = (shutdown == Shutdown.GRACEFUL) ? new GracefulShutdown(tomcat) : null; initialize(); } }
我们看看如果配置了优雅停机,会做哪些动作
final class GracefulShutdown { void shutDownGracefully(GracefulShutdownCallback callback) { logger.info("Commencing graceful shutdown. Waiting for active requests to complete"); new Thread(() -> doShutdown(callback), "tomcat-shutdown").start(); } private void doShutdown(GracefulShutdownCallback callback) { List<Connector> connectors = getConnectors(); connectors.forEach(this::close); try { for (Container host : this.tomcat.getEngine().findChildren()) { for (Container context : host.findChildren()) { while (isActive(context)) { if (this.aborted) { callback.shutdownComplete(GracefulShutdownResult.REQUESTS_ACTIVE); return; } Thread.sleep(50); } } } } catch (InterruptedException ex) { Thread.currentThread().interrupt(); } logger.info("Graceful shutdown complete"); callback.shutdownComplete(GracefulShutdownResult.IDLE); } void abort() { this.aborted = true; } } public class TomcatWebServer implements WebServer { // WebServerStartStopLifecycle在stop时会调用webServer的stop方法 public void stop() throws WebServerException { synchronized (this.monitor) { boolean wasStarted = this.started; try { this.started = false; try { if (this.gracefulShutdown != null) { // 这里会去更新 aborted 字段 this.gracefulShutdown.abort(); } stopTomcat(); this.tomcat.destroy(); } catch (LifecycleException ex) { // swallow and continue } } catch (Exception ex) { throw new WebServerException("Unable to stop embedded Tomcat", ex); } finally { if (wasStarted) { containerCounter.decrementAndGet(); } } } } }
这里会开启一个新线程,主要做几件事情
  1. 先把端口停掉,不再接收新的请求
  1. 会等待已经进来的请求全部处理完,最大等待时长为spring.lifecycle.timeout-per-shutdown-phase。假设超过最大等待时长WebServerGracefulShutdownLifecycle对应的phase还未完成stop,那么Spring的停机流程会往下推进到WebServerStartStopLifecycle对应phase的stop,此时WebServerStartStopLifecycle会更新aborted字段,通知WebServerGracefulShutdownLifecycle立即shutdown
  1. 如果说已经进来的请求在超时时间内被消费完,那么本phase正常关闭,进入WebServerStartStopLifecycle对应phase的stop

SpringBoot2.3之前自己实现的优雅停机

在2.3版本发布之前,我们也尝试过自己来实现优雅停机。整体思路也是一样的:
  1. 先停止connector,不再接收新请求
  1. 替换tomcat线程池,主要是处理已经进来的但是还未提交到线程池的请求,这部分请求直接忽略掉
  1. 等待原tomcat线程池里的任务处理完,超时时间为5s
public class GracefulShutdownCustomizer implements TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent>, Ordered { private volatile Connector connector; @Override public void customize(Connector connector) { this.connector = connector; } @Override public void onApplicationEvent(ContextClosedEvent event) { log.debug("graceful shutdown embed tomcat, pause connector first"); this.connector.pause(); log.debug("then shutdown the threadPool"); ThreadPoolExecutor threadPool = TomcatUtils .setThreadPool(connector, command -> log.info("drop request")); if (threadPool != null) { threadPool.shutdown(); try { // wait 5 seconds threadPool.awaitTermination(5, TimeUnit.SECONDS); } catch (InterruptedException e) { log.warn("Interrupted", e); } if (!threadPool.isTerminated()) { log.warn("shutdown timeout, and force to shutdown now"); threadPool.shutdownNow(); } } } @Override public int getOrder() { return PriorityOrdered.HIGHEST_PRECEDENCE + 1000; } }
上面这个springboot 2.3之前版本的优雅停机逻辑是我在2019年的时候实现的,印象里当时我们使用的还是spring-boot 1.5.x。当时是因为应用在停机过程中总是会打印ERROR日志,所以我们的初衷是为了在停机时不要老是无谓的“骚扰”自己。而按上面的做法,还是会让外界有感知,打印了多少个drop request就是影响了多少个外部请求。
这里还有一点值得思考的是,服务端处理完就算是结束了吗?那如果客户端还没有读取完响应体呢?

参考

  1. ShutdownHook原理
  1. Allow the embedded web server to be shut down gracefully #4657
  1. 研究优雅停机时的一点思考
优雅停机之DubboSentinel之计数统计及限流逻辑