前提
查看了应用日志,是凌晨就已经 OOM 了,但是直到早上八点多客服反馈登录不了才发现问题,当时在路上,于是让运维 dump 了堆栈和线程信息,重启了服务器。
jvm 配置:
-Xmx2000M -Xms2000M -Xmn500M -XX:PermSize=250M -XX:MaxPermSize=250M -Xss256K -XX:-UseGCOverheadLimit -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses -XX:+DisableExplicitGC -XX:SurvivorRatio=1 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:LargePageSizeInBytes=128M -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -Xloggc:/data/apps/apache-tomcat-6.0.36/logs/gc.log
jvm.heap
2020-05-07 08:46: using parallel threads in the new generation. using thread-local object allocation. Concurrent Mark-Sweep GC Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 2097152000 (2000.0MB) NewSize = 524288000 (500.0MB) MaxNewSize = 524288000 (500.0MB) OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 1 PermSize = 262144000 (250.0MB) MaxPermSize = 262144000 (250.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 349569024 (333.375MB) used = 137905504 (131.51693725585938MB) free = 211663520 (201.85806274414062MB) 39.45014990801931% used Eden Space: capacity = 174850048 (166.75MB) used = 129558040 (123.5561752319336MB) free = 45292008 (43.193824768066406MB) 74.09665681075478% used From Space: capacity = 174718976 (166.625MB) used = 8347464 (7.960762023925781MB) free = 166371512 (158.66423797607422MB) 4.777651627262284% used To Space: capacity = 174718976 (166.625MB) used = 0 (0.0MB) free = 174718976 (166.625MB) 0.0% used concurrent mark-sweep generation: capacity = 1572864000 (1500.0MB) used = 532712344 (508.0340805053711MB) free = 1040151656 (991.9659194946289MB) 33.86893870035807% used Perm Generation: capacity = 262144000 (250.0MB) used = 149215056 (142.3025665283203MB) free = 112928944 (107.69743347167969MB) 56.921026611328124% used 2020-05-07 08:46: using thread-local object allocation. Parallel GC with 4 thread(s) Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 1566572544 (1494.0MB) NewSize = 1310720 (1.25MB) MaxNewSize = 17592186044415 MB OldSize = 5439488 (5.1875MB) NewRatio = 2 SurvivorRatio = 8 PermSize = 21757952 (20.75MB) MaxPermSize = 174063616 (166.0MB) Heap Usage: PS Young Generation Eden Space: capacity = 1966080 (1.875MB) used = 628456 (0.5993423461914062MB) free = 1337624 (1.2756576538085938MB) 31.964925130208332% used From Space: capacity = 196608 (0.1875MB) used = 98304 (0.09375MB) free = 98304 (0.09375MB) 50.0% used To Space: capacity = 196608 (0.1875MB) used = 0 (0.0MB) free = 196608 (0.1875MB) 0.0% used PS Old Generation capacity = 65273856 (62.25MB) used = 4938552 (4.709770202636719MB) free = 60335304 (57.54022979736328MB) 7.565895907850151% used PS Perm Generation capacity = 25296896 (24.125MB) used = 25255776 (24.085784912109375MB) free = 41120 (0.039215087890625MB) 99.8374504128886% used
应用日志
jstack 信息
2020-05-07 08:47 cpu_busy: 1 "scheduler-9" prio=10 tid=0x00007f962c6e6000 nid=0x3a0d waiting on condition [0x00007f962437f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081) at java.util.concurrent.DelayQueue.take(DelayQueue.java:193) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) "scheduler-8" prio=10 tid=0x00007f962cb95800 nid=0x3a0c waiting on condition [0x00007f96248d4000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081) at java.util.concurrent.DelayQueue.take(DelayQueue.java:193) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069) at java.util.concurrent.ThreadPoolExecutor.runWorkr(ThreadPoolExecutor.java:1131) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:679) "scheduler-7" prio=10 tid=0x00007f962c7c6000 nid=0x3a0b waiting on condition [0x00007f962468b000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) 2020-05-07 08:47 cpu_busy: 1 "VM Thread" prio=10 tid=0x00007f962c0fe800 nid=0x38e1 runnable "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f962c010800 nid=0x38dc runnable "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f962c012800 nid=0x38dd runnable "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x00007f962c014800 nid=0x38de runnable "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x00007f962c016000 nid=0x38df runnable "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x00007f962c0a5000 nid=0x38e0 runnable "VM Periodic Task Thread" prio=10 tid=0x00007f962c258000 nid=0x39b4 waiting on condition JNI global references: 1952 2020-05-07 08:47 cpu_busy: 1 "DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275) "DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a00228> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-6" prio=10 tid=0x00007f1c8c241000 nid=0x6682 in Object.wait() [0x00007f1c3bffe000] java.lang.Thread.State: TIMED_WAITING (on object monitor) 2020-05-07 08:47 cpu_busy: 1 "Attach Listener" daemon prio=10 tid=0x00007f1c68001000 nid=0x30d waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "DestroyJavaVM" prio=10 tid=0x00007f1c8c008000 nid=0x665b waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275) "DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523) - locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread) "DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000] java.lang.Thread.State: TIMED_WAITING (on object monitor)
用 jhat -J-Xmx50m jvm.dump 查看所有类的实例数
这是我第一次碰到 OOM,我不理解的是为什么我看 jvm.heap 里面,堆内存使用率不是百分百,但请求依然是 OOM 呢?这些堆栈文件是运维用脚本生成的,必要的话我可以把脚本贴上来。
我会继续排查问题,希望哪位看到了也能帮我瞅瞅,非常感谢。
ps:
发现日志的打印时间顺序有问题,不知道这个会不会可以找到什么线索?
pps
哪位大佬如果有好的建议,赶紧提一下赶在故障前我准备下。
![]() | 1 alexfarm 2020-05-07 12:46:47 +08:00 以前用过个工具 eclipse memory analyzer 去看 dump 的内容,稍微直观详细点,主要看看堆内存占用高的类 |
2 Lax 2020-05-07 12:47:57 +08:00 不太懂 java,不过看你输出里有个 PS Perm Generation 已经满了。 |
3 aircjm 2020-05-07 12:48:17 +08:00 这个应该是对象太大导致的吧?? 你可以用 eclipse 的那个分析工具分析下你的 dump 文件找下 |
![]() | 5 Finest 2020-05-07 13:04:54 +08:00 你这里是两个进程? jmap 输出的第二段,明显 jvm 的内存配置都跟你贴的不一样,Perm Gen 才 24m |
![]() | 7 lovedebug 2020-05-07 13:17:17 +08:00 我跟楼主说一个自己碰到过的。线上爆发的及其容易忽略的导致 oom 的点,所有 debug 和 info 级别 log 中不要打印整个 List/Set/Map,这些在运行中实际也是会计算的。 |
![]() | 8 lovedebug &nsp; 2020-05-07 13:18:10 +08:00 另外楼主应该为这个服务设置 health check api,可以今早发现问题。 |
![]() | 9 lovedebug 2020-05-07 13:19:35 +08:00 PPS:先升级机器内存起一台新机器线上泡泡看看,这台出问题的机器可以从集群中移除仔细分析 |
![]() | 10 pmispig 2020-05-07 13:34:49 +08:00 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/logs/gc.hprof 加上这个 |
![]() | 12 Solace202 OP @hand515 这是使用了两个 GC ( Mark-Sweep GC 和 Parallel GC ),jvm 里面配置是给第一个 GC 配的  |
13 yamasa 2020-05-07 13:39:41 +08:00 jmap -histo <pid>会按 mem size 对堆内对象排序,先看看前几名分别是啥。但是这种方式看不出对象树,所以比较大可能还是需要用 jmap dump 一个 bin 或者 hprof 下来,用 eclipse mem analyzer 直接分析。这个 analyzer 可以对一个 dump 出具 mem leak 的报表,要仔细看看,找到了高内存占用的类看看对象树,是不是你们哪里的业务代码造成了内存泄漏。 |
![]() | 14 Solace202 OP @hand515 贴个配置和 jmap 的对应 |
17 yamasa 2020-05-07 13:42:59 +08:00 话说你们的 GC 是 CMS,可以用这个工具看看 https://github.com/DarLiner/vjtools/tree/master/vjmap 。 如果是 G1 的话反而用不了。这是可以分代的 jmap,如果 jmap 给出的分布看不出来问题就上这个,一个一个代区来分析。 |
![]() | 18 Solace202 OP @pmispig 已经让运维加了,我们有 GC 的日志,但是没有日期,这个能否定位问题呢?可以的话我让在 JVM 把 gc 的日期和时间也加上? |
![]() | 19 limbo0 2020-05-07 14:11:53 +08:00 除了堆内存, 也得考虑下堆外内存,是不是有大量的网络请求 |
![]() | 20 coolcfan 2020-05-07 14:13:14 +08:00 via Android 如果是 PermGen 满,结合日志内容,可以关注下应用里的反射操作或者是否存在动态的类生成…… |
22 25OHd2qObJmJ6P10 2020-05-07 14:17:38 +08:00 从应用日志来看是 spring 注入一个类的时候报了堆空间不足,spring 注入属性会去加载相应的类到永久区,根据你的日志分析,你的 jdk 应该是 1.8 以下,而且你的 jvm 内存空间也显示你的永久区只有 24M,使用达到 99%,所以初步分析就可以增大永久区 -XX:PermSize=75M -XX:MaxPermSize=100M 尝试,然后再根据 jstat 和 jmap 分析内存变化,确定原因以及调整最合适的 jvm 参数 |
![]() | 23 Solace202 OP @rayduan 我是有个地方不理解,两个 GC,为什么注入的类会放到下面这个 GC 的 Perm Generation 而不是上面的那个 Perm Generation ? |
![]() | 24 zwx327634 2020-05-07 14:31:50 +08:00 如果是 jdk1.7,升到 1.8 就没问题了,永久代变成元空间,放在堆里了。快速的话就是像楼上,增加永久代空间 |
![]() | 26 1ffree 2020-05-07 15:05:50 +08:00 好奇应用运行期间 还有 autowire 是啥使用场景呢 |
27 pangleon 2020-05-07 15:07:44 +08:00 你这个提示的是 heap space 而不是 permgen space,后者是永久代满了,那么基本上可以定位到你们的程序创建了大量的临时对象 如果真是凌晨 OOM,凌晨也不是用户高峰那么基本可以确定是定时任务导致的。 你可以查询下有哪些定时任务在运行,他们是否创建了大量的 LIST,MAP 这种集合类对象或者拼接了大量字符串对象。 |
![]() | 28 pmispig 2020-05-07 15:15:31 +08:00 @Solace202 -XX:HeapDumpPath=/logs/gc.hprof 看看能不能重现一次,拿到这个 dump 出来分析就比较好定位 |
29 liukzz 2020-05-07 15:16:37 +08:00 https://gceasy.io/gc-index.jsp 这个网站下个离线版的分析工具,把 gc.log 导进去分析下 |
31 gadsavesme 2020-05-07 15:45:01 +08:00 ![]() Perm Generation 满了,这个 jdk1.8 以下的时候 fullgc 的时候不会清理空间,满了就 oom 十分蛋疼。 |
32 bonke 2020-05-07 16:00:18 +08:00 hashMap 的 key 是重写了 hashCode 方法么 看下有没有重写 equals 方法 |
![]() | 33 admol 2020-05-07 17:05:15 +08:00 推荐一下这个工具 https://console.perfma.com/ |
![]() | 34 chibupang 2020-05-07 17:31:03 +08:00 via iPhone 我看 OOM 日志错误是在“work”反射调用的时候,是不是创建了大内存对象。 |
35 kingwang 2020-05-07 17:57:40 +08:00 应该是出报表的时候跪了,看一下这些报表里会不会象楼上们所说的产生大量集合对象 |
![]() | 36 iFlicker 2020-05-07 17:59:47 +08:00 ![]() 如果是 macOS 我推荐 Jprofiler ,平时 Android dump 堆内存查问题就是用这个看 比 as 强大 |
![]() | 37 EmdeBoas 2020-05-07 18:04:08 +08:00 机器内存多大? |
![]() | 38 cheng6563 2020-05-07 18:15:46 +08:00 via Android 强烈推荐 Jprofiler 来观测 jvm 内存转存。轻松找到哪有问题 |
39 Huozy 2020-05-07 19:40:30 +08:00 用的 was 吗?看看 was 控制台最大内存和最小内存是多少,如果操作的数据多了 超出一定范围会报内存溢出的 |
![]() | 41 clino 2020-05-07 20:49:20 +08:00 via Android 不一定内存哦,什么用户最大进程数都可能会出的 所以也要去看一下 syslog kernel log 之类的 |
42 luozic 2020-05-07 23:42:05 +08:00 jdk 6 ? 可以试试 jprofiler, 免费的有阿里的 https://github.com/alibaba/arthas/blob/master/README_CN.md ,还有个唯品会的工具,这两个工具也可以出图来分析。 并且现在新版本的 jdk 6 也支持 gc1 了 |
![]() | 43 zinan 2020-05-08 01:14:36 +08:00 就是永久代满了吧, 1.8 以下永久代就是放在 heap 中管理的, 永久代在 full gc 中不会清理. 反射调用有可能会触发类加载的行为, 永久代放不下就会报 oom 了 |
![]() | 44 heyhumor 2020-05-08 08:13:22 +08:00 V 站摸鱼的大佬还是不少啊[滑稽] |