不踬于山——追踪内存泄漏

不踬于山,而踬于垤。—— 先秦圣谚

最近在Greenplum 6X上发现了一个内存泄漏,Memory Leak DynamicBitmap Index Scan in Nestloop’s inner plan #14122, 查询和查询计划与和之前修复过的如出一辙,所以肯定是来了新的泄漏。这一次可以复现问题,能复现的bug,修复只是时间问题,我的同事此刻已经搞定了,开出了Pull Request (见Issue页面)。本文是我做事后诸葛亮,看看能否用一些工具快速定位这个问题。

背景

了解之前的背景,预备一些前期知识,对解决这个问题很有帮助。Greenplum (Postgres) 的优化器,执行器,MemoryContext的基础知识都是必要的。本文略去,请参考Issue Memory leak of DynamicBitmapHeapscan #12533. 如果是VMware的员工,可以参考内部可见的讲座 诊断2: 一个让人疲惫的OOM Issue (What Goes Around Comes Around)_Zhenghua Lv, Sep 24.

OOM问题诊断大纲

多说一句关于Greenplum里OOM问题应该如何诊断。OOM问题肯定是由于Query执行引起的,故而直接报告了OOM的query以及query plan和minirepro肯定是最重要的直接线索。其次,Greenplum在发现内存跑飞的情况下,会调用MemoryContextStats()从而在日志里打印各个MemoryContext的占用情况,这是至关重要的线索。工程师需要仔细查看这个日志,找到异常的MemoryContext,然后结合query plan进行诊断。此外,日志里可能还会包含其他信息。概括的说,一般处理Greenplum的OOM问题,需要事后采集 (头两个至关重要):

  • Query, Plan, Minirepro
  • Log
    • Master log
    • Segment log where OOM happen
  • Linux System Monitor Data (SAR …)

问题

先看出现内存泄漏的查询计划:

 Aggregate
   ->  Gather Motion 3:1  (slice2; segments: 3)
         ->  Nested Loop
               Join Filter: true
               ->  Broadcast Motion 3:3  (slice1; segments: 3)
                     ->  Result
                           ->  Sequence
                                 ->  Partition Selector for t2_12533 (dynamic scan id: 2)
                                       Partitions selected: 1000 (out of 1000)
                                 ->  Dynamic Bitmap Heap Scan on t2_12533 (dynamic scan id: 2)
                                       Recheck Cond: (b < 600)
                                       ->  Dynamic Bitmap Index Scan on idx2_12533
                                             Index Cond: (b < 600)
               ->  Sequence
                     ->  Partition Selector for t1_12533 (dynamic scan id: 1)
                           Partitions selected: 1000 (out of 1000)
                     ->  Dynamic Bitmap Heap Scan on t1_12533 (dynamic scan id: 1)
                           Recheck Cond: (b < 600)
                           Filter: ((b < 600) AND (c = t2_12533.c))
                           ->  Dynamic Bitmap Index Scan on idx1_12533
                                 Index Cond: (b < 600)
 Optimizer: Pivotal Optimizer (GPORCA)

如上查询计划,我们可以复现OOM,然后在LOG里可以看到是执行器Context和CacheContext里泄漏了很多。上述计划的NestLoop+DynamicScan肯定是核心症结所在,因为Nestloop的right tree会反复扫描,一个外部tuple完整扫描一遍,这个过程如果有内存没有及时释放,就会引起泄漏,从而OOM。

因此我们要捕捉的泄漏,肯定发生在拿到一个outerTuple后,再次要拿一个outerTuple的过程中,发生的泄漏。

思路

根据经验,肯定是有内存palloc后忘了pfree了。Greenplum (Postgres) 的region based内存管理是一些内存泄漏查找工具好像不那么好用(如果有朋友善于使用,希望不吝赐教)。那么这里我的思路就是:

  1. 编译的时候打开宏CLOBBER_FREED_MEMORY. 这样如果内存被释放了,就会被写入0x7f7f7f7f
  2. 那么我先用GDB让程序停在拿完了一次outerTuple
  3. 接下来打开某中系统的监控,记录所有pallocpalloc0的返回内存和调用栈
  4. 然后用GDB continue到下一次索取outerTuple之前
  5. 然后停止监控,再用GDB生成一个coredump文件
  6. 那么监控里记录下来的地址,如果在coredump里发现不是0x7f7f7f7f就值得怀疑
  7. 此时把堆栈拿出来,人肉看看代码,就可以搞定了。

因此,如果我们有一个自动化工具,可以完成上面几步,让人肉看代码的范围较小,问题就几近解决。

工具

需要的工具是可以方便的监控函数返回参数和栈,最近一直不停的听到eBPF和bcc-tools,稍微一搜索就可以发现bcc-tools里的trace就是我需要的。详情可以参考项目文档: https://github.com/iovisor/bcc/blob/master/tools/trace_example.txt.

下面的命令就是我用来trace的命令 (-U打印栈, trace的probe是allocate函数的返回值)

trace-bpfcc -p <pid> 'r:/home/gpadmin/local/gpdb/bin/postgres:palloc "%x", retval' 'r:/home/gpadmin/local/gpdb/bin/postgres:palloc0 "%x", retval'  -U > alloc.<pid>

上面程序打印出来的结果大致如图:

PID     TID     COMM            FUNC             -
57465   57465   postgres        palloc           655c54a0
        b'bms_copy+0x3e [postgres]'
        b'bms_intersect+0x48 [postgres]'
        b'UpdateChangedParamSet+0x2f [postgres]'
        b'ExecReScanSequence+0x5c [postgres]'
        b'ExecReScan+0x33b [postgres]'
        b'ExecNestLoop_guts+0x3d3 [postgres]'
        b'ExecNestLoop+0x1c [postgres]'
        b'ExecProcNode+0x400 [postgres]'

找到Postgres进程的heap范围:

cat /proc/57240/maps | grep heap
5596638ad000-559663937000 rw-p 00000000 00:00 0                          [heap]

补上开头的4位hex码就可以得到coredump里的地址。

GDB的命令生成一个core是用generate-core-file,然后用objdump -s core > xxx得到文本。这样就可以查可疑地址是否是7f了。为了查询方便,我们要写一个程序处理之前的日志和core里的地址信息。

下面的Python脚本可以解析这些问题:

  • 构建所有heap范围内,0x7f7f7f7f的地址集合
  • 构建所有allocate的地址和堆栈二元组
#-*- coding: utf-8 -*-

import sys
import re
from collections import defaultdict
import argparse


class AllocInfo:

    def __init__(self, alloc_dump_file):
        self.alloc_dump_file = alloc_dump_file

    def parse_single(self, f, stack_depth=7):
        lines = []
        while True:
            line = f.readline()
            if line.strip() == '':
                break
            lines.append(line)

        if not lines:
            return None

        _pid, _tid, _comm, alloc_func, addr = lines[0].strip().split()
        pt = re.compile(r'([a-zA-Z0-9_]+)[+]0x')
        bt = []
        for i in range(1, stack_depth+1):
            func = pt.findall(lines[i].strip())
            if func:
                bt.append(func[0])
            else:
                break

        if bt:
            bt.insert(0,alloc_func)
            return (addr, tuple(bt))
        else:
            return (addr, None)

    def parse(self):
        addr_stack = []
        with open(self.alloc_dump_file) as f:
            while True:
                 p = self.parse_single(f)
                 if p is None:
                     return addr_stack
                 if p[1] is None:
                     continue
                 addr_stack.append(p)


class CoreDumpAddr:

    def __init__(self, core, heap_addr_head):
        self.core = core
        self.heap_addr_head = heap_addr_head

    def create_set(self):
        s = set()
        with open(self.core) as f:
            for line in f:
                l = line.strip().split()
                if len(l) < 5:
                    continue
                addr = l[0]
                if not (addr.startswith(self.heap_addr_head) and "7f7f7f7f" in line):
                    continue
                for i in range(4):
                    addr_int = int(addr, 16) + i*4
                    addr_hex = hex(addr_int)[2:]
                    if l[i+1] == "7f7f7f7f":
                        s.add(addr_hex)
        return s


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Analyze memory leak')
    parser.add_argument('--core', type=str, help='core dump file after objdump', required=True)
    parser.add_argument('--heap_head', type=str, help='heap address head string', required=True)
    parser.add_argument('--alloc_dump', type=str, help='alloc dump result', required=True)

    args = parser.parse_args()

    core = args.core
    heap_head = args.heap_head
    alloc_dump = args.alloc_dump

    cda = CoreDumpAddr(core, heap_head)
    ai = AllocInfo(alloc_dump)
    addrs = ai.parse()
    addrs_set = cda.create_set()
    rev_index = defaultdict(list)
    for addr, bt in addrs:
        if (heap_head+addr) in addrs_set:
            continue
        rev_index[bt].append(addr)

    for k in rev_index:
        print(k)

用法是:

usage: check.py [-h] --core CORE --heap_head HEAP_HEAD --alloc_dump ALLOC_DUMP

Analyze memory leak

optional arguments:
  -h, --help            show this help message and exit
  --core CORE           core dump file after objdump
  --heap_head HEAP_HEAD
                        heap address head string
  --alloc_dump ALLOC_DUMP
                        alloc dump result

完整例子

先构建一个集群,为了方便,只需要建立一个single segment的Greenplum cluster就好了。然后利用Issue里的复现SQL,构建数据、表,跑查询。

gpadmin    60745 48.4  1.6 539264 135920 ?       Ssl  16:47   0:03 postgres:  6000, gpadmin gpadmin [local] con10 cmd1 SELECT
gpadmin    60763 60.0  1.9 557280 160860 ?       RNsl 16:47   0:01 postgres:  6002, gpadmin gpadmin 127.0.0.1(63670) con10 seg0 cmd2 slice2 MPPEXEC SELECT
gpadmin    60765 75.0  1.8 541208 151964 ?       SNsl 16:47   0:01 postgres:  6002, gpadmin gpadmin 127.0.0.1(63680) con10 seg0 idle

进程号60763就是我们要诊断追寻内存泄漏的进程。用GDB attach上去,看了一下它的heap范围是发现heap地址开头是5596.

cat /proc/60763/maps | grep heap
5596638ad000-559663937000 rw-p 00000000 00:00 0                          [heap]
559663937000-559663daf000 rw-p 00000000 00:00 0                          [heap]

然后在GDB里设置断点,continue过去。先让程序停在此处

(gdb) info b
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x000055965c7ffaf0 in ExecNestLoop_guts at nodeNestloop.c:246

这就是说即将要开始新的一轮内层计划的完整执行。然后设置断点在下一次拿outerTuple之前.

(gdb) b 200
Breakpoint 4 at 0x55965c7ff914: file nodeNestloop.c, line 200.
(gdb) info b
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x000055965c7ffaf0 in ExecNestLoop_guts at nodeNestloop.c:246
4       breakpoint     keep y   0x000055965c7ff914 in ExecNestLoop_guts at nodeNestloop.c:200

此刻去开监控

trace-bpfcc -p 60763 'r:/home/gpadmin/local/gpdb/bin/postgres:palloc "%x", retval' 'r:/home/gpadmin/local/gpdb/bin/postgres:palloc0 "%x", retval'  -U > alloc.60763

然后GDB去按一下continue它会停在nodeNestloop.c:200,我们监控也就采集完了,这时候关闭监控。然后在GDB里生成一个core。然后就可以退出一切,用python脚本分析了。

gpadmin@zlyu:~/xxx$ objdump -s core.60763 > coredump
gpadmin@zlyu:~/xxx$ pypy3 check.py --core coredump --heap_head 5596 --alloc_dump alloc.60763 > analysis
gpadmin@zlyu:~/xxx$ wc -l analysis
119 analysis

总计需要分析的调用栈是119个,已经在人力可以完成的范围了。结果为:

gpadmin@zlyu:~/xxx$ head analysis
('palloc', 'RelationGetIndexScan', 'btbeginscan', 'FunctionCall3Coll', 'index_beginscan_internal', 'index_beginscan', 'systable_beginscan', 'ScanPgRelation')
('palloc', 'btbeginscan', 'FunctionCall3Coll', 'index_beginscan_internal', 'index_beginscan', 'systable_beginscan', 'ScanPgRelation', 'RelationBuildDesc')
('palloc', '_bt_search', '_bt_first', 'btgettuple', 'FunctionCall2Coll', 'index_getnext_tid', 'index_getnext', 'systable_getnext')
('palloc', 'heaptuple_copy_to', 'heap_copytuple', 'ScanPgRelation', 'RelationBuildDesc', 'RelationIdGetRelation', 'relation_open', 'heap_open')
('palloc', 'AllocateRelationDesc', 'RelationBuildDesc', 'RelationIdGetRelation', 'relation_open', 'heap_open', 'initNextTableToScan', 'ExecDynamicBitmapHeapScan')
('palloc', 'CreateTemplateTupleDesc', 'AllocateRelationDesc', 'RelationBuildDesc', 'RelationIdGetRelation', 'relation_open', 'heap_open', 'initNextTableToScan')
('palloc', 'systable_beginscan', 'RelationBuildTupleDesc', 'RelationBuildDesc', 'RelationIdGetRelation', 'relation_open', 'heap_open', 'initNextTableToScan')
('palloc', 'RelationGetIndexScan', 'btbeginscan', 'FunctionCall3Coll', 'index_beginscan_internal', 'index_beginscan', 'systable_beginscan', 'RelationBuildTupleDesc')
('palloc', 'btbeginscan', 'FunctionCall3Coll', 'index_beginscan_internal', 'index_beginscan', 'systable_beginscan', 'RelationBuildTupleDesc', 'RelationBuildDesc')
('palloc', 'systable_beginscan', 'ScanPgRelation', 'RelationBuildDesc', 'RelationIdGetRelation', 'relation_open', 'heap_open', 'RelationInitAppendOnlyInfo')

其中捕获了代码里修复的几个关键点:

('palloc', 'parseRelOptions', 'default_reloptions', 'heap_reloptions', 'RelationGetAttributeOptions', 'aocs_fetch_init', 'initFetchDesc', 'BitmapAppendOnlyNext')

('palloc0', 'initFetchDesc', 'BitmapAppendOnlyNext', 'ExecScanFetch...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.