peering耗时优化方案:跳过wait_up_thru阶段

需求

减少日常运维操作导致的peering的耗时,如停止osd、启动osd、调整osd权重、迁移pool等,从而减少对客户端IO造成的影响。

现状

当前waitupthru阶段耗时是整个peering阶段最长的,该阶段的耗时与monitor的Paxos决议间隔时间强相关(当前配置是间隔1s),也跟monitor服务的繁忙程度有关,之前通过更换monitor所用的存储盘为ssd盘之后,已经大幅降低了waitupthru阶段的耗时,从而也很大程度上降低了peering耗时,对客户端IO的影响也大大降低。

但通过分析多次线上日常运维对打桩卷IO的影响情况,仍然发现有部分osd的peering耗时达到5s甚至8s,其中最耗时的阶段仍然是waitupthru,可达7s左右。另外观察影响打桩卷IO较小的场景,其peering阶段耗时均较低,一般为1s多(绝大部分仍然为waitupthru占据),因此仍然需要进一步优化waitupthru耗时。

方案

本方案的总体流程变动

相关名词解释

peering相关流程请参考:Ceph peering相关问题

WaitUpThru是peering的最后一个阶段,其作用是等待osd通知monitor把他的up_thru字段更新到osdmap中,up_thru字段用来表明该osd何时(哪个epoch)完成了peering,一旦更新完成,就表示该osd上的pg已经可以接受客户的IO请求,后续生成past_intervals时该interval就不能被跳过(可能有IO写入,如果跳过则可能导致数据丢失)。

如果没有这个字段,则无法区分特定场景下的interval是否有IO写入,官方举例如下:

在上述场景下,epoch 3这个阶段,B所处的状态可能有2个,1)B正常运行并且可以处理IO;2)B已经down,只是mon还没发现或者没有更新到osdmap;如果是情况1,那么在peering阶段就不能跳过2这个interval,如果是情况2,则可以安全跳过,osd的up_thru就是用来区分情况2的,即:

如果这种情况下,B在epoch 3这个interval其实是没有完成peering的,因此肯定没有IO写入,可以在后面的peering阶段跳过。

而如果B在epoch 3这个interval的up_thru成功更新成了3,则表示它正常运行并且完成了peering,有IO写入,后续peering不能跳过。

past_intervals在发生变化后(新加入或老的interval被清理),都会把pg的dirty_big_info字段设置为true,然后把更新后的past_intervals存盘(leveldb),在osd启动时会重新加载past_intervals信息。因此我们只需要考虑的是配置项修改后新生成的interval的maybe_went_rw的值是否符合预期即可。

因此如果要跳过WaitUpThru阶段,就必须要做到将每个interval都看作接收过客户端IO请求(写请求),而不能跳过。

方案设计

计划实现一个开关osd_wait_up_thru,来控制OSD在peering过程中是否需要等待up_thru字段更新到osdmap并返回给osd,并且该开关可以随时打开关闭而不影响OSD的运行和数据可靠性、一致性。false表示不等待up_thru字段更新到osdmap,true表示等待。

在peering跳转到WaitUpThru阶段的位置(通过发送NeedUpThru事件给pg状态机实现跳转),加上这个配置项条件的判断,如果为false则不进入waitupthru阶段。

在GetInfo阶段,会首先生成一系列的interval也即past_intervals,然后把这些interval中的osd列表都放入一个set中(prior_set),之后给他们发送pg info查询请求,找出哪个或者哪些osd的pg信息比较全,然后用来在GetLog阶段获取pg log,生成权威日志,供数据恢复使用。

生成interval过程中会根据up_thru字段检查该interval是否曾经接收过客户端写IO,如果没有则可以不考虑这个interval(这个interval的osd不放入pg info查询的os集合),而如果我们之前跳过了WaitUpThru阶段,则可能无法区分该interval是否有写IO,因此只能将其加入pg info查询的osd集合,带来的影响就是多查询了一些osd,并且这个osd可能已经无法启动。但更多的情况下,3副本存储池,一般都有至少2个副本运行,因此每个interval一般都是会有写IO,很少能跳过,并且3副本对应的osd一般都不会发生变化,如pg 1.0从创建后一般都是在固定的3个osd上(如osd.1,2,3),除非我们对其做过运维操作如调整权重或者踢osd。因此并不会导致pg info或pg log需要多发送给很多的无效osd造成耗时增加。

需要考虑的异常场景如下:

  • 开关临界场景:即配置项开关从开到关或者从关到开,不能造成数据丢失或其他问题
  • OSD频繁up、down场景(正常或误报):不能导致数据丢失或其他问题
  • 某个interval单副本运行,但坏盘导致其无法启动,如何把pg恢复正常,尤其是当该interval实际上可能没有接收客户端IO请求的场景,跳过WaitUpThru阶段是否会引入新的问题?

针对配置项开关临界场景的设计如下:

  • 在线修改配置项(从true到false):根据上面的整体流程图可以看出,这一过程实际上是把interval的maybe_went_rw=true的场景变得更加宽泛,也即只会把原本为false的变为true,让pg在peering时给更多的osd发送查询pg info+log请求,在我们场景下都是ok的,唯一需要考虑的是异常场景3的情况下(单副本运行期间坏盘),如果单副本所在的osd故障无法启动,如何让pg完成peering恢复业务?这个问题在下面的异常场景3的设计讨论时进行解释。
  • 在线修改配置项(从false到true):这个场景与从关到开相反,因此interval的maybe_went_rw=false的场景变得更加宽泛,也即把原来为true的场景变成了false,带来的问题是可能这个interval是有IO写入的,但peering过程中却跳过了,就可能导致数据丢失风险。导致这一问题的根本原因是我们跳过了WaitUpThru阶段,也即判断maybe_went_rw=true的条件是不准确的(根据主osd的up_thru或者pg的info.history.last_epoch_clean版本判断,但由于peering转到active之前没有等待新的osdmap到来,所以这两个值有可能是不准确的),因此我们需要在修改配置项之前,检查osd的up_thru值是否更新完毕,并且pg的状态是否为active,只有满足这两个条件才可以进行配置项更改。为了统一配置项修改条件以简化代码逻辑,我们把在线修改配置项从关到开的修改条件也限制为与从开到关同样的条件。补充:修改配置项与peering触发流程不能并发,加锁控制
  • 针对离线配置文件中配置项的修改:可参考下面的非功能性设计相关内容

针对频繁的OSD up、down场景设计如下:

  • 首先,在配置项为false场景,由于基本上每个interval都被我们认为是有IO写入的,因此会导致某些没有IO写入的interval的osd也需要被查询(pg的info和log),因此某些单副本运行的interval虽然没有IO写入,也需要被查询,导致无法跳过,pg状态可能变成down+peering,但此时只要把该osd启动起来即可恢复,如果无法启动,则需要进行手工的恢复,恢复流程见:pg down+peering状态处理方案,由于实际上这个interval并没有IO写入,因此手工恢复也不会导致数据丢失。如果单副本运行的interval有IO写入,那这种场景跟官方场景是一样的,都可能导致数据丢失,这种场景下的数据丢失并不是本次改动引入的。

针对单副本运行过程中坏盘场景设计如下:

  • 如果只是一个副本坏盘,其他一个或两个副本正常运行(min_size=1),那么这个场景是可以正常完成peering的。如果是单副本运行过程中坏盘,这个场景又分为单副本运行的interval有无IO写入,这个问题与上面的osd频繁up、down中的类似,可以参考上面的说明。

非功能性设计

升级

升级过程比较简单,只要把代码打包,然后安装、启动即可(代码中osd_wait_up_thru配置项默认为false,也即让interval的maybe_went_rw=true的条件变宽泛,让更少的interval被跳过,以保证数据可靠性)。ceph.conf配置文件中的osd_wait_up_thru,也配置为false即可。

配置项修改

配置项修改分为在线和离线两种,在线修改已经在代码中进行相应的设计和处理,只有在条件满足时才能修改成功。

离线的配置项修改,需要先完成在线修改,然后再修改离线的ceph.conf配置文件,这么做的原因是,一旦离线的ceph.conf修改完毕,尤其是从false改为true的场景,此时如果在线配置项没有修改而osd异常down掉并重启(当然我们当前的运维场景下不会发生),那么有些interval可能被错误的标记为没有IO写入而跳过,导致数据丢失。如果我们先修改了进程内存中的配置,并且判断已经成功,那么之后无论是在ceph.conf是否修改时发生osd重启,均不会导致interval错误的标记为没有IO写入。

补充:生成prior_set过程中会首先把当前的acting和up的osd列表加入进去,在我们的场景下,这两个列表里的osd已经有所有需要的pg info和log,因此即使错误的跳过一些interval(不给这个interval里的osd列表发生查询pg info和log请求),也不会导致pg信息获取不足(导致数据丢失)。

if (lastmap->get_up_thru(i.primary) >= i.first && // 不等up_thru,意味着本地的osdmap可能是旧的,所以这个判断条件可能是不正确的
    lastmap->get_up_from(i.primary) <= i.first) { // interval可能与自己无关,这里主要关心主是否可写
        i.maybe_went_rw = true;
} else if (last_epoch_clean >= i.first && // 不等up_thru,意味着本地的osdmap可能是旧的,所以这个判断条件可能是不正确的
           last_epoch_clean <= i.last) {  // 因为last_epoch_clean也是在mark_clean的时候用本地最新的osdmap的epoch设置的
        // If the last_epoch_clean is included in this interval, then
        // the pg must have been rw (for recovery to have completed).
        // This is important because we won't know the _real_
        // first_epoch because we stop at last_epoch_clean, and we
        // don't want the oldest interval to randomly have
        // maybe_went_rw false depending on the relative up_thru vs
        // last_epoch_clean timing.
        i.maybe_went_rw = true;
} else {
        i.maybe_went_rw = false;
}

回退

  • 通过配置项回退,把osd_wait_up_thru配置项的值从false改为true即可(包括在线和离线)
  • 通过重装到老版本回退,在active状态下修改osd_wait_up_thru配置项为true,成功后停掉osd,重装版本,之后启动即可,按副本域顺序对osd操作。

风险

  • 单副本运行interval坏盘场景下,如果该interval并没有IO写入,但在osd_wait_up_thru=false的情况下,这个interval无法被跳过,可能导致pg错误的变成down+peering,需要手工修复。

ceph merged commits – 0301~0318

后面尽量每两周抽时间看下ceph社区merge的pr,水平有限,很多也不理解是做什么的,如有错误请见谅。

看了这半个月大部分提交都是在搞cephadm相关的,也有很多mgr相关的。
我这边更多的是关注核心项目(rbd、rados、cephfs、osd、mon)

mon/MgrMonitor: make ‘mgr fail’ work with no arguments

https://github.com/ceph/ceph/pull/33997

这个commit比较简单,就是原来要fail一个mgr,要传入mgr的名称,现在不需要了,免去了查询active mgr的步骤。

librbd: optimize image copy state machine to use fast-diff

https://github.com/ceph/ceph/pull/33867

这个提交是在rbd-mirror同步image时,充分利用fast-diff这个特性,基于object-map来检查两个snapshot的dirty object,只把dirty的object同步过去,对对象数比较多的size比较大的image来说,会大大提升同步速度。

crimson/heartbeat: report to monitors about osd failure

https://github.com/ceph/ceph/pull/33836

这个是基于crimson实现的新功能,增加了osd故障上报流程。

crimson/osd: add tell command support

https://github.com/ceph/ceph/pull/33847

也是增加新功能,支持tell命令。

nautilus: mon: fix/improve mon sync over small keys

https://github.com/ceph/ceph/pull/33765

这个是backport的提交,用来改进mon启动时同步数据的速度,原来只限制payload size,没有限制每次同步的key的数量,如果数量很多并且size很小,就要花费很多时间。这个提交就是增加限制每次同步的key的数量,进而提高速度。

mon/MonClient: send logs to mon on separate schedule than pings

https://github.com/ceph/ceph/pull/33732

支持单独设置提交cluster log到mon的时间间隔,原来是跟ping消息相同。默认还是1s,增加时延应该可以降低mon(Paxos)的负载。

nautilus: common/blkdev: compilation of telemetry and device backports

https://github.com/ceph/ceph/pull/33726

这个是backport的提交,获取设备metadata的时候支持/dev/disk/by-path这种链接的设备路径。具体是什么用处还没太了解,应该是一堆提交中的一个。

CephFS用户IO流程

基于L版本代码(v12.2.12)分析。本人还在CephFs学习入门阶段,分析过程仅供参考,如有错误请谅解!

用户IO发送客户端

当前有3种方式可以与cephfs集群进行数据交互:
– libcephfs:提供与cephfs集群交互的C语言API,需要自己开发客户端,与ceph-fuse使用的下层接口相同
– ceph-fuse:sudo ceph-fuse -m 192.168.0.1:6789 /mnt/mycephfs
– kernel client:sudo mount -t ceph 192.168.0.1:6789:/ /mnt/mycephfs

ceph-fuse客户端启动流程

ceph-fuse依赖libfuse项目,很多都是调用的libfuse提供的接口,所以要对libfuse的API比较熟悉才能更好的理解启动过程。

// sudo ceph-fuse -m 192.168.0.1:6789 /mnt/mycephfs
ceph_fuse.cc:
    -> main()
        -> global_init()  // 生成ceph context
        -> fuse_parse_cmdline()  // 解析命令行参数
        -> forker.prefork(err)  // daemonize,成为守护进程
        -> new MonClient(g_ceph_context); mc->build_initial_monmap();  // 初始化monitor client及monmap
        -> Messenger::create_client_messenger()  // 创建client messenger
        -> client = new StandaloneClient(messenger, mc)  // 创建client,用来收发用户IO请求
        -> cfuse = new CephFuse(); cfuse->init()  // 创建CephFuse对象并初始化
            -> _handle->init()  // 初始化CephFuse::Handle对象,fuse_parse_cmdline是libfuse的api,用来解析libfuse需要的参数
        -> messenger->start()   // 启动messenger线程,开始接收消息
        -> init_async_signal_handler(); register_async_signal_handler(SIGHUP, sighup_handler)  // 注册SIGHUP信号处理函数
        -> client->init()  // 初始化client的定时器,启动objectcacher(对象缓存管理),初始化objecter并启动,objecter是跟osd打交道的client,添加dispatcher到messenger
        -> client->mount()  // ,与mds交互检查目录权限?
            -> authenticate()  // 通过monitor完成认证流程
            -> monclient->sub_want(want, 0, 0); monclient->renew_subs()  // 订阅mdsmap,并请求更新
            -> tick()  // 启动定时任务
            -> if (require_mds)  // 等待mds可用
            -> make_request()  // 发送请求给mds,循环检查mount的目录权限
        -> cfuse->start()
            -> fuse_mount(mountpoint, &args)  // 调用libfuse接口完成目录挂载
            -> fuse_lowlevel_new(&args, &fuse_ll_oper, sizeof(fuse_ll_oper), this)  // 创建lowlevel fuse session,其中fuse_ll_oper是定义好的各种posix接口的用户态实现,这里涉及到libfuse的两种用法,参考:https://www.lijiaocn.com/%E6%8A%80%E5%B7%A7/2019/01/21/linux-fuse-filesystem-in-userspace-usage.html
            -> fuse_set_signal_handlers(se)  // Exit session on HUP, TERM and INT signals and ignore PIPE signal
            -> fuse_session_add_chan(se, ch)  // Assign a channel to a session
            -> client->ll_register_callbacks(&args)  // 给client注册回调,包括inode invalidate callback、remount callback、dentry invalidate callback等
        -> tester.init(cfuse, client);  tester.create("tester");   // 初始化并启动remount的test线程(执行RemountTest::entry函数检查是否支持invalidate dentry,如果内核版本大于3.18并且配置项里设置了client_try_dentry_invalidate=true,则检查是否注册了dentry invalidate callback;反之则需要通过remount操作来强制invalidate dentry,执行的命令是"mount -i -o remount $mountpoint",如果remount失败并且配置项client_die_on_failed_dentry_invalidate=true则执行"fusermount -u -z $mountpoint"命令umount掉,下面的loop()就会失败,以达到退出进程的目的)
        -> cfuse->loop()
            -> fuse_session_loop_mt(se)  // Enter a multi-threaded event loop,开始处理IO请求
        -> tester.join(&tester_rp)  // loop()结束后,检查tester线程返回值

ceph-fuse用户IO流程

按数据类型可以分为两种IO,一种是操作metadata的,一种是操作文件内容的。

元数据IO流程

以mkdir操作为例进行说明。

client端

-> 用户在cephfs挂载目录下执行mkdir命令
    -> ceph-fuse进程调用fuse_lowlevel_new函数注册的用户态mkdir实现:fuse_ll_mkdir(ll应该是lowlevel的缩写,因为调用的是libfuse的lowlevel api)
        -> fuse_ll_req_prepare(req)
            -> fuse_req_userdata(req)  // 提取用户数据
        -> fuse_req_ctx(req)    // Get the context from the request
        -> UserPerm perm(ctx->uid, ctx->gid)  // 初始化用户权限
        -> get_fuse_groups(perm, req)  // 从req中获取用户组信息
            -> getgroups()
                -> fuse_req_getgroups(req)  // Get the current supplementary group IDs for the specified request, Similar to the getgroups(2) system call, except the return value is always the total number of group IDs, even if it is larger than the specified size.
            -> perms.init_gids(gids, count)  // 把用户组信息设置到perms
        -> i1 = cfuse->iget(parent);  // 获取父目录的inode
            -> client->get_root() 或 client->ll_get_inode(vino)   // 增加相关inode引用计数
        -> cfuse->client->ll_mkdir(i1, name, mode, &fe.attr, &i2, perm)
            -> _mkdir(parent, name, mode, perm, &in)
                -> is_quota_files_exceeded()  // 检查文件数量配额
                -> _posix_acl_create(dir, &mode, xattrs_bl, perm)   // 创建acl
                -> get_or_create(dir, name, &de)  // 创建新建目录的dentry
                -> make_request(req, perm, inp)  // 发送请求给mds执行创建目录操作,应该是同步请求
                -> trim_cache()  // 清理lru缓存中的dentry
            -> fill_stat(in, attr);    _ll_get(in.get());  // 填充stat信息,增加inode引用计数
        -> fuse_reply_entry(req, &fe)  // Reply with a directory entry

server端

void Server::dispatch_client_request(MDRequestRef& mdr)
{
  ...
    case CEPH_MDS_OP_MKDIR:
      handle_client_mkdir(mdr);
      break;
  ...
}

// MKDIR
/* This function takes responsibility for the passed mdr*/
-> Server::handle_client_mkdir(MDRequestRef& mdr)
    -> is_last_dot_or_dotdot()  // 检查末级目录是否为.或..
    -> rdlock_path_xlock_dentry()  // 遍历并创建新文件夹的dentry,获取非新建目录的rdlock,以及新目录的xlock
    -> check_access()  // 检查目录操作权限
    -> check_fragment_space()  // 检查mds上的目录分片大小是否超出限制
    -> prepare_new_inode()  // 创建新inode,填充信息后存入mdcache
    -> push_projected_linkage()  // 没看懂,猜测是链接新目录到父目录链表?
    -> mdlog->start_entry(le)   // 准备mdlog相关操作,新增一条log
    -> mds->locker->issue_new_caps()  // 新建目录的cap,并且清空它(初始化操作)
    -> journal_and_reply()  // 写入mdlog
        -> early_reply()  // 在提交mdlog之前先返回给client结果,应该是为了加速请求返回
        -> submit_mdlog_entry()
        -> mdlog->flush()

文件数据IO流程

以write操作为例进行说明。

client端

-> 用户在cephfs挂载目录下写入文件
    -> fuse_ll_write(fuse_req_t req, fuse_ino_t ino, const char *buf, size_t size, off_t off, struct fuse_file_info *fi)
        -> fuse_ll_req_prepare(req)  // 提取handle
        -> cfuse->client->ll_write(fh, off, size, buf)
            -> Client::_write()
                -> mdsmap->get_max_filesize()  // 检查文件大小是否越界
                -> objecter->osdmap_pool_full(in->layout.pool_id)  // 检查后的存储池是否满了
                -> f->mode & CEPH_FILE_MODE_WR  // 检查句柄是否可写
                -> is_quota_bytes_exceeded()  // 检查配额
                -> bl.append()  // 把写入的内容转存到bufferlist
                -> get_caps()  // 获取操作文件的cap,这个流程比较长,可以参考上面的元数据client端操作流程
                -> 根据写入的字节数判断是否可以执行inline写入,inline应该是指写入inode里面,也就是写入元数据池
                -> cct->_conf->client_oc  // 是否可以执行buffered write
                    -> objectcacher->file_write()  // async, caching, non-blocking
                    -> if O_SYNC||O_DSYNC; _flush_range()  // 刷数据
                -> 否则执行
                    -> if O_DIRECT; _flush_range   // 刷之前的数据
                    -> filer->write_trunc()
                        -> Striper::file_to_extents()   // 根据文件的偏移量和写入的长度找到对应的后端对象
                        -> objecter->sg_write_trunc()  // 把数据写入后端对象
                            -> write_trunc()
                                -> o = new Op(oid, oloc, ops, flags | global_op_flags | CEPH_OSD_FLAG_WRITE, oncommit, objver)
                                -> op_submit(o, &tid)
                                    -> _send_op()
                                        -> send_message()  // 发送对象写消息给osd
        -> fuse_reply_write(req, r)  // Reply with number of bytes written

server端

与rbd场景下osd端处理IO写请求的流程相同,不再分析。

Ceph mds启动及切换流程

基于L版本代码(v12.2.12)分析。本人还在CephFs学习入门阶段,分析过程仅供参考,如有错误请谅解!

mds启动过程

单纯的启动过程比较简单,比较复杂的地方是如何加入mds集群,尤其是多主模式,这部分暂时没有分析。

因此只讨论单主模式的启动,可以分为主启动和备启动两种情况。

通用流程

-> ceph_mds.cc:main()
    -> global_init()  // 创建cct
    -> ceph_argparse_witharg  // 解析进程启动参数hot-standby
    -> pick_addresses  // 解析监听地址
    -> Messenger::create  // 创建messenger,之后就是设置一堆参数,最终绑定ip和端口:msgr->bind(g_conf->public_addr)
    -> global_init_daemonize  // daemon化,守护进程
    -> mc.build_initial_monmap()  // 创建monitor client并且根据配置项里的mon地址初始化monmap
    -> msgr->start()  // 启动messenger线程
    -> mds = new MDSDaemon()  // 创建MDSDaemon实例,后面初始化参数mds->init()
    -> init_async_signal_handler、*_signal_handler  // 注册信号处理函数
    -> msgr->wait()   // 阻塞,等待stop信号


-> MDSDaemon::init()
    -> messenger->add_dispatcher_tail(&beacon);  messenger->add_dispatcher_tail(this);  // 添加dispatcher,一个是心跳、一个是mds自己
    -> monc->set_messenger(messenger)  // mon client也用同一个messenger收发消息
    -> monc->init()  // monitor client初始化,添加dispatcher到messenger,准备与monitor的认证参数,下面monc->authenticate()是完成认证
    -> mgrc.init(); messenger->add_dispatcher_head(&mgrc)  // mgr client的初始化
    -> monc->sub_want("mdsmap", 0, 0);  monc->sub_want("mgrmap", 0, 0);  monc->renew_subs();  // 设置订阅map(mdsmap、mgrmap),并且发送订阅更新请求
    -> set_up_admin_socket()  // 设置admin socket
    -> timer.init()  // mds定时器,主要调用了MDSRankDispatcher::tick()
    -> beacon.init(mdsmap)  // 初始化心跳服务
    -> reset_tick()   // 启动mdsdaemon的定时任务(定期调用MDSRankDispatcher::tick())
// 这个函数比较重要,这个函数是用来定期检查mds运行状态的
-> MDSRankDispatcher::tick()
    -> heartbeat_reset()  // 重置心跳超时时间,一般在执行耗时较长的任务前调用
    -> check_ops_in_flight()  // 检查正在执行的op,slow request就是这里发现的
    -> mdlog->flush()  // mdlog落盘,也就是写入到osd,会调用到Journaler::flush()
    -> is_active() || is_stopping()状态下,执行:
        -> server->recall_client_state(nullptr, Server::RecallFlags::ENFORCE_MAX)  // 发消息给client清理cap,并且unpin一些mdcache里的inode,主要是是释放内存,ENFORCE_MAX表示释放超过最大caps数量的sessions的cap??
        -> mdcache->trim();  // 清理mdcache
        -> mdcache->trim_client_leases();  // 清理无用client链接
        -> mdcache->check_memory_usage();  // 检查是否需要释放内存,如需要则调用recall_client_state进行释放(不指定ENFORCE_MAX,应该是尽量释放?),如果用的是tcmalloc,还会调用ceph_heap_release_free_memory进一步释放内存。
        -> mdlog->trim();  // 清理mdlog
    -> is_clientreplay() || is_active() || is_stopping()状态下,执行:
        -> server->find_idle_sessions()  // 清理空闲的session,g_conf->mds_session_blacklist_on_timeout为true会调用mds->evict_client(),否则kill_session(),前者是加入黑名单,后者只是清理session
        -> server->evict_cap_revoke_non_responders()  // 也是调用evict_client
        -> locker->tick()  // 没看明白
    -> is_reconnect()状态会执行server->reconnect_tick(),reconnect_tick会在mds_reconnect_timeout之后检查session重连情况,对重连超时的session执行evict_client或者kill_session(检查的配置项同上)
    -> is_active()状态会执行(这部分应该都是多mds相关的操作):
        -> balancer->tick()  // 多mds负载均衡
        -> mdcache->find_stale_fragment_freeze()  // 没看明白,大概是说把freeze超时的目录分片unfreeze掉
        -> mdcache->migrator->find_stale_export_freeze()  // 没看明白,大概是执行迁移故障目录
    -> is_active() || is_stopping()状态下,执行update_targets(),这个没看明白,看注释是更新当前mds的mdsmap里的export_targets,主要是这个export_targets不了解是啥,看类型应该是mdsrank的计数集合,计数越大表示mds越重要?
    -> beacon.notify_health(this)  // 将mds的内部状态信息同步给心跳服务,上报给monitor

备mds启动

class Beacon : public Dispatcher
{
  ...
  MDSMap::DaemonState want_state = MDSMap::STATE_BOOT; // 默认state就是BOOT
  ...
}

启动后Beacon会上报mds状态给monitor,默认是BOOT:
/**
 * Call periodically, or when you have updated the desired state
 */
bool Beacon::_send()
{
  ...
    MMDSBeacon *beacon = new MMDSBeacon(
      monc->get_fsid(), mds_gid_t(monc->get_global_id()),
      name,
      epoch,
      want_state,   // 默认就是STATE_BOOT
      last_seq,
      CEPH_FEATURES_SUPPORTED_DEFAULT);

  beacon->set_standby_for_rank(standby_for_rank);
  beacon->set_standby_for_name(standby_for_name);
  beacon->set_standby_for_fscid(standby_for_fscid);
  beacon->set_standby_replay(standby_replay);
  beacon->set_health(health);
  beacon->set_compat(compat);
  // piggyback the sys info on beacon msg
  if (want_state == MDSMap::STATE_BOOT) {
    map<string, string> sys_info;
    collect_sys_info(&sys_info, cct);
    sys_info["addr"] = stringify(monc->get_myaddr());
    beacon->set_sys_info(sys_info);
  }
  monc->send_mon_message(beacon);
  last_send = now;
  return true;
}
}

mds启动后,第一次通过beacon上报mds信息给monitor的时候,monitor如果发现mds不在已有的mds_roles列表,则分配给mds的角色就是STATE_STANDBY。

bool MDSMonitor::prepare_beacon(MonOpRequestRef op)
{
  ...
  // boot?
  if (state == MDSMap::STATE_BOOT) {
    // zap previous instance of this name?
    if (g_conf->mds_enforce_unique_name) {
      bool failed_mds = false;
      while (mds_gid_t existing = pending.find_mds_gid_by_name(m->get_name())) {
        if (!mon->osdmon()->is_writeable()) {
          mon->osdmon()->wait_for_writeable(op, new C_RetryMessage(this, op));
          return false;
        }
        const MDSMap::mds_info_t &existing_info =
          pending.get_info_gid(existing);
        mon->clog->info() << existing_info.human_name() << " restarted";
        fail_mds_gid(pending, existing);
        failed_mds = true;
      }
      if (failed_mds) {
        assert(mon->osdmon()->is_writeable());
        request_proposal(mon->osdmon());
      }
    }

    // Add this daemon to the map
    if (pending.mds_roles.count(gid) == 0) {
      MDSMap::mds_info_t new_info;
      new_info.global_id = gid;
      new_info.name = m->get_name();
      new_info.addr = addr;
      new_info.mds_features = m->get_mds_features();
      new_info.state = MDSMap::STATE_STANDBY;
      new_info.state_seq = seq;
      new_info.standby_for_rank = m->get_standby_for_rank();
      new_info.standby_for_name = m->get_standby_for_name();
      new_info.standby_for_fscid = m->get_standby_for_fscid();
      new_info.standby_replay = m->get_standby_replay();
      pending.insert(new_info);
    }
  ...
}

// 后面monitor会进行Paxos决议,把pending状态的mdsmap持久化,然后分发给mds,mds收到mdsmap后,在MDSDaemon::handle_core_message()里根据消息类型调用MDSDaemon::handle_mds_map()进一步处理。

void MDSDaemon::handle_mds_map(MMDSMap *m)
{
  ...
  if (whoami == MDS_RANK_NONE) {
    if (mds_rank != NULL) {
      const auto myid = monc->get_global_id();
      // We have entered a rank-holding state, we shouldn't be back
      // here!
      if (g_conf->mds_enforce_unique_name) {
        if (mds_gid_t existing = mdsmap->find_mds_gid_by_name(name)) {
          const MDSMap::mds_info_t& i = mdsmap->get_info_gid(existing);
          if (i.global_id > myid) {
            ...
            return;
          }
        }
      }

      dout(1) << "Map removed me (mds." << whoami << " gid:"
              << myid << ") from cluster due to lost contact; respawning" << dendl;
      respawn();
    }
    // MDSRank not active: process the map here to see if we have
    // been assigned a rank.
    dout(10) <<  __func__ << ": handling map in rankless mode" << dendl;
    _handle_mds_map(oldmap);
  } 
  ...
}

void MDSDaemon::_handle_mds_map(MDSMap *oldmap)
{
  MDSMap::DaemonState new_state = mdsmap->get_state_gid(mds_gid_t(monc->get_global_id()));

  // Normal rankless case, we're marked as standby
  if (new_state == MDSMap::STATE_STANDBY) {
    beacon.set_want_state(mdsmap, new_state);
    dout(1) << "Map has assigned me to become a standby" << dendl;

    return;
  }
  ...
}

从上面可以看出STANDBY状态mds启动比较简单,基本啥都不用做。

主mds启动

monitor这边主mds启动有两种情况,一种是扩容mds,比如第一个mds启动,会从BOOT进入STARTING状态:

// Beacon上报的mds状态之后,MDSMonitor会通过tick函数进行定期的检查,根据mds集群状态决定下一步的动作,比如是否需要扩容mds集群,是否需要替换mds,是否需要进行故障恢复等。
void MDSMonitor::tick()
{
  ...
    // expand mds cluster (add new nodes to @in)?
  for (auto &p : pending.filesystems) {
    do_propose |= maybe_expand_cluster(pending, p.second->fscid);  // 扩容mds,单mds下一般为第一个mds,多mds则可能是其他新mds
  }
  ...
  if (since_last.count() >= g_conf->mds_beacon_grace) {
      auto &info = pending.get_info_gid(gid);
      dout(1) << "no beacon from mds." << info.rank << "." << info.inc
              << " (gid: " << gid << " addr: " << info.addr
              << " state: " << ceph_mds_state_name(info.state) << ")"
              << " since " << since_last.count() << "s" << dendl;
      // If the OSDMap is writeable, we can blacklist things, so we can
      // try failing any laggy MDS daemons.  Consider each one for failure.
      if (osdmap_writeable) {
        maybe_replace_gid(pending, gid, info, &do_propose, &propose_osdmap);   // 主mds超时,进行主备转换,用备替换主
      }
    }
  ...
  for (auto &p : pending.filesystems) {
    auto &fs = p.second;
    if (!fs->mds_map.test_flag(CEPH_MDSMAP_DOWN)) {
      do_propose |= maybe_promote_standby(pending, fs);  // 这个场景没太看明白,也是把备变成主,但是不知道是什么场景,跟上面的maybe_replace_gid有啥区别?
    }
  }
  ...
  if (do_propose) {
    propose_pending();  // 触发Paxos决议mdsmap
  }
}

void MDSMonitor::maybe_replace_gid(FSMap &fsmap, mds_gid_t gid,
    const MDSMap::mds_info_t& info, bool *mds_propose, bool *osd_propose)
{
  ...
  mds_gid_t sgid;
  if (info.rank >= 0 &&
      info.state != MDSMap::STATE_STANDBY &&
      info.state != MDSMap::STATE_STANDBY_REPLAY &&
      may_replace &&
      !fsmap.get_filesystem(fscid)->mds_map.test_flag(CEPH_MDSMAP_DOWN) &&
      (sgid = fsmap.find_replacement_for({fscid, info.rank}, info.name,
                g_conf->mon_force_standby_active)) != MDS_GID_NONE)
  {
    ...
    // Promote the replacement
    auto fs = fsmap.filesystems.at(fscid);
    fsmap.promote(sgid, fs, info.rank);

    *mds_propose = true;
  }
  ...
}


void FSMap::promote(
    mds_gid_t standby_gid,
    const std::shared_ptr<Filesystem> &filesystem,
    mds_rank_t assigned_rank)
{
  ...
  if (mds_map.stopped.erase(assigned_rank)) {
    // The cluster is being expanded with a stopped rank  // 我理解这就是启动首个mds情况
    info.state = MDSMap::STATE_STARTING;
  } else if (!mds_map.is_in(assigned_rank)) {
    // The cluster is being expanded with a new rank  // 这个是扩容mds集群情况(多主mds)
    info.state = MDSMap::STATE_CREATING;
  } else {
    // An existing rank is being assigned to a replacement  // 这个是主备转换(备转主)
    info.state = MDSMap::STATE_REPLAY;    // 注意这个状态,mds进程那边会用到
    mds_map.failed.erase(assigned_rank);
  }
  ...
}

MDS进程这边,主和备启动的差别是在MDSDaemon::handle_mds_map(MMDSMap *m)里:

void MDSDaemon::handle_mds_map(MMDSMap *m)
{
  ...
  if (whoami == MDS_RANK_NONE) {
    ...
  } else {
    ...
    // Did I previously not hold a rank?  Initialize!
    if (mds_rank == NULL) {
      mds_rank = new MDSRankDispatcher(whoami, mds_lock, clog,
          timer, beacon, mdsmap, messenger, monc,
          new FunctionContext([this](int r){respawn();}),
          new FunctionContext([this](int r){suicide();}));
      dout(10) <<  __func__ << ": initializing MDS rank "
               << mds_rank->get_nodeid() << dendl;
      mds_rank->init();
    }

    // MDSRank is active: let him process the map, we have no say.
    dout(10) <<  __func__ << ": handling map as rank "
             << mds_rank->get_nodeid() << dendl;
    mds_rank->handle_mds_map(m, oldmap);
  }
  ...
}
void MDSRankDispatcher::handle_mds_map(
    MMDSMap *m,
    MDSMap *oldmap)
{
  ...
  // did it change?
  if (oldstate != state) {
    dout(1) << "handle_mds_map state change "
        << ceph_mds_state_name(oldstate) << " --> "
        << ceph_mds_state_name(state) << dendl;
    beacon.set_want_state(mdsmap, state);

    if (oldstate == MDSMap::STATE_STANDBY_REPLAY) {
        dout(10) << "Monitor activated us! Deactivating replay loop" << dendl;
        assert (state == MDSMap::STATE_REPLAY);
    } else {
      // did i just recover?
      if ((is_active() || is_clientreplay()) &&
          (oldstate == MDSMap::STATE_CREATING ||
       oldstate == MDSMap::STATE_REJOIN ||
       oldstate == MDSMap::STATE_RECONNECT))
        recovery_done(oldstate);

      if (is_active()) {
        active_start();
      } else if (is_any_replay()) {   // 从这个状态开始,FSMap::promote里面设置的STATE_REPLAY状态
        replay_start();   // 经过状态申请消息及回调过程,完成boot的MDS_BOOT_INITIAL、MDS_BOOT_OPEN_ROOT、MDS_BOOT_PREPARE_LOG、MDS_BOOT_REPLAY_DONE阶段,进入MDSRank::replay_done(),这个函数里面最终会调用request_state(MDSMap::STATE_RECONNECT)(单主mds)或request_state(MDSMap::STATE_RESOLVE)(多主mds)
      } else if (is_resolve()) {  // mdsmap里返回STATE_RESOLVE状态后,进入resolve_start
        resolve_start();
      } else if (is_reconnect()) {  // mdsmap里返回STATE_RECONNECT状态后,进入reconnect_start,后面其他状态也类似
        reconnect_start();
      } else if (is_rejoin()) {
        rejoin_start();
      } else if (is_clientreplay()) {
        clientreplay_start();
      } else if (is_creating()) {
        boot_create();
      } else if (is_starting()) {
        boot_start();
      } else if (is_stopping()) {
        assert(oldstate == MDSMap::STATE_ACTIVE);
        stopping_start();
      }
    }
  }
  ...
}

通过上述分析可以发现,mds的状态转化是要经过多个mdsmap变化来完成的,每个阶段或状态都请求一个mdsmap,被决议通过后才会继续进行下一个阶段。

mds状态列表

可以参考: https://github.com/ceph/ceph/blob/master/doc/cephfs/mds-states.rst#mds-states

Ceph异步恢复实现分析

20191219更新:
H版本合入async recovery功能,会引入多个bug,teuthology可以测试出来,主要是delete、snap/clone等操作和异步恢复功能的冲突(这部分依赖另外一个新版本合入的async delete功能)。因此不建议backport到H版本。可以参考xsky之前的异步恢复代码 https://github.com/ceph/ceph/pull/11918 (未merge但修改的版本跟H版本类似,不依赖async delete功能)。

最近在把ceph社区实现的异步恢复(async recovery)功能backport到H版本,目前已经合入完毕并且基本功能验证通过,这里是相关流程的分析文档。

恢复期间对io影响比较

测试环境

  • pubt2-ceph0.dg.163.org、pubt2-ceph1.dg.163.org、pubt2-ceph2.dg.163.org
  • 3节点,各19个ssd osd,共57个osd
  • Xeon(R) CPU E5-2660 v4 @ 2.00GHz,256G内存
  • ssd有两个节点是三星960,一个是Intel 3520

测试场景

  • 配置文件:完全相同
  • 集群总能力:4k随机写iops为3.6W左右
  • 背景压力:2W iops 4k随机写 (fio+librbd)
  • 观察卷压力:100 iops 4k随机读写 (krbd挂载)
  • iostat统计:每秒1个点,记录观察卷所在rbd设备的io util值

测试用例

  1. 停止一个副本域所有osd
  2. 等待15分钟
  3. 启动所有osd:启停命令:/etc/init.d/ceph stop/start osd

测试期间背景压力和观察卷压力持续进行。

影响对比

说明:X轴为时间,单位是s,间隔1s;Y轴是观察卷的io util值,每秒1个点。
恢复期间对IO影响对比图

整体实现思路

可参考官方说明文档:https://docs.ceph.com/docs/master/dev/osd_internals/async_recovery/

官方PR代码:https://github.com/ceph/ceph/pull/19811

总体思路是参考backfill的实现逻辑,backfill是在恢复过程中,生成副本是最新版本的可完全对外服务的acting列表,保证IO读写都完全不受影响,然后还会生成一个actingbackfill列表,其中包括了acting和需要backfill的osd,用来实现写入和恢复,backfill操作在后台执行,不会被客户端的写入操作触发。

异步恢复的操作也类似backfill,只是把actingbackfill列表改成了acting_recovery_backfill列表,IO写入时,主osd会用这个列表作为目标副本发送repop,但是对于需要recover的副本,则只发送log,不发送写入的数据(跟backfill一致)。peering结束后,根据peer_missing列表和missing_loc异步进行数据恢复,恢复也是根据pglog选择需要恢复的对象,不是整个pg无脑的全量恢复。之前同步恢复过程中,如果IO读写操作遇到副本(包括主)上该对象缺失,就要主动进行对象的恢复操作,然后把读写操作放入wait_for_unreadable_object或wait_for_degraded_object队列,等待对象恢复完毕之后重新enqueue,之后再进行正常的读写操作。

能否执行异步恢复也有一些附加条件:
1. 可用副本数大于min_size(注意是大于,不是大于等于)
2. 副本pglog与权威pglog差异条目数在100以内(可配置)
不满足条件则无法进行异步恢复,会执行同步恢复操作????

同步recover过程中IO流程示意


  1. 同步recovery读IO流程

注:图里第一个框client read应该是client write

同步recovery写IO流程

异步recover过程中IO流程示意(与backfill类似)


  1. 异步recovery读IO流程
  • 注:图里第一个框client read应该是client write

    异步recovery写IO流程

    recover流程(含backfill)

    关键数据结构

    PG类:
      vector<int> up, acting, want_acting;
      // acting_recovery_backfill contains shards that are acting,
      // async recovery targets, or backfill targets.
      set<pg_shard_t> acting_recovery_backfill, actingset;
    
      set<pg_shard_t> backfill_targets,  async_recovery_targets;
    
      map<pg_shard_t, pg_missing_t> peer_missing;
    
      MissingLoc missing_loc;
    
    PGLog类:
      pg_missing_t     missing;
    

    主要流程

    1. Peering

    GetLog

    首先choose_acting:构造acting,acting_recovery_backfill,async_recovery_targets,并选出拥有权威日志auth_log_shard的osd,如果osd是自己,则发送GotLog事件。如果不是则发生pg_query_t::LOG消息给权威日志osd获取pglog,等权威osd返回log后,发送GotLog事件,之后调用proc_master_log处理接收到的权威日志。

    proc_master_log:merge_log,保存peer_info、peer_missing、might_have_unfound(意思是这个osd对象比较全,可以从它这里获取unfound对象)。

    处理完权威日志后,转到GetMissing状态。

    GetMissing

    遍历acting_recovery_backfill,根据peer_info和权威日志构建peer_missing(我理解这些都是日志差异太大,根据pg info就能断定副本osd需要backfill),如果这两个信息不足以确定peer_missing(我理解这里就是要根据pg日志来对比,找出需要recovery的osd),则需要发送pg_query_t::LOG或pg_query_t::FULLLOG给osd。

    接收到副本osd返回的pg log之后,调用PG::proc_replica_log进行处理,并保存peer_info、peer_missing、might_have_unfound。

    最后发送NeedUpThru或Activate事件,NeedUpThru是进入WaitUpThru状态,等待新的map更新up_thru值之后,再发送Activate事件进入Active状态。

    1. Active

    调用pg->activate,根据peer_missing构造missing_loc(这部分没看懂),如果是主,要发送MOSDPGLog给副本osd,副本osd发送Activate事件后,从Stray状态进入ReplicaActive状态,处理Activate事件时同样调用pg->activate函数,在回调C_PG_ActivateCommitted里调用pg->_activate_committed,回复MOSDPGInfo消息给主osd,表示activate完毕。pg进入Activating状态,主osd等待所有副本activate完毕,调用pg->all_activated_and_committed,发送AllReplicasActivated事件,处理该事件时设置pg状态为PG_STATE_ACTIVE,重新入队等待peering结束的op(pg->requeue_ops(pg->waiting_for_peered)),最后调用pg->on_activate(),发送事件DoRecovery,pg进入WaitLocalRecoveryReserved状态。

    1. WaitLocalRecoveryReserved

    事件:LocalRecoveryReserved

    1. WaitRemoteRecoveryReserved

    给所有副本osd发送MRecoveryReserve::REQUEST消息,之后等待接收事件RemoteRecoveryReserved,
    等待所有副本osd返回reserve成功消息后,发送事件:AllRemotesReserved,主osd进入Recovering状态,

    1. Recovering
      pg->state_clear(PG_STATE_RECOVERY_WAIT);
      pg->state_set(PG_STATE_RECOVERING);
      pg->osd->queue_for_recovery(pg);
    

    切换到recovery线程,OSD::do_recovery->ReplicatedPG::start_recovery_ops恢复所有对象,每次恢复一个之后检查是否还在恢复,如果结束了,则继续检查是否需要backfill,如需要则发送事件RequestBackfill,调用PG::RecoveryState::Recovering::release_reservations(),释放reservations,replica osd进入RepNotRecovering状态,主osd跳转到WaitRemoteBackfillReserved状态。

    否则直接发送AllReplicasRecovered,收到后调用PG::RecoveryState::Recovering::release_reservations(),释放reservations,跳转到Recovered状态。

    1. Recovered

    再次choose_acting构造acting,acting_recovery_backfill,async_recovery_targets,此时acting,acting_recovery_backfill以及up,应该一致,async_recovery_targets会被清空。

    之后发送事件:GoClean,转到Clean状态。

    1. WaitRemoteBackfillReserved

    给所有副本osd发送MBackfillReserve::REQUEST消息,之后等待接收事件RemoteBackfillReserved,所有副本收到消息后进入RepWaitBackfillReserved状态,之后返回reserve成功消息后,主osd等待所有副本reserve完毕后,发送AllBackfillsReserved事件,主osd进入Backfilling状态.

    1. Backfilling
      PG *pg = context< RecoveryMachine >().pg;
      pg->backfill_reserved = true;
      pg->osd->queue_for_recovery(pg);
    

    仍然切换到recovery线程,之后的流程与上面的Recovering类似,recovery和backfill都结束后进入Clean状态。

    1. Clean

    修改pg状态为clean

    recover过程调用的主要函数

    1. choose_acting
    map<pg_shard_t, pg_info_t>::const_iterator auth_log_shard =
        find_best_info(all_info);  // 查找权威日志osd,如果找不到就返回false,之后GetLog那边直接return,也就意味着peering流程无法正常执行(进入incomplete状态?),直到osdmap发生新的变化再进行重试
    
    if ((up.size() &&
          !all_info.find(up_primary)->second.is_incomplete() &&
          all_info.find(up_primary)->second.last_update >=
           auth_log_shard->second.log_tail) &&
          auth_log_shard->second.is_incomplete()) { // 如果权威osd为incomplete(没有backfill完?),就尝试再选一个(从complete的osd里面选)
    
    calc_replicated_acting  // 计算期望的acting列表(want)、acting_recovery_backfill、backfill_targets,首先选出primary(首选up_primary,如果是incomplete或者日志不全,就选权威osd),之后依次遍历所有的up、acting、以及所有返回pg info的osd,根据pg info信息判断osd应该加入哪个列表(需要backfill的加入acting_recovery_backfill和backfill_targets,否则加入want_acting和acting_recovery_backfill)
    
    recoverable_and_ge_min_size  // 根据want计算pg是否可以恢复,want_acting不为空即表示可以恢复(osd_allow_recovery_below_min_size默认值为true,所以非EC pool场景下,want的大小不需要大于等于min_size)
    
    choose_async_recovery_replicated  // 挑出want_acting中可以执行异步恢复的osd(可以异步恢复的osd会从want中移除),保存到async_recovery_targets中。挑选条件是目标osd的last_update.version与权威osd的last_update.version差异大于100(osd_async_recovery_min_pg_log_entries配置项可以配置),差异越少的排的越靠前(set的自动排序特性),cost就是差异的值
    
    queue_want_pg_temp  // 如果上面的流程选出来的want与之前的acting不一致,则需要通知monitor生成pg_temp,注册临时的acting,并返回false
    
    如果want等于acting,表示不需要变更acting列表,want_acting就可以清空了。这个时候就把calc_replicated_acting计算出来的几个列表赋给PG对象里对应的成员,保存起来以备后用。
    
    1. do_recovery

    OSDService的recovery_wq是处理recovery的队列,并关联了线程池recovery_tp(线程数量由配置项osd_recovery_threads控制,默认值为1),pg->osd->queue_for_recovery(pg)就是把pg放到这个队列里了。

    do_recovery就是recovery_tp线程池的处理函数(ThreadPool::worker->_void_process->RecoveryWQ::_process->do_recovery),ThreadPool::worker会首先dequeue一个pg,然后交给do_recovery去处理。

    do_recovery先检查还有没有recovery的配额(并发数有没有到限制),然后再看是否需要sleep一定时间(进一步降低恢复占用的带宽),如果需要则注册timer event,回调里把pg重新放入recovery_wq,等待下一次被取出来处理。

    如果不需要sleep等待,则调用pg->start_recovery_ops开始恢复,优先恢复主osd(recover_primary),之后恢复从osd(recover_replicas),recovery完成之后进行backfill(recover_backfill)。

    recover_primary // 根据pg log找出需要恢复的对象及版本(根据log类型分为CLONE、LOST_REVERT场景,revert又分为本地有目标对象版本和没有两种情况,本地没有就要找到对象所在的osd,并加入到missing_loc,以便后面进行恢复)。最后调用recover_missing进行恢复,这里都是准备工作,只是把要push(恢复从)或者pull(恢复主)的op放到相应的队列中。真正执行recover操作的是run_recovery_op函数里的send_pushs(发送MOSDPGPush消息给目标osd)和send_pulls(发送MOSDPGPull给目标osd)。发送之后就是handle和reply流程了,这里就不细讲了。

    recover_replicas跟recover_primary流程类似,也不细讲了。

    Ceph H版本性能调优总结

    现状

    sas 场景

    sas盘场景下,由于有raid卡的缓存加成,以及journal的顺序写特性,因此性能比sas裸盘要好很多,单osd单副本场景下数据(未调优):

    job1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
      write: io=316136KB, bw=2589.8KB/s, iops=647, runt=122107msec
        clat (msec): min=2, max=7502, avg=98.86, stdev=280.79
    

    经过配置项调整后,同样的测试用例和单osd节点,iops从600多提升到1010左右。调整的配置项是:

    osd_client_message_cap = 50000 #线上配置是100
    filestore_queue_max_ops = 50000 #线上配置是50
    

    但此时内存占用会暴涨,原因是后端data落盘跟不上journal落盘速度,大量op保存在内存中排队。

    而裸盘的fio数据:

    fio -filename=/var/lib/ceph/osd/ceph-34/fio.data -rw=randwrite -bs=4k -iodepth=128 -numjobs=2 -ioengine=libaio -direct=1 -runtime=60 -group_reporting  -size=1000G -name=sdb1
    write: io=78768KB, bw=1303.1KB/s, iops=325, runt= 60447msec
    clat (msec): min=8, max=4049, avg=392.82, stdev=281.27
    

    因此sas盘场景没有仔细调优,主要是在ssd场景下进行的各种对比测试。

    需要注意的是,throttle放开的参数在sas场景下不能放开,否则会导致内存暴涨(sas盘性能跟不上,不需要设置太高的throttle)。

    ssd 场景

    9台物理机(近期上线的机器,cpu应该稍好),每台20个ssd盘(Intel S4500),4k随机写fio+rbd模式下,9个卷(每个500G)可以并发跑到72000个IOPS左右。

      write: io=3856.9MB, bw=32852KB/s, iops=8212, runt=120219msec
        slat (usec): min=0, max=15124, avg= 5.45, stdev=51.21
        clat (msec): min=1, max=703, avg=15.54, stdev=25.71
    

    fio测试过程中性能稳定无波动,osd后端磁盘IO压力均衡,每个osd均在2000~4000iops范围波动。

    同样的9台物理机,每台3个ssd盘(Intel S4500),跑4k随机写裸盘fio+libaio,每个节点可以达到15W iops。

    而在线下环境的pubt2-ceph0/1/2 3台节点上(E5-2660 v4 2.0GHz,两个节点ssd盘是SAMSUNG MZ7LM960HMJP,PM863a,官方标称随机写24K IOPS;另外一台是INTEL SSDSC2BB016T7,S3520,官方标称随机写17K IOPS),用线上环境试用的1.2-0版本进行对比测试,配置项使用线上的,无绑核,fio跑在独立的节点上。4k随机写总体iops性能为30652,平均单卷3405:

      write: io=825868KB, bw=13623KB/s, iops=3405, runt= 60624msec
        clat (msec): min=1, max=1356, avg=37.57, stdev=84.53
    

    分析过程及优化手段

    注意:下面的分析及优化都是针对的ssd盘场景。

    整个分析过程主要依赖的手段包括:
    1. perf counter,增加自定义counter
    2. 日志,添加耗时统计
    3. kill -USR1,获取调用栈
    4. Linux perf工具,查看cpu资源占用情况
    5. 修改代码,增加io落盘开关,做到无io落盘情况下分析osd层性能,排除磁盘对性能数据的影响

    其他期望尝试的工具是lttng+blkin,但H版本这个感觉做的不完善,代码里很多地方都没有打点,后续L版本可以尝试。

    单osd性能分析

    首先要分析的是单osd、单副本场景下的rbd卷4k随机写性能,分析的是有io落盘和无io落盘两种情况下的性能差距,并找出无io落盘情况下osd层的性能瓶颈点。

    9个卷,(4k随机写,128iodepth),有IO落盘(未mock FileStore和journal),单卷性能:

      write: io=461228KB, bw=3840.1KB/s, iops=960, runt=120082msec
        clat (msec): min=1, max=1468, avg=133.29, stdev=117.17
    

    9个卷,(4k随机写,128iodepth),无IO落盘(mock FileStore和journal),单卷性能:

      write: io=717212KB, bw=5974.2KB/s, iops=1493, runt=120052msec
        clat (msec): min=1, max=141, avg=85.69, stdev= 9.74
    

    可以看出无io落盘场景下,平均时延过高,因此通过dump_historic_ops分析时延较长的位置,

     {
                "description": "osd_op(client.8984.0:11873 rbd_data.21f66b8b4567.000000000001bb3d [] 13.f568b271 ack+ondisk+write+known_if_redirecte
    d e970)",
                "initiated_at": "2019-08-22 17:35:39.837373",
                "age": 0.992912,
                "duration": 0.982060,
    ......
    
                        {
                            "time": "2019-08-22 17:35:39.837512",
                            "event": "started"
                        },
                        {
                            "time": "2019-08-22 17:35:39.837613",
                            "event": "commit_queued_for_journal_write"
                        },
                        {
                            "time": "2019-08-22 17:35:39.840339",
                            "event": "write_thread_in_journal_buffer"  ##### 这里
                        },
                       {
                            "time": "2019-08-22 17:35:39.888499",
                            "event": "journaled_completion_queued"   ##### 到这里
                        },
                        {
                            "time": "2019-08-22 17:35:39.892560",
                            "event": "filestore_queued_op"
                        },
                        {
                            "time": "2019-08-22 17:35:39.892565",
                            "event": "filestore_do_op"
                        },
                        {
                            "time": "2019-08-22 17:35:39.892603",
                            "event": "filestore_do_op_end"
                        },
                        {
                            "time": "2019-08-22 17:35:39.924675",  ##### 到这里
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-08-22 17:35:39.924698",
                            "event": "commit_sent"
                        },
    
    

    Filestore相关event影响排序,因此修改代码去掉了之后,

     {
                "description": "osd_op(client.9820.0:135 rbd_data.21f26b8b4567.000000000000ac1a [] 13.f87a5531 ack+ondisk+write+known_if_redirected e994)",
                "initiated_at": "2019-08-23 16:26:38.362347",
                "age": 30.219829,
                "duration": 0.136654,
    ......
                        {
                            "time": "2019-08-23 16:26:38.402545",
                            "event": "started"
                        },
                        {
                            "time": "2019-08-23 16:26:38.402654",
                            "event": "commit_queued_for_journal_write"
                        },
                        {
                            "time": "2019-08-23 16:26:38.403273",
                            "event": "write_thread_in_journal_buffer"   ## 这里
                        },
                        {
                            "time": "2019-08-23 16:26:38.444155",
                            "event": "journaled_completion_queued"   ## 到这里40ms左右
                        },
                        {
                            "time": "2019-08-23 16:26:38.498969",      ## 到这里又50ms左右
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-08-23 16:26:38.499001",
                            "event": "commit_sent"
                        }
    
    

    前半段在等待journal落盘,后半段是journal落盘之后等待回调。

    但需要注意的是,history ops并不能完全表现真实性能瓶颈,只能作为参考,只有耗时最长的20个op才会列出来,是最大值,而非平均值。

    官方代码已有的perf平均值数据如下:

    "journal_latency": {
                "avgcount": 972945,
                "sum": 258.177185807
            },
    
    "op_latency": {
                "avgcount": 973143,
                "sum": 68849.737076609
            },
    "op_process_latency": {
                "avgcount": 973143,
                "sum": 68685.614715542
            },
    

    可以看出journal_latency平均值很低,才0.26ms,也就是说绝大部分情况下commit_queued_for_journal_write到journaled_completion_queued阶段都是不怎么耗时的,op_process_latency平均延时比较高。因此需要通过增加perf counter来对比验证瓶颈点。

    通过增加perf counter,journal commit可以看出回调耗时较长:

            "repop_all_commited_latency": { ## ReplicatedPG::repop_all_committed的时间点减去ReplicatedPG::eval_repop首次进入的时间点的差值,这段平均耗时统计值与fio的平均时延比较接近
                "avgcount": 335625,
                "sum": 67348.398432313
    
            "op_commit_lock_latency": {  ## 已排除是BlessedContext::finish里面的pg->lock加锁导致的耗时长
                "avgcount": 671250,
                "sum": 44.412470067
            }
    

    当前finisher线程是1个,不能修改,之前正好看到社区代码有修改这部分,增加finisher回调线程数量,于是尝试合入进行对比(https://github.com/ceph/ceph/pull/6486/commits )。

    增加finisher线程前(默认都是1个线程):

            "journaled_ahead_to_op_commit_latency": {
                "avgcount": 325212,
                "sum": 67107.237113916
            },  ## fio的平均时延正好跟这个回调耗时匹配
    
        "finisher-journal": {
            "queue_len": 0
        },
        "finisher-onapply": {
            "queue_len": 21
        },
        "finisher-ondisk": {
            "queue_len": 1133
        },
    

    增加finisher线程后:

    "journaled_ahead_to_op_commit_latency": {
                "avgcount": 988516,
                "sum": 891.208668089
            },
    
        "finisher-filestore-apply-0": {
            "queue_len": 0
        },
        ......
        "finisher-filestore-apply-4": {
            "queue_len": 2
        },
        "finisher-filestore-ondisk-0": {
            "queue_len": 0
        },
        ......
        "finisher-filestore-ondisk-9": {
            "queue_len": 0
        },
        "finisher-journal": {
            "queue_len": 0
        },
    

    增加finisher线程数量后,队列中等待的op数量已经没有堆积,fio性能偶尔可以从原来的每个卷1200左右增加到1800左右,但更多时候是下降的,到800多,需要进一步分析原因。
    说明还有其他地方有性能瓶颈。

    统计perf counter耗时,发现提交journal之后到op_commit耗时仍然比较长(上面有之前的数据),当时没想到具体的瓶颈点,翻了社区提交的性能优化代码找到一个相关的改动,合入之后发现提升挺明显(https://github.com/ceph/ceph/commit/eb020b6a6a75c69f71dc29fd607a4626bed28366 ),单osd单副本9个卷无IO落盘,每个卷性能从1490提高到2200:

      write: io=529944KB, bw=8826.1KB/s, iops=2206, runt= 60037msec
        clat (msec): min=3, max=125, avg=57.99, stdev= 5.59
    

    有IO落盘(从960提升到1490):

      write: io=162348KB, bw=5982.1KB/s, iops=1495, runt= 27139msec
        clat (msec): min=1, max=150, avg=85.56, stdev= 8.77
    

    根据晨会上宏松的建议,看了下所有的pr,涉及到osd层、pg层、journal层、FileStore层的performance优化都过滤下,看看有么有可以拿过来的。找到一个https://github.com/ceph/ceph/commit/a75a9a8178f42ba7363de4929af5da140b64c668 ,合入后无IO落盘场景(其他条件同上),每个卷性能从2206提升到2320:

      write: io=557512KB, bw=9285.4KB/s, iops=2321, runt= 60042msec
      write: io=557496KB, bw=9285.3KB/s, iops=2321, runt= 60041msec
      write: io=557620KB, bw=9287.2KB/s, iops=2321, runt= 60043msec
        clat (msec): min=2, max=141, avg=55.13, stdev= 4.90
        clat (msec): min=3, max=133, avg=55.13, stdev= 4.92
    

    有IO落盘场景,从1490提升到1527:

      write: io=366880KB, bw=6109.2KB/s, iops=1527, runt= 60054msec
        clat (msec): min=1, max=205, avg=83.80, stdev=12.87
    

    其他的可以合入的patch也都合入测试了,发现并没有提升。

    分析无IO落盘场景下的fio结果,发现时延还是比较高,平均55ms,于是通过kill -USR1看了下top里cpu占用最高的线程的调用栈(多kill几次),发现每次调用栈都是在打印日志,于是尝试把日志级别都改成0(主要是内存级别,因此文件级别已经是0了),

    for d in `sudo ceph  daemon osd.37  config show| grep '"debug_' | awk -F: '{print $1}' | sed 's/"//g'`; do sudo ceph daemon osd.37 config set $d 0; done
    

    使用上面的脚本,关掉所有模块日志(包括内存日志)之后,性能有很大提升。

    无IO落盘提升非常大(延时敏感,每个卷从2320提升到5718,翻1倍还要多):

      write: io=1340.1MB, bw=22875KB/s, iops=5718, runt= 60028msec
        clat (usec): min=1320, max=48165, avg=22373.50, stdev=809.35
    

    有IO落盘,从1527提升到1770:

      write: io=425060KB, bw=7082.5KB/s, iops=1770, runt= 60016msec
        clat (usec): min=783, max=2614.1K, avg=72280.76, stdev=122010.96
    

    上述单osd场景下,也对比过64个pg和256个pg的性能,256 pg性能比64个要好一些,因此上述优化都是在256 pg场景下测试的。

    单节点3 osd 3副本性能分析

    使用上面修改过代码及配置项的版本部署3副本场景(共3个osd位于同一个节点上,排除网络影响),进行性能测试。

    3副本(3个osd,在同一节点上),256 pg,无io落盘场景下,版本为未合入CLDNBS-1383相关pr的版本,也即CLDNBS-1358相关commit合入的版本,配置项与线上环境相同,9个客户端卷,每个卷的iops情况(单osd cpu利用率1000%左右):

      write: io=308464KB, bw=5130.7KB/s, iops=1282, runt= 60122msec
        clat (msec): min=1, max=381, avg=99.78, stdev=113.20
    

    有io落盘iops情况(单osd cpu利用率400%~800%之间波动):

      write: io=93144KB, bw=1542.7KB/s, iops=385, runt= 60379msec
        clat (msec): min=2, max=3905, avg=331.87, stdev=487.11
    

    3副本(3个osd,在同一节点上),256 pg,无io落盘场景下,版本为未合入CLDNBS-1383相关pr的版本,也即CLDNBS-1358相关commit合入的版本,配置项与CLDNBS-1358测试时相同,9个客户端卷,每个卷的iops情况(单osd cpu利用率1000%左右):

      write: io=284264KB, bw=4728.4KB/s, iops=1182, runt= 60119msec
        clat (msec): min=1, max=267, avg=108.27, stdev=66.00
    

    有io落盘iops情况(单osd cpu利用率1300%左右):

      write: io=230180KB, bw=3830.5KB/s, iops=957, runt= 60093msec
        clat (msec): min=2, max=319, avg=133.65, stdev=69.22
    

    主pg数量不均衡,导致一个osd处理的osd_op数量比其他两个少40W(180W~220W),经过手工调整后(
    hzwangpan@pubt2-ceph2 ceph $ ceph osd primary-affinity 35 0.95
    set osd.35 primary-affinity to 0.95 (8622592)
    hzwangpan@pubt2-ceph2 ceph $ ceph pg ls-by-primary 35| wc ),iops有一些提升(无io落盘场景,其他同上):

      write: io=306644KB, bw=5102.2KB/s, iops=1275, runt= 60101msec
        clat (msec): min=1, max=247, avg=100.34, stdev=62.83
    

    问题:无io落盘3个osd场景下,为啥性能仍然较差?

    首先通过增加perf counter分析到osd之间的repop发送和接收过程较慢导致的,并进一步通过debug_ms=10日志确认。 之后再Pipe::reader()里面增加耗时和队列长度统计,并分析日志,9个卷无IO落盘场景下,可以看出sent队列长度为1100+,repop相关dispatch耗时在0.1ms左右,与客户端时延接近(120+ms),而单卷无IO落盘场景下,sent队列长度为100+,dispatch耗时不变,客户端时延14ms左右,也即与9个卷场景相差9倍左右(但单卷和9个卷的总体iops相差不大),由此可见,是服务端来不及处理高并发的IO,在服务端排队导致的时延较高。另外还在Pipe::writer()里增加代码进行测试,发现out_q队列长度均非常小,基本都在1或者0,没有排队。

    查看cpu利用率最高的线程,均为Pipe::reader()线程,达到90%以上,kill -USR1看到是在mark_event,尝试关掉optracker功能之后(osd_enable_op_tracker = false),9个卷128iodepth无IO落盘,iops从1050左右提升到1680+,时延从122ms降低到75ms。每个osd的cpu利用率也从1200%降低到750%左右。

    另外尝试把Pipe:reader线程绑定到独立的cpu上,把osd的其他线程绑定到其他cpu上,iops有一点提升,从1680+提升到1700+。
    把日志都改成0级别(包括内存),iops提升到1950+。以上均为无io落盘,9个卷128iodepth。

    3节点57 osd 3副本性能分析

    9个卷,57个osd(3节点,每节点19个osd),有io落盘(关闭optracker、关闭日志),4k随机写,128iodepth,平均每个卷的性能:

      write: io=995372KB, bw=16585KB/s, iops=4146, runt= 60017msec
        clat (msec): min=1, max=161, avg=30.73, stdev=16.42
    

    用老的版本,老的配置项,同样的测试用例,同样的环境,iops性能为:

      write: io=760796KB, bw=12671KB/s, iops=3167, runt= 60042msec
        clat (msec): min=1, max=1272, avg=40.38, stdev=77.00
    

    两种场景下,前4个核都已经跑满,经过对比发现前4个核跑满是因为被osd和fio线程占用了。

    绑核之后(fio、osd线程绑定到4-55,fio分散到3个节点执行,每个节点跑3个卷),老的版本加老的配置,同样的测试场景下,性能提升到:

      write: io=884548KB, bw=14609KB/s, iops=3652, runt= 60550msec
        clat (msec): min=1, max=1572, avg=35.00, stdev=87.48
    

    同样的绑核策略,新版本+新配置项,性能提升:

      write: io=1178.4MB, bw=20102KB/s, iops=4960, runt= 60027msec
        clat (msec): min=1, max=442, avg=25.33, stdev=13.11
    

    此时发现cpu是瓶颈,于是尝试采用async messager替换simple messager来对比测试(需要backport高版本的async功能),async比simple的线程数要少很多,cpu占用应该也会少一些。发现async比simple有提升,可以跑到5270左右了(simple是4960左右)。

    之后才做了其他对比测试,以下数据均为async messager:
    1. 尝试不绑定osd线程到4-55 cpu,iops下降到5260左右(fio线程是否绑定4-55 cpu对性能几乎无影响)。
    2. 绑核,减少线程数,可以跑到5500左右:
    filestore_op_threads = 16
    osd_op_num_threads_per_shard = 1
    osd_op_num_shards = 16
    filestore_ondisk_finisher_threads = 8
    filestore_apply_finisher_threads = 8
    osd_enable_op_tracker = true
    其他配置不变

    3. 关闭osd_enable_op_tracker,可以跑到5570左右,其他配置不变
    4. 尝试把fio跑在其他节点(与osd节点分开),其他不变,可以跑到5819;把async改为simple messager,同样的场景,可以跑到5106

    线上环境使用的1.2-0版本进行对比测试,配置项使用线上的,无绑核,fio跑在独立的节点上。4k随机写总体iops性能为30652,平均单卷3405,

      write: io=825868KB, bw=13623KB/s, iops=3405, runt= 60624msec
        clat (msec): min=1, max=1356, avg=37.57, stdev=84.53
    

    绑核后性能提升到33145,平均单卷3682:

      write: io=893156KB, bw=14749KB/s, iops=3687, runt= 60556msec
        clat (msec): min=1, max=1596, avg=34.70, stdev=97.10
    

    尝试绑核并且修改配置文件,1.2-0版本可以跑到总iops 47750,平均单卷5305:

      write: io=1246.5MB, bw=21257KB/s, iops=5314, runt= 60044msec
        clat (usec): min=873, max=2790.6K, avg=24075.40, stdev=126415.06
    

    尝试每个节点只用6个osd,性能也比19个osd高,单卷iops平均6415(总57735),时延在19~20ms。单osd cpu利用率在700~800%左右,内存最大到12G左右,原因是数据落盘比journal延迟10几秒,此时ssd盘才达到瓶颈。

    总结:

    1. cpu是瓶颈(测试过程中除了前4个cpu只跑软中断有20~30%的idle之后,其他cpuidle基本都接近0),ssd盘性能仍有较大余量
    2. async messager性能最好,线程数量少很多,cpu利用率比较低一些,性能可以提升一些,总iops可以跑到52374
    3. 优化代码合入对多osd场景的最终性能影响有限,原因是cpu利用率是瓶颈,ssd性能发挥不出来,对单osd性能提升比较明显
    4. 绑定cpu对性能有提升
    5. 内存日志级别也要调整到0,对性能有提升
    6. 打开op tracker在无io落盘场景下对性能影响较大,但在有io落盘场景下不明显,可以接受
    7. 通过perf top分析可以看出,cpu利用率占用较多的是内存分配释放操作,其次是bufferlist、字符串等操作

    配置项改动:

    --- online.conf 2019-10-23 14:24:52.738763518 +0800
    +++ ceph.conf   2019-10-23 14:27:08.326466021 +0800
    @@ -1,4 +1,6 @@
     [global]
    +enable experimental unrecoverable data corrupting features = ms-type-async
    +ms_type = async
    
    @@ -32,11 +27,71 @@
    +osd_pg_object_context_cache_count = 1024
    +filestore_queue_max_ops = 5000
    +filestore_queue_max_bytes = 1048576000
    +filestore_ondisk_finisher_threads = 4
    +filestore_apply_finisher_threads = 4
    +osd_client_message_cap = 5000
    +osd_client_message_size_cap = 1048576000
    +journal_queue_max_ops = 5000
    +journal_queue_max_bytes = 524288000
    +journal_max_write_entries = 5000
    +journal_max_write_bytes = 20971520
    +
    +debug_none = 0
    +debug_lockdep = 0
    +debug_context = 0
    +debug_crush = 0
    +debug_mds = 0
    +debug_mds_balancer = 0
    +debug_mds_locker = 0
    +debug_mds_log = 0
    +debug_mds_log_expire = 0
    +debug_mds_migrator = 0
    +debug_buffer = 0
    +debug_timer = 0
    +debug_filer = 0
    +debug_striper = 0
    +debug_objecter = 0
    +debug_rados = 0
    +debug_rbd = 0
    +debug_rbd_replay = 0
    +debug_journaler = 0
    +debug_objectcacher = 0
    +debug_client = 0
    +debug_osd = 0
    +debug_optracker = 0
    +debug_objclass = 0
    +debug_filestore = 0
    +debug_keyvaluestore = 0
    +debug_journal = 0
    +debug_ms = 0
    +debug_mon = 0
    +debug_monc = 0
    +debug_paxos = 0
    +debug_tp = 0
    +debug_auth = 0
    +debug_crypto = 0
    +debug_finisher = 0
    +debug_heartbeatmap = 0
    +debug_perfcounter = 0
    +debug_rgw = 0
    +debug_civetweb = 0
    +debug_javaclient = 0
    +debug_asok = 0
    +debug_throttle = 0
    +debug_refs = 0
    +debug_xio = 0
    

    其他改进手段

    1. 优化osd的cpu利用率,进一步发挥ssd盘的性能
    2. 可以考虑混合部署sas和ssd,降低ssd盘的密度,更好的发挥ssd盘的性能

    通过rados命令删除rbd卷操作流程

    参数写错导致卷创建的太大了,16384T,H版本没有object-map特性导致rbd rm删除卷非常慢,实际上后的rados都没有对象(还没有写入多少数据),因此考虑用rados命令进行rbd卷的清理,步骤如下:

    1. 首先查看卷信息:rbd info rbd/rbd_sys_disk
      rbd image 'rbd_sys_disk':
      size 16384 TB in 4294967296 objects
      order 22 (4096 kB objects)
      block_name_prefix: rbd_data.6a74656327b23c6
      format: 2
      features: layering
      flags:
    2. 之后开始清理rados中的数据对象和卷信息记录
       $ rados ls -p rbd | grep rbd_data.6a74656327b23c6
       $ rados rm -p rbd rbd_data.6a74656327b23c6.0000000000000214 # 逐个对象删除即可,可以写一个脚本来做
       $ rados rm -p rbd rbd_header.6a74656327b23c6  # 清理rbd_header对象
       $ rados listomapvals -p rbd rbd_directory  # 查看rbd_directory的omap属性列表
          id_6a74656327b23c6   # 删除这个key
          value: (16 bytes) :
          0000 : 0c 00 00 00 72 62 64 5f 73 79 73 5f 64 69 73 6b : ....rbd_sys_disk
      
          id_6a95c8a327b23c6
          value: (17 bytes) :
          0000 : 0d 00 00 00 72 62 64 5f 73 79 73 5f 64 69 73 6b : ....rbd_sys_disk   # 注意换行下面还有一个1
          0010 : 31                                              : 1
      
          id_6a9e49a6b8b4567
          value: (12 bytes) :
          0000 : 08 00 00 00 76 6f 6c 75 6d 65 30 32             : ....volume02
      
          id_6aa42d36b8b4567
          value: (19 bytes) :
          0000 : 0f 00 00 00 72 62 64 5f 6c 6f 67 69 63 5f 64 69 : ....rbd_logic_di
          0010 : 73 6b 31                                        : sk1
      
          name_rbd_logic_disk1
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 61 34 32 64 33 36 62 38 62 34 : ....6aa42d36b8b4
          0010 : 35 36 37                                        : 567
      
          name_rbd_sys_disk# 删除这个key
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 37 34 36 35 36 33 32 37 62 32 : ....6a74656327b2
          0010 : 33 63 36                                        : 3c6
      
          name_rbd_sys_disk1
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 39 35 63 38 61 33 32 37 62 32 : ....6a95c8a327b2
          0010 : 33 63 36                                        : 3c6
      
          name_volume02
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 39 65 34 39 61 36 62 38 62 34 : ....6a9e49a6b8b4
          0010 : 35 36 37                                        : 567
      $ rados rmomapkey -p rbd rbd_directory id_6a74656327b23c6
      $ rados rmomapkey -p rbd rbd_directory name_rbd_sys_disk
      $ rbd ls -p rbd 就看不到rbd_sys_disk这个卷了
      

    mock osd Journal and FileStore

    最近在分析ceph rbd性能瓶颈,想排除掉磁盘对性能的影响,单独分析osd层软件栈的耗时,因此想mock掉所有落盘的代码,测试了下可以实现。

    代码是基于H版本0.94.6修改的,仅供参考。patch如下,代码不多,不做解释了:

    diff --git a/src/common/config_opts.h b/src/common/config_opts.h
    index 0748acf..7255f53 100644
    --- a/src/common/config_opts.h
    +++ b/src/common/config_opts.h
    @@ -858,6 +858,7 @@ OPTION(filestore_fd_cache_shards, OPT_INT, 16)   // FD number of shards
     OPTION(filestore_dump_file, OPT_STR, "")         // file onto which store transaction dumps
     OPTION(filestore_kill_at, OPT_INT, 0)            // inject a failure at the n'th opportunity
     OPTION(filestore_inject_stall, OPT_INT, 0)       // artificially stall for N seconds in op queue thread
    +OPTION(filestore_inject_blackhole, OPT_BOOL, false)
     OPTION(filestore_fail_eio, OPT_BOOL, true)       // fail/crash on EIO
     OPTION(filestore_debug_verify_split, OPT_BOOL, false)
     OPTION(journal_dio, OPT_BOOL, true)
    diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
    index 14cecc2..ce8da41 100644
    --- a/src/os/FileJournal.cc
    +++ b/src/os/FileJournal.cc
    @@ -1370,7 +1370,10 @@ int FileJournal::write_aio_bl(off64_t& pos, bufferlist& bl, uint64_t seq)
         iocb *piocb = &aio.iocb;
         int attempts = 10;
         do {
    -      int r = io_submit(aio_ctx, 1, &piocb);
    +      int r = 1;
    +      if (!g_conf->filestore_inject_blackhole) {
    +          r = io_submit(aio_ctx, 1, &piocb);
    +      }
           if (r < 0) {
            derr << "io_submit to " << aio.off << "~" << aio.len
                 << " got " << cpp_strerror(r) << dendl;
    @@ -1406,7 +1409,10 @@ void FileJournal::write_finish_thread_entry()
    
         dout(20) << "write_finish_thread_entry waiting for aio(s)" << dendl;
         io_event event[16];
    -    int r = io_getevents(aio_ctx, 1, 16, event, NULL);
    +    int r = 1;
    +    if (!g_conf->filestore_inject_blackhole) {
    +        r = io_getevents(aio_ctx, 1, 16, event, NULL);
    +    }
         if (r < 0) {
           if (r == -EINTR) {
            dout(0) << "io_getevents got " << cpp_strerror(r) << dendl;
    @@ -1416,6 +1422,7 @@ void FileJournal::write_finish_thread_entry()
           assert(0 == "got unexpected error from io_getevents");
         }
    
    +    if (!g_conf->filestore_inject_blackhole)
         {
           Mutex::Locker locker(aio_lock);
           for (int i=0; i<r ; i++) {
    @@ -1449,6 +1456,9 @@ void FileJournal::write_finish_thread_entry()
            ai->done = true;
           }
           check_aio_completion();
    +    } else {
    +        Mutex::Locker locker(aio_lock);
    +        check_aio_completion();
         }
       }
       dout(10) < < "write_finish_thread_entry exit" << dendl;
    @@ -1468,7 +1478,7 @@ void FileJournal::check_aio_completion()
       uint64_t new_journaled_seq = 0;
    
       list<aio_info>::iterator p = aio_queue.begin();
    -  while (p != aio_queue.end() && p->done) {
    +  while (p != aio_queue.end() && (p->done || g_conf->filestore_inject_blackhole)) {
         dout(20) < < "check_aio_completion completed seq " << p->seq < < " "
                 << p->off < < "~" << p->len < < dendl;
         if (p->seq) {
    diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
    index 3e7e8fb..4c59040 100644
    --- a/src/os/FileStore.cc
    +++ b/src/os/FileStore.cc
    @@ -2286,7 +2286,8 @@ unsigned FileStore::_do_transaction(
    
         Transaction::Op *op = i.decode_op();
         int r = 0;
    -
    +    if (g_conf->filestore_inject_blackhole)
    +        return 0;
         _inject_failure();
    
         switch (op->op) {
    

    qemu_rbd_open流程分析

    qemu进程启动参数:
    ./qemu-system-x86_64 -m 512 -smp 1 -drive file=rbd:rbd/vol1:auth_supported=none:mon_host=192.168.0.2\\:6789,cache=none,if=none,format=raw

    // qemu\block\rbd.c
    static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
                             Error **errp)
    {
        BDRVRBDState *s = bs->opaque;
        char pool[RBD_MAX_POOL_NAME_SIZE];
        char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
        char conf[RBD_MAX_CONF_SIZE];
        char clientname_buf[RBD_MAX_CONF_SIZE];
        char *clientname;
        const char *secretid;
        QemuOpts *opts;
        Error *local_err = NULL;
        const char *filename;
        int r;
    
        opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); // 初始化opts结构体
        qemu_opts_absorb_qdict(opts, options, &local_err);  // 把options转成opts
        if (local_err) {
            error_propagate(errp, local_err);
            qemu_opts_del(opts);
            return -EINVAL;
        }
    
        filename = qemu_opt_get(opts, "filename");  // "rbd:rbd/vol1:auth_supported=none:mon_host=192.168.0.2\\:6789"
        secretid = qemu_opt_get(opts, "password-secret"); // 0x0
    
        if (qemu_rbd_parsename(filename, pool, sizeof(pool), // 从filename中解析pool、snap、name、conf,其中pool是存储池,snap是快照名,name是卷名,conf是配置文件路径
                               snap_buf, sizeof(snap_buf),
                               s->name, sizeof(s->name),
                               conf, sizeof(conf), errp) < 0) {
            r = -EINVAL;
            goto failed_opts;
        }
    
        clientname = qemu_rbd_parse_clientname(conf, clientname_buf); // clientname=0x0,qemu没有传这个参数,为空
        r = rados_create(&s->cluster, clientname);
        if (r < 0) {
            error_setg_errno(errp, -r, "error initializing");
            goto failed_opts;
        }
    
        s->snap = NULL;
        if (snap_buf[0] != '\0') {
            s->snap = g_strdup(snap_buf);
        }
    
        if (strstr(conf, "conf=") == NULL) {
            /* try default location, but ignore failure */
            rados_conf_read_file(s->cluster, NULL);  // 走这里,命令行没有传“conf=”参数,librados有默认路径
        } else if (conf[0] != '\0') {
            r = qemu_rbd_set_conf(s->cluster, conf, true, errp);
            if (r < 0) {
                goto failed_shutdown;
            }
        }
    
        if (conf[0] != '\0') {
            r = qemu_rbd_set_conf(s->cluster, conf, false, errp); // 把conf中的参数传给librados的_conf对象,这里会覆盖上面从配置文件中读取的配置项,如mon_host
            if (r < 0) {
                goto failed_shutdown;
            }
        }
    
        if (qemu_rbd_set_auth(s->cluster, secretid, errp) < 0) {  // secretid为空
            r = -EIO;
            goto failed_shutdown;
        }
    
        /*
         * Fallback to more conservative semantics if setting cache
         * options fails. Ignore errors from setting rbd_cache because the
         * only possible error is that the option does not exist, and
         * librbd defaults to no caching. If write through caching cannot
         * be set up, fall back to no caching.
         */
        if (flags & BDRV_O_NOCACHE) {
            rados_conf_set(s->cluster, "rbd_cache", "false"); // 这里,这个flags估计要传递给part2
        } else {
            rados_conf_set(s->cluster, "rbd_cache", "true");
        }
    
        r = rados_connect(s->cluster);
        if (r < 0) {
            error_setg_errno(errp, -r, "error connecting");
            goto failed_shutdown;
        }
    
        r = rados_ioctx_create(s->cluster, pool, &s->io_ctx);
        if (r < 0) {
            error_setg_errno(errp, -r, "error opening pool %s", pool);
            goto failed_shutdown;
        }
    
        r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
        if (r < 0) {
            error_setg_errno(errp, -r, "error reading header from %s", s->name);
            goto failed_open;
        }
    
        bs->read_only = (s->snap != NULL);
    
        qemu_opts_del(opts);
        return 0;
    
    failed_open:
        rados_ioctx_destroy(s->io_ctx);
    failed_shutdown:
        rados_shutdown(s->cluster);
        g_free(s->snap);
    failed_opts:
        qemu_opts_del(opts);
        return r;
    }
    
    // src\librados\librados.cc
    // -- config --
    extern "C" int rados_conf_read_file(rados_t cluster, const char *path_list)
    {
      tracepoint(librados, rados_conf_read_file_enter, cluster, path_list);
      librados::RadosClient *client = (librados::RadosClient *)cluster;
      md_config_t *conf = client->cct->_conf;
      std::deque<std::string> parse_errors;
      int ret = conf->parse_config_files(path_list, &parse_errors, NULL, 0);
      if (ret) {
        tracepoint(librados, rados_conf_read_file_exit, ret);
        return ret;
      }
      conf->parse_env(); // environment variables override
    
      conf->apply_changes(NULL);
      complain_about_parse_errors(client->cct, &parse_errors);
      tracepoint(librados, rados_conf_read_file_exit, 0);
      return 0;
    }
    

    // src\common\config.cc int md_config_t::parse_config_files(const char *conf_files, std::deque<std::string> *parse_errors, std::ostream *warnings, int flags) { Mutex::Locker l(lock); if (internal_safe_to_start_threads) return -ENOSYS; if (!conf_files) { const char *c = getenv("CEPH_CONF"); if (c) { conf_files = c; } else { if (flags & CINIT_FLAG_NO_DEFAULT_CONFIG_FILE) return 0; conf_files = CEPH_CONF_FILE_DEFAULT; // const char *CEPH_CONF_FILE_DEFAULT = "/etc/ceph/$cluster.conf, ~/.ceph/$cluster.conf, $cluster.conf"; } } std::list</std::string><std::string> cfl; get_str_list(conf_files, cfl); // cfl = {[0] = "/etc/ceph/$cluster.conf", [1] = "~/.ceph/$cluster.conf", [2] = "$cluster.conf"} return parse_config_files_impl(cfl, parse_errors, warnings); } int md_config_t::parse_config_files_impl(const std::list</std::string><std::string> &conf_files, std::deque</std::string><std::string> *parse_errors, std::ostream *warnings) { assert(lock.is_locked()); // open new conf list<string>::const_iterator c; for (c = conf_files.begin(); c != conf_files.end(); ++c) { cf.clear(); string fn = *c; expand_meta(fn, warnings); int ret = cf.parse_file(fn.c_str(), parse_errors, warnings); if (ret == 0) break; // 读取到第一个配置文件就不再读取后面的,/etc/ceph/$cluster.conf,其中$cluster会被替换为默认值ceph else if (ret != -ENOENT) return ret; } if (c == conf_files.end()) return -EINVAL; std::vector <std::string> my_sections; _get_my_sections(my_sections); // my_sections = {"client.admin", "client", "global"},正常会读[client]和[global]两个段下的配置项 for (int i = 0; i < NUM_CONFIG_OPTIONS; i++) { config_option *opt = &config_optionsp[i]; std::string val; int ret = _get_val_from_conf_file(my_sections, opt->name, val, false); // 从配置文件中读取配置项 if (ret == 0) { set_val_impl(val.c_str(), opt); // 设置配置项的值,比如mon_host = 192.168.0.2 } } // subsystems?日志配置项 for (int o = 0; o < subsys.get_num(); o++) { std::string as_option("debug_"); as_option += subsys.get_name(o); std::string val; int ret = _get_val_from_conf_file(my_sections, as_option.c_str(), val, false); if (ret == 0) { int log, gather; int r = sscanf(val.c_str(), "%d/%d", &log, &gather); if (r >= 1) { if (r < 2) gather = log; // cout << "config subsys " << subsys.get_name(o) << " log " << log << " gather " << gather << std::endl; subsys.set_log_level(o, log); subsys.set_gather_level(o, gather); } } } // Warn about section names that look like old-style section names std::deque < std::string > old_style_section_names; for (ConfFile::const_section_iter_t s = cf.sections_begin(); s != cf.sections_end(); ++s) { const string &str(s->first); if (((str.find("mds") == 0) || (str.find("mon") == 0) || (str.find("osd") == 0)) && (str.size() > 3) && (str[3] != '.')) { old_style_section_names.push_back(str); } } if (!old_style_section_names.empty()) { ostringstream oss; oss < < "ERROR! old-style section name(s) found: "; string sep; for (std::deque < std::string >::const_iterator os = old_style_section_names.begin(); os != old_style_section_names.end(); ++os) { oss < < sep << *os; sep = ", "; } oss << ". Please use the new style section names that include a period."; parse_errors->push_back(oss.str()); } return 0; }

    “`

    Ceph内建辅助工具实践

    admin socket

    client及server、osd及mon

    # /var/run/ceph
    srwxr-xr-x 1 root root 0 Jul 17 14:01 ceph-client.admin.21723.24796272.asok
    srwxr-xr-x 1 root root 0 Jul 29 10:30 ceph-client.admin.2583.54700240.asok
    srwxr-xr-x 1 root root 0 Jul 29 10:30 ceph-client.admin.2583.55071360.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:45 ceph-mon.ceph-l.asok
    srwxr-xr-x 1 root root 0 Jul 29 17:09 ceph-osd.0.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.1.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.2.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.3.asok
    

    创建流程

    注册命令hook函数

    CephContext::CephContext(uint32_t module_type_)
      ...
    {
      ...
      _admin_socket = new AdminSocket(this);
      _admin_hook = new CephContextHook(this);
      _admin_socket->register_command("perf dump", "perf dump name=logger,type=CephString,req=false name=counter,type=CephString,req=false", _admin_hook, "dump perfcounters value");
      ...
      _admin_socket->register_command("log reopen", "log reopen", _admin_hook, "reopen log file");
    }
    

    线程创建及启动

    // ceph_osd.cc:main->common_init_finish->start_service_thread->AdminSocket::init
    
    class AdminSocket : public Thread
    {
    public:
      AdminSocket(CephContext *cct);
      virtual ~AdminSocket();
      ...
    }
    
    bool AdminSocket::init(const std::string &path)
    {
      ...
      err = create_shutdown_pipe(&pipe_rd, &pipe_wr);
      ...
      int sock_fd;
      err = bind_and_listen(path, &sock_fd);
      ...
    
      /* Create new thread */
      m_sock_fd = sock_fd;
      m_shutdown_rd_fd = pipe_rd;
      m_shutdown_wr_fd = pipe_wr;
      m_path = path;
    
      m_version_hook = new VersionHook;
      register_command("0", "0", m_version_hook, "");
      register_command("version", "version", m_version_hook, "get ceph version");
      register_command("git_version", "git_version", m_version_hook, "get git sha1");
      m_help_hook = new HelpHook(this);
      register_command("help", "help", m_help_hook, "list available commands");
      m_getdescs_hook = new GetdescsHook(this);
      register_command("get_command_descriptions", "get_command_descriptions",
               m_getdescs_hook, "list available commands");
    
      create(); // Thread::create
      add_cleanup_file(m_path.c_str());
      return true;
    }
    
    
    

    命令处理

    // perfcounter hooks
    
    class CephContextHook : public AdminSocketHook {
      CephContext *m_cct;
    
    public:
      CephContextHook(CephContext *cct) : m_cct(cct) {}
    
      bool call(std::string command, cmdmap_t& cmdmap, std::string format,
            bufferlist& out) {
        m_cct->do_command(command, cmdmap, format, &out);
        return true;
      }
    };
    
    
    void CephContext::do_command(std::string command, cmdmap_t& cmdmap,
                     std::string format, bufferlist *out)
    {
      Formatter *f = Formatter::create(format, "json-pretty", "json-pretty");
      stringstream ss;
      for (cmdmap_t::iterator it = cmdmap.begin(); it != cmdmap.end(); ++it) {
        if (it->first != "prefix") {
          ss < < it->first  < < ":" << cmd_vartype_stringify(it->second) < < " ";
        }
      }
      lgeneric_dout(this, 1) << "do_command '" << command << "' '"
                 << ss.str() << dendl;
      if (command == "perfcounters_dump" || command == "1" ||
          command == "perf dump") {
        std::string logger;
        std::string counter;
        cmd_getval(this, cmdmap, "logger", logger);
        cmd_getval(this, cmdmap, "counter", counter);
        _perf_counters_collection->dump_formatted(f, false, logger, counter);
      }
      else if (command == "perfcounters_schema" || command == "2" ||
        command == "perf schema") {
        _perf_counters_collection->dump_formatted(f, true);
      }
      ...
    }
    

    功能用途

    查询版本/配置项信息/限流/性能计数器数据/reopen log/dump log/dump_mempools/dump_objectstore_kv_stats/dump_historic_ops等

    使用场景举例:

    Paxos耗时监控

    hzwangpan@pubt2-ceph27:/usr/local/nagent/libexec$ sudo ceph daemon mon.pubt2-ceph27  perf dump paxos
    {
        "paxos": {
            ...
            "begin_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            "commit": 0,
            "commit_keys": {
                "avgcount": 0,
                "sum": 0
            },
            "commit_bytes": {
                "avgcount": 0,
                "sum": 0
            },
            "commit_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            ...
        }
    }
    
    

    peering耗时监控

    hzwangpan@pubt2-ceph27:/usr/local/nagent/libexec$ sudo ceph daemon osd.1  perf dump recoverystate_perf | egrep "peering|waitup" -A3
            "peering_latency": {
                "avgcount": 803,
                "sum": 88.156743622
            },
    --
            "waitupthru_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            }
    

    overflow巡检脚本

    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo config set debug_ms=0/15 for $mon
       sudo ceph --admin-daemon $mon config set debug_ms 0/15
    done
    
    sleep 5
    
    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo dump log for $mon
       sudo ceph --admin-daemon $mon log dump
    done
    
    sleep 2
    
    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo config set debug_ms=0/5 for $mon
       sudo ceph --admin-daemon $mon config set debug_ms 0/5
    done
    
    echo ==========WARNING MON start==============
    grep -H "got ack seq" /data/log/ceph/ceph-mon.*.log | grep " 15 -- " | awk '$NF>1073741824 {print $0}' | awk -F'ceph-|.log:' '{print $2}' | sort | uniq
    echo ==========WARNING MON end==============
    

    修改配置项

    单实例配置项修改

    ceph daemon osd.0 config set debug_ms 15
    

    实现:参考上面的admin socket创建注册流程,会有针对config set命令的处理函数。

    多实例配置项修改

    ceph tell osd.* injectargs '--debug_ms=15'
    

    实现:
    在/usr/bin/ceph中检测tell命令的后续参数,如果是指定了target(如osd.0)则直接发送injectargs message给osd.0,由osd进程的tp_osd_cmd线程调用do_command函数对其进行处理(L版本代码流程),之后调用md_config_t::injectargs进行实际的配置项修改,最后handle_conf_change执行配置项修改的side effect相关操作。

    如果指定的是osd.*,则先通过mon获取所有osd id列表,然后对每个osd循环执行上面的操作。

    性能分析相关

    性能概览

    ceph osd perf

    查看osd的FileStore时延信息

    root@ceph-l ceph $ ceph osd perf
    osd fs_commit_latency(ms) fs_apply_latency(ms) 
      0                    65                   72 
      1                    27                   36 
      2                    22                   31 
      3                     0                    0 
    

    ceph -s

    hzwangpan@prit1-ceph19:~$ ceph -s
        cluster ee7b0372-f77e-4504-a4cf-4f5a3f503857
         health HEALTH_WARN
                pool bigceph_jiande2_sas_vol01 has too few pgs
         monmap e1: 3 mons at {prit1-ceph19=10.198.136.19:6789/0,prit1-ceph20=10.198.136.20:6789/0,prit1-ceph21=10.198.136.21:6789/0}
                election epoch 36, quorum 0,1,2 prit1-ceph19,prit1-ceph20,prit1-ceph21
         osdmap e12259: 267 osds: 267 up, 267 in
          pgmap v10782798: 8192 pgs, 9 pools, 12860 GB data, 3235 kobjects
                38661 GB used, 417 TB / 454 TB avail
                    8182 active+clean
                      10 active+clean+scrubbing+deep
      client io 57582 kB/s rd, 14104 kB/s wr, 2831 op/s
    

    ceph osd pool stat

    // 没有物理池级别统计
    hzwangpan@prit1-ceph19:~$ ceph osd pool stats
    pool bigceph_jiande2_sas_vol01 id 4
      client io 56782 kB/s rd, 15450 kB/s wr, 2498 op/s
    
    pool bigceph_jiande2_ssd_vol01 id 5
      client io 77283 B/s rd, 6136 kB/s wr, 1846 op/s
    
    pool data02_bigdata_ssd_vol01 id 7
      nothing is going on
    

    这些数据是由osd定期上报给mon(OSD::tick->OSD::do_mon_report->OSD::send_pg_stats:m->osd_stat),然后mon直接dump出来即可,不是特别实时。

    perf counter

    perf counter的初始化过程

    代码流程:ceph_osd.cc:main->OSD::init->OSD::create_recoverystate_perf/OSD::create_logger

    // OSD.h
    enum {
      l_osd_first = 10000,
      l_osd_op_wip,
      l_osd_op,
      l_osd_op_inb,
      l_osd_op_outb,
      l_osd_op_lat,
      ...
    };
    
    
    void OSD::create_logger()
    {
      dout(10) < < "create_logger" << dendl;
    
      PerfCountersBuilder osd_plb(cct, "osd", l_osd_first, l_osd_last);
    
      osd_plb.add_u64(l_osd_op_wip, "op_wip");   // rep ops currently being processed (primary)
    
      osd_plb.add_u64_counter(l_osd_op,       "op");           // client ops
      ...
      logger = osd_plb.create_perf_counters();
      cct->get_perfcounters_collection()->add(logger);
    }
    
    
    void OSD::create_recoverystate_perf()
    {
      dout(10) < < "create_recoverystate_perf" << dendl;
    
      PerfCountersBuilder rs_perf(cct, "recoverystate_perf", rs_first, rs_last);
      ...
      rs_perf.add_time_avg(rs_peering_latency, "peering_latency");
      ...
      rs_perf.add_time_avg(rs_getmissing_latency, "getmissing_latency");
      rs_perf.add_time_avg(rs_waitupthru_latency, "waitupthru_latency");
    
      recoverystate_perf = rs_perf.create_perf_counters();
      cct->get_perfcounters_collection()->add(recoverystate_perf);
    }
    
    void PerfCountersBuilder::add_time_avg(int idx, const char *name)
    {
      add_impl(idx, name, PERFCOUNTER_TIME | PERFCOUNTER_LONGRUNAVG);
    }
    
    
    • add_u64/add_u64_counter:累加值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep op_wip -A3
            "op_wip": 0,
            "op": 679522,
    
    • add_u64_avg:总/平均值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep journal_wr_bytes -A3
            "journal_wr_bytes": {
                "avgcount": 414183,
                "sum": 11194507264
            },
    
    • add_time:累加值
        } else if (d->type & PERFCOUNTER_TIME) {
          f->dump_format_unquoted(d->name, "%" PRId64 ".%09" PRId64,
                      v / 1000000000ull,  // 秒
                      v % 1000000000ull); // 毫秒
        }
        // 结果类似"sum": 123.450000000
    
    • add_time_avg:平均值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep op_r_latency -A3    
            "op_r_latency": {
                "avgcount": 1234,
                "sum": 10.300000000
            },
    

    perf counter使用方法

    void ReplicatedPG::log_op_stats(OpContext *ctx)
    {
      OpRequestRef op = ctx->op;
      MOSDOp *m = static_cast<mosdop *>(op->get_req());
    
      utime_t now = ceph_clock_now(cct);
      utime_t latency = now;
      latency -= ctx->op->get_req()->get_recv_stamp();
      utime_t process_latency = now;
      process_latency -= ctx->op->get_dequeued_time();
    
      utime_t rlatency;
      if (ctx->readable_stamp != utime_t()) {
        rlatency = ctx->readable_stamp;
        rlatency -= ctx->op->get_req()->get_recv_stamp();
      }
    
      uint64_t inb = ctx->bytes_written;
      uint64_t outb = ctx->bytes_read;
    
      osd->logger->inc(l_osd_op);
    
      osd->logger->inc(l_osd_op_outb, outb);
      ...
      osd->logger->tinc(l_osd_op_process_lat, process_latency);
    
      if (op->may_read() && op->may_write()) {
        ...
      } else if (op->may_read()) {
        osd->logger->inc(l_osd_op_r);
        ...
        osd->logger->tinc(l_osd_op_r_process_lat, process_latency);
      }
      ...
    }
    
    
    void PerfCounters::tinc(int idx, utime_t amt)
    {
      if (!m_cct->_conf->perf)
        return;
    
      assert(idx > m_lower_bound);
      assert(idx < m_upper_bound);
      perf_counter_data_any_d& data(m_data[idx - m_lower_bound - 1]);
      if (!(data.type & PERFCOUNTER_TIME))
        return;
      if (data.type & PERFCOUNTER_LONGRUNAVG) {
        data.avgcount.inc();
        data.u64.add(amt.to_nsec());
        data.avgcount2.inc();
      } else {
        data.u64.add(amt.to_nsec());
      }
    }
    

    perf counter结果获取

    void PerfCounters::dump_formatted(Formatter *f, bool schema,
        const std::string &counter)
    {
      f->open_object_section(m_name.c_str());
    
      for (perf_counter_data_vec_t::const_iterator d = m_data.begin();
           d != m_data.end(); ++d) {
        ...
        if (schema) {
          f->open_object_section(d->name);
          f->dump_int("type", d->type);
          f->close_section();
        } else {
          if (d->type & PERFCOUNTER_LONGRUNAVG) {
        f->open_object_section(d->name);
        pair<uint64_t ,uint64_t> a = d->read_avg();
        if (d->type & PERFCOUNTER_U64) {
          f->dump_unsigned("avgcount", a.second);
          f->dump_unsigned("sum", a.first);
        } else if (d->type & PERFCOUNTER_TIME) {
          f->dump_unsigned("avgcount", a.second);
          f->dump_format_unquoted("sum", "%" PRId64 ".%09" PRId64,
                      a.first / 1000000000ull,
                      a.first % 1000000000ull);
        ...
    }
    

    举例:peering各阶段耗时、写journal耗时、慢盘检查等

    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "peering|getinfo|getlog|getmiss|waitup"  -A3 
            "peering_latency": {
                "avgcount": 240,
                "sum": 12.630200202
            },
    --
            "getinfo_latency": {
                "avgcount": 240,
                "sum": 7.091297764
            },
            "getlog_latency": {
                "avgcount": 187,
                "sum": 4.670226353
            },
    --
            "getmissing_latency": {
                "avgcount": 187,
                "sum": 0.861913789
            },
            "waitupthru_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            }
    
    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "journal_latency"  -A3
            "journal_latency": {
                "avgcount": 500990,
                "sum": 128.696573903
            },
    
    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "aio_"  -A3
            "aio_usual_lat": {
                "avgcount": 479092,
                "sum": 72.261508449
            },
            "aio_unusual_lat": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            "aio_slow_lat": {
                "avgcount": 0,
                "sum": 0.000000000
            },
    

    op tracker

    op tracker初始化

    OSD::OSD(CephContext *cct_, ObjectStore *store_,
      ...
      op_tracker(cct, cct->_conf->osd_enable_op_tracker,  // 默认true
                      cct->_conf->osd_num_op_tracker_shard), // 默认32
      ...
      {
        ...
        op_tracker.set_complaint_and_threshold(cct->_conf->osd_op_complaint_time,
                                             cct->_conf->osd_op_log_threshold);
        op_tracker.set_history_size_and_duration(cct->_conf->osd_op_history_size,
                                               cct->_conf->osd_op_history_duration);
        ...
      }
    
    OpTracker(CephContext *cct_, bool tracking, uint32_t num_shards) : seq(0), 
                                         num_optracker_shards(num_shards),
                         complaint_time(0), log_threshold(0),
                         tracking_enabled(tracking), cct(cct_) { // tracking_enabled不能动态修改,需要重启osd进程
    
        for (uint32_t i = 0; i < num_optracker_shards; i++) {
          char lock_name[32] = {0};
          snprintf(lock_name, sizeof(lock_name), "%s:%d", "OpTracker::ShardedLock", i);
          ShardedTrackingData* one_shard = new ShardedTrackingData(lock_name);
          sharded_in_flight_list.push_back(one_shard);
        }
    

    op事件记录

    void OSD::ms_fast_dispatch(Message *m)
    {
      if (service.is_stopping()) {
        m->put();
        return;
      }
      OpRequestRef op = op_tracker.create_request<oprequest>(m); // 创建op
      ...
    }
    
    struct OpRequest : public TrackedOp {
      friend class OpTracker;
      ...
      void mark_queued_for_pg() {
        mark_flag_point(flag_queued_for_pg, "queued_for_pg");
      }
      void mark_reached_pg() {
        mark_flag_point(flag_reached_pg, "reached_pg");
      }
      void mark_delayed(const string& s) {
        mark_flag_point(flag_delayed, s);
      }
      void mark_started() {
        mark_flag_point(flag_started, "started");
      }
      void mark_sub_op_sent(const string& s) {
        mark_flag_point(flag_sub_op_sent, s);
      }
      void mark_commit_sent() {
        mark_flag_point(flag_commit_sent, "commit_sent");
      }
      ...
    }
    
    
    void OpRequest::mark_flag_point(uint8_t flag, const string& s) {
      ...
      mark_event(s);
      ...
    }
    
    
    void ReplicatedBackend::op_commit(
      InProgressOp *op)
    {
      dout(10) < < __func__ << ": " << op->tid < < dendl;
      if (op->op)
        op->op->mark_event("op_commit");
      ...
    }
    

    op事件查看

    root@ceph1 ~ $ ceph daemon osd.0 dump_ops_in_flight
    {
        "ops": [
            {
                "description": "osd_op(client.115363.0:416 6.3c 6:3faa4df8:::benchmark_data_ceph1_589633_object415:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected e337)",
                "initiated_at": "2019-07-30 21:31:38.924985",
                "age": 0.080449,
                "duration": 0.080484,
                "type_data": {
                    "flag_point": "waiting for sub ops",
                    "client_info": {
                        "client": "client.115363",
                        "client_addr": "192.168.0.2:0/176428539",
                        "tid": 416
                    },
                    "events": [
                        {
                            "time": "2019-07-30 21:31:38.924985",
                            "event": "initiated"
                        },
                        {
                            "time": "2019-07-30 21:31:38.930980",
                            "event": "queued_for_pg"
                        },
                        {
                            "time": "2019-07-30 21:31:38.931220",
                            "event": "reached_pg"
                        },
                        {
                            "time": "2019-07-30 21:31:38.931922",
                            "event": "started"
                        },
                        {
                            "time": "2019-07-30 21:31:38.932345",
                            "event": "waiting for subops from 1"
                        },
                        {
                            "time": "2019-07-30 21:31:38.970683",
                            "event": "sub_op_commit_rec from 1"
                        }
                    ]
                }
            },
            ...
    
    root@ceph1 ~ $ ceph daemon osd.0 dump_historic_ops
    {
        "size": 20,
        "duration": 600,
        "ops": [
            {
                "description": "osd_op(client.115364.0:18 6.11 6:89bb7b70:::benchmark_data_ceph1_589885_object17:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected e337)",
                "initiated_at": "2019-07-30 21:58:08.250964",
                "age": 1.486169,
                "duration": 0.111498,
                "type_data": {
                    "flag_point": "commit sent; apply or cleanup",
                    "client_info": {
                        "client": "client.115364",
                        "client_addr": "192.168.0.2:0/3144333915",
                        "tid": 18
                    },
                    "events": [
                        {
                            "time": "2019-07-30 21:58:08.250964",
                            "event": "initiated"
                        },
                        {
                            "time": "2019-07-30 21:58:08.253853",
                            "event": "queued_for_pg"
                        },
                        {
                            "time": "2019-07-30 21:58:08.253941",
                            "event": "reached_pg"
                        },
                        {
                            "time": "2019-07-30 21:58:08.254432",
                            "event": "started"
                        },
                        {
                            "time": "2019-07-30 21:58:08.256617",
                            "event": "waiting for subops from 1"
                        },
                        {
                            "time": "2019-07-30 21:58:08.317819",
                            "event": "sub_op_commit_rec from 1"
                        },
                        {
                            "time": "2019-07-30 21:58:08.361995",
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362093",
                            "event": "commit_sent"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362126",
                            "event": "op_applied"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362462",
                            "event": "done"
                        }
                    ]
                }
            },
            ...
    

    slow request达到阈值也会打印到日志里。

    基于op tracker的slow op日志改进

    痛点:op history无法长期保存

    解决方法:修改代码,把耗时较长的op history信息持久化到日志文件

    2019-04-26 06:25:05.454228 7f0ac2c17700 0 — op tracker — slow op: osd_op(client.544152.0:3674314 rbd_data.2af5f6b8b4567.0000000000000065 [write 961608~4096] 0.8c5714c5 snapc 5=[] ondisk+write+known_if_redirected e57939), start at: 2019-04-26 06:25:05.343005, duration: 0.111196, history: (0ms: initiated) (0ms: queued_for_pg) (0ms: reached_pg) (0ms: started) (0ms: waiting for subops from 1,2) (23ms: commit_queued_for_journal_write) (43ms: write_thread_in_journal_buffer) (38ms: sub_op_commit_rec) (0ms: filestore_queued_op) (0ms: journaled_completion_queued) (0ms: op_commit) (0ms: sub_op_commit_rec) (4ms: commit_sent) (0ms: filestore_do_op) (0ms: filestore_do_op_end) (0ms: op_applied) (0ms: done)

    举例:线上IO抖动问题分析

    萧山A云主机系统盘util频繁100%性能问题分析过程记录

    blkin+lttng

    举例:rbd卷读写全链路时延分析

    Ceph-blkin+lttng+zipkin性能追踪工具

    tcmalloc支持

    ceph tell osd.0 heap start_profiler/dump/stats/release

    举例:L版本EC OSD内存占用分析

    Ceph CPU&MEMORY profiling

    mempool(L版本)

    root@ceph1 ~ $ ceph daemon osd.0 dump_mempools                  
    {
        ...
        "bluestore_alloc": {
            "items": 512,
            "bytes": 512
        },
        "bluestore_cache_data": {
            "items": 3,
            "bytes": 12288
        },
        "bluestore_cache_onode": {
            "items": 879,
            "bytes": 590688
        },
        "bluestore_cache_other": {
            "items": 96422,
            "bytes": 2844381
        },
        ...
        "bluestore_writing": {
            "items": 2,
            "bytes": 8388608
        },
        "bluefs": {
            "items": 90,
            "bytes": 5496
        },
        ...
        "osd": {
            "items": 64,
            "bytes": 773632
        },
        "osd_mapbl": {
            "items": 0,
            "bytes": 0
        },
        "osd_pglog": {
            "items": 643642,
            "bytes": 308785360
        },
        "osdmap": {
            "items": 245,
            "bytes": 18752
        },
        "osdmap_mapping": {
            "items": 0,
            "bytes": 0
        },
        "pgmap": {
            "items": 0,
            "bytes": 0
        },
        ...
        "total": {
            "items": 741904,
            "bytes": 323610213
        }
    }
    

    打开debug开关可以dump更详细的信息:

    root@ceph1 ~ $ ceph daemon osd.0 config get mempool_debug
    {
        "mempool_debug": "false"
    }
    root@ceph1 ~ $ ceph daemon osd.0 config set mempool_debug true
    {
        "success": "mempool_debug = 'true' osd_objectstore = 'bluestore' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "
    }
    root@ceph1 ~ $ ceph daemon osd.0 dump_mempools                
    {
        "bloom_filter": {
            "items": 0,
            "bytes": 0,
            "by_type": {
                "unsigned char": {
                    "items": 0,
                    "bytes": 0
                }
            }
        },
        "bluestore_alloc": {
            "items": 512,
            "bytes": 512,
            "by_type": {
                "BitAllocator": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapArea": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapAreaIN": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapAreaLeaf": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapZone": {
                    "items": 0,
                    "bytes": 0
                },
                "BmapEntry": {
                    "items": 0,
                    "bytes": 0
                }
            }
        },
        "bluestore_cache_data": {
            "items": 3,
            "bytes": 12288
        },
        "bluestore_cache_onode": {
            "items": 879,
            "bytes": 590688,
            "by_type": {
                "BlueStore::Onode": {
                    "items": 879,
                    "bytes": 590688
                }
            }
        },
        ...
    

    mempool介绍

    在src\include\mempool.h头文件里有详细的使用方法介绍。引入mempool的主要目的是统计内存占用情况,绝大部分占用内存的组件或对象都已经加入mempool的管理中。

    // src\include\mempool.h
    #define DEFINE_MEMORY_POOLS_HELPER(f) \
      f(bloom_filter)             \
      f(bluestore_alloc)              \
      f(bluestore_cache_data)         \
      f(bluestore_cache_onode)        \
      f(bluestore_cache_other)        \
      f(bluestore_fsck)           \
      f(bluestore_txc)            \
      f(bluestore_writing_deferred)       \
      f(bluestore_writing)            \
      f(bluefs)               \
      f(buffer_anon)              \
      f(buffer_meta)              \
      f(osd)                  \
      f(osd_mapbl)                \
      f(osd_pglog)                \
      f(osdmap)               \
      f(osdmap_mapping)           \
      f(pgmap)                \
      f(mds_co)               \
      f(unittest_1)               \
      f(unittest_2)
    

    rocksdb内存及其他统计信息获取(L版本)

    需要打开rocksdb_perf配置项才可以使用。

    root@ceph1 ~ $ ceph daemon osd.1 perf dump rocksdb
    {
        "rocksdb": {
            "get": 531,
            "submit_transaction": 352,
            "submit_transaction_sync": 195,
            "get_latency": {
                "avgcount": 531,
                "sum": 0.086399156,
                "avgtime": 0.000162710
            },
            ...
            "submit_sync_latency": {
                "avgcount": 195,
                "sum": 0.995746355,
                "avgtime": 0.005106391
            },
            ...
            "rocksdb_write_memtable_time": {
                "avgcount": 513,
                "sum": 0.008326307,
                "avgtime": 0.000016230
            },
            "rocksdb_write_delay_time": {
                "avgcount": 513,
                "sum": 0.000000000,
                "avgtime": 0.000000000
            },
            ...
    

    还有rocksdb本身的一些统计信息,都是从rocksdb自身的统计接口里获取的数据。这些信息需要打开rocksdb_collect_memory_stats、rocksdb_collect_extended_stats、rocksdb_collect_compaction_stats配置项。

    root@ceph1 ~ $ ceph daemon osd.0 dump_objectstore_kv_stats
    {
        "rocksdb_compaction_statistics": "",
        "": "",
        "": "** Compaction Stats [default] **",
        "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
        "": "----------------------------------------------------------------------------------------------------------------------------------------------------------",
        "": "  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      4.0         0         1    0.018       0      0",
        "": "  L1      3/0   132.66 MB   0.5      0.2     0.1      0.1       0.1      0.0       0.0   1.8     39.0     25.0         5         1    5.303   1432K   722K",
        "": " Sum      3/0   132.66 MB   0.0      0.2     0.1      0.1       0.1      0.0       0.0 1862.6     38.9     24.9         5         2    2.661   1432K   722K",
        "": " Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0",
        ...
        "": "** File Read Latency Histogram By Level [default] **",
        "": "** Level 0 read latency histogram (micros):",
        "": "Count: 2280 Average: 908.5671  StdDev: 1394.78",
        "": "Min: 455  Median: 789.0152  Max: 24077",
        "": "Percentiles: P50: 789.02 P75: 868.51 P99: 3150.00 P99.9: 23371.43 P99.99: 24077.00",
        "": "------------------------------------------------------",
        "": "[     450,     500 )        4   0.175%   0.175% ",
        "": "[     500,     600 )       89   3.904%   4.079% #",
        "": "[     600,     700 )      342  15.000%  19.079% ###",
        "": "[     700,     800 )      792  34.737%  53.816% #######",
        "": "[     800,     900 )      705  30.921%  84.737% ######",
        "": "[     900,    1000 )      253  11.096%  95.833% ##",
        ...
        "": "[   14000,   16000 )        2   0.088%  99.693% ",
        "": "[   20000,   25000 )        7   0.307% 100.000% ",
        "": "",
        "": "** Level 1 read latency histogram (micros):",
        "": "Count: 33743 Average: 849.7653  StdDev: 522.51",
        "": "Min: 434  Median: 825.1231  Max: 26385",
        "": "Percentiles: P50: 825.12 P75: 914.89 P99: 1195.47 P99.9: 5564.25 P99.99: 24141.06",
        "": "------------------------------------------------------",
        "": "[     400,     450 )        8   0.024%   0.024% ",
        "": "[     450,     500 )       66   0.196%   0.219% ",
        "": "[     500,     600 )     1067   3.162%   3.381% #",
        "": "[     600,     700 )     4231  12.539%  15.920% ###",
        "": "[     700,     800 )     8999  26.669%  42.590% #####",
        "": "[     800,     900 )     9953  29.496%  72.086% ######",
        "": "[     900,    1000 )     6605  19.574%  91.660% ####",
        "": "[    1000,    1200 )     2534   7.510%  99.170% ##",
        "": "[    1200,    1400 )       58   0.172%  99.342% ",
        "": "[    1400,    1600 )       25   0.074%  99.416% ",
        ...
        "": "[   10000,   12000 )        2   0.006%  99.947% ",
        "": "[   14000,   16000 )        2   0.006%  99.953% ",
        "": "[   16000,   18000 )        3   0.009%  99.961% ",
        "": "[   18000,   20000 )        3   0.009%  99.970% ",
        "": "[   20000,   25000 )        8   0.024%  99.994% ",
        "": "[   25000,   30000 )        2   0.006% 100.000% ",
        "": "",
        "": "",
        "": "** DB Stats **",
        "": "Uptime(secs): 1554.3 total, 938.9 interval",
        "": "Cumulative writes: 2292 writes, 5842 keys, 2292 commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s",
        "": "Cumulative WAL: 2292 writes, 606 syncs, 3.78 writes per sync, written: 0.00 GB, 0.00 MB/s",
        "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
        "": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s",
        "": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s",
        "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
    }
    {
        "rocksdb_extended_statistics": "",
        ".": "rocksdb.block.cache.miss COUNT : 85828",
        ".": "rocksdb.block.cache.hit COUNT : 8066",
        ".": "rocksdb.block.cache.add COUNT : 36002",
        ...
        ".": "rocksdb.num.subcompactions.scheduled statistics Percentiles :=> 50 : 0.000000 95 : 0.000000 99 : 0.000000 100 : 0.000000",
        ".": "rocksdb.bytes.per.read statistics Percentiles :=> 50 : 0.610333 95 : 191.390625 99 : 871.390625 100 : 883.000000",
        ".": "rocksdb.bytes.per.write statistics Percentiles :=> 50 : 997.607656 95 : 1954.814815 99 : 1997.259259 100 : 30050.000000",
        ...
    }
    {
        "rocksdb": {
            "get": 1831,
            "submit_transaction": 1686,
            "submit_transaction_sync": 606,
            "get_latency": {
                "avgcount": 1831,
                "sum": 0.262964280,
                "avgtime": 0.000143617
            },
            ...
    {
        "block_cache_usage": "155874833",
        "block_cache_pinned_blocks_usage": "0",
        "rocksdb_memtable_usage": "2228896"
    }
    

    dump数据流程

    void RocksDBStore::get_statistics(Formatter *f)
    {
      if (!g_conf->rocksdb_perf)  {
        dout(20) < < __func__ << "RocksDB perf is disabled, can't probe for stats"
             << dendl;
        return;
      }
    
      if (g_conf->rocksdb_collect_compaction_stats) {
        std::string stat_str;
        bool status = db->GetProperty("rocksdb.stats", &stat_str);
        ...
      }
      if (g_conf->rocksdb_collect_extended_stats) {
        if (dbstats) {
          f->open_object_section("rocksdb_extended_statistics");
          string stat_str = dbstats->ToString();
          ...
      }
      if (g_conf->rocksdb_collect_memory_stats) {
        f->open_object_section("rocksdb_memtable_statistics");
        std::string str(stringify(bbt_opts.block_cache->GetUsage()));
        f->dump_string("block_cache_usage", str.data());
        str.clear();
        str.append(stringify(bbt_opts.block_cache->GetPinnedUsage()));
        f->dump_string("block_cache_pinned_blocks_usage", str);
        str.clear();
        db->GetProperty("rocksdb.cur-size-all-mem-tables", &str);
        f->dump_string("rocksdb_memtable_usage", str);
        f->close_section();
      }
    }
    

    异常问题分析相关

    signal处理/assert打印调用栈

    线程健康状态监控及调用栈打印(当前及其他线程)

    Ceph中的信号处理、调用栈打印机制分析

    Ceph OSD IO线程健康状态检查机制

    valgrind

    ceph在teuthology集成了valgrind测试(并且有专门的配置文件:src/valgrind.sup),在跑用例过程中同时监控osd、mon的内存使用情况,发现内存泄露问题可以报出来。

    https://blog.dachary.org/2013/08/14/howto-valgrind-ceph-with-teuthology/

    crash dump或者正常退出可以打印保存在内存缓冲区的日志

    默认内存缓冲区日志级别比较高,可以打印更多日志。

    int OSD::shutdown()
    {
      ...
      if (cct->_conf->get_val<bool>("osd_debug_shutdown")) { // 默认关闭
        cct->_conf->set_val("debug_osd", "100");
        cct->_conf->set_val("debug_journal", "100");
        cct->_conf->set_val("debug_filestore", "100");
        cct->_conf->set_val("debug_bluestore", "100");
        cct->_conf->set_val("debug_ms", "100");
        cct->_conf->apply_changes(NULL);
      }
      ...
    
    // src\common\assert.cc
      void __ceph_assert_fail(const char *assertion, const char *file, int line,
                  const char *func)
      {
        ...
        if (g_assert_context) {
          lderr(g_assert_context) < < buf << std::endl;
          *_dout << oss.str();
          *_dout << " NOTE: a copy of the executable, or `objdump -rdS <executable>` "
             < < "is needed to interpret this.\n" << dendl;
    
          g_assert_context->_log->dump_recent(); // 打印内存缓冲区日志,包括所有子模块
        }
    
        abort();
      }
    

    wireshark ceph协议支持

    • https://docs.ceph.com/docs/master/dev/wireshark/
    • https://www.wireshark.org/docs/dfref/c/ceph.html
    • http://aspirer.wang/?p=1312 (第三节:使用tcpdump+wireshark解析Ceph网络包)

    rbd客户端

    krbd

    使用

    modprobe rbd
    rbd map pool/volume
    rbd unmap /dev/rbd0
    rbd showmapped
    

    原理

    与rbd-nbd类似,只是把用户态的nbd server转发librbd,改为在内核态直接发送到librbd,相当于实现了一个内核态的librbd client。

    注意事项

    • krbd方式需要卷的feature只有layering一个,如含有其他feature则无法映射卷到物理机的块设备
    • 对内核版本有要求,随内核版本更新,更新缓慢,不支持新功能
    • 调试不便

    rbd-nbd是为了解决上述问题而开发的。

    rbd-nbd

    使用

    // L版本
    modprobe nbd
    rbd-nbd map pool/volume
    rbd-nbd unmap /dev/nbd0
    rbd-nbd list-mapped
    

    原理

    rbd-nbd

    注意事项

    有一些内核bug:
    – nbd设备设置IO超时时间无效
    – 4.9.65内核nbd设备+ext4文件系统mount状态resize后端盘大小导致文件系统异常
    – 4.9.65内核umount一个断开连接的nbd设备挂载目录导致内核oops

    fio支持

    rbd

    root@ceph-l ~ $ cat fio.rbd
    [global]
    ioengine=rbd
    clientname=admin
    pool=rbd
    rbdname=vol1
    invalidate=0
    rw=randwrite
    bs=4k
    runtime=6000
    [rbd_iodepth128]
    iodepth=64
    
    root@ceph-l ~ $ fio --enghelp
    Available IO engines:
            cpuio
            mmap
            sync
            psync
            ...
            splice
            rbd   ###
            sg
            binject
    

    objectstore

    fio测试ceph的filestore

    对比

    fio+librbd与krbd接近,但librbd方式对用户态CPU资源占用较多,rbd-nbd性能稍差

    死锁检查

    • 支持mutex、rwlock的死锁检查
    • 可以打印加锁、解锁操作和锁的调用栈
    • 可以记录等待锁的耗时信息

    配置项lockdep/lockdep_force_backtrace/mutex_perf_counter

    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep -i mutex -A3
        "mutex-FileJournal::completions_lock": {
            "wait": {
                "avgcount": 29,
                "sum": 0.002629507
    --
        "mutex-FileJournal::finisher_lock": {
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
    --
        "mutex-FileJournal::write_lock": {
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
    --
        "mutex-FileJournal::writeq_lock": {
            "wait": {
                "avgcount": 87,
                "sum": 0.006935580
    
    void Mutex::Lock(bool no_lockdep) {
      utime_t start;
      int r;
    
      if (lockdep && g_lockdep && !no_lockdep) _will_lock(); // 准备工作,死锁检查
    
      if (TryLock()) { // 记录加锁信息,锁id、线程id、调用栈等,为下次的死锁检查做准备
        goto out;
      }
    
      if (logger && cct && cct->_conf->mutex_perf_counter)
        start = ceph_clock_now(cct);
      r = pthread_mutex_lock(&_m);
      if (logger && cct && cct->_conf->mutex_perf_counter)
        logger->tinc(l_mutex_wait,
             ceph_clock_now(cct) - start);
      assert(r == 0);
      if (lockdep && g_lockdep) _locked();
      _post_lock();
    
    out:
      ;
    }
    
    void Mutex::Unlock() {
      _pre_unlock();
      if (lockdep && g_lockdep) _will_unlock(); // 调用lockdep_will_unlock,打印解锁信息并清理记录
      int r = pthread_mutex_unlock(&_m);
      assert(r == 0);
    }
    
    int lockdep_will_lock(const char *name, int id, bool force_backtrace)
    {
      pthread_t p = pthread_self();
      if (id < 0) id = lockdep_register(name);
    
      pthread_mutex_lock(&lockdep_mutex);
      lockdep_dout(20) << "_will_lock " << name << " (" << id << ")" << dendl;  // 打印加锁信息
    
      // check dependency graph
      map<int, BackTrace *> &m = held[p];
      for (map<int , BackTrace *>::iterator p = m.begin();
           p != m.end();
           ++p) {
        if (p->first == id) {  // 死锁条件1检查,二次加锁
          lockdep_dout(0) < < "\n";
          *_dout << "recursive lock of " << name << " (" << id << ")\n";
          BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
          bt->print(*_dout);
          if (p->second) {
        *_dout < < "\npreviously locked at\n";
        p->second->print(*_dout);
          }
          delete bt;
          *_dout < < dendl;
          assert(0);
        }
        else if (!follows[p->first][id]) { // 死锁条件2检查,循环依赖
          // new dependency
    
          // did we just create a cycle?
          if (does_follow(id, p->first)) {
            BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
        lockdep_dout(0) < < "new dependency " << lock_names[p->first]
            < < " (" << p->first < < ") -> " < < name << " (" << id << ")"
            << " creates a cycle at\n";
        bt->print(*_dout);
        *_dout < < dendl;
    
        lockdep_dout(0) << "btw, i am holding these locks:" << dendl;
        for (map<int, BackTrace *>::iterator q = m.begin();
             q != m.end();
             ++q) {
          lockdep_dout(0) < < "  " << lock_names[q->first] < < " (" << q->first < < ")" << dendl;
          if (q->second) {
            lockdep_dout(0) < < " ";
            q->second->print(*_dout);
            *_dout < < dendl;
          }
        }
    
        lockdep_dout(0) << "\n" << dendl;
    
        // don't add this dependency, or we'll get aMutex. cycle in the graph, and
        // does_follow() won't terminate.
    
        assert(0);  // actually, we should just die here.
          } else {
            BackTrace *bt = NULL;
            if (force_backtrace || lockdep_force_backtrace()) {
              bt = new BackTrace(BACKTRACE_SKIP);
            }
        follows[p->first][id] = bt;
        lockdep_dout(10) < < lock_names[p->first] < < " -> " < < name << " at" << dendl;
        //bt->print(*_dout);
          }
        }
      }
    
      pthread_mutex_unlock(&lockdep_mutex);
      return id;
    }
    

    pg锁增加等待耗时统计

    PG::PG(OSDService *o, OSDMapRef curmap,
           const PGPool &_pool, spg_t p) :
        osd(o),
        cct(o->cct),
        ...
    -    _lock("PG::_lock"),
    +    _lock((std::string("PG::_lock") + _pool.name.c_str()).c_str(), false, true, false, o->cct), // cct传入才可以初始化perf counter
    
    

    throttle限流

    查看限流统计信息

    root@ceph-l ceph $ ceph daemon osd.0 perf dump throttle-osd_client_messages
    {
        "throttle-osd_client_messages": {
            "val": 19,
            "max": 500,
            "get": 811801,
            "get_sum": 811801,
            "get_or_fail_fail": 0,
            "get_or_fail_success": 0,
            "take": 0,
            "take_sum": 0,
            "put": 811782,
            "put_sum": 811782,
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
            }
        }
    }
    

    throttle使用

    // ceph_osd.cc
    int main(int argc, const char **argv) 
    {
      ...
      boost::scoped_ptr<throttle> client_msg_throttler(
        new Throttle(g_ceph_context, "osd_client_messages",
             g_conf->osd_client_message_cap));
      ...
      ms_public->set_policy_throttlers(entity_name_t::TYPE_CLIENT,
                       client_byte_throttler.get(),
                       client_msg_throttler.get());
      ...
    }
    
    int Pipe::read_message(Message **pm, AuthSessionHandler* auth_handler)
    {
      ...
      if (policy.throttler_messages) {
        ldout(msgr->cct,10) < < "reader wants " << 1 << " message from policy throttler "
                << policy.throttler_messages->get_current() < < "/"
                << policy.throttler_messages->get_max() < < dendl;
        policy.throttler_messages->get();
      }
      ...
    out_dethrottle:
      // release bytes reserved from the throttlers on failure
      if (policy.throttler_messages) {
        ldout(msgr->cct,10) < < "reader releasing " << 1 << " message to policy throttler "
                << policy.throttler_messages->get_current() < < "/"
                << policy.throttler_messages->get_max() < < dendl;
        policy.throttler_messages->put();
      }
      ...
    

    缺陷

    目前只支持按op数量和大小进行限制,只用在了正常io处理流程,异常恢复等流程还没有实现。

    日志相关

    日志级别

    # 文件日志级别,内存日志级别
    root@ceph-l ceph $ ceph daemon osd.0 config show | grep debug_
        "debug_none": "0\/5",
        ...
        "debug_mon": "1\/5",
        "debug_monc": "0\/10",
        "debug_paxos": "1\/5",
        ...
    

    日志子模块

    // src\common\config_opts.h
    DEFAULT_SUBSYS(0, 5)
    SUBSYS(lockdep, 0, 1)
    SUBSYS(context, 0, 1)
    SUBSYS(crush, 1, 1)
    ...
    SUBSYS(throttle, 1, 1)
    SUBSYS(refs, 0, 0)
    SUBSYS(xio, 1, 5)
    
    // src\common\config.cc
    void md_config_t::init_subsys()
    {
    #define SUBSYS(name, log, gather) \
      subsys.add(ceph_subsys_##name, STRINGIFY(name), log, gather);
    #define DEFAULT_SUBSYS(log, gather) \
      subsys.add(ceph_subsys_, "none", log, gather);
    #define OPTION(a, b, c)
    #include "common/config_opts.h"
    #undef OPTION
    #undef SUBSYS
    #undef DEFAULT_SUBSYS
    }
    
    // src\osd\OSD.cc
    #define dout_subsys ceph_subsys_osd
    #undef dout_prefix
    #define dout_prefix _prefix(_dout, whoami, get_osdmap_epoch())
    
    static ostream& _prefix(std::ostream* _dout, int whoami, epoch_t epoch) {
      return *_dout < < "osd." << whoami << " " << epoch << " ";
    }
    
    dout(0) << " have superblock" << dendl;  // 打印日志示例
    
    // src\common\debug.h
    #define dout(v) ldout((g_ceph_context), v)
    
    // src\common\dout.h
    #define dout_impl(cct, sub, v)                     \
      do {                                  \
      if (cct->_conf->subsys.should_gather(sub, v)) {           \
        if (0) {                                \
          char __array[((v >= -1) && (v < = 200)) ? 0 : -1] __attribute__((unused)); \
        }                                   \
        ceph::log::Entry *_dout_e = cct->_log->create_entry(v, sub);    \
        ostream _dout_os(&_dout_e->m_streambuf);                \
        CephContext *_dout_cct = cct;                   \
        std::ostream* _dout = &_dout_os;
    ...
    #define ldout(cct, v)  dout_impl(cct, dout_subsys, v) dout_prefix
    ...
    #define dendl std::flush;              \
      _ASSERT_H->_log->submit_entry(_dout_e);       \
        }                       \
      } while (0)
    ...
    

    集群日志

    void OSD::handle_osd_map(MOSDMap *m)
    {
        ...
        } else {
          clog->warn() < < "map e" << osdmap->get_epoch()
                  < < " wrongly marked me down";
        }
        ...
    // 由mon的LogMonitor::preprocess_log/prepare_log处理来自osd的集群日志打印请求,最终打印到ceph.log文件中。
    
    # /var/log/ceph/ceph.log
    2019-07-31 02:07:32.191716 osd.92 10.171.5.122:6818/2755462 1588 : cluster [WRN] map e250674 wrongly marked me down
    

    日志回滚

    kill -HUP/log reopen

    举例:logrotate(reload)

    # /etc/init.d/ceph
    ...
            force-reload | reload)
                signal_daemon $name ceph-$type $pid_file -1 "Reloading"  ## kill -HUP $pid
                ;;
    ...
    

    也可以直接使用ceph daemon osd.X log reopen

    全链路日志跟踪(优化目标)

    痛点:跨节点、跨进程的日志无法串联起来,分析比较困难,需要对IO流程代码非常熟悉

    解决方法:参考blkin流程,在日志里增加全局request id,从客户端到osd,以及osd内部的各子模块pg、journal、FileStore等统一使用request id来区分客户端io请求。

    数据编解码

    ceph-objectstore-tool

    Ceph高级工具介绍之ceph-objectstore-tool的使用

    ceph-dencoder

    Ceph高级工具介绍之ceph-dencoder的使用

    在ceph-dencoder中增加修改osd superblock中oldest_map字段功能

    root@ceph-l ~ $ ceph-dencoder -h
    usage: ceph-dencoder [commands ...]
      ...
      editsb <oldest_map> <input sb file/> <output sb file>    edit OSDSuperblock oldest_map section
    
    root@ceph-l ~ $ ceph-dencoder editsb 10000 superblock.orig superblock.new
    {
        "cluster_fsid": "238c3b45-c884-4911-942b-a424910250bf",
        "osd_fsid": "ca873f75-8cd1-4fbf-91fc-7cd6784b7e2c",
        "whoami": 0,
        "current_epoch": 354773,
        "oldest_map": 10000,   ## superblock.new中的这个字段被改成10000
        "newest_map": 354773,
        "weight": 0.000000,
        ....
    }
    

    安装部署

    ceph-deploy

    ceph-deploy

    crush rule编辑工具

    sudo ceph osd getcrushmap -o map
    sudo crushtool -d map -o dmap
    sudo vi dmap 
    sudo crushtool -c dmap -o nmap
    sudo ceph osd setcrushmap -i nmap
    

    pg balance

    Mgr Balancer

    开发者工具

    vstart

    使用vstart搭建ceph开发环境

    blackhole开关

    int FileStore::queue_transactions(Sequencer *posr, list<transaction *> &tls,
                      TrackedOpRef osd_op,
                      ThreadPool::TPHandle *handle)
    {
      Context *onreadable;
      Context *ondisk;
      Context *onreadable_sync;
      ObjectStore::Transaction::collect_contexts(
        tls, &onreadable, &ondisk, &onreadable_sync);
      if (g_conf->filestore_blackhole) {
        dout(0) < < "queue_transactions filestore_blackhole = TRUE, dropping transaction" << dendl;
        delete ondisk;
        delete onreadable;
        delete onreadable_sync;
        return 0;
      }
      ...
    

    异常(延时)注入

    kill/error/心跳异常/ms异常/injectdataerr等:

    root@ceph-l common [CLDNBS-1301] $ grep inject config_opts.h 
    OPTION(heartbeat_inject_failure, OPT_INT, 0)    // force an unhealthy heartbeat for N seconds
    OPTION(ms_inject_socket_failures, OPT_U64, 0)
    OPTION(ms_inject_delay_type, OPT_STR, "")          // "osd mds mon client" allowed
    OPTION(ms_inject_delay_msg_type, OPT_STR, "")      // the type of message to delay, as returned by Message::get_type_name(). This is an additional restriction on the general type filter ms_inject_delay_type.
    OPTION(ms_inject_delay_max, OPT_DOUBLE, 1)         // seconds
    OPTION(ms_inject_delay_probability, OPT_DOUBLE, 0) // range [0, 1]
    OPTION(ms_inject_internal_delays, OPT_DOUBLE, 0)   // seconds
    OPTION(inject_early_sigterm, OPT_BOOL, false)
    OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE, 0)  // inject N second delay on each get_chunk request
    OPTION(mon_inject_transaction_delay_max, OPT_DOUBLE, 10.0)      // seconds
    OPTION(mon_inject_transaction_delay_probability, OPT_DOUBLE, 0) // range [0, 1]
    OPTION(client_debug_inject_tick_delay, OPT_INT, 0) // delay the client tick for a number of seconds
    OPTION(client_inject_release_failure, OPT_BOOL, false)  // synthetic client bug for testing
    OPTION(objecter_inject_no_watch_ping, OPT_BOOL, false)   // suppress watch pings
    OPTION(mds_inject_traceless_reply_probability, OPT_DOUBLE, 0) /* percentage
    OPTION(osd_inject_bad_map_crc_probability, OPT_FLOAT, 0)
    OPTION(osd_inject_failure_on_pg_removal, OPT_BOOL, false)
    OPTION(osd_debug_inject_copyfrom_error, OPT_BOOL, false)  // inject failure during copyfrom completion
    // Allow object read error injection
    OPTION(filestore_debug_inject_read_err, OPT_BOOL, false)
    OPTION(filestore_kill_at, OPT_INT, 0)            // inject a failure at the n'th opportunity
    OPTION(filestore_inject_stall, OPT_INT, 0)       // artificially stall for N seconds in op queue thread
    
    bool HeartbeatMap::is_healthy()
    {
      m_rwlock.get_read();
      time_t now = time(NULL);
      if (m_cct->_conf->heartbeat_inject_failure) {
        ldout(m_cct, 0) < < "is_healthy injecting failure for next " << m_cct->_conf->heartbeat_inject_failure < < " seconds" << dendl;
        m_inject_unhealthy_until = now + m_cct->_conf->heartbeat_inject_failure;
        m_cct->_conf->set_val("heartbeat_inject_failure", "0");
      }
      ...
    
    // src\mon\MonitorDBStore.h
      struct C_DoTransaction : public Context {
        MonitorDBStore *store;
        MonitorDBStore::TransactionRef t;
        Context *oncommit;
        C_DoTransaction(MonitorDBStore *s, MonitorDBStore::TransactionRef t,
                Context *f)
          : store(s), t(t), oncommit(f)
        {}
        void finish(int r) {
          /* The store serializes writes.  Each transaction is handled
           * sequentially by the io_work Finisher.  If a transaction takes longer
           * to apply its state to permanent storage, then no other transaction
           * will be handled meanwhile.
           *
           * We will now randomly inject random delays.  We can safely sleep prior
           * to applying the transaction as it won't break the model.
           */
          double delay_prob = g_conf->mon_inject_transaction_delay_probability;
          if (delay_prob && (rand() % 10000 < delay_prob * 10000.0)) {
            utime_t delay;
            double delay_max = g_conf->mon_inject_transaction_delay_max;
            delay.set_from_double(delay_max * (double)(rand() % 10000) / 10000.0);
            lsubdout(g_ceph_context, mon, 1)
              < < "apply_transaction will be delayed for " << delay
              << " seconds" << dendl;
            delay.sleep();
          }
          int ret = store->apply_transaction(t);
          oncommit->complete(ret);
        }