Ceph异步恢复实现分析

20191219更新:
H版本合入async recovery功能,会引入多个bug,teuthology可以测试出来,主要是delete、snap/clone等操作和异步恢复功能的冲突(这部分依赖另外一个新版本合入的async delete功能)。因此不建议backport到H版本。可以参考xsky之前的异步恢复代码 https://github.com/ceph/ceph/pull/11918 (未merge但修改的版本跟H版本类似,不依赖async delete功能)。

最近在把ceph社区实现的异步恢复(async recovery)功能backport到H版本,目前已经合入完毕并且基本功能验证通过,这里是相关流程的分析文档。

恢复期间对io影响比较

测试环境

  • pubt2-ceph0.dg.163.org、pubt2-ceph1.dg.163.org、pubt2-ceph2.dg.163.org
  • 3节点,各19个ssd osd,共57个osd
  • Xeon(R) CPU E5-2660 v4 @ 2.00GHz,256G内存
  • ssd有两个节点是三星960,一个是Intel 3520

测试场景

  • 配置文件:完全相同
  • 集群总能力:4k随机写iops为3.6W左右
  • 背景压力:2W iops 4k随机写 (fio+librbd)
  • 观察卷压力:100 iops 4k随机读写 (krbd挂载)
  • iostat统计:每秒1个点,记录观察卷所在rbd设备的io util值

测试用例

  1. 停止一个副本域所有osd
  2. 等待15分钟
  3. 启动所有osd:启停命令:/etc/init.d/ceph stop/start osd

测试期间背景压力和观察卷压力持续进行。

影响对比

说明:X轴为时间,单位是s,间隔1s;Y轴是观察卷的io util值,每秒1个点。
恢复期间对IO影响对比图

整体实现思路

可参考官方说明文档:https://docs.ceph.com/docs/master/dev/osd_internals/async_recovery/

官方PR代码:https://github.com/ceph/ceph/pull/19811

总体思路是参考backfill的实现逻辑,backfill是在恢复过程中,生成副本是最新版本的可完全对外服务的acting列表,保证IO读写都完全不受影响,然后还会生成一个actingbackfill列表,其中包括了acting和需要backfill的osd,用来实现写入和恢复,backfill操作在后台执行,不会被客户端的写入操作触发。

异步恢复的操作也类似backfill,只是把actingbackfill列表改成了acting_recovery_backfill列表,IO写入时,主osd会用这个列表作为目标副本发送repop,但是对于需要recover的副本,则只发送log,不发送写入的数据(跟backfill一致)。peering结束后,根据peer_missing列表和missing_loc异步进行数据恢复,恢复也是根据pglog选择需要恢复的对象,不是整个pg无脑的全量恢复。之前同步恢复过程中,如果IO读写操作遇到副本(包括主)上该对象缺失,就要主动进行对象的恢复操作,然后把读写操作放入wait_for_unreadable_object或wait_for_degraded_object队列,等待对象恢复完毕之后重新enqueue,之后再进行正常的读写操作。

能否执行异步恢复也有一些附加条件:
1. 可用副本数大于min_size(注意是大于,不是大于等于)
2. 副本pglog与权威pglog差异条目数在100以内(可配置)
不满足条件则无法进行异步恢复,会执行同步恢复操作????

同步recover过程中IO流程示意


  1. 同步recovery读IO流程

注:图里第一个框client read应该是client write

同步recovery写IO流程

异步recover过程中IO流程示意(与backfill类似)


  1. 异步recovery读IO流程
  • 注:图里第一个框client read应该是client write

    异步recovery写IO流程

    recover流程(含backfill)

    关键数据结构

    PG类:
      vector<int> up, acting, want_acting;
      // acting_recovery_backfill contains shards that are acting,
      // async recovery targets, or backfill targets.
      set<pg_shard_t> acting_recovery_backfill, actingset;
    
      set<pg_shard_t> backfill_targets,  async_recovery_targets;
    
      map<pg_shard_t, pg_missing_t> peer_missing;
    
      MissingLoc missing_loc;
    
    PGLog类:
      pg_missing_t     missing;
    

    主要流程

    1. Peering

    GetLog

    首先choose_acting:构造acting,acting_recovery_backfill,async_recovery_targets,并选出拥有权威日志auth_log_shard的osd,如果osd是自己,则发送GotLog事件。如果不是则发生pg_query_t::LOG消息给权威日志osd获取pglog,等权威osd返回log后,发送GotLog事件,之后调用proc_master_log处理接收到的权威日志。

    proc_master_log:merge_log,保存peer_info、peer_missing、might_have_unfound(意思是这个osd对象比较全,可以从它这里获取unfound对象)。

    处理完权威日志后,转到GetMissing状态。

    GetMissing

    遍历acting_recovery_backfill,根据peer_info和权威日志构建peer_missing(我理解这些都是日志差异太大,根据pg info就能断定副本osd需要backfill),如果这两个信息不足以确定peer_missing(我理解这里就是要根据pg日志来对比,找出需要recovery的osd),则需要发送pg_query_t::LOG或pg_query_t::FULLLOG给osd。

    接收到副本osd返回的pg log之后,调用PG::proc_replica_log进行处理,并保存peer_info、peer_missing、might_have_unfound。

    最后发送NeedUpThru或Activate事件,NeedUpThru是进入WaitUpThru状态,等待新的map更新up_thru值之后,再发送Activate事件进入Active状态。

    1. Active

    调用pg->activate,根据peer_missing构造missing_loc(这部分没看懂),如果是主,要发送MOSDPGLog给副本osd,副本osd发送Activate事件后,从Stray状态进入ReplicaActive状态,处理Activate事件时同样调用pg->activate函数,在回调C_PG_ActivateCommitted里调用pg->_activate_committed,回复MOSDPGInfo消息给主osd,表示activate完毕。pg进入Activating状态,主osd等待所有副本activate完毕,调用pg->all_activated_and_committed,发送AllReplicasActivated事件,处理该事件时设置pg状态为PG_STATE_ACTIVE,重新入队等待peering结束的op(pg->requeue_ops(pg->waiting_for_peered)),最后调用pg->on_activate(),发送事件DoRecovery,pg进入WaitLocalRecoveryReserved状态。

    1. WaitLocalRecoveryReserved

    事件:LocalRecoveryReserved

    1. WaitRemoteRecoveryReserved

    给所有副本osd发送MRecoveryReserve::REQUEST消息,之后等待接收事件RemoteRecoveryReserved,
    等待所有副本osd返回reserve成功消息后,发送事件:AllRemotesReserved,主osd进入Recovering状态,

    1. Recovering
      pg->state_clear(PG_STATE_RECOVERY_WAIT);
      pg->state_set(PG_STATE_RECOVERING);
      pg->osd->queue_for_recovery(pg);
    

    切换到recovery线程,OSD::do_recovery->ReplicatedPG::start_recovery_ops恢复所有对象,每次恢复一个之后检查是否还在恢复,如果结束了,则继续检查是否需要backfill,如需要则发送事件RequestBackfill,调用PG::RecoveryState::Recovering::release_reservations(),释放reservations,replica osd进入RepNotRecovering状态,主osd跳转到WaitRemoteBackfillReserved状态。

    否则直接发送AllReplicasRecovered,收到后调用PG::RecoveryState::Recovering::release_reservations(),释放reservations,跳转到Recovered状态。

    1. Recovered

    再次choose_acting构造acting,acting_recovery_backfill,async_recovery_targets,此时acting,acting_recovery_backfill以及up,应该一致,async_recovery_targets会被清空。

    之后发送事件:GoClean,转到Clean状态。

    1. WaitRemoteBackfillReserved

    给所有副本osd发送MBackfillReserve::REQUEST消息,之后等待接收事件RemoteBackfillReserved,所有副本收到消息后进入RepWaitBackfillReserved状态,之后返回reserve成功消息后,主osd等待所有副本reserve完毕后,发送AllBackfillsReserved事件,主osd进入Backfilling状态.

    1. Backfilling
      PG *pg = context< RecoveryMachine >().pg;
      pg->backfill_reserved = true;
      pg->osd->queue_for_recovery(pg);
    

    仍然切换到recovery线程,之后的流程与上面的Recovering类似,recovery和backfill都结束后进入Clean状态。

    1. Clean

    修改pg状态为clean

    recover过程调用的主要函数

    1. choose_acting
    map<pg_shard_t, pg_info_t>::const_iterator auth_log_shard =
        find_best_info(all_info);  // 查找权威日志osd,如果找不到就返回false,之后GetLog那边直接return,也就意味着peering流程无法正常执行(进入incomplete状态?),直到osdmap发生新的变化再进行重试
    
    if ((up.size() &&
          !all_info.find(up_primary)->second.is_incomplete() &&
          all_info.find(up_primary)->second.last_update >=
           auth_log_shard->second.log_tail) &&
          auth_log_shard->second.is_incomplete()) { // 如果权威osd为incomplete(没有backfill完?),就尝试再选一个(从complete的osd里面选)
    
    calc_replicated_acting  // 计算期望的acting列表(want)、acting_recovery_backfill、backfill_targets,首先选出primary(首选up_primary,如果是incomplete或者日志不全,就选权威osd),之后依次遍历所有的up、acting、以及所有返回pg info的osd,根据pg info信息判断osd应该加入哪个列表(需要backfill的加入acting_recovery_backfill和backfill_targets,否则加入want_acting和acting_recovery_backfill)
    
    recoverable_and_ge_min_size  // 根据want计算pg是否可以恢复,want_acting不为空即表示可以恢复(osd_allow_recovery_below_min_size默认值为true,所以非EC pool场景下,want的大小不需要大于等于min_size)
    
    choose_async_recovery_replicated  // 挑出want_acting中可以执行异步恢复的osd(可以异步恢复的osd会从want中移除),保存到async_recovery_targets中。挑选条件是目标osd的last_update.version与权威osd的last_update.version差异大于100(osd_async_recovery_min_pg_log_entries配置项可以配置),差异越少的排的越靠前(set的自动排序特性),cost就是差异的值
    
    queue_want_pg_temp  // 如果上面的流程选出来的want与之前的acting不一致,则需要通知monitor生成pg_temp,注册临时的acting,并返回false
    
    如果want等于acting,表示不需要变更acting列表,want_acting就可以清空了。这个时候就把calc_replicated_acting计算出来的几个列表赋给PG对象里对应的成员,保存起来以备后用。
    
    1. do_recovery

    OSDService的recovery_wq是处理recovery的队列,并关联了线程池recovery_tp(线程数量由配置项osd_recovery_threads控制,默认值为1),pg->osd->queue_for_recovery(pg)就是把pg放到这个队列里了。

    do_recovery就是recovery_tp线程池的处理函数(ThreadPool::worker->_void_process->RecoveryWQ::_process->do_recovery),ThreadPool::worker会首先dequeue一个pg,然后交给do_recovery去处理。

    do_recovery先检查还有没有recovery的配额(并发数有没有到限制),然后再看是否需要sleep一定时间(进一步降低恢复占用的带宽),如果需要则注册timer event,回调里把pg重新放入recovery_wq,等待下一次被取出来处理。

    如果不需要sleep等待,则调用pg->start_recovery_ops开始恢复,优先恢复主osd(recover_primary),之后恢复从osd(recover_replicas),recovery完成之后进行backfill(recover_backfill)。

    recover_primary // 根据pg log找出需要恢复的对象及版本(根据log类型分为CLONE、LOST_REVERT场景,revert又分为本地有目标对象版本和没有两种情况,本地没有就要找到对象所在的osd,并加入到missing_loc,以便后面进行恢复)。最后调用recover_missing进行恢复,这里都是准备工作,只是把要push(恢复从)或者pull(恢复主)的op放到相应的队列中。真正执行recover操作的是run_recovery_op函数里的send_pushs(发送MOSDPGPush消息给目标osd)和send_pulls(发送MOSDPGPull给目标osd)。发送之后就是handle和reply流程了,这里就不细讲了。

    recover_replicas跟recover_primary流程类似,也不细讲了。

    Ceph H版本性能调优总结

    现状

    sas 场景

    sas盘场景下,由于有raid卡的缓存加成,以及journal的顺序写特性,因此性能比sas裸盘要好很多,单osd单副本场景下数据(未调优):

    job1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
      write: io=316136KB, bw=2589.8KB/s, iops=647, runt=122107msec
        clat (msec): min=2, max=7502, avg=98.86, stdev=280.79
    

    经过配置项调整后,同样的测试用例和单osd节点,iops从600多提升到1010左右。调整的配置项是:

    osd_client_message_cap = 50000 #线上配置是100
    filestore_queue_max_ops = 50000 #线上配置是50
    

    但此时内存占用会暴涨,原因是后端data落盘跟不上journal落盘速度,大量op保存在内存中排队。

    而裸盘的fio数据:

    fio -filename=/var/lib/ceph/osd/ceph-34/fio.data -rw=randwrite -bs=4k -iodepth=128 -numjobs=2 -ioengine=libaio -direct=1 -runtime=60 -group_reporting  -size=1000G -name=sdb1
    write: io=78768KB, bw=1303.1KB/s, iops=325, runt= 60447msec
    clat (msec): min=8, max=4049, avg=392.82, stdev=281.27
    

    因此sas盘场景没有仔细调优,主要是在ssd场景下进行的各种对比测试。

    需要注意的是,throttle放开的参数在sas场景下不能放开,否则会导致内存暴涨(sas盘性能跟不上,不需要设置太高的throttle)。

    ssd 场景

    9台物理机(近期上线的机器,cpu应该稍好),每台20个ssd盘(Intel S4500),4k随机写fio+rbd模式下,9个卷(每个500G)可以并发跑到72000个IOPS左右。

      write: io=3856.9MB, bw=32852KB/s, iops=8212, runt=120219msec
        slat (usec): min=0, max=15124, avg= 5.45, stdev=51.21
        clat (msec): min=1, max=703, avg=15.54, stdev=25.71
    

    fio测试过程中性能稳定无波动,osd后端磁盘IO压力均衡,每个osd均在2000~4000iops范围波动。

    同样的9台物理机,每台3个ssd盘(Intel S4500),跑4k随机写裸盘fio+libaio,每个节点可以达到15W iops。

    而在线下环境的pubt2-ceph0/1/2 3台节点上(E5-2660 v4 2.0GHz,两个节点ssd盘是SAMSUNG MZ7LM960HMJP,PM863a,官方标称随机写24K IOPS;另外一台是INTEL SSDSC2BB016T7,S3520,官方标称随机写17K IOPS),用线上环境试用的1.2-0版本进行对比测试,配置项使用线上的,无绑核,fio跑在独立的节点上。4k随机写总体iops性能为30652,平均单卷3405:

      write: io=825868KB, bw=13623KB/s, iops=3405, runt= 60624msec
        clat (msec): min=1, max=1356, avg=37.57, stdev=84.53
    

    分析过程及优化手段

    注意:下面的分析及优化都是针对的ssd盘场景。

    整个分析过程主要依赖的手段包括:
    1. perf counter,增加自定义counter
    2. 日志,添加耗时统计
    3. kill -USR1,获取调用栈
    4. Linux perf工具,查看cpu资源占用情况
    5. 修改代码,增加io落盘开关,做到无io落盘情况下分析osd层性能,排除磁盘对性能数据的影响

    其他期望尝试的工具是lttng+blkin,但H版本这个感觉做的不完善,代码里很多地方都没有打点,后续L版本可以尝试。

    单osd性能分析

    首先要分析的是单osd、单副本场景下的rbd卷4k随机写性能,分析的是有io落盘和无io落盘两种情况下的性能差距,并找出无io落盘情况下osd层的性能瓶颈点。

    9个卷,(4k随机写,128iodepth),有IO落盘(未mock FileStore和journal),单卷性能:

      write: io=461228KB, bw=3840.1KB/s, iops=960, runt=120082msec
        clat (msec): min=1, max=1468, avg=133.29, stdev=117.17
    

    9个卷,(4k随机写,128iodepth),无IO落盘(mock FileStore和journal),单卷性能:

      write: io=717212KB, bw=5974.2KB/s, iops=1493, runt=120052msec
        clat (msec): min=1, max=141, avg=85.69, stdev= 9.74
    

    可以看出无io落盘场景下,平均时延过高,因此通过dump_historic_ops分析时延较长的位置,

     {
                "description": "osd_op(client.8984.0:11873 rbd_data.21f66b8b4567.000000000001bb3d [] 13.f568b271 ack+ondisk+write+known_if_redirecte
    d e970)",
                "initiated_at": "2019-08-22 17:35:39.837373",
                "age": 0.992912,
                "duration": 0.982060,
    ......
    
                        {
                            "time": "2019-08-22 17:35:39.837512",
                            "event": "started"
                        },
                        {
                            "time": "2019-08-22 17:35:39.837613",
                            "event": "commit_queued_for_journal_write"
                        },
                        {
                            "time": "2019-08-22 17:35:39.840339",
                            "event": "write_thread_in_journal_buffer"  ##### 这里
                        },
                       {
                            "time": "2019-08-22 17:35:39.888499",
                            "event": "journaled_completion_queued"   ##### 到这里
                        },
                        {
                            "time": "2019-08-22 17:35:39.892560",
                            "event": "filestore_queued_op"
                        },
                        {
                            "time": "2019-08-22 17:35:39.892565",
                            "event": "filestore_do_op"
                        },
                        {
                            "time": "2019-08-22 17:35:39.892603",
                            "event": "filestore_do_op_end"
                        },
                        {
                            "time": "2019-08-22 17:35:39.924675",  ##### 到这里
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-08-22 17:35:39.924698",
                            "event": "commit_sent"
                        },
    
    

    Filestore相关event影响排序,因此修改代码去掉了之后,

     {
                "description": "osd_op(client.9820.0:135 rbd_data.21f26b8b4567.000000000000ac1a [] 13.f87a5531 ack+ondisk+write+known_if_redirected e994)",
                "initiated_at": "2019-08-23 16:26:38.362347",
                "age": 30.219829,
                "duration": 0.136654,
    ......
                        {
                            "time": "2019-08-23 16:26:38.402545",
                            "event": "started"
                        },
                        {
                            "time": "2019-08-23 16:26:38.402654",
                            "event": "commit_queued_for_journal_write"
                        },
                        {
                            "time": "2019-08-23 16:26:38.403273",
                            "event": "write_thread_in_journal_buffer"   ## 这里
                        },
                        {
                            "time": "2019-08-23 16:26:38.444155",
                            "event": "journaled_completion_queued"   ## 到这里40ms左右
                        },
                        {
                            "time": "2019-08-23 16:26:38.498969",      ## 到这里又50ms左右
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-08-23 16:26:38.499001",
                            "event": "commit_sent"
                        }
    
    

    前半段在等待journal落盘,后半段是journal落盘之后等待回调。

    但需要注意的是,history ops并不能完全表现真实性能瓶颈,只能作为参考,只有耗时最长的20个op才会列出来,是最大值,而非平均值。

    官方代码已有的perf平均值数据如下:

    "journal_latency": {
                "avgcount": 972945,
                "sum": 258.177185807
            },
    
    "op_latency": {
                "avgcount": 973143,
                "sum": 68849.737076609
            },
    "op_process_latency": {
                "avgcount": 973143,
                "sum": 68685.614715542
            },
    

    可以看出journal_latency平均值很低,才0.26ms,也就是说绝大部分情况下commit_queued_for_journal_write到journaled_completion_queued阶段都是不怎么耗时的,op_process_latency平均延时比较高。因此需要通过增加perf counter来对比验证瓶颈点。

    通过增加perf counter,journal commit可以看出回调耗时较长:

            "repop_all_commited_latency": { ## ReplicatedPG::repop_all_committed的时间点减去ReplicatedPG::eval_repop首次进入的时间点的差值,这段平均耗时统计值与fio的平均时延比较接近
                "avgcount": 335625,
                "sum": 67348.398432313
    
            "op_commit_lock_latency": {  ## 已排除是BlessedContext::finish里面的pg->lock加锁导致的耗时长
                "avgcount": 671250,
                "sum": 44.412470067
            }
    

    当前finisher线程是1个,不能修改,之前正好看到社区代码有修改这部分,增加finisher回调线程数量,于是尝试合入进行对比(https://github.com/ceph/ceph/pull/6486/commits )。

    增加finisher线程前(默认都是1个线程):

            "journaled_ahead_to_op_commit_latency": {
                "avgcount": 325212,
                "sum": 67107.237113916
            },  ## fio的平均时延正好跟这个回调耗时匹配
    
        "finisher-journal": {
            "queue_len": 0
        },
        "finisher-onapply": {
            "queue_len": 21
        },
        "finisher-ondisk": {
            "queue_len": 1133
        },
    

    增加finisher线程后:

    "journaled_ahead_to_op_commit_latency": {
                "avgcount": 988516,
                "sum": 891.208668089
            },
    
        "finisher-filestore-apply-0": {
            "queue_len": 0
        },
        ......
        "finisher-filestore-apply-4": {
            "queue_len": 2
        },
        "finisher-filestore-ondisk-0": {
            "queue_len": 0
        },
        ......
        "finisher-filestore-ondisk-9": {
            "queue_len": 0
        },
        "finisher-journal": {
            "queue_len": 0
        },
    

    增加finisher线程数量后,队列中等待的op数量已经没有堆积,fio性能偶尔可以从原来的每个卷1200左右增加到1800左右,但更多时候是下降的,到800多,需要进一步分析原因。
    说明还有其他地方有性能瓶颈。

    统计perf counter耗时,发现提交journal之后到op_commit耗时仍然比较长(上面有之前的数据),当时没想到具体的瓶颈点,翻了社区提交的性能优化代码找到一个相关的改动,合入之后发现提升挺明显(https://github.com/ceph/ceph/commit/eb020b6a6a75c69f71dc29fd607a4626bed28366 ),单osd单副本9个卷无IO落盘,每个卷性能从1490提高到2200:

      write: io=529944KB, bw=8826.1KB/s, iops=2206, runt= 60037msec
        clat (msec): min=3, max=125, avg=57.99, stdev= 5.59
    

    有IO落盘(从960提升到1490):

      write: io=162348KB, bw=5982.1KB/s, iops=1495, runt= 27139msec
        clat (msec): min=1, max=150, avg=85.56, stdev= 8.77
    

    根据晨会上宏松的建议,看了下所有的pr,涉及到osd层、pg层、journal层、FileStore层的performance优化都过滤下,看看有么有可以拿过来的。找到一个https://github.com/ceph/ceph/commit/a75a9a8178f42ba7363de4929af5da140b64c668 ,合入后无IO落盘场景(其他条件同上),每个卷性能从2206提升到2320:

      write: io=557512KB, bw=9285.4KB/s, iops=2321, runt= 60042msec
      write: io=557496KB, bw=9285.3KB/s, iops=2321, runt= 60041msec
      write: io=557620KB, bw=9287.2KB/s, iops=2321, runt= 60043msec
        clat (msec): min=2, max=141, avg=55.13, stdev= 4.90
        clat (msec): min=3, max=133, avg=55.13, stdev= 4.92
    

    有IO落盘场景,从1490提升到1527:

      write: io=366880KB, bw=6109.2KB/s, iops=1527, runt= 60054msec
        clat (msec): min=1, max=205, avg=83.80, stdev=12.87
    

    其他的可以合入的patch也都合入测试了,发现并没有提升。

    分析无IO落盘场景下的fio结果,发现时延还是比较高,平均55ms,于是通过kill -USR1看了下top里cpu占用最高的线程的调用栈(多kill几次),发现每次调用栈都是在打印日志,于是尝试把日志级别都改成0(主要是内存级别,因此文件级别已经是0了),

    for d in `sudo ceph  daemon osd.37  config show| grep '"debug_' | awk -F: '{print $1}' | sed 's/"//g'`; do sudo ceph daemon osd.37 config set $d 0; done
    

    使用上面的脚本,关掉所有模块日志(包括内存日志)之后,性能有很大提升。

    无IO落盘提升非常大(延时敏感,每个卷从2320提升到5718,翻1倍还要多):

      write: io=1340.1MB, bw=22875KB/s, iops=5718, runt= 60028msec
        clat (usec): min=1320, max=48165, avg=22373.50, stdev=809.35
    

    有IO落盘,从1527提升到1770:

      write: io=425060KB, bw=7082.5KB/s, iops=1770, runt= 60016msec
        clat (usec): min=783, max=2614.1K, avg=72280.76, stdev=122010.96
    

    上述单osd场景下,也对比过64个pg和256个pg的性能,256 pg性能比64个要好一些,因此上述优化都是在256 pg场景下测试的。

    单节点3 osd 3副本性能分析

    使用上面修改过代码及配置项的版本部署3副本场景(共3个osd位于同一个节点上,排除网络影响),进行性能测试。

    3副本(3个osd,在同一节点上),256 pg,无io落盘场景下,版本为未合入CLDNBS-1383相关pr的版本,也即CLDNBS-1358相关commit合入的版本,配置项与线上环境相同,9个客户端卷,每个卷的iops情况(单osd cpu利用率1000%左右):

      write: io=308464KB, bw=5130.7KB/s, iops=1282, runt= 60122msec
        clat (msec): min=1, max=381, avg=99.78, stdev=113.20
    

    有io落盘iops情况(单osd cpu利用率400%~800%之间波动):

      write: io=93144KB, bw=1542.7KB/s, iops=385, runt= 60379msec
        clat (msec): min=2, max=3905, avg=331.87, stdev=487.11
    

    3副本(3个osd,在同一节点上),256 pg,无io落盘场景下,版本为未合入CLDNBS-1383相关pr的版本,也即CLDNBS-1358相关commit合入的版本,配置项与CLDNBS-1358测试时相同,9个客户端卷,每个卷的iops情况(单osd cpu利用率1000%左右):

      write: io=284264KB, bw=4728.4KB/s, iops=1182, runt= 60119msec
        clat (msec): min=1, max=267, avg=108.27, stdev=66.00
    

    有io落盘iops情况(单osd cpu利用率1300%左右):

      write: io=230180KB, bw=3830.5KB/s, iops=957, runt= 60093msec
        clat (msec): min=2, max=319, avg=133.65, stdev=69.22
    

    主pg数量不均衡,导致一个osd处理的osd_op数量比其他两个少40W(180W~220W),经过手工调整后(
    hzwangpan@pubt2-ceph2 ceph $ ceph osd primary-affinity 35 0.95
    set osd.35 primary-affinity to 0.95 (8622592)
    hzwangpan@pubt2-ceph2 ceph $ ceph pg ls-by-primary 35| wc ),iops有一些提升(无io落盘场景,其他同上):

      write: io=306644KB, bw=5102.2KB/s, iops=1275, runt= 60101msec
        clat (msec): min=1, max=247, avg=100.34, stdev=62.83
    

    问题:无io落盘3个osd场景下,为啥性能仍然较差?

    首先通过增加perf counter分析到osd之间的repop发送和接收过程较慢导致的,并进一步通过debug_ms=10日志确认。 之后再Pipe::reader()里面增加耗时和队列长度统计,并分析日志,9个卷无IO落盘场景下,可以看出sent队列长度为1100+,repop相关dispatch耗时在0.1ms左右,与客户端时延接近(120+ms),而单卷无IO落盘场景下,sent队列长度为100+,dispatch耗时不变,客户端时延14ms左右,也即与9个卷场景相差9倍左右(但单卷和9个卷的总体iops相差不大),由此可见,是服务端来不及处理高并发的IO,在服务端排队导致的时延较高。另外还在Pipe::writer()里增加代码进行测试,发现out_q队列长度均非常小,基本都在1或者0,没有排队。

    查看cpu利用率最高的线程,均为Pipe::reader()线程,达到90%以上,kill -USR1看到是在mark_event,尝试关掉optracker功能之后(osd_enable_op_tracker = false),9个卷128iodepth无IO落盘,iops从1050左右提升到1680+,时延从122ms降低到75ms。每个osd的cpu利用率也从1200%降低到750%左右。

    另外尝试把Pipe:reader线程绑定到独立的cpu上,把osd的其他线程绑定到其他cpu上,iops有一点提升,从1680+提升到1700+。
    把日志都改成0级别(包括内存),iops提升到1950+。以上均为无io落盘,9个卷128iodepth。

    3节点57 osd 3副本性能分析

    9个卷,57个osd(3节点,每节点19个osd),有io落盘(关闭optracker、关闭日志),4k随机写,128iodepth,平均每个卷的性能:

      write: io=995372KB, bw=16585KB/s, iops=4146, runt= 60017msec
        clat (msec): min=1, max=161, avg=30.73, stdev=16.42
    

    用老的版本,老的配置项,同样的测试用例,同样的环境,iops性能为:

      write: io=760796KB, bw=12671KB/s, iops=3167, runt= 60042msec
        clat (msec): min=1, max=1272, avg=40.38, stdev=77.00
    

    两种场景下,前4个核都已经跑满,经过对比发现前4个核跑满是因为被osd和fio线程占用了。

    绑核之后(fio、osd线程绑定到4-55,fio分散到3个节点执行,每个节点跑3个卷),老的版本加老的配置,同样的测试场景下,性能提升到:

      write: io=884548KB, bw=14609KB/s, iops=3652, runt= 60550msec
        clat (msec): min=1, max=1572, avg=35.00, stdev=87.48
    

    同样的绑核策略,新版本+新配置项,性能提升:

      write: io=1178.4MB, bw=20102KB/s, iops=4960, runt= 60027msec
        clat (msec): min=1, max=442, avg=25.33, stdev=13.11
    

    此时发现cpu是瓶颈,于是尝试采用async messager替换simple messager来对比测试(需要backport高版本的async功能),async比simple的线程数要少很多,cpu占用应该也会少一些。发现async比simple有提升,可以跑到5270左右了(simple是4960左右)。

    之后才做了其他对比测试,以下数据均为async messager:
    1. 尝试不绑定osd线程到4-55 cpu,iops下降到5260左右(fio线程是否绑定4-55 cpu对性能几乎无影响)。
    2. 绑核,减少线程数,可以跑到5500左右:
    filestore_op_threads = 16
    osd_op_num_threads_per_shard = 1
    osd_op_num_shards = 16
    filestore_ondisk_finisher_threads = 8
    filestore_apply_finisher_threads = 8
    osd_enable_op_tracker = true
    其他配置不变

    3. 关闭osd_enable_op_tracker,可以跑到5570左右,其他配置不变
    4. 尝试把fio跑在其他节点(与osd节点分开),其他不变,可以跑到5819;把async改为simple messager,同样的场景,可以跑到5106

    线上环境使用的1.2-0版本进行对比测试,配置项使用线上的,无绑核,fio跑在独立的节点上。4k随机写总体iops性能为30652,平均单卷3405,

      write: io=825868KB, bw=13623KB/s, iops=3405, runt= 60624msec
        clat (msec): min=1, max=1356, avg=37.57, stdev=84.53
    

    绑核后性能提升到33145,平均单卷3682:

      write: io=893156KB, bw=14749KB/s, iops=3687, runt= 60556msec
        clat (msec): min=1, max=1596, avg=34.70, stdev=97.10
    

    尝试绑核并且修改配置文件,1.2-0版本可以跑到总iops 47750,平均单卷5305:

      write: io=1246.5MB, bw=21257KB/s, iops=5314, runt= 60044msec
        clat (usec): min=873, max=2790.6K, avg=24075.40, stdev=126415.06
    

    尝试每个节点只用6个osd,性能也比19个osd高,单卷iops平均6415(总57735),时延在19~20ms。单osd cpu利用率在700~800%左右,内存最大到12G左右,原因是数据落盘比journal延迟10几秒,此时ssd盘才达到瓶颈。

    总结:

    1. cpu是瓶颈(测试过程中除了前4个cpu只跑软中断有20~30%的idle之后,其他cpuidle基本都接近0),ssd盘性能仍有较大余量
    2. async messager性能最好,线程数量少很多,cpu利用率比较低一些,性能可以提升一些,总iops可以跑到52374
    3. 优化代码合入对多osd场景的最终性能影响有限,原因是cpu利用率是瓶颈,ssd性能发挥不出来,对单osd性能提升比较明显
    4. 绑定cpu对性能有提升
    5. 内存日志级别也要调整到0,对性能有提升
    6. 打开op tracker在无io落盘场景下对性能影响较大,但在有io落盘场景下不明显,可以接受
    7. 通过perf top分析可以看出,cpu利用率占用较多的是内存分配释放操作,其次是bufferlist、字符串等操作

    配置项改动:

    --- online.conf 2019-10-23 14:24:52.738763518 +0800
    +++ ceph.conf   2019-10-23 14:27:08.326466021 +0800
    @@ -1,4 +1,6 @@
     [global]
    +enable experimental unrecoverable data corrupting features = ms-type-async
    +ms_type = async
    
    @@ -32,11 +27,71 @@
    +osd_pg_object_context_cache_count = 1024
    +filestore_queue_max_ops = 5000
    +filestore_queue_max_bytes = 1048576000
    +filestore_ondisk_finisher_threads = 4
    +filestore_apply_finisher_threads = 4
    +osd_client_message_cap = 5000
    +osd_client_message_size_cap = 1048576000
    +journal_queue_max_ops = 5000
    +journal_queue_max_bytes = 524288000
    +journal_max_write_entries = 5000
    +journal_max_write_bytes = 20971520
    +
    +debug_none = 0
    +debug_lockdep = 0
    +debug_context = 0
    +debug_crush = 0
    +debug_mds = 0
    +debug_mds_balancer = 0
    +debug_mds_locker = 0
    +debug_mds_log = 0
    +debug_mds_log_expire = 0
    +debug_mds_migrator = 0
    +debug_buffer = 0
    +debug_timer = 0
    +debug_filer = 0
    +debug_striper = 0
    +debug_objecter = 0
    +debug_rados = 0
    +debug_rbd = 0
    +debug_rbd_replay = 0
    +debug_journaler = 0
    +debug_objectcacher = 0
    +debug_client = 0
    +debug_osd = 0
    +debug_optracker = 0
    +debug_objclass = 0
    +debug_filestore = 0
    +debug_keyvaluestore = 0
    +debug_journal = 0
    +debug_ms = 0
    +debug_mon = 0
    +debug_monc = 0
    +debug_paxos = 0
    +debug_tp = 0
    +debug_auth = 0
    +debug_crypto = 0
    +debug_finisher = 0
    +debug_heartbeatmap = 0
    +debug_perfcounter = 0
    +debug_rgw = 0
    +debug_civetweb = 0
    +debug_javaclient = 0
    +debug_asok = 0
    +debug_throttle = 0
    +debug_refs = 0
    +debug_xio = 0
    

    其他改进手段

    1. 优化osd的cpu利用率,进一步发挥ssd盘的性能
    2. 可以考虑混合部署sas和ssd,降低ssd盘的密度,更好的发挥ssd盘的性能

    通过rados命令删除rbd卷操作流程

    参数写错导致卷创建的太大了,16384T,H版本没有object-map特性导致rbd rm删除卷非常慢,实际上后的rados都没有对象(还没有写入多少数据),因此考虑用rados命令进行rbd卷的清理,步骤如下:

    1. 首先查看卷信息:rbd info rbd/rbd_sys_disk
      rbd image 'rbd_sys_disk':
      size 16384 TB in 4294967296 objects
      order 22 (4096 kB objects)
      block_name_prefix: rbd_data.6a74656327b23c6
      format: 2
      features: layering
      flags:
    2. 之后开始清理rados中的数据对象和卷信息记录
       $ rados ls -p rbd | grep rbd_data.6a74656327b23c6
       $ rados rm -p rbd rbd_data.6a74656327b23c6.0000000000000214 # 逐个对象删除即可,可以写一个脚本来做
       $ rados rm -p rbd rbd_header.6a74656327b23c6  # 清理rbd_header对象
       $ rados listomapvals -p rbd rbd_directory  # 查看rbd_directory的omap属性列表
          id_6a74656327b23c6   # 删除这个key
          value: (16 bytes) :
          0000 : 0c 00 00 00 72 62 64 5f 73 79 73 5f 64 69 73 6b : ....rbd_sys_disk
      
          id_6a95c8a327b23c6
          value: (17 bytes) :
          0000 : 0d 00 00 00 72 62 64 5f 73 79 73 5f 64 69 73 6b : ....rbd_sys_disk   # 注意换行下面还有一个1
          0010 : 31                                              : 1
      
          id_6a9e49a6b8b4567
          value: (12 bytes) :
          0000 : 08 00 00 00 76 6f 6c 75 6d 65 30 32             : ....volume02
      
          id_6aa42d36b8b4567
          value: (19 bytes) :
          0000 : 0f 00 00 00 72 62 64 5f 6c 6f 67 69 63 5f 64 69 : ....rbd_logic_di
          0010 : 73 6b 31                                        : sk1
      
          name_rbd_logic_disk1
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 61 34 32 64 33 36 62 38 62 34 : ....6aa42d36b8b4
          0010 : 35 36 37                                        : 567
      
          name_rbd_sys_disk# 删除这个key
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 37 34 36 35 36 33 32 37 62 32 : ....6a74656327b2
          0010 : 33 63 36                                        : 3c6
      
          name_rbd_sys_disk1
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 39 35 63 38 61 33 32 37 62 32 : ....6a95c8a327b2
          0010 : 33 63 36                                        : 3c6
      
          name_volume02
          value: (19 bytes) :
          0000 : 0f 00 00 00 36 61 39 65 34 39 61 36 62 38 62 34 : ....6a9e49a6b8b4
          0010 : 35 36 37                                        : 567
      $ rados rmomapkey -p rbd rbd_directory id_6a74656327b23c6
      $ rados rmomapkey -p rbd rbd_directory name_rbd_sys_disk
      $ rbd ls -p rbd 就看不到rbd_sys_disk这个卷了
      

    mock osd Journal and FileStore

    最近在分析ceph rbd性能瓶颈,想排除掉磁盘对性能的影响,单独分析osd层软件栈的耗时,因此想mock掉所有落盘的代码,测试了下可以实现。

    代码是基于H版本0.94.6修改的,仅供参考。patch如下,代码不多,不做解释了:

    diff --git a/src/common/config_opts.h b/src/common/config_opts.h
    index 0748acf..7255f53 100644
    --- a/src/common/config_opts.h
    +++ b/src/common/config_opts.h
    @@ -858,6 +858,7 @@ OPTION(filestore_fd_cache_shards, OPT_INT, 16)   // FD number of shards
     OPTION(filestore_dump_file, OPT_STR, "")         // file onto which store transaction dumps
     OPTION(filestore_kill_at, OPT_INT, 0)            // inject a failure at the n'th opportunity
     OPTION(filestore_inject_stall, OPT_INT, 0)       // artificially stall for N seconds in op queue thread
    +OPTION(filestore_inject_blackhole, OPT_BOOL, false)
     OPTION(filestore_fail_eio, OPT_BOOL, true)       // fail/crash on EIO
     OPTION(filestore_debug_verify_split, OPT_BOOL, false)
     OPTION(journal_dio, OPT_BOOL, true)
    diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
    index 14cecc2..ce8da41 100644
    --- a/src/os/FileJournal.cc
    +++ b/src/os/FileJournal.cc
    @@ -1370,7 +1370,10 @@ int FileJournal::write_aio_bl(off64_t& pos, bufferlist& bl, uint64_t seq)
         iocb *piocb = &aio.iocb;
         int attempts = 10;
         do {
    -      int r = io_submit(aio_ctx, 1, &piocb);
    +      int r = 1;
    +      if (!g_conf->filestore_inject_blackhole) {
    +          r = io_submit(aio_ctx, 1, &piocb);
    +      }
           if (r < 0) {
            derr << "io_submit to " << aio.off << "~" << aio.len
                 << " got " << cpp_strerror(r) << dendl;
    @@ -1406,7 +1409,10 @@ void FileJournal::write_finish_thread_entry()
    
         dout(20) << "write_finish_thread_entry waiting for aio(s)" << dendl;
         io_event event[16];
    -    int r = io_getevents(aio_ctx, 1, 16, event, NULL);
    +    int r = 1;
    +    if (!g_conf->filestore_inject_blackhole) {
    +        r = io_getevents(aio_ctx, 1, 16, event, NULL);
    +    }
         if (r < 0) {
           if (r == -EINTR) {
            dout(0) << "io_getevents got " << cpp_strerror(r) << dendl;
    @@ -1416,6 +1422,7 @@ void FileJournal::write_finish_thread_entry()
           assert(0 == "got unexpected error from io_getevents");
         }
    
    +    if (!g_conf->filestore_inject_blackhole)
         {
           Mutex::Locker locker(aio_lock);
           for (int i=0; i<r ; i++) {
    @@ -1449,6 +1456,9 @@ void FileJournal::write_finish_thread_entry()
            ai->done = true;
           }
           check_aio_completion();
    +    } else {
    +        Mutex::Locker locker(aio_lock);
    +        check_aio_completion();
         }
       }
       dout(10) < < "write_finish_thread_entry exit" << dendl;
    @@ -1468,7 +1478,7 @@ void FileJournal::check_aio_completion()
       uint64_t new_journaled_seq = 0;
    
       list<aio_info>::iterator p = aio_queue.begin();
    -  while (p != aio_queue.end() && p->done) {
    +  while (p != aio_queue.end() && (p->done || g_conf->filestore_inject_blackhole)) {
         dout(20) < < "check_aio_completion completed seq " << p->seq < < " "
                 << p->off < < "~" << p->len < < dendl;
         if (p->seq) {
    diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
    index 3e7e8fb..4c59040 100644
    --- a/src/os/FileStore.cc
    +++ b/src/os/FileStore.cc
    @@ -2286,7 +2286,8 @@ unsigned FileStore::_do_transaction(
    
         Transaction::Op *op = i.decode_op();
         int r = 0;
    -
    +    if (g_conf->filestore_inject_blackhole)
    +        return 0;
         _inject_failure();
    
         switch (op->op) {
    

    qemu_rbd_open流程分析

    qemu进程启动参数:
    ./qemu-system-x86_64 -m 512 -smp 1 -drive file=rbd:rbd/vol1:auth_supported=none:mon_host=192.168.0.2\\:6789,cache=none,if=none,format=raw

    // qemu\block\rbd.c
    static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
                             Error **errp)
    {
        BDRVRBDState *s = bs->opaque;
        char pool[RBD_MAX_POOL_NAME_SIZE];
        char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
        char conf[RBD_MAX_CONF_SIZE];
        char clientname_buf[RBD_MAX_CONF_SIZE];
        char *clientname;
        const char *secretid;
        QemuOpts *opts;
        Error *local_err = NULL;
        const char *filename;
        int r;
    
        opts = qemu_opts_create(&runtime_opts, NULL, 0, &error_abort); // 初始化opts结构体
        qemu_opts_absorb_qdict(opts, options, &local_err);  // 把options转成opts
        if (local_err) {
            error_propagate(errp, local_err);
            qemu_opts_del(opts);
            return -EINVAL;
        }
    
        filename = qemu_opt_get(opts, "filename");  // "rbd:rbd/vol1:auth_supported=none:mon_host=192.168.0.2\\:6789"
        secretid = qemu_opt_get(opts, "password-secret"); // 0x0
    
        if (qemu_rbd_parsename(filename, pool, sizeof(pool), // 从filename中解析pool、snap、name、conf,其中pool是存储池,snap是快照名,name是卷名,conf是配置文件路径
                               snap_buf, sizeof(snap_buf),
                               s->name, sizeof(s->name),
                               conf, sizeof(conf), errp) < 0) {
            r = -EINVAL;
            goto failed_opts;
        }
    
        clientname = qemu_rbd_parse_clientname(conf, clientname_buf); // clientname=0x0,qemu没有传这个参数,为空
        r = rados_create(&s->cluster, clientname);
        if (r < 0) {
            error_setg_errno(errp, -r, "error initializing");
            goto failed_opts;
        }
    
        s->snap = NULL;
        if (snap_buf[0] != '\0') {
            s->snap = g_strdup(snap_buf);
        }
    
        if (strstr(conf, "conf=") == NULL) {
            /* try default location, but ignore failure */
            rados_conf_read_file(s->cluster, NULL);  // 走这里,命令行没有传“conf=”参数,librados有默认路径
        } else if (conf[0] != '\0') {
            r = qemu_rbd_set_conf(s->cluster, conf, true, errp);
            if (r < 0) {
                goto failed_shutdown;
            }
        }
    
        if (conf[0] != '\0') {
            r = qemu_rbd_set_conf(s->cluster, conf, false, errp); // 把conf中的参数传给librados的_conf对象,这里会覆盖上面从配置文件中读取的配置项,如mon_host
            if (r < 0) {
                goto failed_shutdown;
            }
        }
    
        if (qemu_rbd_set_auth(s->cluster, secretid, errp) < 0) {  // secretid为空
            r = -EIO;
            goto failed_shutdown;
        }
    
        /*
         * Fallback to more conservative semantics if setting cache
         * options fails. Ignore errors from setting rbd_cache because the
         * only possible error is that the option does not exist, and
         * librbd defaults to no caching. If write through caching cannot
         * be set up, fall back to no caching.
         */
        if (flags & BDRV_O_NOCACHE) {
            rados_conf_set(s->cluster, "rbd_cache", "false"); // 这里,这个flags估计要传递给part2
        } else {
            rados_conf_set(s->cluster, "rbd_cache", "true");
        }
    
        r = rados_connect(s->cluster);
        if (r < 0) {
            error_setg_errno(errp, -r, "error connecting");
            goto failed_shutdown;
        }
    
        r = rados_ioctx_create(s->cluster, pool, &s->io_ctx);
        if (r < 0) {
            error_setg_errno(errp, -r, "error opening pool %s", pool);
            goto failed_shutdown;
        }
    
        r = rbd_open(s->io_ctx, s->name, &s->image, s->snap);
        if (r < 0) {
            error_setg_errno(errp, -r, "error reading header from %s", s->name);
            goto failed_open;
        }
    
        bs->read_only = (s->snap != NULL);
    
        qemu_opts_del(opts);
        return 0;
    
    failed_open:
        rados_ioctx_destroy(s->io_ctx);
    failed_shutdown:
        rados_shutdown(s->cluster);
        g_free(s->snap);
    failed_opts:
        qemu_opts_del(opts);
        return r;
    }
    
    // src\librados\librados.cc
    // -- config --
    extern "C" int rados_conf_read_file(rados_t cluster, const char *path_list)
    {
      tracepoint(librados, rados_conf_read_file_enter, cluster, path_list);
      librados::RadosClient *client = (librados::RadosClient *)cluster;
      md_config_t *conf = client->cct->_conf;
      std::deque<std::string> parse_errors;
      int ret = conf->parse_config_files(path_list, &parse_errors, NULL, 0);
      if (ret) {
        tracepoint(librados, rados_conf_read_file_exit, ret);
        return ret;
      }
      conf->parse_env(); // environment variables override
    
      conf->apply_changes(NULL);
      complain_about_parse_errors(client->cct, &parse_errors);
      tracepoint(librados, rados_conf_read_file_exit, 0);
      return 0;
    }
    

    // src\common\config.cc int md_config_t::parse_config_files(const char *conf_files, std::deque<std::string> *parse_errors, std::ostream *warnings, int flags) { Mutex::Locker l(lock); if (internal_safe_to_start_threads) return -ENOSYS; if (!conf_files) { const char *c = getenv("CEPH_CONF"); if (c) { conf_files = c; } else { if (flags & CINIT_FLAG_NO_DEFAULT_CONFIG_FILE) return 0; conf_files = CEPH_CONF_FILE_DEFAULT; // const char *CEPH_CONF_FILE_DEFAULT = "/etc/ceph/$cluster.conf, ~/.ceph/$cluster.conf, $cluster.conf"; } } std::list</std::string><std::string> cfl; get_str_list(conf_files, cfl); // cfl = {[0] = "/etc/ceph/$cluster.conf", [1] = "~/.ceph/$cluster.conf", [2] = "$cluster.conf"} return parse_config_files_impl(cfl, parse_errors, warnings); } int md_config_t::parse_config_files_impl(const std::list</std::string><std::string> &conf_files, std::deque</std::string><std::string> *parse_errors, std::ostream *warnings) { assert(lock.is_locked()); // open new conf list<string>::const_iterator c; for (c = conf_files.begin(); c != conf_files.end(); ++c) { cf.clear(); string fn = *c; expand_meta(fn, warnings); int ret = cf.parse_file(fn.c_str(), parse_errors, warnings); if (ret == 0) break; // 读取到第一个配置文件就不再读取后面的,/etc/ceph/$cluster.conf,其中$cluster会被替换为默认值ceph else if (ret != -ENOENT) return ret; } if (c == conf_files.end()) return -EINVAL; std::vector <std::string> my_sections; _get_my_sections(my_sections); // my_sections = {"client.admin", "client", "global"},正常会读[client]和[global]两个段下的配置项 for (int i = 0; i < NUM_CONFIG_OPTIONS; i++) { config_option *opt = &config_optionsp[i]; std::string val; int ret = _get_val_from_conf_file(my_sections, opt->name, val, false); // 从配置文件中读取配置项 if (ret == 0) { set_val_impl(val.c_str(), opt); // 设置配置项的值,比如mon_host = 192.168.0.2 } } // subsystems?日志配置项 for (int o = 0; o < subsys.get_num(); o++) { std::string as_option("debug_"); as_option += subsys.get_name(o); std::string val; int ret = _get_val_from_conf_file(my_sections, as_option.c_str(), val, false); if (ret == 0) { int log, gather; int r = sscanf(val.c_str(), "%d/%d", &log, &gather); if (r >= 1) { if (r < 2) gather = log; // cout << "config subsys " << subsys.get_name(o) << " log " << log << " gather " << gather << std::endl; subsys.set_log_level(o, log); subsys.set_gather_level(o, gather); } } } // Warn about section names that look like old-style section names std::deque < std::string > old_style_section_names; for (ConfFile::const_section_iter_t s = cf.sections_begin(); s != cf.sections_end(); ++s) { const string &str(s->first); if (((str.find("mds") == 0) || (str.find("mon") == 0) || (str.find("osd") == 0)) && (str.size() > 3) && (str[3] != '.')) { old_style_section_names.push_back(str); } } if (!old_style_section_names.empty()) { ostringstream oss; oss < < "ERROR! old-style section name(s) found: "; string sep; for (std::deque < std::string >::const_iterator os = old_style_section_names.begin(); os != old_style_section_names.end(); ++os) { oss < < sep << *os; sep = ", "; } oss << ". Please use the new style section names that include a period."; parse_errors->push_back(oss.str()); } return 0; }

    “`

    Ceph内建辅助工具实践

    admin socket

    client及server、osd及mon

    # /var/run/ceph
    srwxr-xr-x 1 root root 0 Jul 17 14:01 ceph-client.admin.21723.24796272.asok
    srwxr-xr-x 1 root root 0 Jul 29 10:30 ceph-client.admin.2583.54700240.asok
    srwxr-xr-x 1 root root 0 Jul 29 10:30 ceph-client.admin.2583.55071360.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:45 ceph-mon.ceph-l.asok
    srwxr-xr-x 1 root root 0 Jul 29 17:09 ceph-osd.0.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.1.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.2.asok
    srwxr-xr-x 1 root root 0 Jul 29 16:44 ceph-osd.3.asok
    

    创建流程

    注册命令hook函数

    CephContext::CephContext(uint32_t module_type_)
      ...
    {
      ...
      _admin_socket = new AdminSocket(this);
      _admin_hook = new CephContextHook(this);
      _admin_socket->register_command("perf dump", "perf dump name=logger,type=CephString,req=false name=counter,type=CephString,req=false", _admin_hook, "dump perfcounters value");
      ...
      _admin_socket->register_command("log reopen", "log reopen", _admin_hook, "reopen log file");
    }
    

    线程创建及启动

    // ceph_osd.cc:main->common_init_finish->start_service_thread->AdminSocket::init
    
    class AdminSocket : public Thread
    {
    public:
      AdminSocket(CephContext *cct);
      virtual ~AdminSocket();
      ...
    }
    
    bool AdminSocket::init(const std::string &path)
    {
      ...
      err = create_shutdown_pipe(&pipe_rd, &pipe_wr);
      ...
      int sock_fd;
      err = bind_and_listen(path, &sock_fd);
      ...
    
      /* Create new thread */
      m_sock_fd = sock_fd;
      m_shutdown_rd_fd = pipe_rd;
      m_shutdown_wr_fd = pipe_wr;
      m_path = path;
    
      m_version_hook = new VersionHook;
      register_command("0", "0", m_version_hook, "");
      register_command("version", "version", m_version_hook, "get ceph version");
      register_command("git_version", "git_version", m_version_hook, "get git sha1");
      m_help_hook = new HelpHook(this);
      register_command("help", "help", m_help_hook, "list available commands");
      m_getdescs_hook = new GetdescsHook(this);
      register_command("get_command_descriptions", "get_command_descriptions",
               m_getdescs_hook, "list available commands");
    
      create(); // Thread::create
      add_cleanup_file(m_path.c_str());
      return true;
    }
    
    
    

    命令处理

    // perfcounter hooks
    
    class CephContextHook : public AdminSocketHook {
      CephContext *m_cct;
    
    public:
      CephContextHook(CephContext *cct) : m_cct(cct) {}
    
      bool call(std::string command, cmdmap_t& cmdmap, std::string format,
            bufferlist& out) {
        m_cct->do_command(command, cmdmap, format, &out);
        return true;
      }
    };
    
    
    void CephContext::do_command(std::string command, cmdmap_t& cmdmap,
                     std::string format, bufferlist *out)
    {
      Formatter *f = Formatter::create(format, "json-pretty", "json-pretty");
      stringstream ss;
      for (cmdmap_t::iterator it = cmdmap.begin(); it != cmdmap.end(); ++it) {
        if (it->first != "prefix") {
          ss < < it->first  < < ":" << cmd_vartype_stringify(it->second) < < " ";
        }
      }
      lgeneric_dout(this, 1) << "do_command '" << command << "' '"
                 << ss.str() << dendl;
      if (command == "perfcounters_dump" || command == "1" ||
          command == "perf dump") {
        std::string logger;
        std::string counter;
        cmd_getval(this, cmdmap, "logger", logger);
        cmd_getval(this, cmdmap, "counter", counter);
        _perf_counters_collection->dump_formatted(f, false, logger, counter);
      }
      else if (command == "perfcounters_schema" || command == "2" ||
        command == "perf schema") {
        _perf_counters_collection->dump_formatted(f, true);
      }
      ...
    }
    

    功能用途

    查询版本/配置项信息/限流/性能计数器数据/reopen log/dump log/dump_mempools/dump_objectstore_kv_stats/dump_historic_ops等

    使用场景举例:

    Paxos耗时监控

    hzwangpan@pubt2-ceph27:/usr/local/nagent/libexec$ sudo ceph daemon mon.pubt2-ceph27  perf dump paxos
    {
        "paxos": {
            ...
            "begin_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            "commit": 0,
            "commit_keys": {
                "avgcount": 0,
                "sum": 0
            },
            "commit_bytes": {
                "avgcount": 0,
                "sum": 0
            },
            "commit_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            ...
        }
    }
    
    

    peering耗时监控

    hzwangpan@pubt2-ceph27:/usr/local/nagent/libexec$ sudo ceph daemon osd.1  perf dump recoverystate_perf | egrep "peering|waitup" -A3
            "peering_latency": {
                "avgcount": 803,
                "sum": 88.156743622
            },
    --
            "waitupthru_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            }
    

    overflow巡检脚本

    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo config set debug_ms=0/15 for $mon
       sudo ceph --admin-daemon $mon config set debug_ms 0/15
    done
    
    sleep 5
    
    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo dump log for $mon
       sudo ceph --admin-daemon $mon log dump
    done
    
    sleep 2
    
    for mon in `ls /var/run/ceph/ceph-mon.*.asok`; do
       echo config set debug_ms=0/5 for $mon
       sudo ceph --admin-daemon $mon config set debug_ms 0/5
    done
    
    echo ==========WARNING MON start==============
    grep -H "got ack seq" /data/log/ceph/ceph-mon.*.log | grep " 15 -- " | awk '$NF>1073741824 {print $0}' | awk -F'ceph-|.log:' '{print $2}' | sort | uniq
    echo ==========WARNING MON end==============
    

    修改配置项

    单实例配置项修改

    ceph daemon osd.0 config set debug_ms 15
    

    实现:参考上面的admin socket创建注册流程,会有针对config set命令的处理函数。

    多实例配置项修改

    ceph tell osd.* injectargs '--debug_ms=15'
    

    实现:
    在/usr/bin/ceph中检测tell命令的后续参数,如果是指定了target(如osd.0)则直接发送injectargs message给osd.0,由osd进程的tp_osd_cmd线程调用do_command函数对其进行处理(L版本代码流程),之后调用md_config_t::injectargs进行实际的配置项修改,最后handle_conf_change执行配置项修改的side effect相关操作。

    如果指定的是osd.*,则先通过mon获取所有osd id列表,然后对每个osd循环执行上面的操作。

    性能分析相关

    性能概览

    ceph osd perf

    查看osd的FileStore时延信息

    root@ceph-l ceph $ ceph osd perf
    osd fs_commit_latency(ms) fs_apply_latency(ms) 
      0                    65                   72 
      1                    27                   36 
      2                    22                   31 
      3                     0                    0 
    

    ceph -s

    hzwangpan@prit1-ceph19:~$ ceph -s
        cluster ee7b0372-f77e-4504-a4cf-4f5a3f503857
         health HEALTH_WARN
                pool bigceph_jiande2_sas_vol01 has too few pgs
         monmap e1: 3 mons at {prit1-ceph19=10.198.136.19:6789/0,prit1-ceph20=10.198.136.20:6789/0,prit1-ceph21=10.198.136.21:6789/0}
                election epoch 36, quorum 0,1,2 prit1-ceph19,prit1-ceph20,prit1-ceph21
         osdmap e12259: 267 osds: 267 up, 267 in
          pgmap v10782798: 8192 pgs, 9 pools, 12860 GB data, 3235 kobjects
                38661 GB used, 417 TB / 454 TB avail
                    8182 active+clean
                      10 active+clean+scrubbing+deep
      client io 57582 kB/s rd, 14104 kB/s wr, 2831 op/s
    

    ceph osd pool stat

    // 没有物理池级别统计
    hzwangpan@prit1-ceph19:~$ ceph osd pool stats
    pool bigceph_jiande2_sas_vol01 id 4
      client io 56782 kB/s rd, 15450 kB/s wr, 2498 op/s
    
    pool bigceph_jiande2_ssd_vol01 id 5
      client io 77283 B/s rd, 6136 kB/s wr, 1846 op/s
    
    pool data02_bigdata_ssd_vol01 id 7
      nothing is going on
    

    这些数据是由osd定期上报给mon(OSD::tick->OSD::do_mon_report->OSD::send_pg_stats:m->osd_stat),然后mon直接dump出来即可,不是特别实时。

    perf counter

    perf counter的初始化过程

    代码流程:ceph_osd.cc:main->OSD::init->OSD::create_recoverystate_perf/OSD::create_logger

    // OSD.h
    enum {
      l_osd_first = 10000,
      l_osd_op_wip,
      l_osd_op,
      l_osd_op_inb,
      l_osd_op_outb,
      l_osd_op_lat,
      ...
    };
    
    
    void OSD::create_logger()
    {
      dout(10) < < "create_logger" << dendl;
    
      PerfCountersBuilder osd_plb(cct, "osd", l_osd_first, l_osd_last);
    
      osd_plb.add_u64(l_osd_op_wip, "op_wip");   // rep ops currently being processed (primary)
    
      osd_plb.add_u64_counter(l_osd_op,       "op");           // client ops
      ...
      logger = osd_plb.create_perf_counters();
      cct->get_perfcounters_collection()->add(logger);
    }
    
    
    void OSD::create_recoverystate_perf()
    {
      dout(10) < < "create_recoverystate_perf" << dendl;
    
      PerfCountersBuilder rs_perf(cct, "recoverystate_perf", rs_first, rs_last);
      ...
      rs_perf.add_time_avg(rs_peering_latency, "peering_latency");
      ...
      rs_perf.add_time_avg(rs_getmissing_latency, "getmissing_latency");
      rs_perf.add_time_avg(rs_waitupthru_latency, "waitupthru_latency");
    
      recoverystate_perf = rs_perf.create_perf_counters();
      cct->get_perfcounters_collection()->add(recoverystate_perf);
    }
    
    void PerfCountersBuilder::add_time_avg(int idx, const char *name)
    {
      add_impl(idx, name, PERFCOUNTER_TIME | PERFCOUNTER_LONGRUNAVG);
    }
    
    
    • add_u64/add_u64_counter:累加值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep op_wip -A3
            "op_wip": 0,
            "op": 679522,
    
    • add_u64_avg:总/平均值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep journal_wr_bytes -A3
            "journal_wr_bytes": {
                "avgcount": 414183,
                "sum": 11194507264
            },
    
    • add_time:累加值
        } else if (d->type & PERFCOUNTER_TIME) {
          f->dump_format_unquoted(d->name, "%" PRId64 ".%09" PRId64,
                      v / 1000000000ull,  // 秒
                      v % 1000000000ull); // 毫秒
        }
        // 结果类似"sum": 123.450000000
    
    • add_time_avg:平均值
    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep op_r_latency -A3    
            "op_r_latency": {
                "avgcount": 1234,
                "sum": 10.300000000
            },
    

    perf counter使用方法

    void ReplicatedPG::log_op_stats(OpContext *ctx)
    {
      OpRequestRef op = ctx->op;
      MOSDOp *m = static_cast<mosdop *>(op->get_req());
    
      utime_t now = ceph_clock_now(cct);
      utime_t latency = now;
      latency -= ctx->op->get_req()->get_recv_stamp();
      utime_t process_latency = now;
      process_latency -= ctx->op->get_dequeued_time();
    
      utime_t rlatency;
      if (ctx->readable_stamp != utime_t()) {
        rlatency = ctx->readable_stamp;
        rlatency -= ctx->op->get_req()->get_recv_stamp();
      }
    
      uint64_t inb = ctx->bytes_written;
      uint64_t outb = ctx->bytes_read;
    
      osd->logger->inc(l_osd_op);
    
      osd->logger->inc(l_osd_op_outb, outb);
      ...
      osd->logger->tinc(l_osd_op_process_lat, process_latency);
    
      if (op->may_read() && op->may_write()) {
        ...
      } else if (op->may_read()) {
        osd->logger->inc(l_osd_op_r);
        ...
        osd->logger->tinc(l_osd_op_r_process_lat, process_latency);
      }
      ...
    }
    
    
    void PerfCounters::tinc(int idx, utime_t amt)
    {
      if (!m_cct->_conf->perf)
        return;
    
      assert(idx > m_lower_bound);
      assert(idx < m_upper_bound);
      perf_counter_data_any_d& data(m_data[idx - m_lower_bound - 1]);
      if (!(data.type & PERFCOUNTER_TIME))
        return;
      if (data.type & PERFCOUNTER_LONGRUNAVG) {
        data.avgcount.inc();
        data.u64.add(amt.to_nsec());
        data.avgcount2.inc();
      } else {
        data.u64.add(amt.to_nsec());
      }
    }
    

    perf counter结果获取

    void PerfCounters::dump_formatted(Formatter *f, bool schema,
        const std::string &counter)
    {
      f->open_object_section(m_name.c_str());
    
      for (perf_counter_data_vec_t::const_iterator d = m_data.begin();
           d != m_data.end(); ++d) {
        ...
        if (schema) {
          f->open_object_section(d->name);
          f->dump_int("type", d->type);
          f->close_section();
        } else {
          if (d->type & PERFCOUNTER_LONGRUNAVG) {
        f->open_object_section(d->name);
        pair<uint64_t ,uint64_t> a = d->read_avg();
        if (d->type & PERFCOUNTER_U64) {
          f->dump_unsigned("avgcount", a.second);
          f->dump_unsigned("sum", a.first);
        } else if (d->type & PERFCOUNTER_TIME) {
          f->dump_unsigned("avgcount", a.second);
          f->dump_format_unquoted("sum", "%" PRId64 ".%09" PRId64,
                      a.first / 1000000000ull,
                      a.first % 1000000000ull);
        ...
    }
    

    举例:peering各阶段耗时、写journal耗时、慢盘检查等

    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "peering|getinfo|getlog|getmiss|waitup"  -A3 
            "peering_latency": {
                "avgcount": 240,
                "sum": 12.630200202
            },
    --
            "getinfo_latency": {
                "avgcount": 240,
                "sum": 7.091297764
            },
            "getlog_latency": {
                "avgcount": 187,
                "sum": 4.670226353
            },
    --
            "getmissing_latency": {
                "avgcount": 187,
                "sum": 0.861913789
            },
            "waitupthru_latency": {
                "avgcount": 0,
                "sum": 0.000000000
            }
    
    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "journal_latency"  -A3
            "journal_latency": {
                "avgcount": 500990,
                "sum": 128.696573903
            },
    
    hzwangpan@prit1-ceph20:~$ sudo ceph daemon osd.10 perf dump | egrep "aio_"  -A3
            "aio_usual_lat": {
                "avgcount": 479092,
                "sum": 72.261508449
            },
            "aio_unusual_lat": {
                "avgcount": 0,
                "sum": 0.000000000
            },
            "aio_slow_lat": {
                "avgcount": 0,
                "sum": 0.000000000
            },
    

    op tracker

    op tracker初始化

    OSD::OSD(CephContext *cct_, ObjectStore *store_,
      ...
      op_tracker(cct, cct->_conf->osd_enable_op_tracker,  // 默认true
                      cct->_conf->osd_num_op_tracker_shard), // 默认32
      ...
      {
        ...
        op_tracker.set_complaint_and_threshold(cct->_conf->osd_op_complaint_time,
                                             cct->_conf->osd_op_log_threshold);
        op_tracker.set_history_size_and_duration(cct->_conf->osd_op_history_size,
                                               cct->_conf->osd_op_history_duration);
        ...
      }
    
    OpTracker(CephContext *cct_, bool tracking, uint32_t num_shards) : seq(0), 
                                         num_optracker_shards(num_shards),
                         complaint_time(0), log_threshold(0),
                         tracking_enabled(tracking), cct(cct_) { // tracking_enabled不能动态修改,需要重启osd进程
    
        for (uint32_t i = 0; i < num_optracker_shards; i++) {
          char lock_name[32] = {0};
          snprintf(lock_name, sizeof(lock_name), "%s:%d", "OpTracker::ShardedLock", i);
          ShardedTrackingData* one_shard = new ShardedTrackingData(lock_name);
          sharded_in_flight_list.push_back(one_shard);
        }
    

    op事件记录

    void OSD::ms_fast_dispatch(Message *m)
    {
      if (service.is_stopping()) {
        m->put();
        return;
      }
      OpRequestRef op = op_tracker.create_request<oprequest>(m); // 创建op
      ...
    }
    
    struct OpRequest : public TrackedOp {
      friend class OpTracker;
      ...
      void mark_queued_for_pg() {
        mark_flag_point(flag_queued_for_pg, "queued_for_pg");
      }
      void mark_reached_pg() {
        mark_flag_point(flag_reached_pg, "reached_pg");
      }
      void mark_delayed(const string& s) {
        mark_flag_point(flag_delayed, s);
      }
      void mark_started() {
        mark_flag_point(flag_started, "started");
      }
      void mark_sub_op_sent(const string& s) {
        mark_flag_point(flag_sub_op_sent, s);
      }
      void mark_commit_sent() {
        mark_flag_point(flag_commit_sent, "commit_sent");
      }
      ...
    }
    
    
    void OpRequest::mark_flag_point(uint8_t flag, const string& s) {
      ...
      mark_event(s);
      ...
    }
    
    
    void ReplicatedBackend::op_commit(
      InProgressOp *op)
    {
      dout(10) < < __func__ << ": " << op->tid < < dendl;
      if (op->op)
        op->op->mark_event("op_commit");
      ...
    }
    

    op事件查看

    root@ceph1 ~ $ ceph daemon osd.0 dump_ops_in_flight
    {
        "ops": [
            {
                "description": "osd_op(client.115363.0:416 6.3c 6:3faa4df8:::benchmark_data_ceph1_589633_object415:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected e337)",
                "initiated_at": "2019-07-30 21:31:38.924985",
                "age": 0.080449,
                "duration": 0.080484,
                "type_data": {
                    "flag_point": "waiting for sub ops",
                    "client_info": {
                        "client": "client.115363",
                        "client_addr": "192.168.0.2:0/176428539",
                        "tid": 416
                    },
                    "events": [
                        {
                            "time": "2019-07-30 21:31:38.924985",
                            "event": "initiated"
                        },
                        {
                            "time": "2019-07-30 21:31:38.930980",
                            "event": "queued_for_pg"
                        },
                        {
                            "time": "2019-07-30 21:31:38.931220",
                            "event": "reached_pg"
                        },
                        {
                            "time": "2019-07-30 21:31:38.931922",
                            "event": "started"
                        },
                        {
                            "time": "2019-07-30 21:31:38.932345",
                            "event": "waiting for subops from 1"
                        },
                        {
                            "time": "2019-07-30 21:31:38.970683",
                            "event": "sub_op_commit_rec from 1"
                        }
                    ]
                }
            },
            ...
    
    root@ceph1 ~ $ ceph daemon osd.0 dump_historic_ops
    {
        "size": 20,
        "duration": 600,
        "ops": [
            {
                "description": "osd_op(client.115364.0:18 6.11 6:89bb7b70:::benchmark_data_ceph1_589885_object17:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected e337)",
                "initiated_at": "2019-07-30 21:58:08.250964",
                "age": 1.486169,
                "duration": 0.111498,
                "type_data": {
                    "flag_point": "commit sent; apply or cleanup",
                    "client_info": {
                        "client": "client.115364",
                        "client_addr": "192.168.0.2:0/3144333915",
                        "tid": 18
                    },
                    "events": [
                        {
                            "time": "2019-07-30 21:58:08.250964",
                            "event": "initiated"
                        },
                        {
                            "time": "2019-07-30 21:58:08.253853",
                            "event": "queued_for_pg"
                        },
                        {
                            "time": "2019-07-30 21:58:08.253941",
                            "event": "reached_pg"
                        },
                        {
                            "time": "2019-07-30 21:58:08.254432",
                            "event": "started"
                        },
                        {
                            "time": "2019-07-30 21:58:08.256617",
                            "event": "waiting for subops from 1"
                        },
                        {
                            "time": "2019-07-30 21:58:08.317819",
                            "event": "sub_op_commit_rec from 1"
                        },
                        {
                            "time": "2019-07-30 21:58:08.361995",
                            "event": "op_commit"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362093",
                            "event": "commit_sent"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362126",
                            "event": "op_applied"
                        },
                        {
                            "time": "2019-07-30 21:58:08.362462",
                            "event": "done"
                        }
                    ]
                }
            },
            ...
    

    slow request达到阈值也会打印到日志里。

    基于op tracker的slow op日志改进

    痛点:op history无法长期保存

    解决方法:修改代码,把耗时较长的op history信息持久化到日志文件

    2019-04-26 06:25:05.454228 7f0ac2c17700 0 — op tracker — slow op: osd_op(client.544152.0:3674314 rbd_data.2af5f6b8b4567.0000000000000065 [write 961608~4096] 0.8c5714c5 snapc 5=[] ondisk+write+known_if_redirected e57939), start at: 2019-04-26 06:25:05.343005, duration: 0.111196, history: (0ms: initiated) (0ms: queued_for_pg) (0ms: reached_pg) (0ms: started) (0ms: waiting for subops from 1,2) (23ms: commit_queued_for_journal_write) (43ms: write_thread_in_journal_buffer) (38ms: sub_op_commit_rec) (0ms: filestore_queued_op) (0ms: journaled_completion_queued) (0ms: op_commit) (0ms: sub_op_commit_rec) (4ms: commit_sent) (0ms: filestore_do_op) (0ms: filestore_do_op_end) (0ms: op_applied) (0ms: done)

    举例:线上IO抖动问题分析

    萧山A云主机系统盘util频繁100%性能问题分析过程记录

    blkin+lttng

    举例:rbd卷读写全链路时延分析

    Ceph-blkin+lttng+zipkin性能追踪工具

    tcmalloc支持

    ceph tell osd.0 heap start_profiler/dump/stats/release

    举例:L版本EC OSD内存占用分析

    Ceph CPU&MEMORY profiling

    mempool(L版本)

    root@ceph1 ~ $ ceph daemon osd.0 dump_mempools                  
    {
        ...
        "bluestore_alloc": {
            "items": 512,
            "bytes": 512
        },
        "bluestore_cache_data": {
            "items": 3,
            "bytes": 12288
        },
        "bluestore_cache_onode": {
            "items": 879,
            "bytes": 590688
        },
        "bluestore_cache_other": {
            "items": 96422,
            "bytes": 2844381
        },
        ...
        "bluestore_writing": {
            "items": 2,
            "bytes": 8388608
        },
        "bluefs": {
            "items": 90,
            "bytes": 5496
        },
        ...
        "osd": {
            "items": 64,
            "bytes": 773632
        },
        "osd_mapbl": {
            "items": 0,
            "bytes": 0
        },
        "osd_pglog": {
            "items": 643642,
            "bytes": 308785360
        },
        "osdmap": {
            "items": 245,
            "bytes": 18752
        },
        "osdmap_mapping": {
            "items": 0,
            "bytes": 0
        },
        "pgmap": {
            "items": 0,
            "bytes": 0
        },
        ...
        "total": {
            "items": 741904,
            "bytes": 323610213
        }
    }
    

    打开debug开关可以dump更详细的信息:

    root@ceph1 ~ $ ceph daemon osd.0 config get mempool_debug
    {
        "mempool_debug": "false"
    }
    root@ceph1 ~ $ ceph daemon osd.0 config set mempool_debug true
    {
        "success": "mempool_debug = 'true' osd_objectstore = 'bluestore' (not observed, change may require restart) rocksdb_separate_wal_dir = 'false' (not observed, change may require restart) "
    }
    root@ceph1 ~ $ ceph daemon osd.0 dump_mempools                
    {
        "bloom_filter": {
            "items": 0,
            "bytes": 0,
            "by_type": {
                "unsigned char": {
                    "items": 0,
                    "bytes": 0
                }
            }
        },
        "bluestore_alloc": {
            "items": 512,
            "bytes": 512,
            "by_type": {
                "BitAllocator": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapArea": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapAreaIN": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapAreaLeaf": {
                    "items": 0,
                    "bytes": 0
                },
                "BitMapZone": {
                    "items": 0,
                    "bytes": 0
                },
                "BmapEntry": {
                    "items": 0,
                    "bytes": 0
                }
            }
        },
        "bluestore_cache_data": {
            "items": 3,
            "bytes": 12288
        },
        "bluestore_cache_onode": {
            "items": 879,
            "bytes": 590688,
            "by_type": {
                "BlueStore::Onode": {
                    "items": 879,
                    "bytes": 590688
                }
            }
        },
        ...
    

    mempool介绍

    在src\include\mempool.h头文件里有详细的使用方法介绍。引入mempool的主要目的是统计内存占用情况,绝大部分占用内存的组件或对象都已经加入mempool的管理中。

    // src\include\mempool.h
    #define DEFINE_MEMORY_POOLS_HELPER(f) \
      f(bloom_filter)             \
      f(bluestore_alloc)              \
      f(bluestore_cache_data)         \
      f(bluestore_cache_onode)        \
      f(bluestore_cache_other)        \
      f(bluestore_fsck)           \
      f(bluestore_txc)            \
      f(bluestore_writing_deferred)       \
      f(bluestore_writing)            \
      f(bluefs)               \
      f(buffer_anon)              \
      f(buffer_meta)              \
      f(osd)                  \
      f(osd_mapbl)                \
      f(osd_pglog)                \
      f(osdmap)               \
      f(osdmap_mapping)           \
      f(pgmap)                \
      f(mds_co)               \
      f(unittest_1)               \
      f(unittest_2)
    

    rocksdb内存及其他统计信息获取(L版本)

    需要打开rocksdb_perf配置项才可以使用。

    root@ceph1 ~ $ ceph daemon osd.1 perf dump rocksdb
    {
        "rocksdb": {
            "get": 531,
            "submit_transaction": 352,
            "submit_transaction_sync": 195,
            "get_latency": {
                "avgcount": 531,
                "sum": 0.086399156,
                "avgtime": 0.000162710
            },
            ...
            "submit_sync_latency": {
                "avgcount": 195,
                "sum": 0.995746355,
                "avgtime": 0.005106391
            },
            ...
            "rocksdb_write_memtable_time": {
                "avgcount": 513,
                "sum": 0.008326307,
                "avgtime": 0.000016230
            },
            "rocksdb_write_delay_time": {
                "avgcount": 513,
                "sum": 0.000000000,
                "avgtime": 0.000000000
            },
            ...
    

    还有rocksdb本身的一些统计信息,都是从rocksdb自身的统计接口里获取的数据。这些信息需要打开rocksdb_collect_memory_stats、rocksdb_collect_extended_stats、rocksdb_collect_compaction_stats配置项。

    root@ceph1 ~ $ ceph daemon osd.0 dump_objectstore_kv_stats
    {
        "rocksdb_compaction_statistics": "",
        "": "",
        "": "** Compaction Stats [default] **",
        "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
        "": "----------------------------------------------------------------------------------------------------------------------------------------------------------",
        "": "  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0      4.0         0         1    0.018       0      0",
        "": "  L1      3/0   132.66 MB   0.5      0.2     0.1      0.1       0.1      0.0       0.0   1.8     39.0     25.0         5         1    5.303   1432K   722K",
        "": " Sum      3/0   132.66 MB   0.0      0.2     0.1      0.1       0.1      0.0       0.0 1862.6     38.9     24.9         5         2    2.661   1432K   722K",
        "": " Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0",
        ...
        "": "** File Read Latency Histogram By Level [default] **",
        "": "** Level 0 read latency histogram (micros):",
        "": "Count: 2280 Average: 908.5671  StdDev: 1394.78",
        "": "Min: 455  Median: 789.0152  Max: 24077",
        "": "Percentiles: P50: 789.02 P75: 868.51 P99: 3150.00 P99.9: 23371.43 P99.99: 24077.00",
        "": "------------------------------------------------------",
        "": "[     450,     500 )        4   0.175%   0.175% ",
        "": "[     500,     600 )       89   3.904%   4.079% #",
        "": "[     600,     700 )      342  15.000%  19.079% ###",
        "": "[     700,     800 )      792  34.737%  53.816% #######",
        "": "[     800,     900 )      705  30.921%  84.737% ######",
        "": "[     900,    1000 )      253  11.096%  95.833% ##",
        ...
        "": "[   14000,   16000 )        2   0.088%  99.693% ",
        "": "[   20000,   25000 )        7   0.307% 100.000% ",
        "": "",
        "": "** Level 1 read latency histogram (micros):",
        "": "Count: 33743 Average: 849.7653  StdDev: 522.51",
        "": "Min: 434  Median: 825.1231  Max: 26385",
        "": "Percentiles: P50: 825.12 P75: 914.89 P99: 1195.47 P99.9: 5564.25 P99.99: 24141.06",
        "": "------------------------------------------------------",
        "": "[     400,     450 )        8   0.024%   0.024% ",
        "": "[     450,     500 )       66   0.196%   0.219% ",
        "": "[     500,     600 )     1067   3.162%   3.381% #",
        "": "[     600,     700 )     4231  12.539%  15.920% ###",
        "": "[     700,     800 )     8999  26.669%  42.590% #####",
        "": "[     800,     900 )     9953  29.496%  72.086% ######",
        "": "[     900,    1000 )     6605  19.574%  91.660% ####",
        "": "[    1000,    1200 )     2534   7.510%  99.170% ##",
        "": "[    1200,    1400 )       58   0.172%  99.342% ",
        "": "[    1400,    1600 )       25   0.074%  99.416% ",
        ...
        "": "[   10000,   12000 )        2   0.006%  99.947% ",
        "": "[   14000,   16000 )        2   0.006%  99.953% ",
        "": "[   16000,   18000 )        3   0.009%  99.961% ",
        "": "[   18000,   20000 )        3   0.009%  99.970% ",
        "": "[   20000,   25000 )        8   0.024%  99.994% ",
        "": "[   25000,   30000 )        2   0.006% 100.000% ",
        "": "",
        "": "",
        "": "** DB Stats **",
        "": "Uptime(secs): 1554.3 total, 938.9 interval",
        "": "Cumulative writes: 2292 writes, 5842 keys, 2292 commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s",
        "": "Cumulative WAL: 2292 writes, 606 syncs, 3.78 writes per sync, written: 0.00 GB, 0.00 MB/s",
        "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
        "": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s",
        "": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s",
        "": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
    }
    {
        "rocksdb_extended_statistics": "",
        ".": "rocksdb.block.cache.miss COUNT : 85828",
        ".": "rocksdb.block.cache.hit COUNT : 8066",
        ".": "rocksdb.block.cache.add COUNT : 36002",
        ...
        ".": "rocksdb.num.subcompactions.scheduled statistics Percentiles :=> 50 : 0.000000 95 : 0.000000 99 : 0.000000 100 : 0.000000",
        ".": "rocksdb.bytes.per.read statistics Percentiles :=> 50 : 0.610333 95 : 191.390625 99 : 871.390625 100 : 883.000000",
        ".": "rocksdb.bytes.per.write statistics Percentiles :=> 50 : 997.607656 95 : 1954.814815 99 : 1997.259259 100 : 30050.000000",
        ...
    }
    {
        "rocksdb": {
            "get": 1831,
            "submit_transaction": 1686,
            "submit_transaction_sync": 606,
            "get_latency": {
                "avgcount": 1831,
                "sum": 0.262964280,
                "avgtime": 0.000143617
            },
            ...
    {
        "block_cache_usage": "155874833",
        "block_cache_pinned_blocks_usage": "0",
        "rocksdb_memtable_usage": "2228896"
    }
    

    dump数据流程

    void RocksDBStore::get_statistics(Formatter *f)
    {
      if (!g_conf->rocksdb_perf)  {
        dout(20) < < __func__ << "RocksDB perf is disabled, can't probe for stats"
             << dendl;
        return;
      }
    
      if (g_conf->rocksdb_collect_compaction_stats) {
        std::string stat_str;
        bool status = db->GetProperty("rocksdb.stats", &stat_str);
        ...
      }
      if (g_conf->rocksdb_collect_extended_stats) {
        if (dbstats) {
          f->open_object_section("rocksdb_extended_statistics");
          string stat_str = dbstats->ToString();
          ...
      }
      if (g_conf->rocksdb_collect_memory_stats) {
        f->open_object_section("rocksdb_memtable_statistics");
        std::string str(stringify(bbt_opts.block_cache->GetUsage()));
        f->dump_string("block_cache_usage", str.data());
        str.clear();
        str.append(stringify(bbt_opts.block_cache->GetPinnedUsage()));
        f->dump_string("block_cache_pinned_blocks_usage", str);
        str.clear();
        db->GetProperty("rocksdb.cur-size-all-mem-tables", &str);
        f->dump_string("rocksdb_memtable_usage", str);
        f->close_section();
      }
    }
    

    异常问题分析相关

    signal处理/assert打印调用栈

    线程健康状态监控及调用栈打印(当前及其他线程)

    Ceph中的信号处理、调用栈打印机制分析

    Ceph OSD IO线程健康状态检查机制

    valgrind

    ceph在teuthology集成了valgrind测试(并且有专门的配置文件:src/valgrind.sup),在跑用例过程中同时监控osd、mon的内存使用情况,发现内存泄露问题可以报出来。

    https://blog.dachary.org/2013/08/14/howto-valgrind-ceph-with-teuthology/

    crash dump或者正常退出可以打印保存在内存缓冲区的日志

    默认内存缓冲区日志级别比较高,可以打印更多日志。

    int OSD::shutdown()
    {
      ...
      if (cct->_conf->get_val<bool>("osd_debug_shutdown")) { // 默认关闭
        cct->_conf->set_val("debug_osd", "100");
        cct->_conf->set_val("debug_journal", "100");
        cct->_conf->set_val("debug_filestore", "100");
        cct->_conf->set_val("debug_bluestore", "100");
        cct->_conf->set_val("debug_ms", "100");
        cct->_conf->apply_changes(NULL);
      }
      ...
    
    // src\common\assert.cc
      void __ceph_assert_fail(const char *assertion, const char *file, int line,
                  const char *func)
      {
        ...
        if (g_assert_context) {
          lderr(g_assert_context) < < buf << std::endl;
          *_dout << oss.str();
          *_dout << " NOTE: a copy of the executable, or `objdump -rdS <executable>` "
             < < "is needed to interpret this.\n" << dendl;
    
          g_assert_context->_log->dump_recent(); // 打印内存缓冲区日志,包括所有子模块
        }
    
        abort();
      }
    

    wireshark ceph协议支持

    • https://docs.ceph.com/docs/master/dev/wireshark/
    • https://www.wireshark.org/docs/dfref/c/ceph.html
    • http://aspirer.wang/?p=1312 (第三节:使用tcpdump+wireshark解析Ceph网络包)

    rbd客户端

    krbd

    使用

    modprobe rbd
    rbd map pool/volume
    rbd unmap /dev/rbd0
    rbd showmapped
    

    原理

    与rbd-nbd类似,只是把用户态的nbd server转发librbd,改为在内核态直接发送到librbd,相当于实现了一个内核态的librbd client。

    注意事项

    • krbd方式需要卷的feature只有layering一个,如含有其他feature则无法映射卷到物理机的块设备
    • 对内核版本有要求,随内核版本更新,更新缓慢,不支持新功能
    • 调试不便

    rbd-nbd是为了解决上述问题而开发的。

    rbd-nbd

    使用

    // L版本
    modprobe nbd
    rbd-nbd map pool/volume
    rbd-nbd unmap /dev/nbd0
    rbd-nbd list-mapped
    

    原理

    rbd-nbd

    注意事项

    有一些内核bug:
    – nbd设备设置IO超时时间无效
    – 4.9.65内核nbd设备+ext4文件系统mount状态resize后端盘大小导致文件系统异常
    – 4.9.65内核umount一个断开连接的nbd设备挂载目录导致内核oops

    fio支持

    rbd

    root@ceph-l ~ $ cat fio.rbd
    [global]
    ioengine=rbd
    clientname=admin
    pool=rbd
    rbdname=vol1
    invalidate=0
    rw=randwrite
    bs=4k
    runtime=6000
    [rbd_iodepth128]
    iodepth=64
    
    root@ceph-l ~ $ fio --enghelp
    Available IO engines:
            cpuio
            mmap
            sync
            psync
            ...
            splice
            rbd   ###
            sg
            binject
    

    objectstore

    fio测试ceph的filestore

    对比

    fio+librbd与krbd接近,但librbd方式对用户态CPU资源占用较多,rbd-nbd性能稍差

    死锁检查

    • 支持mutex、rwlock的死锁检查
    • 可以打印加锁、解锁操作和锁的调用栈
    • 可以记录等待锁的耗时信息

    配置项lockdep/lockdep_force_backtrace/mutex_perf_counter

    root@ceph-l ceph $ ceph daemon osd.0 perf dump | grep -i mutex -A3
        "mutex-FileJournal::completions_lock": {
            "wait": {
                "avgcount": 29,
                "sum": 0.002629507
    --
        "mutex-FileJournal::finisher_lock": {
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
    --
        "mutex-FileJournal::write_lock": {
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
    --
        "mutex-FileJournal::writeq_lock": {
            "wait": {
                "avgcount": 87,
                "sum": 0.006935580
    
    void Mutex::Lock(bool no_lockdep) {
      utime_t start;
      int r;
    
      if (lockdep && g_lockdep && !no_lockdep) _will_lock(); // 准备工作,死锁检查
    
      if (TryLock()) { // 记录加锁信息,锁id、线程id、调用栈等,为下次的死锁检查做准备
        goto out;
      }
    
      if (logger && cct && cct->_conf->mutex_perf_counter)
        start = ceph_clock_now(cct);
      r = pthread_mutex_lock(&_m);
      if (logger && cct && cct->_conf->mutex_perf_counter)
        logger->tinc(l_mutex_wait,
             ceph_clock_now(cct) - start);
      assert(r == 0);
      if (lockdep && g_lockdep) _locked();
      _post_lock();
    
    out:
      ;
    }
    
    void Mutex::Unlock() {
      _pre_unlock();
      if (lockdep && g_lockdep) _will_unlock(); // 调用lockdep_will_unlock,打印解锁信息并清理记录
      int r = pthread_mutex_unlock(&_m);
      assert(r == 0);
    }
    
    int lockdep_will_lock(const char *name, int id, bool force_backtrace)
    {
      pthread_t p = pthread_self();
      if (id < 0) id = lockdep_register(name);
    
      pthread_mutex_lock(&lockdep_mutex);
      lockdep_dout(20) << "_will_lock " << name << " (" << id << ")" << dendl;  // 打印加锁信息
    
      // check dependency graph
      map<int, BackTrace *> &m = held[p];
      for (map<int , BackTrace *>::iterator p = m.begin();
           p != m.end();
           ++p) {
        if (p->first == id) {  // 死锁条件1检查,二次加锁
          lockdep_dout(0) < < "\n";
          *_dout << "recursive lock of " << name << " (" << id << ")\n";
          BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
          bt->print(*_dout);
          if (p->second) {
        *_dout < < "\npreviously locked at\n";
        p->second->print(*_dout);
          }
          delete bt;
          *_dout < < dendl;
          assert(0);
        }
        else if (!follows[p->first][id]) { // 死锁条件2检查,循环依赖
          // new dependency
    
          // did we just create a cycle?
          if (does_follow(id, p->first)) {
            BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
        lockdep_dout(0) < < "new dependency " << lock_names[p->first]
            < < " (" << p->first < < ") -> " < < name << " (" << id << ")"
            << " creates a cycle at\n";
        bt->print(*_dout);
        *_dout < < dendl;
    
        lockdep_dout(0) << "btw, i am holding these locks:" << dendl;
        for (map<int, BackTrace *>::iterator q = m.begin();
             q != m.end();
             ++q) {
          lockdep_dout(0) < < "  " << lock_names[q->first] < < " (" << q->first < < ")" << dendl;
          if (q->second) {
            lockdep_dout(0) < < " ";
            q->second->print(*_dout);
            *_dout < < dendl;
          }
        }
    
        lockdep_dout(0) << "\n" << dendl;
    
        // don't add this dependency, or we'll get aMutex. cycle in the graph, and
        // does_follow() won't terminate.
    
        assert(0);  // actually, we should just die here.
          } else {
            BackTrace *bt = NULL;
            if (force_backtrace || lockdep_force_backtrace()) {
              bt = new BackTrace(BACKTRACE_SKIP);
            }
        follows[p->first][id] = bt;
        lockdep_dout(10) < < lock_names[p->first] < < " -> " < < name << " at" << dendl;
        //bt->print(*_dout);
          }
        }
      }
    
      pthread_mutex_unlock(&lockdep_mutex);
      return id;
    }
    

    pg锁增加等待耗时统计

    PG::PG(OSDService *o, OSDMapRef curmap,
           const PGPool &_pool, spg_t p) :
        osd(o),
        cct(o->cct),
        ...
    -    _lock("PG::_lock"),
    +    _lock((std::string("PG::_lock") + _pool.name.c_str()).c_str(), false, true, false, o->cct), // cct传入才可以初始化perf counter
    
    

    throttle限流

    查看限流统计信息

    root@ceph-l ceph $ ceph daemon osd.0 perf dump throttle-osd_client_messages
    {
        "throttle-osd_client_messages": {
            "val": 19,
            "max": 500,
            "get": 811801,
            "get_sum": 811801,
            "get_or_fail_fail": 0,
            "get_or_fail_success": 0,
            "take": 0,
            "take_sum": 0,
            "put": 811782,
            "put_sum": 811782,
            "wait": {
                "avgcount": 0,
                "sum": 0.000000000
            }
        }
    }
    

    throttle使用

    // ceph_osd.cc
    int main(int argc, const char **argv) 
    {
      ...
      boost::scoped_ptr<throttle> client_msg_throttler(
        new Throttle(g_ceph_context, "osd_client_messages",
             g_conf->osd_client_message_cap));
      ...
      ms_public->set_policy_throttlers(entity_name_t::TYPE_CLIENT,
                       client_byte_throttler.get(),
                       client_msg_throttler.get());
      ...
    }
    
    int Pipe::read_message(Message **pm, AuthSessionHandler* auth_handler)
    {
      ...
      if (policy.throttler_messages) {
        ldout(msgr->cct,10) < < "reader wants " << 1 << " message from policy throttler "
                << policy.throttler_messages->get_current() < < "/"
                << policy.throttler_messages->get_max() < < dendl;
        policy.throttler_messages->get();
      }
      ...
    out_dethrottle:
      // release bytes reserved from the throttlers on failure
      if (policy.throttler_messages) {
        ldout(msgr->cct,10) < < "reader releasing " << 1 << " message to policy throttler "
                << policy.throttler_messages->get_current() < < "/"
                << policy.throttler_messages->get_max() < < dendl;
        policy.throttler_messages->put();
      }
      ...
    

    缺陷

    目前只支持按op数量和大小进行限制,只用在了正常io处理流程,异常恢复等流程还没有实现。

    日志相关

    日志级别

    # 文件日志级别,内存日志级别
    root@ceph-l ceph $ ceph daemon osd.0 config show | grep debug_
        "debug_none": "0\/5",
        ...
        "debug_mon": "1\/5",
        "debug_monc": "0\/10",
        "debug_paxos": "1\/5",
        ...
    

    日志子模块

    // src\common\config_opts.h
    DEFAULT_SUBSYS(0, 5)
    SUBSYS(lockdep, 0, 1)
    SUBSYS(context, 0, 1)
    SUBSYS(crush, 1, 1)
    ...
    SUBSYS(throttle, 1, 1)
    SUBSYS(refs, 0, 0)
    SUBSYS(xio, 1, 5)
    
    // src\common\config.cc
    void md_config_t::init_subsys()
    {
    #define SUBSYS(name, log, gather) \
      subsys.add(ceph_subsys_##name, STRINGIFY(name), log, gather);
    #define DEFAULT_SUBSYS(log, gather) \
      subsys.add(ceph_subsys_, "none", log, gather);
    #define OPTION(a, b, c)
    #include "common/config_opts.h"
    #undef OPTION
    #undef SUBSYS
    #undef DEFAULT_SUBSYS
    }
    
    // src\osd\OSD.cc
    #define dout_subsys ceph_subsys_osd
    #undef dout_prefix
    #define dout_prefix _prefix(_dout, whoami, get_osdmap_epoch())
    
    static ostream& _prefix(std::ostream* _dout, int whoami, epoch_t epoch) {
      return *_dout < < "osd." << whoami << " " << epoch << " ";
    }
    
    dout(0) << " have superblock" << dendl;  // 打印日志示例
    
    // src\common\debug.h
    #define dout(v) ldout((g_ceph_context), v)
    
    // src\common\dout.h
    #define dout_impl(cct, sub, v)                     \
      do {                                  \
      if (cct->_conf->subsys.should_gather(sub, v)) {           \
        if (0) {                                \
          char __array[((v >= -1) && (v < = 200)) ? 0 : -1] __attribute__((unused)); \
        }                                   \
        ceph::log::Entry *_dout_e = cct->_log->create_entry(v, sub);    \
        ostream _dout_os(&_dout_e->m_streambuf);                \
        CephContext *_dout_cct = cct;                   \
        std::ostream* _dout = &_dout_os;
    ...
    #define ldout(cct, v)  dout_impl(cct, dout_subsys, v) dout_prefix
    ...
    #define dendl std::flush;              \
      _ASSERT_H->_log->submit_entry(_dout_e);       \
        }                       \
      } while (0)
    ...
    

    集群日志

    void OSD::handle_osd_map(MOSDMap *m)
    {
        ...
        } else {
          clog->warn() < < "map e" << osdmap->get_epoch()
                  < < " wrongly marked me down";
        }
        ...
    // 由mon的LogMonitor::preprocess_log/prepare_log处理来自osd的集群日志打印请求,最终打印到ceph.log文件中。
    
    # /var/log/ceph/ceph.log
    2019-07-31 02:07:32.191716 osd.92 10.171.5.122:6818/2755462 1588 : cluster [WRN] map e250674 wrongly marked me down
    

    日志回滚

    kill -HUP/log reopen

    举例:logrotate(reload)

    # /etc/init.d/ceph
    ...
            force-reload | reload)
                signal_daemon $name ceph-$type $pid_file -1 "Reloading"  ## kill -HUP $pid
                ;;
    ...
    

    也可以直接使用ceph daemon osd.X log reopen

    全链路日志跟踪(优化目标)

    痛点:跨节点、跨进程的日志无法串联起来,分析比较困难,需要对IO流程代码非常熟悉

    解决方法:参考blkin流程,在日志里增加全局request id,从客户端到osd,以及osd内部的各子模块pg、journal、FileStore等统一使用request id来区分客户端io请求。

    数据编解码

    ceph-objectstore-tool

    Ceph高级工具介绍之ceph-objectstore-tool的使用

    ceph-dencoder

    Ceph高级工具介绍之ceph-dencoder的使用

    在ceph-dencoder中增加修改osd superblock中oldest_map字段功能

    root@ceph-l ~ $ ceph-dencoder -h
    usage: ceph-dencoder [commands ...]
      ...
      editsb <oldest_map> <input sb file/> <output sb file>    edit OSDSuperblock oldest_map section
    
    root@ceph-l ~ $ ceph-dencoder editsb 10000 superblock.orig superblock.new
    {
        "cluster_fsid": "238c3b45-c884-4911-942b-a424910250bf",
        "osd_fsid": "ca873f75-8cd1-4fbf-91fc-7cd6784b7e2c",
        "whoami": 0,
        "current_epoch": 354773,
        "oldest_map": 10000,   ## superblock.new中的这个字段被改成10000
        "newest_map": 354773,
        "weight": 0.000000,
        ....
    }
    

    安装部署

    ceph-deploy

    ceph-deploy

    crush rule编辑工具

    sudo ceph osd getcrushmap -o map
    sudo crushtool -d map -o dmap
    sudo vi dmap 
    sudo crushtool -c dmap -o nmap
    sudo ceph osd setcrushmap -i nmap
    

    pg balance

    Mgr Balancer

    开发者工具

    vstart

    使用vstart搭建ceph开发环境

    blackhole开关

    int FileStore::queue_transactions(Sequencer *posr, list<transaction *> &tls,
                      TrackedOpRef osd_op,
                      ThreadPool::TPHandle *handle)
    {
      Context *onreadable;
      Context *ondisk;
      Context *onreadable_sync;
      ObjectStore::Transaction::collect_contexts(
        tls, &onreadable, &ondisk, &onreadable_sync);
      if (g_conf->filestore_blackhole) {
        dout(0) < < "queue_transactions filestore_blackhole = TRUE, dropping transaction" << dendl;
        delete ondisk;
        delete onreadable;
        delete onreadable_sync;
        return 0;
      }
      ...
    

    异常(延时)注入

    kill/error/心跳异常/ms异常/injectdataerr等:

    root@ceph-l common [CLDNBS-1301] $ grep inject config_opts.h 
    OPTION(heartbeat_inject_failure, OPT_INT, 0)    // force an unhealthy heartbeat for N seconds
    OPTION(ms_inject_socket_failures, OPT_U64, 0)
    OPTION(ms_inject_delay_type, OPT_STR, "")          // "osd mds mon client" allowed
    OPTION(ms_inject_delay_msg_type, OPT_STR, "")      // the type of message to delay, as returned by Message::get_type_name(). This is an additional restriction on the general type filter ms_inject_delay_type.
    OPTION(ms_inject_delay_max, OPT_DOUBLE, 1)         // seconds
    OPTION(ms_inject_delay_probability, OPT_DOUBLE, 0) // range [0, 1]
    OPTION(ms_inject_internal_delays, OPT_DOUBLE, 0)   // seconds
    OPTION(inject_early_sigterm, OPT_BOOL, false)
    OPTION(mon_inject_sync_get_chunk_delay, OPT_DOUBLE, 0)  // inject N second delay on each get_chunk request
    OPTION(mon_inject_transaction_delay_max, OPT_DOUBLE, 10.0)      // seconds
    OPTION(mon_inject_transaction_delay_probability, OPT_DOUBLE, 0) // range [0, 1]
    OPTION(client_debug_inject_tick_delay, OPT_INT, 0) // delay the client tick for a number of seconds
    OPTION(client_inject_release_failure, OPT_BOOL, false)  // synthetic client bug for testing
    OPTION(objecter_inject_no_watch_ping, OPT_BOOL, false)   // suppress watch pings
    OPTION(mds_inject_traceless_reply_probability, OPT_DOUBLE, 0) /* percentage
    OPTION(osd_inject_bad_map_crc_probability, OPT_FLOAT, 0)
    OPTION(osd_inject_failure_on_pg_removal, OPT_BOOL, false)
    OPTION(osd_debug_inject_copyfrom_error, OPT_BOOL, false)  // inject failure during copyfrom completion
    // Allow object read error injection
    OPTION(filestore_debug_inject_read_err, OPT_BOOL, false)
    OPTION(filestore_kill_at, OPT_INT, 0)            // inject a failure at the n'th opportunity
    OPTION(filestore_inject_stall, OPT_INT, 0)       // artificially stall for N seconds in op queue thread
    
    bool HeartbeatMap::is_healthy()
    {
      m_rwlock.get_read();
      time_t now = time(NULL);
      if (m_cct->_conf->heartbeat_inject_failure) {
        ldout(m_cct, 0) < < "is_healthy injecting failure for next " << m_cct->_conf->heartbeat_inject_failure < < " seconds" << dendl;
        m_inject_unhealthy_until = now + m_cct->_conf->heartbeat_inject_failure;
        m_cct->_conf->set_val("heartbeat_inject_failure", "0");
      }
      ...
    
    // src\mon\MonitorDBStore.h
      struct C_DoTransaction : public Context {
        MonitorDBStore *store;
        MonitorDBStore::TransactionRef t;
        Context *oncommit;
        C_DoTransaction(MonitorDBStore *s, MonitorDBStore::TransactionRef t,
                Context *f)
          : store(s), t(t), oncommit(f)
        {}
        void finish(int r) {
          /* The store serializes writes.  Each transaction is handled
           * sequentially by the io_work Finisher.  If a transaction takes longer
           * to apply its state to permanent storage, then no other transaction
           * will be handled meanwhile.
           *
           * We will now randomly inject random delays.  We can safely sleep prior
           * to applying the transaction as it won't break the model.
           */
          double delay_prob = g_conf->mon_inject_transaction_delay_probability;
          if (delay_prob && (rand() % 10000 < delay_prob * 10000.0)) {
            utime_t delay;
            double delay_max = g_conf->mon_inject_transaction_delay_max;
            delay.set_from_double(delay_max * (double)(rand() % 10000) / 10000.0);
            lsubdout(g_ceph_context, mon, 1)
              < < "apply_transaction will be delayed for " << delay
              << " seconds" << dendl;
            delay.sleep();
          }
          int ret = store->apply_transaction(t);
          oncommit->complete(ret);
        }
    

    teuthology数据不一致问题分析过程

    注意:这个bug非官方引入,因此对使用官方代码的ceph环境没有影响

    问题场景

    teuthology某个测试用例执行失败,错误原因是数据不一致:

    2019-05-30T20:11:18.986 INFO:teuthology.run:Summary data:
    {description: 'rados:thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml
    clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml msgr-failures/few.yaml test.yaml thrashers/mapgap.yaml workloads/radosbench.yaml}', duration: 579.8200960159302,
    failure_reason: '"2019-05-30 20:05:01.881151 osd.5 10.18.192.117:6804/11992 77 :
    cluster [ERR] 1.1 shard 1: soid d7cf0021/benchmark_data_hammer004.netease-ceph.com_18678_object192/head//1 size 3145728 != known size 4194304" in cluster log', flavor: basic, owner: cyh, success: false}
    

    并且之前没有跑出来过。没跑出来的原因可能是这个用例没执行到,或者执行到不是必现的。

    解决bug第一步:复现bug

    要解决bug,第一步必定是找到必现场景,或者高概率复现场景,只有稳定复现的bug,才有希望找到bug原因并修复,否则只能靠推测或者猜测,永远没有办法确认bug是否真正的解决了(加不加修复代码都跑不出来,就不能说问题已经修复了)。

    因此这里分两个方向:
    1. 分析teuthology测试过程中的ceph日志
    2. 直接反复执行失败的用例,看是否可以复现

    这两个方向可以并行做,先在已经搭好teuthology环境下写脚本反复执行跑失败的用例,差不多10分钟可以跑完,反复执行看复现概率。

    测试执行过程中继续分析之前跑失败的teuthology用例的ceph日志,ceph.log除了上述scrub不一致的错误之外没有别的异常日志,ceph-mon.log里面有所有测试期间收到的客户端操作命令请求,比如osd的up、down、out、reweight、进程退出启动等,然后就是配合rados bench进行io写入,执行完毕功能用例后,进行pg deep-scrub操作,检查是否有数据不一致情况,ceph相关配置项都是在提前配置好的,也有个别是在用例里覆盖掉的。

    在看osd日志,除了正常的IO写入外,就是异常的恢复IO,分析srcub发现的不一致对象,找到其所有的IO路径,刚开始没什么头绪,写入过程中涉及到peering、remap等过程,发生过多次Retry,切换过2个主osd,有点被误导,看了一天也只是了解到bug相关场景的基础(或者说用例执行的逻辑),对问题的分析每一实质进展。

    分析日志过程中teuthology单独跑测试发现可以复现,并且概率较高(10次必现)。之后继续分析二次复现出来的ceph日志,并尝试更换之前的版本跑相同的用例。

    二次复现的日志相对于第一次跑出来的场景要简单许多,不一致的对象是正常写入的,只是osd在恢复对象过程中被打断(osd进程被强制干掉,并且osd是从),然后再次启动后并没有继续恢复流程,最后在scrub过程中发现对象大小不一致,用例执行失败。

    尝试老版本复现仍然有问题,也就是说可以排除是本次修改引入的bug,继续使用更老的版本(线上版本)进行复现,仍然可以复现,并且复现概率类似,说明是更早的问题。

    根据第二次复现的场景,尝试在本地开发环境使用故障注入的方式进行bug复现,由于是在从osd接收到对象恢复的op后进程被杀掉导致的bug,因此在对应的函数里增加assert(false)操作即可:

    diff --git a/src/osd/ReplicatedBackend.cc b/src/osd/ReplicatedBackend.cc
    index ee37529..0b298c7 100644
    --- a/src/osd/ReplicatedBackend.cc
    +++ b/src/osd/ReplicatedBackend.cc
    @@ -1941,6 +1941,8 @@ void ReplicatedBackend::handle_push(
                       pop.omap_entries,
                       t);
    
    +  if (!complete && cct->_conf->osd_recover_inject_assert)
    +    assert(0 == "xxxxxxxxxxxxxxxxxxxxxxxxx");
       if (complete)
         get_parent()->on_local_recover(
           pop.recovery_info.soid,
    

    添加上述代码后,在osd正在恢复对象的过程中,动态修改配置项osd_recover_inject_assert=true,osd会assert退出,之后再启动这个osd,尝试后发现这个方式可以几乎100%复现这个bug,问题基本就有很大解决的希望了。

    首先分析代码流程,发现是osd再次启动后,恢复到一半的对象没有再次加入到missing列表,导致主osd不会再次将其恢复。
    但是没有找到根本原因,为啥没有加入missing列表中?因为osd启动时,load_pgs过程中发现这个对象存储(可以读到),并且版本与pg log中的一致,但是为啥对象在?并且版本一致?发现是在分段恢复时,第一段(比如4M中的第1M)恢复的时候就会把对象版本写入对象attrs属性中,之后恢复剩余的部分数据时不会进行修改。

    问题的关键就是osd再次启动后对象为啥是存在的?当时怎么也没想明白。

    尝试L版本复现

    于是尝试分析L版本代码进行对比,发现相关流程并没有改动(load_pgs和分段恢复时第一段设置对象attrs),但用同样的故障注入方式不能复现这个bug,原因是pg的missing对象列表已经跟pglog一起持久化到磁盘上(rocksdb?),只有在对象完全恢复之后才会被删除掉(这部分代码没分析,但猜测肯定是这样),如果按照上面的复现方式进行操作,osd assert退出再启动,pg的missing对象列表在load_pgs时会从rocksdb里加载进来,然后就可以继续恢复。

    尝试官方版本复现

    分析到这里以为是0.94.6版本本身就有的bug,但问题第一次出现时就google过了,没有搜到任何类似的bug报告,并且tracker.ceph.com上也没有类似bug报告,而且以我们的teuthology复现概率来看,这个bug基本算是必现的,如果官方版本存在,那肯定有人报告。所以就在开发环境通过上面的故障注入方式,搭配官方的0.94.6版本代码进行复现,发现果然是没有问题的,分析到这里基本就可以确认是我们自己修改的代码导致的bug,对比下线上版本和官方0.94.6版本,看到基本唯一的改动就是部分恢复功能,基本可以确认这个功能引入的(其实问题一出现我们就怀疑这部分代码,当时应该第一时间确认,就可以少走很多弯路,但也就意味着少学习很多东西)。

    分析部分恢复优化代码

    分析部分恢复优化代码,可以看到在submit_push_data函数里有修改数据恢复的pg目录(head还是TMP):

    void ReplicatedBackend::submit_push_data(
      ObjectRecoveryInfo &recovery_info,
      bool first,
      bool complete,
      const interval_set<uint64_t> &intervals_included,
      bufferlist data_included,
      bufferlist omap_header,
      map<string, bufferlist> &attrs,
      map<string, bufferlist> &omap_entries,
      ObjectStore::Transaction *t)
    {
      coll_t target_coll;
      bool can_recover_partial = recovery_info.can_recover_partial;
      if (first && complete || can_recover_partial) {  // 这里can_recover_partial几乎永远为true,所以target_coll就是pg的head目录,而不是tmp目录
        target_coll = coll;
      } else {
        dout(10) << __func__ << ": Creating oid "
             << recovery_info.soid << " in the temp collection" << dendl;
        add_temp_obj(recovery_info.soid);
        target_coll = get_temp_coll(t);  // 官方的分段恢复是走这里,在pg的tmp目录下进行恢复,对象全部恢复完毕之后再move到head目录
      }
    
      if (first) {
        if (!can_recover_partial) {
          get_parent()->on_local_recover_start(recovery_info.soid, t);
          t->remove(get_temp_coll(t), recovery_info.soid);
          t->touch(target_coll, recovery_info.soid);
          t->truncate(target_coll, recovery_info.soid, recovery_info.size);
        }
        t->omap_setheader(target_coll, recovery_info.soid, omap_header);
      }
      uint64_t off = 0;
      for (interval_set<uint64_t>::const_iterator p = intervals_included.begin();
           p != intervals_included.end();
           ++p) {
        bufferlist bit;
        bit.substr_of(data_included, off, p.get_len());
        t->write(target_coll, recovery_info.soid,
             p.get_start(), p.get_len(), bit);
        off += p.get_len();
      }
    
      t->omap_setkeys(target_coll, recovery_info.soid,
              omap_entries);
      t->setattrs(target_coll, recovery_info.soid,  // 如果按照官方的代码逻辑,部分恢复场景下,这里的target_coll是pg的tmp目录,但是我们增加部分恢复功能之后,这里就是直接在pg的head目录下修改对象属性了,导致上面复现场景下,二次重启后,检查对象版本时可以在pg的head目录下读到对象的属性,而且与pg log中的一致,导致pg认为这个对象是正常的,没有加入pg的missing对象列表(列表会在peering的GetMissing阶段发给主osd),主osd就不会主动恢复这个对象,最终pg deep scrub的时候发现对象数据大小不一致。
              attrs);
    
      if (complete) {
        if (!first && !can_recover_partial) {
          dout(10) << __func__ << ": Removing oid "
               << recovery_info.soid << " from the temp collection" << dendl;
          clear_temp_obj(recovery_info.soid);
          t->collection_move(coll, target_coll, recovery_info.soid);
        }
    
        submit_push_complete(recovery_info, t);
      }
    }
    

    解决方案

    因此就加入日志进行复现,发现确实是如此,确认问题之后就是找解决方案了,组长建议把分段恢复功能关闭(注意是官方的分段恢复,而不是我们修改的部分恢复,部分恢复是指分段里面的一部分,比如1M分段中的128K数据,这个功能可以减少实际需要恢复的数据量),也即把osd_recovery_max_chunk配置项设置为默认值(8388608),当前我们线上和teuthology测试用的都是131072。这样的话,

      bool can_recover_partial = recovery_info.can_recover_partial;
      if (first && complete || can_recover_partial) {  // 这里保证first&&complete为true,target_coll就是pg的head目录,一次恢复全部数据,并且由事务保证要么成功,对象存在且是完整恢复的,要么失败,对象完全不存在,不会处于中间状态
        target_coll = coll;
      } else {
        dout(10) << __func__ << ": Creating oid "
             << recovery_info.soid << " in the temp collection" << dendl;
        add_temp_obj(recovery_info.soid);
        target_coll = get_temp_coll(t);
      }
    

    尝试修改配置项之后用故障注入方式及teuthology单独执行相关bug用例,不一致问题均不再出现,问题确认修复。

    可以看出,问题分析过程虽然比较曲折,但问题的原因分析清楚之后,却非常简单,改动方案也很简单。一般来说改动方案越简单,才说明找到了真正的问题点。

    Ceph云盘util频繁100%性能问题分析过程记录

    问题症状

    popo群用户报障,hzabj-ysf-nginx-08.server.163.org这台机器IO总是100%,哨兵监控截图如下:

    初步分析

    首先查看打桩卷监控,根据历史日志(日志是我之前修改librados代码加上的)可以了解这个物理池经常有超过3s的IO(一般我们认为超过3s就会导致io util 100%):

    2019-06-01 17:25:19.993596 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000085, pool: 59
    2019-06-01 18:09:57.595608 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000069, pool: 59
    2019-06-01 18:16:20.702047 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000a7, pool: 59
    2019-06-01 19:54:51.239482 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000054, pool: 59
    2019-06-01 19:56:35.497817 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000085, pool: 59
    2019-06-01 19:57:02.986602 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000a7, pool: 59
    2019-06-01 19:57:54.665901 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000a6, pool: 59
    2019-06-01 20:00:59.491391 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000cb, pool: 59
    2019-06-01 22:05:23.637840 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000a6, pool: 59
    2019-06-01 22:46:59.814315 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.000000000000002b, pool: 59
    2019-06-01 22:47:03.845807 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000cb, pool: 59
    2019-06-02 00:07:14.823514 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000085, pool: 59
    2019-06-02 00:09:12.176502 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000079, pool: 59
    2019-06-02 00:09:42.964805 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.0000000000000076, pool: 59
    2019-06-02 00:10:05.854999 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000cb, pool: 59
    2019-06-02 00:11:20.025895 7fe9967fc700 0 client.824171177.objecter [slow op] critical(>3s), object name: rbd_data.10587c6b8b4567.00000000000000a5, pool: 59
    
    

    通过这个日志就基本可以排除是用户自己磁盘IO繁忙导致的util 100%了。

    否则就要先确认用户自己的IO繁忙程度,可以看下哨兵监控(根据主机名前缀搜索一般能搜到)或者找NVS组查看云主机磁盘监控信息,如果本身已经比较繁忙,比如达到了云盘的QoS限制(iops、bps),那首先怀疑他们业务的问题导致的100%。

    进一步分析

    之前这个存储池的打桩卷已经偶有告警,一直误以为是存储池比较繁忙导致的,因此首先看了下物理池上的2个逻辑池的负载,看起来不忙(对ssd pool来说不到5千的iops应该算比较空闲了)。

    xxx@ceph146:~$ ceph osd pool stats | egrep 'switch12_ssd_vms|switch16_ssd_vms' -A1
    pool switch12_ssd_vms id 46
    client io 75986 B/s rd, 41106 kB/s wr, 2623 op/s
    --
    pool switch16_ssd_vms id 59
    client io 125 kB/s rd, 68879 kB/s wr, 1786 op/s
    
    

    之后根据打桩卷日志中的对象进行分析,可以看出还是有一定的规律的,就是告警的对象有很多重复,所以就怀疑跟特定pg或者osd有关,但有不能肯定,因为经过ceph osd map查看对应的osd,还是比较分散的。

    再进一步就是查看ceph.log及ceph-osd.log,结果是一无所获,没有有效日志。

    之后就是怀疑有慢盘,查看osd的perf dump信息,分析osd的journal相关延时,发现比较正常,但op_w_latency比较高,说明盘ok,但是op就是处理的慢,看物理机的基础监控也ok。

    此时就比较棘手了。

    只能打开osd_enable_op_tracker = true这个配置项并重启osd了,重启osd也要挑一个重点怀疑的对象,挑选比较简单,就是根据打桩卷的告警中的对象对应的osd,查看哪个osd被对应的次数最多,就挑他先重启打开。

    for i in `cat xxx`; do ceph osd map switch16_ssd_vms $i >> yyy; done
    

    其中xxx就是所有打桩卷日志中告警的对象名称列表。

    然后截取yyy中的osd列表,进行排序,查看最多出现的那个osd。

    挑选的osd是osd.1908,之后提at单进行重启(先手工修改/etc/ceph/ceph.conf中的配置项:osd_enable_op_tracker = true),之后重启这个osd,重启过程比较平稳,没有造成抖动(一般来说停止启动间隔时间比较短的情况下都不会造成明显抖动)。

    重启之后就等着问题出来就行,tailf看着打桩卷的日志,发现有超过3s的就对osd.1908执行sudo ceph daemon osd.1908 dump_historic_ops命令查看op耗时,发现有很多op是在等其他副本返回(sub_op_commit_rec这个事件比较耗时)。

    可惜的是H版本没有把这个事件对应的osd id打印出来(L版本已加,已经pick到我们的H版本),无从知晓是哪个副本osd返回慢,于是只能挑选出现比较频繁的osd(两个副本)都继续打开osd_enable_op_tracker并重启osd。

    重启之后看到repop的耗时也比较长,但耗时的阶段比较奇怪(在reached_pg之前,这个是dequeue_op的第一个阶段):

    {
    "description": "osd_repop(client.893650048.0:2152251 46.235 91c5c235\/rbd_data.25f087ae6b8b4567.0000000000000358\/head\/\/46 v 116247'52196234)",
    "initiated_at": "2019-06-03 13:53:22.152573",
    "age": 396.997140,
    "duration": 2.345648,
    "type_data": [
        "commit sent; apply or cleanup",
        [
            {
            "time": "2019-06-03 13:53:22.152573",
            "event": "initiated"
            },
            {
            "time": "2019-06-03 13:53:24.482251",
            "event": "reached_pg"
            },
    

    但是还不清楚具体是在enqueue_op还是dequeue_op阶段比较慢。于是只能打开osd日志(debug_osd=15),查看具体慢在哪里,日志看到耗时确实在enqueue的时候就比较长:

    2019-06-03 16:23:00.097107 7fde12cfa700 10 osd.1854 116251 handle_replica_op osd_repop(client.896864302.0:1498545 59.17a f41cf17a/rbd_data.1906c95d6b8b4567.0000000000
    0038f0/head//59 v 116251'18143050) v1 epoch 116251
    2019-06-03 16:23:00.097115 7fde12cfa700 15 osd.1854 116251 enqueue_op 0x7fddf4813b00 prio 127 cost 3793664 latency 1.542532 osd_repop(client.896864302.0:1498545 59.17
    a f41cf17a/rbd_data.1906c95d6b8b4567.00000000000038f0/head//59 v 116251'18143050) v1
    2019-06-03 16:23:00.097131 7fde269e6700 10 osd.1854 116251 dequeue_op 0x7fddf4813b00 prio 127 cost 3793664 latency 1.542548 osd_repop(client.896864302.0:1498545 59.17
    a f41cf17a/rbd_data.1906c95d6b8b4567.00000000000038f0/head//59 v 116251'18143050) v1 pg pg[59.17a( v 116251'18143049 (116247'18140009,116251'18143049] local-les=11624
    7 n=1427 ec=10414 les/c 116247/116247 116246/116246/67102) [1909,1942,1854] r=2 lpr=116246 pi=65962-116245/7 luod=0'0 crt=116251'18143047 lcod 116251'18143048 active]
    2019-06-03 16:23:00.097165 7fde269e6700 10 osd.1854 pg_epoch: 116251 pg[59.17a( v 116251'18143049 (116247'18140009,116251'18143049] local-les=116247 n=1427 ec=10414 l
    es/c 116247/116247 116246/116246/67102) [1909,1942,1854] r=2 lpr=116246 pi=65962-116245/7 luod=0'0 crt=116251'18143047 lcod 116251'18143048 active] handle_message: 0x
    7fddf4813b00
    

    另外通过日志还确认了一点,只有osd_repop才enqueue比较慢,正常的osd_op都没问题(二者入队的队列都是同一个)。

    通过在开发环境gdb调试确认二者的处理流程没有区别:

    (gdb) bt
    #0 OSD::enqueue_op (this=this@entry=0x4be6000, pg=0xbc74c00, op=std::tr1::shared_ptr (count 2, weak 0) 0x4ab5100) at osd/OSD.cc:8470
    #1 0x00000000006a08aa in OSD::handle_replica_op<mosdrepop , 112> (this=this@entry=0x4be6000, op=std::tr1::shared_ptr (count 2, weak 0) 0x4ab5100,    ### osd_op这里是走handle_op
    osdmap=std::tr1::shared_ptr (count 72, weak 1) 0x1149d680) at osd/OSD.cc:8448
    #2 0x000000000065704e in OSD::dispatch_op_fast (this=this@entry=0x4be6000, op=std::tr1::shared_ptr (count 2, weak 0) 0x4ab5100, 
    osdmap=std::tr1::shared_ptr (count 72, weak 1) 0x1149d680) at osd/OSD.cc:5839
    #3 0x00000000006571a8 in OSD::dispatch_session_waiting (this=this@entry=0x4be6000, session=session@entry=0x4b3a800, 
    osdmap=std::tr1::shared_ptr (count 72, weak 1) 0x1149d680) at osd/OSD.cc:5473
    #4 0x00000000006575c0 in OSD::ms_fast_dispatch (this=0x4be6000, m=<optimized out>) at osd/OSD.cc:5585
    #5 0x0000000000bddd9e in ms_fast_dispatch (m=0x11de2400, this=0x4ae4700) at ./msg/Messenger.h:553
    #6 DispatchQueue::fast_dispatch (this=0x4ae48c8, m=0x11de2400) at msg/simple/DispatchQueue.cc:71
    #7 0x0000000000c0b543 in Pipe::reader (this=0x126d0000) at msg/simple/Pipe.cc:1645
    #8 0x0000000000c1384d in Pipe::Reader::entry (this=</optimized><optimized out>) at msg/simple/Pipe.h:50
    #9 0x00007fc06b25d064 in start_thread (arg=0x7fc046880700) at pthread_create.c:309
    #10 0x00007fc0697b562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
    

    这就更奇怪了,enqueue_op其实做的事情非常简单,就是把op丢到osd的op处理队列就返回了,cpu也不忙,内存也剩余非常多,没理由这么慢,但通过日志可以确认慢的过程肯定在上面的调用栈里面。

    首先怀疑是pg锁导致的问题,但还是有点疑问,业务并不繁忙,每个op正常应该处理很快,pg锁应该不会占用太长时间,当然除非是有bug。

    问题分析到这里基本就有点郁闷了。只能继续撸代码看看有没有可以进一步打开的日志供我们确认问题。

    还好有op tracker,我之前在基于它增加日志的时候,看到了一些日志可以打出来,我想看的日志是op什么时候收到的,什么时候分发的,

    OpRequest::OpRequest(Message *req, OpTracker *tracker) :
    TrackedOp(tracker, req->get_recv_stamp()),
    rmw_flags(0), request(req),
    hit_flag_points(0), latest_flag_point(0),
    send_map_update(false), sent_epoch(0) {
        if (req->get_priority() < tracker->cct->_conf->osd_client_op_priority) {
          // don't warn as quickly for low priority ops
          warn_interval_multiplier = tracker->cct->_conf->osd_recovery_op_warn_multiple;
        }
        if (req->get_type() == CEPH_MSG_OSD_OP) {
          reqid = static_cast<mosdop *>(req)->get_reqid();
        } else if (req->get_type() == MSG_OSD_SUBOP) {
          reqid = static_cast<mosdsubop *>(req)->reqid;
        } else if (req->get_type() == MSG_OSD_REPOP) {
          reqid = static_cast<mosdrepop *>(req)->reqid;
        }
        tracker->mark_event(this, "header_read", request->get_recv_stamp());   // 就是这几条日志救了命,这里只是打日志,并没有记录到historic ops
        tracker->mark_event(this, "throttled", request->get_throttle_stamp());
        tracker->mark_event(this, "all_read", request->get_recv_complete_stamp());
        tracker->mark_event(this, "dispatched", request->get_dispatch_stamp());
    

    于是我把op tracker日志级别调到5(debug_optracker=5),日志就出来了:

    2019-06-03 17:11:45.489452 7fde166fe700 5 -- op tracker -- seq: 2757941, time: 2019-06-03 17:11:43.689630, event: throttled, op: osd_repop(client.689719056.0:69702497 59.1c7 913ee9c7/rbd_data.2bd4c3681befd79f.0000000000000fbc/head//59 v 116251'17079367)
    2019-06-03 17:11:45.489457 7fde166fe700 5 -- op tracker -- seq: 2757941, time: 2019-06-03 17:11:45.489414, event: all_read, op: osd_repop(client.689719056.0:69702497 59.1c7 913ee9c7/rbd_data.2bd4c3681befd79f.0000000000000fbc/head//59 v 116251'17079367)
    

    可以明显看到是慢在从throttled到all_read这段时间,这部分代码是在src\msg\simple\Pipe.cc的Pipe::read_message里:

    // read front
    front_len = header.front_len;
    if (front_len) {
        bufferptr bp = buffer::create(front_len);
        if (tcp_read(bp.c_str(), front_len) < 0)
        goto out_dethrottle;
        front.push_back(bp);
        ldout(msgr->cct,20) < < "reader got front " << front.length() << dendl;
    }
    
    // read middle
    middle_len = header.middle_len;
    if (middle_len) {
        bufferptr bp = buffer::create(middle_len);
        if (tcp_read(bp.c_str(), middle_len) < 0)
            goto out_dethrottle;
        middle.push_back(bp);
        ldout(msgr->cct,20) << "reader got middle " << middle.length() << dendl;
    }
    

    都是tcp_read相关操作,于是怀疑是cluster网络问题,也解释了为啥之后repop慢(它走的cluster网,而主osd处理的op则走的是public网),看起来离真相越来越近了。

    之后通过ping 10.122.192.146 -s 4096 测试cluster网络,发现有异常:

    4104 bytes from 10.122.192.146: icmp_seq=88 ttl=64 time=0.082 ms
    From 10.122.192.146 icmp_seq=58 Frag reassembly time exceeded
    4104 bytes from 10.122.192.146: icmp_seq=89 ttl=64 time=0.075 ms
    4104 bytes from 10.122.192.146: icmp_seq=90 ttl=64 time=0.084 ms
    4104 bytes from 10.122.192.146: icmp_seq=91 ttl=64 time=0.080 ms
    4104 bytes from 10.122.192.146: icmp_seq=92 ttl=64 time=0.084 ms
    4104 bytes from 10.122.192.146: icmp_seq=93 ttl=64 time=0.079 ms
    4104 bytes from 10.122.192.146: icmp_seq=94 ttl=64 time=0.078 ms
    4104 bytes from 10.122.192.146: icmp_seq=95 ttl=64 time=0.081 ms
    From 10.122.192.146 icmp_seq=65 Frag reassembly time exceeded
    4104 bytes from 10.122.192.146: icmp_seq=96 ttl=64 time=0.088 ms
    

    于是找sa查看网络是否有问题,sa经过猜测,认为是bond的一个网卡有问题,就停掉一个eth0试验,观察一晚上没问题,问题解决了。

    但问题的根本原因还是没有找到,于是第二天再次把eth0网卡up起来,跑了一天还是正常的。目前还没有找到问题出在哪儿,哨兵上查看网卡监控都是正常的。

    只是机房给出的交换机信息里有异常(大量input errors和17个aborts),但也没能解释出现这些错误的原因:

    ceph pg incomplete状态处理方案

    产生原因:

    1. 单副本运行一段时间,有IO写入
    2. 单副本osd磁盘故障,无法启动
    3. 将该osd标记为lost,之后就可能导致部分pg进入incomplete状态

    原因是无法选出包含权威日志的osd。可参考:https://blog.csdn.net/a1454927420/article/details/79007144

    解决方案:

    该方案属于最终兜底方案,只能在没有其他修复方法时使用,该方案可能会导致数据丢失。

    修复原理是将所有导致pg的past_intervals全部清空重新写入pg info持久化到硬盘(就是这些past_intervals中包含故障磁盘的osd,导致pg无法通过peering的GetLog阶段)。

    使用新版本的 ceph-objectstore-tool 工具,强制将pg状态置为正常(在主osd上执行,需要先停止osd进程):

    $ ceph-objectstore-tool –data-path /var/lib/ceph/osd/ceph-16/ –journal-path /var/lib/ceph/osd/ceph-16/journal –pgid $pg –op mark-complete

    我们当前的0.94.6版本代码没有mark-complete功能,社区相关改动代码(0.94.7加入):

    https://github.com/ceph/ceph/pull/7702/commits/0fe3dfe8768b27743189add73cde9a44095f2a62#diff-6ace6eced75ec25c9d512cb71e03d384

    backport过来,然后编译就可以使用了。

    新版本已经默认支持了,所以可以直接使用这个工具。

    ceph pg down+peering状态处理方案

    参考文档:http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/

    复现步骤:

    1. 对存储池(min_size=1)中的一个卷进行fio数据写入
    2. 停止2个副本osd,保持单副本运行状态,并保持写入数据
    3. 停止第3个副本osd,之后启动前2个副本osd

    此时ceph -s就可以看到出现了down+peering状态的pg。

    我们在这种情况下,首先选择启动第3个副本osd,如果可以正常启动,那么down+peering状态的pg就会恢复正常active状态。

    如果发生第3个副本osd无法启动(磁盘故障或者节点故障等)情况,则需要考虑挽救方案,这种情况下可能会丢失部分对象数据(对象数据回滚到老版本)。这种场景的pg状态恢复操作步骤如下:
    根据ceph pg xxx query命令,找到peering_blocked_by的osd id,执行ceph osd lost $OSDID –yes-i-really-mean-it,标记该osd为lost,并重启pg xxx中的存活的一个osd(强制重新进行peering),如果pg数据没有更新到故障的osd上,也即单副本运行期间,该pg没有新写入数据,那么这么操作后不会发生对象数据丢失。之后把故障的osd移除即可。

    如果执行lost命令之后,ceph -s看到有unfound对象。则需要执行ceph pg xxx mark_unfound_lost revert,将pg中lost对象revert到之前的版本(单副本运行期间写入的数据会丢失)。之后pg就会恢复正常的active状态。最后将故障osd移除即可。

    常用命令:

    $ ceph health detail   ## 可以查看集群健康状态的详细信息,比如不可用pg情况、osd down情况等
    pg 11.3c is stuck inactive for 5581.136026, current state down+peering, last acting [24,40]
    pg 11.17 is stuck inactive for 5582.046433, current state down+peering, last acting [24,40]
    pg 11.3a is stuck inactive for 5543.997257, current state down+peering, last acting [40,24]
    pg 11.2e is stuck inactive for 5582.079197, current state down+peering, last acting [24,40]
    pg 11.1 is stuck unclean for 5585.626778, current state down+peering, last acting [40,24]
    pg 11.3c is stuck unclean for 5585.230310, current state down+peering, last acting [24,40]
    pg 11.3 is stuck unclean for 5585.384133, current state down+peering, last acting [40,24]
    pg 11.17 is stuck unclean for 5586.867256, current state down+peering, last acting [24,40]
    pg 11.38 is stuck unclean for 5586.403706, current state down+peering, last acting [40,24]
    pg 11.3a is stuck unclean for 11991.965760, current state down+peering, last acting [40,24]
    pg 11.2e is stuck unclean for 5585.567934, current state down+peering, last acting [24,40]
    pg 11.6 is stuck unclean for 5585.452910, current state down+peering, last acting [40,24]
    pg 11.3c is down+peering, acting [24,40]
    pg 11.38 is down+peering, acting [40,24]
    pg 11.3a is down+peering, acting [40,24]
    pg 11.2e is down+peering, acting [24,40]
    pg 11.17 is down+peering, acting [24,40]
    pg 11.6 is down+peering, acting [40,24]
    pg 11.1 is down+peering, acting [40,24]
    pg 11.3 is down+peering, acting [40,24]
    $ ceph pg 11.3c query  ##查看pg的详细信息
    ......
      "recovery_state": [
       ......
                "probing_osds": [
                    "24",
                    "40"
                ],
                "blocked": "peering is blocked due to down osds",
                "down_osds_we_would_probe": [
                    0
                ],
                "peering_blocked_by": [
                    {
                        "osd": 0,
                        "current_lost_at": 20891,
                        "comment": "starting or marking this osd lost may let us proceed"  ### 提示信息
                  }
                ]
            },
    ......