libvirt/qemu live snapshot代码流程分析




上次分析了nova live snapshot的一个数据丢失问题,但对底层的实现一知半解,这篇文章是对之前的延续,继续深入的搞明白底层的技术实现和原理(本文虎头蛇尾,慎入!)。

libvirt

首先有个问题,libvirt的用途是啥?

我很早之前写过一篇文档,内容比较浅,主要说明了libvirt、qemu、kvm之间的关系,在这个问题上我的理解是,libvirt就是虚拟化适配层,adapter,适配各种底层虚拟化技术,官方声明支持的有:KVMQEMUXenVirtuozzo,VMWare ESXLXCBHyve and more,当然主要支持的还是qemu,其他的虚拟化技术,支持的都不是很完善,比如Xen、lxc,基础功能的支持尚可,高级功能就不行了,所以应该绝大部分用户都是用的qemu虚拟化,Xen已经越来越不流行了。libvirt+LXC之前用过,也是不太好用,毕竟现在都是用docker了,LXC也已经成了昨日黄花。VMWare ESX这个没用过,不发表意见,不过看nova代码里面都是实现的ESX driver,而不是用libvirt driver,我估计也不是很完善,毕竟VMWare本身客户端或者API接口做的就很不错了,用户也没必要再套一层libvirt了。libvirt接口本身是C的,为了方便使用,还提供了多种语言的SDK,比如python、java等,当前最常用的应该就是python了。

libvirt是开源的,所以目前是OpenStack等开源IaaS云平台甚至闭源自研云平台的虚拟化接口适配层的首先方案,libvirt比OpenStack项目要早很多,一直以来都是redhat的主场,libvirt官方网站上有更多信息:https://libvirt.org/

代码流程

基于master版本分析(HEAD commit:07adbd4b1f82a9f09584dfa5fb6ca9063bd24bd0)。

上篇文章提到nova里面调用的是dev.rebase(),实际是通过python-libvirt接口(对libvirt api的python封装)调用的是libvirt的virDomainBlockRebase接口,源码位于src\libvirt_domain.c:

/**
 * virDomainBlockRebase:
 * @dom: pointer to domain object
 * @disk: path to the block device, or device shorthand
 * @base: path to backing file to keep, or device shorthand,
 *        or NULL for no backing file
 * @bandwidth: (optional) specify bandwidth limit; flags determine the unit
 * @flags: bitwise-OR of virDomainBlockRebaseFlags
 *
 * Populate a disk image with data from its backing image chain, and
 * setting the backing image to @base, or alternatively copy an entire
 * backing chain to a new file @base.
 *
 * When @flags is 0, this starts a pull, where @base must be the absolute
 * path of one of the backing images further up the chain, or NULL to
 * convert the disk image so that it has no backing image.  Once all
 * data from its backing image chain has been pulled, the disk no
 * longer depends on those intermediate backing images.  This function
 * pulls data for the entire device in the background.  Progress of
 * the operation can be checked with virDomainGetBlockJobInfo() with a
 * job type of VIR_DOMAIN_BLOCK_JOB_TYPE_PULL, and the operation can be
 * aborted with virDomainBlockJobAbort().  When finished, an asynchronous
 * event is raised to indicate the final status, and the job no longer
 * exists.  If the job is aborted, a new one can be started later to
 * resume from the same point.
 *
 * If @flags contains VIR_DOMAIN_BLOCK_REBASE_RELATIVE, the name recorded
 * into the active disk as the location for @base will be kept relative.
 * The operation will fail if libvirt can't infer the name.
 *
 * When @flags includes VIR_DOMAIN_BLOCK_REBASE_COPY, this starts a copy,
 * where @base must be the name of a new file to copy the chain to.  By
 * default, the copy will pull the entire source chain into the destination
 * file, but if @flags also contains VIR_DOMAIN_BLOCK_REBASE_SHALLOW, then
 * only the top of the source chain will be copied (the source and
 * destination have a common backing file).  By default, @base will be
 * created with the same file format as the source, but this can be altered
 * by adding VIR_DOMAIN_BLOCK_REBASE_COPY_RAW to force the copy to be raw
 * (does not make sense with the shallow flag unless the source is also raw),
 * or by using VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT to reuse an existing file
 * which was pre-created with the correct format and metadata and sufficient
 * size to hold the copy. In case the VIR_DOMAIN_BLOCK_REBASE_SHALLOW flag
 * is used the pre-created file has to exhibit the same guest visible contents
 * as the backing file of the original image. This allows a management app to
 * pre-create files with relative backing file names, rather than the default
 * of absolute backing file names; as a security precaution, you should
 * generally only use reuse_ext with the shallow flag and a non-raw
 * destination file.  By default, the copy destination will be treated as
 * type='file', but using VIR_DOMAIN_BLOCK_REBASE_COPY_DEV treats the
 * destination as type='block' (affecting how virDomainGetBlockInfo() will
 * report allocation after pivoting).
 *
 * A copy job has two parts; in the first phase, the @bandwidth parameter
 * affects how fast the source is pulled into the destination, and the job
 * can only be canceled by reverting to the source file; progress in this
 * phase can be tracked via the virDomainBlockJobInfo() command, with a
 * job type of VIR_DOMAIN_BLOCK_JOB_TYPE_COPY.  The job transitions to the
 * second phase when the job info states cur == end, and remains alive to
 * mirror all further changes to both source and destination.  The user
 * must call virDomainBlockJobAbort() to end the mirroring while choosing
 * whether to revert to source or pivot to the destination.  An event is
 * issued when the job ends, and depending on the hypervisor, an event may
 * also be issued when the job transitions from pulling to mirroring.  If
 * the job is aborted, a new job will have to start over from the beginning
 * of the first phase.
 *
 * Some hypervisors will restrict certain actions, such as virDomainSave()
 * or virDomainDetachDevice(), while a copy job is active; they may
 * also restrict a copy job to transient domains.
 *
 * The @disk parameter is either an unambiguous source name of the
 * block device (the <source file='...'/> sub-element, such as
 * "/path/to/image"), or the device target shorthand (the
 * <target dev='...'/> sub-element, such as "vda").  Valid names
 * can be found by calling virDomainGetXMLDesc() and inspecting
 * elements within //domain/devices/disk.
 *
 * The @base parameter can be either a path to a file within the backing
 * chain, or the device target shorthand (the <target dev='...'/>
 * sub-element, such as "vda") followed by an index to the backing chain
 * enclosed in square brackets. Backing chain indexes can be found by
 * inspecting //disk//backingStore/@index in the domain XML. Thus, for
 * example, "vda[3]" refers to the backing store with index equal to "3"
 * in the chain of disk "vda".
 *
 * The maximum bandwidth that will be used to do the copy can be
 * specified with the @bandwidth parameter.  If set to 0, there is no
 * limit.  If @flags includes VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES,
 * @bandwidth is in bytes/second; otherwise, it is in MiB/second.
 * Values larger than 2^52 bytes/sec may be rejected due to overflow
 * considerations based on the word size of both client and server,
 * and values larger than 2^31 bytes/sec may cause overflow problems
 * if later queried by virDomainGetBlockJobInfo() without scaling.
 * Hypervisors may further restrict the range of valid bandwidth
 * values.  Some hypervisors do not support this feature and will
 * return an error if bandwidth is not 0; in this case, it might still
 * be possible for a later call to virDomainBlockJobSetSpeed() to
 * succeed.  The actual speed can be determined with
 * virDomainGetBlockJobInfo().
 *
 * When @base is NULL and @flags is 0, this is identical to
 * virDomainBlockPull().  When @flags contains VIR_DOMAIN_BLOCK_REBASE_COPY,
 * this command is shorthand for virDomainBlockCopy() where the destination
 * XML encodes @base as a <disk type='file'>, @bandwidth is properly scaled
 * and passed as a typed parameter, the shallow and reuse external flags
 * are preserved, and remaining flags control whether the XML encodes a
 * destination format of raw instead of leaving the destination identical
 * to the source format or probed from the reused file.
 *
 * Returns 0 if the operation has started, -1 on failure.
 */
int
virDomainBlockRebase(virDomainPtr dom, const char *disk,
                     const char *base, unsigned long bandwidth,
                     unsigned int flags)
{
    virConnectPtr conn;

    VIR_DOMAIN_DEBUG(dom, "disk=%s, base=%s, bandwidth=%lu, flags=0x%x",
                     disk, NULLSTR(base), bandwidth, flags);

    virResetLastError();

    virCheckDomainReturn(dom, -1);
    conn = dom->conn;

    virCheckReadOnlyGoto(conn->flags, error);
    virCheckNonNullArgGoto(disk, error);

    if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY) {
        virCheckNonNullArgGoto(base, error);
    } else if (flags & (VIR_DOMAIN_BLOCK_REBASE_SHALLOW |
                        VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT |
                        VIR_DOMAIN_BLOCK_REBASE_COPY_RAW |
                        VIR_DOMAIN_BLOCK_REBASE_COPY_DEV)) {
        virReportInvalidArg(flags, "%s",
                            _("use of flags requires a copy job"));
        goto error;
    }
    // conn.driver是qemu driver,各个driver注册过程就不讲了,网上搜搜应该有,入口是在
    // daemon/libvirtd.c: VIR_DAEMON_LOAD_MODULE(qemuRegister, "qemu");
    // 使用哪个driver是根据connection连接建立的时候传入的uri决定的,
    // 比如virsh -c qemu:///system就是表示用qemu的unix domain socket连接到libvirtd的qemu driver
    if (conn->driver->domainBlockRebase) {  // 检查虚拟化driver后端有没有实现这个函数
        int ret;
        ret = conn->driver->domainBlockRebase(dom, disk, base, bandwidth,
                                              flags); // 调用driver里的这个rebase函数
        if (ret < 0)
            goto error;
        return ret;
    }

    virReportUnsupportedError();

 error:
    virDispatchError(dom->conn);
    return -1;
}

上面的注释很清楚,其实API接口说明文档就是根据这里的注释生成的:https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockRebase

nova里面调用blockRebase参数是:dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True),最终调用的python libvirt参数是self._guest._domain.blockRebase(self._disk, base, self.REBASE_DEFAULT_BANDWIDTH, flags=flags),对应的libvirt api的flag是:libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW、libvirt.VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT、libvirt.VIR_DOMAIN_BLOCK_REBASE_COPY,self._disk就是系统盘(vda)的文件路径,base就是snapshot文件路径,之后就是带宽限制(nova默认是0不限制)和flag参数。

因此这个接口的大致功能就是,把vda系统盘的数据rebase到snapshot文件上,之后可以通过virDomainGetBlockJobInfo接口查询rebase job的进度,nova里面也是这么做的,只不过之前的接口有bug,导致job进度有误报,还没开始就认为已经结束了。文档里有这段话:When @flags contains VIR_DOMAIN_BLOCK_REBASE_COPY,
this command is shorthand for virDomainBlockCopy(),也就是说virDomainBlockCopy这个接口可以实现同样的功能,怪不得我搜索virsh命令没找到blockrebase子命令,只有blockcopy和blockpull这俩。因此参考这个接口的文档更详细:https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockCopy

接下来的代码是在qemu driver目录下,src\qemu\qemu_driver.c里面(在这个文件的最后,有个qemuRegister函数,这个就是注册qemu driver的入口,qemuHypervisorDriver这个结构体定义了各种libvirt API的qemu driver实现函数,有兴趣可以自己看下):

// conn->driver->domainBlockRebase是个在libvirtd启动时注册好的函数指针,指向的函数就是这个
static int
qemuDomainBlockRebase(virDomainPtr dom, const char *path, const char *base,
                      unsigned long bandwidth, unsigned int flags)
{
    virQEMUDriverPtr driver = dom->conn->privateData;
    virDomainObjPtr vm;
    int ret = -1;
    unsigned long long speed = bandwidth;
    virStorageSourcePtr dest = NULL;

    virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW |
                  VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT |
                  VIR_DOMAIN_BLOCK_REBASE_COPY |
                  VIR_DOMAIN_BLOCK_REBASE_COPY_RAW |
                  VIR_DOMAIN_BLOCK_REBASE_RELATIVE |
                  VIR_DOMAIN_BLOCK_REBASE_COPY_DEV |
                  VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES, -1);

    if (!(vm = qemuDomObjFromDomain(dom)))
        return -1;

    if (virDomainBlockRebaseEnsureACL(dom->conn, vm->def) < 0)
        goto cleanup;

    /* For normal rebase (enhanced blockpull), the common code handles
     * everything, including vm cleanup. */
    if (!(flags & VIR_DOMAIN_BLOCK_REBASE_COPY))
        return qemuDomainBlockPullCommon(driver, vm, path, base, bandwidth, flags);

    /* If we got here, we are doing a block copy rebase. */
    if (VIR_ALLOC(dest) < 0)
        goto cleanup;
    dest->type = (flags & VIR_DOMAIN_BLOCK_REBASE_COPY_DEV) ?
        VIR_STORAGE_TYPE_BLOCK : VIR_STORAGE_TYPE_FILE;
    if (VIR_STRDUP(dest->path, base) < 0)
        goto cleanup;
    if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY_RAW)
        dest->format = VIR_STORAGE_FILE_RAW;

    /* Convert bandwidth MiB to bytes, if necessary */
    if (!(flags & VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES)) {
        if (speed > LLONG_MAX >> 20) {
            virReportError(VIR_ERR_OVERFLOW,
                           _("bandwidth must be less than %llu"),
                           LLONG_MAX >> 20);
            goto cleanup;
        }
        speed <<= 20;
    }

    /* XXX: If we are doing a shallow copy but not reusing an external
     * file, we should attempt to pre-create the destination with a
     * relative backing chain instead of qemu's default of absolute */
    if (flags & VIR_DOMAIN_BLOCK_REBASE_RELATIVE) {
        virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s",
                       _("Relative backing during copy not supported yet"));
        goto cleanup;
    }

    /* We rely on the fact that VIR_DOMAIN_BLOCK_REBASE_SHALLOW
     * and VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT map to the same values
     * as for block copy. */
    flags &= (VIR_DOMAIN_BLOCK_REBASE_SHALLOW |
              VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT);
    // 根据nova传入的flags,走到的是这个函数,跟virDomainBlockCopy调用的方法是一个
    ret = qemuDomainBlockCopyCommon(vm, dom->conn, path, dest,
                                    speed, 0, 0, flags, true);
    dest = NULL;

 cleanup:
    virDomainObjEndAPI(&vm);
    virStorageSourceFree(dest);
    return ret;
}

之后的代码流程比较长,就不一一分析了,总体来说就是为准备发送json消息给qemu monitor unix domain socket准备json消息内容,大致流程如下:

src\qemu\qemu_driver.c:qemuDomainBlockCopyCommon –> src\qemu\qemu_monitor.c:qemuMonitorDriveMirror –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONDriveMirror –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONMakeCommand –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONCommand –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONCommandWithFd –> src\qemu\qemu_monitor.c:qemuMonitorSend,这个函数里面的内容比较简单,但我没有调试,没看明白具体在哪里发送的json请求给qemu monitor,因为实在没调用几个函数,用排除法应该是调用的qemuMonitorUpdateWatch,但是看了这个方法的实现,也没啥内容,看来只能加断点单步调试才能搞清楚了。这部分代码在另外一篇文章里调试了下,大概了解了相关流程,是在qemuMonitorIO这个回调里面执行的,发现monitor socket可写了,就调用qemuMonitorIOWrite把消息写到qemu monitor socket,读也是类似流程,qemuMonitorUpdateWatch只是更新monitor事件回调的监听事件列表。

网上搜了下,最终发送给qemu monitor的json字符串应该是类似(使用的命令是:virsh blockcopy rhel7f vda —dest /var/lib/libvirt/images/f.img,可以通过blockjob命令查询job进度情况:virsh blockjob rhel7f vda):

    {
        "execute":"drive-mirror",
        "arguments":{
            "device":"drive-virtio-disk0",
            "target":"/var/lib/libvirt/images/f1.img",
            "speed":0,
            "sync":"full",
            "mode":"absolute-paths",
            "format":"raw"
        },
        "id":"libvirt-96"
    }

看起来跟qemu-guest-agent的命令很像,这是因为他们都遵循相同的qmp协议

virsh命令验证:

# 执行前需要先virsh undefine c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659,否则会报错:
# error: Requested operation is not valid: domain is not transient
[root@vs2-compute-84 ~]# virsh blockcopy c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda --dest  xxx.qcow2
# 不指定dest的绝对路径,centos7上默认保存在/根目录下,建议指定绝对路径
[root@vs2-compute-84 ~]# virsh 
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda
Block Copy: [ 13 %]

virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda
Block Copy: [ 35 %]

virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda
No current block job for vda   ### job完成
[root@vs2-compute-84 ~]# virsh dumpxml c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659
    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source protocol='rbd' name='pool-8084ad417b504b1e837147273715a27a/c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659_disk'>
        <host name='192.168.66.81' port='6789'/>
        <host name='192.168.66.82' port='6789'/>
        <host name='192.168.66.83' port='6789'/>
      </source>
      <backingStore/>
      <mirror type='file' file='xxx.qcow2' format='raw' job='copy'>
      ## nova里添加了一个检查方法,当blockjob查询到进度完成后,用disk.mirror.ready == 'yes'来二次确认block job是否完成
        <format type='raw'/>
        <source file='xxx.qcow2'/>   
      </mirror>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>

 

小结

libvirt代码毕竟只是个适配层,类似命令传递通道的作用,没有libvirt也可以用qemu命令行启动虚拟机,通过monitor传递各种命令,但是接口不太友好,不利于编程,上层服务调用比较繁琐,libvirt则很好的解决这种问题,并且还把各种虚拟化底层统一进行了API封装,还用XML定义各种虚拟机参数,比较人性化,用户体验很好,所以它流行起来了。接下来就是分析qemu代码流程了,这部分比较复杂,我也只是略懂皮毛,尽量理出来相关代码流程吧。

qemu

基于master版本分析,HEAD commit:2babfe0c9241c239272a03fec785165a50e8288c。

qemu的用途就不多说了,配合内核态的kvm模拟各种硬件设备(CPU、内存除外),可以说实现了主板的功能,BIOS则是由另外的组件实现的(如seabios),qemu也是使用相关组件而已。qemu官网:https://www.qemu.org/,这个项目应该基本上也是redhat的主场。

qemu编译方法:https://wiki.qemu.org/Hosts/Linux

qapi和qmp、hmp的关系

qapi是底层C接口,而qmp则是封装好的json格式的协议,hmp是在qmp之上提供的方便人类使用的协议,调用关系是hmp->qmp->qapi,官方说明中提到上层服务(比如libvirt)使用的应该是qmp接口。

QAPI介绍:https://wiki.qemu.org/Features/QAPI

QMP介绍:https://wiki.qemu.org/QMP

HMP介绍(只有寥寥几句,还是TODO状态):https://wiki.qemu.org/ToDo/HMP

QAPI源代码生成相关介绍:源码库doc目录下docs/devel/qapi-code-gen.txt或者https://people.cs.clemson.edu/~ccorsi/kyouko/docs/qapi-code-gen.txt

其实我对这部分自动生成代码的功能挺感兴趣的,感觉是个很神奇的功能,很早之前还研究过,当时水平太差,没整明白,一脸懵逼,后面有空再研究研究。

QAPI代码是根据定义好的模板(schema,qapi目录下的那些json文件)来生成的,尤其是一些.h文件中定义的结构体、枚举类型,都是自动生成的(包含.h文件本身),生成的工具都在qemu根目录的scripts目录下,qapi*.py,使用方法参考上面的链接。

如果你不想每个源码文件都人肉生成,可以按照上面的qemu编译方法,直接在编译过程中生成相关源码文件即可。

qemu live block operations

相关介绍位于源码库的docs/interop/live-block-operations.rst文本文件中,这里有个在线版本:https://kashyapc.fedorapeople.org/virt/qemu/live-block-operations.html

我就不翻译了,水平不行,大家直接看原文吧,这里大概讲讲里面说了些啥,也就是翻译下文档的开头一段算是摘要的部分:

QEMU Block Layer currently (as of QEMU 2.9) supports four major kinds of live block device jobs — stream, commit, mirror, and backup. These can be used to manipulate disk image chains to accomplish certain tasks, namely: live copy data from backing files into overlays; shorten long disk image chains by merging data from overlays into backing files; live synchronize data from a disk image chain (including current active disk) to another target image; point-in-time (and incremental) backups of a block device. Below is a description of the said block (QMP) primitives, and some (non-exhaustive list of) examples to illustrate their use.

QEMU块设备层当前(2.9版本)支持4种类型的在线块设备操作:stream、commit、mirror(nova live snapshot使用的是这个)、backup,这些功能分别可用来操作磁盘镜像链来完成某些任务如:从backing file live copy磁盘镜像数据到overlays、通过合并overlays到backing file来收缩镜像链的长度、live synchronize磁盘镜像数据到另外一个镜像(包含正在使用的磁盘)、制作块设备在某一时间点的备份(增量快照)。下面是对这些块设备操作方法(基于QMP命令)的描述,以及一些简要示例来说明它们的用法。

backing file:镜像的基础数据部分,打个比喻可以看作是只读的live CD启动的系统,所有数据都保存在overlay里面,overlay与backing file的数据块一一映射,但overlay一般具备稀疏文件特性,保证没有写入过的0数据块不占用实际存储空间,从而节约物理磁盘使用,当用户要修改backing file中的数据时,立即把数据copy一份到overlay中,然后根据用户下发的文件操作进行正常的数据修改,这一过程对用户透明,一旦backing file中的数据块copy到overlay之后,后续就直接从overlay这边进行相关读取、更新操作,不再跟backing file有关联。

这也是qcow2镜像的专有特性,qcow2的意义就是qemu copy on write格式的镜像(第二版本),copy on write的大概原理上面那段已经简要说明了。nova中的backing file一般是保存在instances目录下的_base目录里,而overlay部分则是instance目录下的disk文件,backing file一般是raw格式的,而overlay一般是qcow2格式,不过好像qcow2格式也支持作为backing file,但nova里是强制转换成raw再做backing file的(原因没搞清楚)。可以通过qemu-img info命令查看镜像属性信息,类似:

# 给base镜像创建overlay: qemu-img create -f qcow2 -b base.raw overlay.qcow2
[root@vs-controller b729b095-a302-4c0b-9a59-a8bd49ed393a]# qemu-img info disk
image: disk
file format: qcow2
virtual size: 200G (214748364800 bytes)
disk size: 246M
cluster_size: 65536
backing file: /var/lib/nova/instances/_base/4de41d75db172155181742a86136f749ab140bce
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false  ### 这些字段的含义可以通过man qemu-img查看相关解释

具体原理解释和相关qmp操作命令就不多说了,上面的链接或者源码库的doc里解释的比我清楚,自己看就好了。下面进行代码流程分析,我也属于只知其然不知其所以然的水平,大家凑合看吧(注意上面说的qapi代码生成过程,否则看不到相关源码,./configure&&make之后生成的相关源码文件大概如下,仅供参考,可能有遗漏)。

[root@linux qemu]# ll | grep "Dec 12" | egrep "\.h|\.c" | egrep "qapi|qmp"
-rw-r--r--  1 root root   33737 Dec 12 12:03 qapi-event.c
-rw-r--r--  1 root root    5580 Dec 12 12:03 qapi-event.h
-rw-r--r--  1 root root  108821 Dec 12 12:03 qapi-types.c
-rw-r--r--  1 root root  163180 Dec 12 12:03 qapi-types.h
-rw-r--r--  1 root root  471588 Dec 12 12:03 qapi-visit.c
-rw-r--r--  1 root root   78987 Dec 12 12:03 qapi-visit.h
-rw-r--r--  1 root root   25191 Dec 12 12:03 qmp-commands.h
-rw-r--r--  1 root root  126134 Dec 12 12:03 qmp-introspect.c
-rw-r--r--  1 root root     364 Dec 12 12:03 qmp-introspect.h
-rw-r--r--  1 root root  178626 Dec 12 12:03 qmp-marshal.c

qemu是一个独立的进程,要编译成二进制可执行文件,肯定要有main函数才行,因此如果不知道入口在哪儿,可以用笨方法:

[root@linux qemu]# grep -no  main\(  *.c
qemu-bridge-helper.c:215:main(
qemu-img.c:4684:main(
qemu-io.c:444:main(
qemu-keymap.c:145:main(
qemu-nbd.c:503:main(
vl.c:38:main(
vl.c:39:main(
vl.c:41:main(
vl.c:3091:main(   ### 通过排除法,可以找到入口在这

当然还可以把qemu编译成debug模式二进制,用gdb调试,就能很容易的找到入口函数的位置。

其实我是通过搜索“drive-mirror”命令反推的代码流程,因为毕竟qemu项目这么大,我不知道该怎么看(也就是说下面的代码流程分析是反向分析的,但效果应该一样,并且比较快速),C代码的函数开头和结尾有很多初始化变量,copy字符串,指针地址空间管理等准备操作,还有参数检查等逻辑,因此一般主流程都在函数的中间部分。

int main(int argc, char **argv, char **envp) 
{
    ......
    monitor_init_qmp_commands();  //初始化qemu monitor支持的qmp命令
    ......
}
void monitor_init_qmp_commands(void)
{
    /*
     * Two command lists:
     * - qmp_commands contains all QMP commands
     * - qmp_cap_negotiation_commands contains just
     *   "qmp_capabilities", to enforce capability negotiation
     */

    qmp_init_marshal(&qmp_commands); //这里

    qmp_register_command(&qmp_commands, "query-qmp-schema",
                         qmp_query_qmp_schema,
                         QCO_NO_OPTIONS);
    ......
}
void qmp_init_marshal(QmpCommandList *cmds)
{
    QTAILQ_INIT(cmds);

    qmp_register_command(cmds, "add-fd",
                         qmp_marshal_add_fd, QCO_NO_OPTIONS);
    ......
    qmp_register_command(cmds, "drive-mirror",  //这里注册的,回调是qmp_marshal_drive_mirror
                         qmp_marshal_drive_mirror, QCO_NO_OPTIONS);
    ......
}

void qmp_marshal_drive_mirror(QDict *args, QObject **ret, Error **errp)
{
    ......
    qmp_drive_mirror(&arg, &err);  //这个是主流程
    ......
}
void qmp_drive_mirror(DriveMirror *arg, Error **errp)
{
    BlockDriverState *bs;
    BlockDriverState *source, *target_bs;
    AioContext *aio_context;
    ......
    bdrv_set_aio_context(target_bs, aio_context);

    blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs,
                           arg->has_replaces, arg->replaces, arg->sync,
                           backing_mode, arg->has_speed, arg->speed,
                           arg->has_granularity, arg->granularity,
                           arg->has_buf_size, arg->buf_size,
                           arg->has_on_source_error, arg->on_source_error,
                           arg->has_on_target_error, arg->on_target_error,
                           arg->has_unmap, arg->unmap,
                           false, NULL,
                           &local_err); //主流程
    bdrv_unref(target_bs);
    error_propagate(errp, local_err);
out:
    aio_context_release(aio_context);
}
/* Parameter check and block job starting for drive mirroring.
 * Caller should hold @device and @target's aio context (must be the same).
 **/
static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs,
                                   BlockDriverState *target,
                                   bool has_replaces, const char *replaces,
                                   enum MirrorSyncMode sync,
                                   BlockMirrorBackingMode backing_mode,
                                   bool has_speed, int64_t speed,
                                   bool has_granularity, uint32_t granularity,
                                   bool has_buf_size, int64_t buf_size,
                                   bool has_on_source_error,
                                   BlockdevOnError on_source_error,
                                   bool has_on_target_error,
                                   BlockdevOnError on_target_error,
                                   bool has_unmap, bool unmap,
                                   bool has_filter_node_name,
                                   const char *filter_node_name,
                                   Error **errp)
{
    ......
    /* pass the node name to replace to mirror start since it's loose coupling
     * and will allow to check whether the node still exist at mirror completion
     */
    mirror_start(job_id, bs, target,
                 has_replaces ? replaces : NULL,
                 speed, granularity, buf_size, sync, backing_mode,
                 on_source_error, on_target_error, unmap, filter_node_name,
                 errp);   //主流程
}
void mirror_start(const char *job_id, BlockDriverState *bs,
                  BlockDriverState *target, const char *replaces,
                  int64_t speed, uint32_t granularity, int64_t buf_size,
                  MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
                  BlockdevOnError on_source_error,
                  BlockdevOnError on_target_error,
                  bool unmap, const char *filter_node_name, Error **errp)
{
    bool is_none_mode;
    BlockDriverState *base;

    if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
        error_setg(errp, "Sync mode 'incremental' not supported");
        return;
    }
    is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
    mirror_start_job(job_id, bs, BLOCK_JOB_DEFAULT, target, replaces,
                     speed, granularity, buf_size, backing_mode,
                     on_source_error, on_target_error, unmap, NULL, NULL,
                     &mirror_job_driver, is_none_mode, base, false,
                     filter_node_name, true, errp);
}
static void mirror_start_job(const char *job_id, BlockDriverState *bs,
                             int creation_flags, BlockDriverState *target,
                             const char *replaces, int64_t speed,
                             uint32_t granularity, int64_t buf_size,
                             BlockMirrorBackingMode backing_mode,
                             BlockdevOnError on_source_error,
                             BlockdevOnError on_target_error,
                             bool unmap,
                             BlockCompletionFunc *cb,
                             void *opaque,
                             const BlockJobDriver *driver,
                             bool is_none_mode, BlockDriverState *base,
                             bool auto_complete, const char *filter_node_name,
                             bool is_mirror,
                             Error **errp)
{
    ......
    //这个函数比较复杂了,但也支持根据入参做不同的准备工作,最后的关键点还没到
}

接下来的代码流程大致是(为啥不写了?因为关键的block数据处理流程我也没研究过,不懂啊。。。留给牛人们分析吧):

blockjob.c:void block_job_start  –>  block.c:void bdrv_coroutine_enter –> util/async.c:void aio_co_enter –> util/qemu-coroutine.c:void qemu_aio_coroutine_enter –> util/coroutine-sigaltstack.c: CoroutineAction qemu_coroutine_switch

这部分核心代码,应该必须要对块设备底层实现非常清楚才行,而这部分我的基础基本是0,后续有机会要补上(其实何止这部分,整个内核态部分都差不多是0,没做过内核相关的工作)。

底层虚拟化技术3大核心:计算、存储、网络,都是要恶补的。

能看到最后的都是真爱码农,给大家留点福利(参考资料):