上次分析了nova live snapshot的一个数据丢失问题,但对底层的实现一知半解,这篇文章是对之前的延续,继续深入的搞明白底层的技术实现和原理(本文虎头蛇尾,慎入!)。
libvirt
首先有个问题,libvirt的用途是啥?
我很早之前写过一篇文档,内容比较浅,主要说明了libvirt、qemu、kvm之间的关系,在这个问题上我的理解是,libvirt就是虚拟化适配层,adapter,适配各种底层虚拟化技术,官方声明支持的有:KVM, QEMU, Xen, Virtuozzo,VMWare ESX, LXC, BHyve and more,当然主要支持的还是qemu,其他的虚拟化技术,支持的都不是很完善,比如Xen、lxc,基础功能的支持尚可,高级功能就不行了,所以应该绝大部分用户都是用的qemu虚拟化,Xen已经越来越不流行了。libvirt+LXC之前用过,也是不太好用,毕竟现在都是用docker了,LXC也已经成了昨日黄花。VMWare ESX这个没用过,不发表意见,不过看nova代码里面都是实现的ESX driver,而不是用libvirt driver,我估计也不是很完善,毕竟VMWare本身客户端或者API接口做的就很不错了,用户也没必要再套一层libvirt了。libvirt接口本身是C的,为了方便使用,还提供了多种语言的SDK,比如python、java等,当前最常用的应该就是python了。
libvirt是开源的,所以目前是OpenStack等开源IaaS云平台甚至闭源自研云平台的虚拟化接口适配层的首先方案,libvirt比OpenStack项目要早很多,一直以来都是redhat的主场,libvirt官方网站上有更多信息:https://libvirt.org/
代码流程
基于master版本分析(HEAD commit:07adbd4b1f82a9f09584dfa5fb6ca9063bd24bd0)。
上篇文章提到nova里面调用的是dev.rebase(),实际是通过python-libvirt接口(对libvirt api的python封装)调用的是libvirt的virDomainBlockRebase接口,源码位于src\libvirt_domain.c:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
/** * virDomainBlockRebase: * @dom: pointer to domain object * @disk: path to the block device, or device shorthand * @base: path to backing file to keep, or device shorthand, * or NULL for no backing file * @bandwidth: (optional) specify bandwidth limit; flags determine the unit * @flags: bitwise-OR of virDomainBlockRebaseFlags * * Populate a disk image with data from its backing image chain, and * setting the backing image to @base, or alternatively copy an entire * backing chain to a new file @base. * * When @flags is 0, this starts a pull, where @base must be the absolute * path of one of the backing images further up the chain, or NULL to * convert the disk image so that it has no backing image. Once all * data from its backing image chain has been pulled, the disk no * longer depends on those intermediate backing images. This function * pulls data for the entire device in the background. Progress of * the operation can be checked with virDomainGetBlockJobInfo() with a * job type of VIR_DOMAIN_BLOCK_JOB_TYPE_PULL, and the operation can be * aborted with virDomainBlockJobAbort(). When finished, an asynchronous * event is raised to indicate the final status, and the job no longer * exists. If the job is aborted, a new one can be started later to * resume from the same point. * * If @flags contains VIR_DOMAIN_BLOCK_REBASE_RELATIVE, the name recorded * into the active disk as the location for @base will be kept relative. * The operation will fail if libvirt can't infer the name. * * When @flags includes VIR_DOMAIN_BLOCK_REBASE_COPY, this starts a copy, * where @base must be the name of a new file to copy the chain to. By * default, the copy will pull the entire source chain into the destination * file, but if @flags also contains VIR_DOMAIN_BLOCK_REBASE_SHALLOW, then * only the top of the source chain will be copied (the source and * destination have a common backing file). By default, @base will be * created with the same file format as the source, but this can be altered * by adding VIR_DOMAIN_BLOCK_REBASE_COPY_RAW to force the copy to be raw * (does not make sense with the shallow flag unless the source is also raw), * or by using VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT to reuse an existing file * which was pre-created with the correct format and metadata and sufficient * size to hold the copy. In case the VIR_DOMAIN_BLOCK_REBASE_SHALLOW flag * is used the pre-created file has to exhibit the same guest visible contents * as the backing file of the original image. This allows a management app to * pre-create files with relative backing file names, rather than the default * of absolute backing file names; as a security precaution, you should * generally only use reuse_ext with the shallow flag and a non-raw * destination file. By default, the copy destination will be treated as * type='file', but using VIR_DOMAIN_BLOCK_REBASE_COPY_DEV treats the * destination as type='block' (affecting how virDomainGetBlockInfo() will * report allocation after pivoting). * * A copy job has two parts; in the first phase, the @bandwidth parameter * affects how fast the source is pulled into the destination, and the job * can only be canceled by reverting to the source file; progress in this * phase can be tracked via the virDomainBlockJobInfo() command, with a * job type of VIR_DOMAIN_BLOCK_JOB_TYPE_COPY. The job transitions to the * second phase when the job info states cur == end, and remains alive to * mirror all further changes to both source and destination. The user * must call virDomainBlockJobAbort() to end the mirroring while choosing * whether to revert to source or pivot to the destination. An event is * issued when the job ends, and depending on the hypervisor, an event may * also be issued when the job transitions from pulling to mirroring. If * the job is aborted, a new job will have to start over from the beginning * of the first phase. * * Some hypervisors will restrict certain actions, such as virDomainSave() * or virDomainDetachDevice(), while a copy job is active; they may * also restrict a copy job to transient domains. * * The @disk parameter is either an unambiguous source name of the * block device (the <source file='...'/> sub-element, such as * "/path/to/image"), or the device target shorthand (the * <target dev='...'/> sub-element, such as "vda"). Valid names * can be found by calling virDomainGetXMLDesc() and inspecting * elements within //domain/devices/disk. * * The @base parameter can be either a path to a file within the backing * chain, or the device target shorthand (the <target dev='...'/> * sub-element, such as "vda") followed by an index to the backing chain * enclosed in square brackets. Backing chain indexes can be found by * inspecting //disk//backingStore/@index in the domain XML. Thus, for * example, "vda[3]" refers to the backing store with index equal to "3" * in the chain of disk "vda". * * The maximum bandwidth that will be used to do the copy can be * specified with the @bandwidth parameter. If set to 0, there is no * limit. If @flags includes VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES, * @bandwidth is in bytes/second; otherwise, it is in MiB/second. * Values larger than 2^52 bytes/sec may be rejected due to overflow * considerations based on the word size of both client and server, * and values larger than 2^31 bytes/sec may cause overflow problems * if later queried by virDomainGetBlockJobInfo() without scaling. * Hypervisors may further restrict the range of valid bandwidth * values. Some hypervisors do not support this feature and will * return an error if bandwidth is not 0; in this case, it might still * be possible for a later call to virDomainBlockJobSetSpeed() to * succeed. The actual speed can be determined with * virDomainGetBlockJobInfo(). * * When @base is NULL and @flags is 0, this is identical to * virDomainBlockPull(). When @flags contains VIR_DOMAIN_BLOCK_REBASE_COPY, * this command is shorthand for virDomainBlockCopy() where the destination * XML encodes @base as a <disk type='file'>, @bandwidth is properly scaled * and passed as a typed parameter, the shallow and reuse external flags * are preserved, and remaining flags control whether the XML encodes a * destination format of raw instead of leaving the destination identical * to the source format or probed from the reused file. * * Returns 0 if the operation has started, -1 on failure. */ int virDomainBlockRebase(virDomainPtr dom, const char *disk, const char *base, unsigned long bandwidth, unsigned int flags) { virConnectPtr conn; VIR_DOMAIN_DEBUG(dom, "disk=%s, base=%s, bandwidth=%lu, flags=0x%x", disk, NULLSTR(base), bandwidth, flags); virResetLastError(); virCheckDomainReturn(dom, -1); conn = dom->conn; virCheckReadOnlyGoto(conn->flags, error); virCheckNonNullArgGoto(disk, error); if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY) { virCheckNonNullArgGoto(base, error); } else if (flags & (VIR_DOMAIN_BLOCK_REBASE_SHALLOW | VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT | VIR_DOMAIN_BLOCK_REBASE_COPY_RAW | VIR_DOMAIN_BLOCK_REBASE_COPY_DEV)) { virReportInvalidArg(flags, "%s", _("use of flags requires a copy job")); goto error; } // conn.driver是qemu driver,各个driver注册过程就不讲了,网上搜搜应该有,入口是在 // daemon/libvirtd.c: VIR_DAEMON_LOAD_MODULE(qemuRegister, "qemu"); // 使用哪个driver是根据connection连接建立的时候传入的uri决定的, // 比如virsh -c qemu:///system就是表示用qemu的unix domain socket连接到libvirtd的qemu driver if (conn->driver->domainBlockRebase) { // 检查虚拟化driver后端有没有实现这个函数 int ret; ret = conn->driver->domainBlockRebase(dom, disk, base, bandwidth, flags); // 调用driver里的这个rebase函数 if (ret < 0) goto error; return ret; } virReportUnsupportedError(); error: virDispatchError(dom->conn); return -1; } |
上面的注释很清楚,其实API接口说明文档就是根据这里的注释生成的:https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockRebase
nova里面调用blockRebase参数是:dev.rebase(disk_delta, copy=True, reuse_ext=True, shallow=True),最终调用的python libvirt参数是self._guest._domain.blockRebase(self._disk, base, self.REBASE_DEFAULT_BANDWIDTH, flags=flags),对应的libvirt api的flag是:libvirt.VIR_DOMAIN_BLOCK_REBASE_SHALLOW、libvirt.VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT、libvirt.VIR_DOMAIN_BLOCK_REBASE_COPY,self._disk就是系统盘(vda)的文件路径,base就是snapshot文件路径,之后就是带宽限制(nova默认是0不限制)和flag参数。
因此这个接口的大致功能就是,把vda系统盘的数据rebase到snapshot文件上,之后可以通过virDomainGetBlockJobInfo接口查询rebase job的进度,nova里面也是这么做的,只不过之前的接口有bug,导致job进度有误报,还没开始就认为已经结束了。文档里有这段话:When @flags contains VIR_DOMAIN_BLOCK_REBASE_COPY,
this command is shorthand for virDomainBlockCopy(),也就是说virDomainBlockCopy这个接口可以实现同样的功能,怪不得我搜索virsh命令没找到blockrebase子命令,只有blockcopy和blockpull这俩。因此参考这个接口的文档更详细:https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockCopy
接下来的代码是在qemu driver目录下,src\qemu\qemu_driver.c里面(在这个文件的最后,有个qemuRegister函数,这个就是注册qemu driver的入口,qemuHypervisorDriver这个结构体定义了各种libvirt API的qemu driver实现函数,有兴趣可以自己看下):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
// conn->driver->domainBlockRebase是个在libvirtd启动时注册好的函数指针,指向的函数就是这个 static int qemuDomainBlockRebase(virDomainPtr dom, const char *path, const char *base, unsigned long bandwidth, unsigned int flags) { virQEMUDriverPtr driver = dom->conn->privateData; virDomainObjPtr vm; int ret = -1; unsigned long long speed = bandwidth; virStorageSourcePtr dest = NULL; virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW | VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT | VIR_DOMAIN_BLOCK_REBASE_COPY | VIR_DOMAIN_BLOCK_REBASE_COPY_RAW | VIR_DOMAIN_BLOCK_REBASE_RELATIVE | VIR_DOMAIN_BLOCK_REBASE_COPY_DEV | VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES, -1); if (!(vm = qemuDomObjFromDomain(dom))) return -1; if (virDomainBlockRebaseEnsureACL(dom->conn, vm->def) < 0) goto cleanup; /* For normal rebase (enhanced blockpull), the common code handles * everything, including vm cleanup. */ if (!(flags & VIR_DOMAIN_BLOCK_REBASE_COPY)) return qemuDomainBlockPullCommon(driver, vm, path, base, bandwidth, flags); /* If we got here, we are doing a block copy rebase. */ if (VIR_ALLOC(dest) < 0) goto cleanup; dest->type = (flags & VIR_DOMAIN_BLOCK_REBASE_COPY_DEV) ? VIR_STORAGE_TYPE_BLOCK : VIR_STORAGE_TYPE_FILE; if (VIR_STRDUP(dest->path, base) < 0) goto cleanup; if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY_RAW) dest->format = VIR_STORAGE_FILE_RAW; /* Convert bandwidth MiB to bytes, if necessary */ if (!(flags & VIR_DOMAIN_BLOCK_REBASE_BANDWIDTH_BYTES)) { if (speed > LLONG_MAX >> 20) { virReportError(VIR_ERR_OVERFLOW, _("bandwidth must be less than %llu"), LLONG_MAX >> 20); goto cleanup; } speed <<= 20; } /* XXX: If we are doing a shallow copy but not reusing an external * file, we should attempt to pre-create the destination with a * relative backing chain instead of qemu's default of absolute */ if (flags & VIR_DOMAIN_BLOCK_REBASE_RELATIVE) { virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", _("Relative backing during copy not supported yet")); goto cleanup; } /* We rely on the fact that VIR_DOMAIN_BLOCK_REBASE_SHALLOW * and VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT map to the same values * as for block copy. */ flags &= (VIR_DOMAIN_BLOCK_REBASE_SHALLOW | VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT); // 根据nova传入的flags,走到的是这个函数,跟virDomainBlockCopy调用的方法是一个 ret = qemuDomainBlockCopyCommon(vm, dom->conn, path, dest, speed, 0, 0, flags, true); dest = NULL; cleanup: virDomainObjEndAPI(&vm); virStorageSourceFree(dest); return ret; } |
之后的代码流程比较长,就不一一分析了,总体来说就是为准备发送json消息给qemu monitor unix domain socket准备json消息内容,大致流程如下:
src\qemu\qemu_driver.c:qemuDomainBlockCopyCommon –> src\qemu\qemu_monitor.c:qemuMonitorDriveMirror –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONDriveMirror –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONMakeCommand –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONCommand –> src\qemu\qemu_monitor_json.c:qemuMonitorJSONCommandWithFd –> src\qemu\qemu_monitor.c:qemuMonitorSend,这个函数里面的内容比较简单,但我没有调试,没看明白具体在哪里发送的json请求给qemu monitor,因为实在没调用几个函数,用排除法应该是调用的qemuMonitorUpdateWatch,但是看了这个方法的实现,也没啥内容,看来只能加断点单步调试才能搞清楚了。这部分代码在另外一篇文章里调试了下,大概了解了相关流程,是在qemuMonitorIO这个回调里面执行的,发现monitor socket可写了,就调用qemuMonitorIOWrite把消息写到qemu monitor socket,读也是类似流程,qemuMonitorUpdateWatch只是更新monitor事件回调的监听事件列表。
网上搜了下,最终发送给qemu monitor的json字符串应该是类似(使用的命令是:virsh blockcopy rhel7f vda —dest /var/lib/libvirt/images/f.img,可以通过blockjob命令查询job进度情况:virsh blockjob rhel7f vda):
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "execute":"drive-mirror", "arguments":{ "device":"drive-virtio-disk0", "target":"/var/lib/libvirt/images/f1.img", "speed":0, "sync":"full", "mode":"absolute-paths", "format":"raw" }, "id":"libvirt-96" } |
看起来跟qemu-guest-agent的命令很像,这是因为他们都遵循相同的qmp协议。
virsh命令验证:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# 执行前需要先virsh undefine c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659,否则会报错: # error: Requested operation is not valid: domain is not transient [root@vs2-compute-84 ~]# virsh blockcopy c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda --dest xxx.qcow2 # 不指定dest的绝对路径,centos7上默认保存在/根目录下,建议指定绝对路径 [root@vs2-compute-84 ~]# virsh Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda Block Copy: [ 13 %] virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda Block Copy: [ 35 %] virsh # blockjob c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 vda No current block job for vda ### job完成 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
[root@vs2-compute-84 ~]# virsh dumpxml c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659 <disk type='network' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <source protocol='rbd' name='pool-8084ad417b504b1e837147273715a27a/c9cfbc44-1b4c-41bd-bdc0-1cc5f7acb659_disk'> <host name='192.168.66.81' port='6789'/> <host name='192.168.66.82' port='6789'/> <host name='192.168.66.83' port='6789'/> </source> <backingStore/> <mirror type='file' file='xxx.qcow2' format='raw' job='copy'> ## nova里添加了一个检查方法,当blockjob查询到进度完成后,用disk.mirror.ready == 'yes'来二次确认block job是否完成 <format type='raw'/> <source file='xxx.qcow2'/> </mirror> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> |
小结
libvirt代码毕竟只是个适配层,类似命令传递通道的作用,没有libvirt也可以用qemu命令行启动虚拟机,通过monitor传递各种命令,但是接口不太友好,不利于编程,上层服务调用比较繁琐,libvirt则很好的解决这种问题,并且还把各种虚拟化底层统一进行了API封装,还用XML定义各种虚拟机参数,比较人性化,用户体验很好,所以它流行起来了。接下来就是分析qemu代码流程了,这部分比较复杂,我也只是略懂皮毛,尽量理出来相关代码流程吧。
qemu
基于master版本分析,HEAD commit:2babfe0c9241c239272a03fec785165a50e8288c。
qemu的用途就不多说了,配合内核态的kvm模拟各种硬件设备(CPU、内存除外),可以说实现了主板的功能,BIOS则是由另外的组件实现的(如seabios),qemu也是使用相关组件而已。qemu官网:https://www.qemu.org/,这个项目应该基本上也是redhat的主场。
qemu编译方法:https://wiki.qemu.org/Hosts/Linux
qapi和qmp、hmp的关系
qapi是底层C接口,而qmp则是封装好的json格式的协议,hmp是在qmp之上提供的方便人类使用的协议,调用关系是hmp->qmp->qapi,官方说明中提到上层服务(比如libvirt)使用的应该是qmp接口。
QAPI介绍:https://wiki.qemu.org/Features/QAPI
QMP介绍:https://wiki.qemu.org/QMP
HMP介绍(只有寥寥几句,还是TODO状态):https://wiki.qemu.org/ToDo/HMP
QAPI源代码生成相关介绍:源码库doc目录下docs/devel/qapi-code-gen.txt或者https://people.cs.clemson.edu/~ccorsi/kyouko/docs/qapi-code-gen.txt
其实我对这部分自动生成代码的功能挺感兴趣的,感觉是个很神奇的功能,很早之前还研究过,当时水平太差,没整明白,一脸懵逼,后面有空再研究研究。
QAPI代码是根据定义好的模板(schema,qapi目录下的那些json文件)来生成的,尤其是一些.h文件中定义的结构体、枚举类型,都是自动生成的(包含.h文件本身),生成的工具都在qemu根目录的scripts目录下,qapi*.py,使用方法参考上面的链接。
如果你不想每个源码文件都人肉生成,可以按照上面的qemu编译方法,直接在编译过程中生成相关源码文件即可。
qemu live block operations
相关介绍位于源码库的docs/interop/live-block-operations.rst文本文件中,这里有个在线版本:https://kashyapc.fedorapeople.org/virt/qemu/live-block-operations.html
我就不翻译了,水平不行,大家直接看原文吧,这里大概讲讲里面说了些啥,也就是翻译下文档的开头一段算是摘要的部分:
QEMU Block Layer currently (as of QEMU 2.9) supports four major kinds of live block device jobs — stream, commit, mirror, and backup. These can be used to manipulate disk image chains to accomplish certain tasks, namely: live copy data from backing files into overlays; shorten long disk image chains by merging data from overlays into backing files; live synchronize data from a disk image chain (including current active disk) to another target image; point-in-time (and incremental) backups of a block device. Below is a description of the said block (QMP) primitives, and some (non-exhaustive list of) examples to illustrate their use.
QEMU块设备层当前(2.9版本)支持4种类型的在线块设备操作:stream、commit、mirror(nova live snapshot使用的是这个)、backup,这些功能分别可用来操作磁盘镜像链来完成某些任务如:从backing file live copy磁盘镜像数据到overlays、通过合并overlays到backing file来收缩镜像链的长度、live synchronize磁盘镜像数据到另外一个镜像(包含正在使用的磁盘)、制作块设备在某一时间点的备份(增量快照)。下面是对这些块设备操作方法(基于QMP命令)的描述,以及一些简要示例来说明它们的用法。
backing file:镜像的基础数据部分,打个比喻可以看作是只读的live CD启动的系统,所有数据都保存在overlay里面,overlay与backing file的数据块一一映射,但overlay一般具备稀疏文件特性,保证没有写入过的0数据块不占用实际存储空间,从而节约物理磁盘使用,当用户要修改backing file中的数据时,立即把数据copy一份到overlay中,然后根据用户下发的文件操作进行正常的数据修改,这一过程对用户透明,一旦backing file中的数据块copy到overlay之后,后续就直接从overlay这边进行相关读取、更新操作,不再跟backing file有关联。
这也是qcow2镜像的专有特性,qcow2的意义就是qemu copy on write格式的镜像(第二版本),copy on write的大概原理上面那段已经简要说明了。nova中的backing file一般是保存在instances目录下的_base目录里,而overlay部分则是instance目录下的disk文件,backing file一般是raw格式的,而overlay一般是qcow2格式,不过好像qcow2格式也支持作为backing file,但nova里是强制转换成raw再做backing file的(原因没搞清楚)。可以通过qemu-img info命令查看镜像属性信息,类似:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# 给base镜像创建overlay: qemu-img create -f qcow2 -b base.raw overlay.qcow2 [root@vs-controller b729b095-a302-4c0b-9a59-a8bd49ed393a]# qemu-img info disk image: disk file format: qcow2 virtual size: 200G (214748364800 bytes) disk size: 246M cluster_size: 65536 backing file: /var/lib/nova/instances/_base/4de41d75db172155181742a86136f749ab140bce Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false ### 这些字段的含义可以通过man qemu-img查看相关解释 |
具体原理解释和相关qmp操作命令就不多说了,上面的链接或者源码库的doc里解释的比我清楚,自己看就好了。下面进行代码流程分析,我也属于只知其然不知其所以然的水平,大家凑合看吧(注意上面说的qapi代码生成过程,否则看不到相关源码,./configure&&make之后生成的相关源码文件大概如下,仅供参考,可能有遗漏)。
1 2 3 4 5 6 7 8 9 10 11 |
[root@linux qemu]# ll | grep "Dec 12" | egrep "\.h|\.c" | egrep "qapi|qmp" -rw-r--r-- 1 root root 33737 Dec 12 12:03 qapi-event.c -rw-r--r-- 1 root root 5580 Dec 12 12:03 qapi-event.h -rw-r--r-- 1 root root 108821 Dec 12 12:03 qapi-types.c -rw-r--r-- 1 root root 163180 Dec 12 12:03 qapi-types.h -rw-r--r-- 1 root root 471588 Dec 12 12:03 qapi-visit.c -rw-r--r-- 1 root root 78987 Dec 12 12:03 qapi-visit.h -rw-r--r-- 1 root root 25191 Dec 12 12:03 qmp-commands.h -rw-r--r-- 1 root root 126134 Dec 12 12:03 qmp-introspect.c -rw-r--r-- 1 root root 364 Dec 12 12:03 qmp-introspect.h -rw-r--r-- 1 root root 178626 Dec 12 12:03 qmp-marshal.c |
qemu是一个独立的进程,要编译成二进制可执行文件,肯定要有main函数才行,因此如果不知道入口在哪儿,可以用笨方法:
1 2 3 4 5 6 7 8 9 10 |
[root@linux qemu]# grep -no main\( *.c qemu-bridge-helper.c:215:main( qemu-img.c:4684:main( qemu-io.c:444:main( qemu-keymap.c:145:main( qemu-nbd.c:503:main( vl.c:38:main( vl.c:39:main( vl.c:41:main( vl.c:3091:main( ### 通过排除法,可以找到入口在这 |
当然还可以把qemu编译成debug模式二进制,用gdb调试,就能很容易的找到入口函数的位置。
其实我是通过搜索“drive-mirror”命令反推的代码流程,因为毕竟qemu项目这么大,我不知道该怎么看(也就是说下面的代码流程分析是反向分析的,但效果应该一样,并且比较快速),C代码的函数开头和结尾有很多初始化变量,copy字符串,指针地址空间管理等准备操作,还有参数检查等逻辑,因此一般主流程都在函数的中间部分。
1 2 3 4 5 6 |
int main(int argc, char **argv, char **envp) { ...... monitor_init_qmp_commands(); //初始化qemu monitor支持的qmp命令 ...... } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
void monitor_init_qmp_commands(void) { /* * Two command lists: * - qmp_commands contains all QMP commands * - qmp_cap_negotiation_commands contains just * "qmp_capabilities", to enforce capability negotiation */ qmp_init_marshal(&qmp_commands); //这里 qmp_register_command(&qmp_commands, "query-qmp-schema", qmp_query_qmp_schema, QCO_NO_OPTIONS); ...... } |
1 2 3 4 5 6 7 8 9 10 11 |
void qmp_init_marshal(QmpCommandList *cmds) { QTAILQ_INIT(cmds); qmp_register_command(cmds, "add-fd", qmp_marshal_add_fd, QCO_NO_OPTIONS); ...... qmp_register_command(cmds, "drive-mirror", //这里注册的,回调是qmp_marshal_drive_mirror qmp_marshal_drive_mirror, QCO_NO_OPTIONS); ...... } |
1 2 3 4 5 6 |
void qmp_marshal_drive_mirror(QDict *args, QObject **ret, Error **errp) { ...... qmp_drive_mirror(&arg, &err); //这个是主流程 ...... } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
void qmp_drive_mirror(DriveMirror *arg, Error **errp) { BlockDriverState *bs; BlockDriverState *source, *target_bs; AioContext *aio_context; ...... bdrv_set_aio_context(target_bs, aio_context); blockdev_mirror_common(arg->has_job_id ? arg->job_id : NULL, bs, target_bs, arg->has_replaces, arg->replaces, arg->sync, backing_mode, arg->has_speed, arg->speed, arg->has_granularity, arg->granularity, arg->has_buf_size, arg->buf_size, arg->has_on_source_error, arg->on_source_error, arg->has_on_target_error, arg->on_target_error, arg->has_unmap, arg->unmap, false, NULL, &local_err); //主流程 bdrv_unref(target_bs); error_propagate(errp, local_err); out: aio_context_release(aio_context); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
/* Parameter check and block job starting for drive mirroring. * Caller should hold @device and @target's aio context (must be the same). **/ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, BlockDriverState *target, bool has_replaces, const char *replaces, enum MirrorSyncMode sync, BlockMirrorBackingMode backing_mode, bool has_speed, int64_t speed, bool has_granularity, uint32_t granularity, bool has_buf_size, int64_t buf_size, bool has_on_source_error, BlockdevOnError on_source_error, bool has_on_target_error, BlockdevOnError on_target_error, bool has_unmap, bool unmap, bool has_filter_node_name, const char *filter_node_name, Error **errp) { ...... /* pass the node name to replace to mirror start since it's loose coupling * and will allow to check whether the node still exist at mirror completion */ mirror_start(job_id, bs, target, has_replaces ? replaces : NULL, speed, granularity, buf_size, sync, backing_mode, on_source_error, on_target_error, unmap, filter_node_name, errp); //主流程 } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
void mirror_start(const char *job_id, BlockDriverState *bs, BlockDriverState *target, const char *replaces, int64_t speed, uint32_t granularity, int64_t buf_size, MirrorSyncMode mode, BlockMirrorBackingMode backing_mode, BlockdevOnError on_source_error, BlockdevOnError on_target_error, bool unmap, const char *filter_node_name, Error **errp) { bool is_none_mode; BlockDriverState *base; if (mode == MIRROR_SYNC_MODE_INCREMENTAL) { error_setg(errp, "Sync mode 'incremental' not supported"); return; } is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL; mirror_start_job(job_id, bs, BLOCK_JOB_DEFAULT, target, replaces, speed, granularity, buf_size, backing_mode, on_source_error, on_target_error, unmap, NULL, NULL, &mirror_job_driver, is_none_mode, base, false, filter_node_name, true, errp); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
static void mirror_start_job(const char *job_id, BlockDriverState *bs, int creation_flags, BlockDriverState *target, const char *replaces, int64_t speed, uint32_t granularity, int64_t buf_size, BlockMirrorBackingMode backing_mode, BlockdevOnError on_source_error, BlockdevOnError on_target_error, bool unmap, BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, bool is_none_mode, BlockDriverState *base, bool auto_complete, const char *filter_node_name, bool is_mirror, Error **errp) { ...... //这个函数比较复杂了,但也支持根据入参做不同的准备工作,最后的关键点还没到 } |
接下来的代码流程大致是(为啥不写了?因为关键的block数据处理流程我也没研究过,不懂啊。。。留给牛人们分析吧):
blockjob.c:void block_job_start –> block.c:void bdrv_coroutine_enter –> util/async.c:void aio_co_enter –> util/qemu-coroutine.c:void qemu_aio_coroutine_enter –> util/coroutine-sigaltstack.c: CoroutineAction qemu_coroutine_switch
这部分核心代码,应该必须要对块设备底层实现非常清楚才行,而这部分我的基础基本是0,后续有机会要补上(其实何止这部分,整个内核态部分都差不多是0,没做过内核相关的工作)。
底层虚拟化技术3大核心:计算、存储、网络,都是要恶补的。
能看到最后的都是真爱码农,给大家留点福利(参考资料):