nova-cinder交互流程分析

Posted on 2015-06-16 by aspirer

nova-cinder交互流程分析

原文地址：http://aspirer2004.blog.163.com/blog/static/106764720134755131463/

本文主要调研cinder与nova的交互流程，分析了自有块存储系统与nova的整合问题。

1. Nova现有API统计

nova已经支持的块设备API可以参考http://api.openstack.org/api-ref.html中Volume Attachments，Volume Extension to Compute两个部分的说明。

操作类（所有删除操作都是异步的，需要用户自行调用查询API进行确认）：

创建块设备（包括从快照恢复出块设备）（可以指定块设备AZ）（需要提供用户ID）
删除块设备（需要提供用户ID和块设备ID）
挂载块设备（需要指定用户ID，云主机ID，块设备ID）
卸载块设备（需要指定用户ID，云主机ID，块设备ID）
给块设备建快照（需要提供用户ID和块设备ID）
删除快照（需要提供用户ID和快照ID）

查询类：

列出云主机上挂载的块设备（需要指定用户ID和云主机ID）
根据云主机ID及挂载在其上的块设备ID查询挂载详细信息（需要指定用户ID，云主机ID，块设备ID）
查询用户所有的块设备（需要提供用户ID）
根据块设备ID查询用户某个块设备的详细信息（需要提供用户ID和块设备ID）
查询用户所有的块设备快照（需要提供用户ID）
查询用户所有的块设备快照详细信息（需要提供用户ID和快照ID）

需要新增API：

扩容API（我们这边有新增API的经验，比较容易实现）

2. Nova-Cinder交互流程分析

这里只选择两个比较典型的交互过程进行分析。

2.1 创建块设备cinder流程

创建块设备支持从快照恢复出块设备。

API URL：POST http://localhost:8774/v1.1/{tenant_id}/os-volumes

Request parameters

Parameter Description

tenant_id The unique identifier of the tenant or account.

volume_id The unique identifier for a volume.

Volume A partial representation of a volume that is used to create a volume.

Create Volume Request: JSON

{

“volume”: {

“display_name”: “vol-001”,

“display_description”: “Another volume.”,

“size”: 30,

“volume_type”: “289da7f8-6440-407c-9fb4-7db01ec49164”,

“metadata”: {“contents”: “junk”},

“availability_zone”: “us-east1”

}

Create Volume Response: JSON

{

“volume”: {

“id”: “521752a6-acf6-4b2d-bc7a-119f9148cd8c”,

“display_name”: “vol-001”,

“display_description”: “Another volume.”,

“size”: 30,

“volume_type”: “289da7f8-6440-407c-9fb4-7db01ec49164”,

“metadata”: {“contents”: “junk”},

“availability_zone”: “us-east1”,

“snapshot_id”: null,

“attachments”: [],

“created_at”: “2012-02-14T20:53:07Z”

}

# nova\api\openstack\compute\contrib\volumes.py:

VolumeController.create()

@wsgi.serializers(xml=VolumeTemplate)

@wsgi.deserializers(xml=CreateDeserializer)

def create(self, req, body):

“””Creates a new volume.”””

context = req.environ[‘nova.context’]

authorize(context)

if not self.is_valid_body(body, ‘volume’):

raise exc.HTTPUnprocessableEntity()

vol = body[‘volume’]

# 卷类型，暂时不支持，参数不传入即可

vol_type = vol.get(‘volume_type’, None)

if vol_type:

try:

vol_type = volume_types.get_volume_type_by_name(context,

vol_type)

except exception.NotFound:

raise exc.HTTPNotFound()

metadata = vol.get(‘metadata’, None)

# 如果要从快照恢复卷，传入要被恢复的快照ID即可

snapshot_id = vol.get(‘snapshot_id’)

if snapshot_id is not None:

# 从快照恢复云硬盘需要实现如下方法，self.volume_api下面会有说明

snapshot = self.volume_api.get_snapshot(context, snapshot_id)

else:

snapshot = None

size = vol.get(‘size’, None)

if size is None and snapshot is not None:

size = snapshot[‘volume_size’]

LOG.audit(_(“Create volume of %s GB”), size, context=context)

# 卷AZ信息

availability_zone = vol.get(‘availability_zone’, None)

# 云硬盘需要实现如下方法，self.volume_api下面会有说明

new_volume = self.volume_api.create(context,

size,

vol.get(‘display_name’),

vol.get(‘display_description’),

snapshot=snapshot,

volume_type=vol_type,

metadata=metadata,

availability_zone=availability_zone

)

# TODO(vish): Instance should be None at db layer instead of

# trying to lazy load, but for now we turn it into

# a dict to avoid an error.

retval = _translate_volume_detail_view(context, dict(new_volume))

result = {‘volume’: retval}

location = ‘%s/%s’ % (req.url, new_volume[‘id’])

return wsgi.ResponseObject(result, headers=dict(location=location))

# self.volume_api说明

self.volume_api = volume.API()

volume是from nova import volume导入的

# nova\volume\__init__.py:

def API():

importutils = nova.openstack.common.importutils

cls = importutils.import_class(nova.flags.FLAGS.volume_api_class)

return cls()

可见self.volume_api调用的所有方法都是由配置项volume_api_class决定的，默认配置是使用nova-volume的API封装类，

cfg.StrOpt(‘volume_api_class’,

default=’nova.volume.api.API’,

help=’The full class name of the volume API class to use’),

也可以改用cinder的API封装类，只要把配置改为volume_api_class=nova.volume.cinder.API即可，cinder API封装类通过调用封装了创建卷方法的cinder_client库来调用到cinder的API，云硬盘可以实现一个类似的client库，也可以直接调用已有的API来实现相同的动作（cinder_client库也是对cinder API调用的封装），云硬盘可以参考nova\volume\cinder.py开发自己的API封装类，供NVS使用，由于API已经开发完成，所以只是封装API，工作量应该不是很大，需要注意的应该是认证问题。

快照相关操作及查询与上述流程没有区别，只要模仿nova\volume\cinder.py即可实现。

2.2 挂载块设备cinder流程

API URL：POST http://localhost:8774/v2/{tenant_id}/servers/{server_id}/os-volume_attachments

Request parameters

Parameter Description

tenant_id The ID for the tenant or account in a multi-tenancy cloud.

server_id The UUID for the server of interest to you.

volumeId ID of the volume to attach.

device Name of the device e.g. /dev/vdb. Use “auto” for autoassign (if supported).

volumeAttachment A dictionary representation of a volume attachment.

Attach Volume to Server Request: JSON

{

‘volumeAttachment’: {

‘volumeId’: volume_id,

‘device’: device

}

Attach Volume to Server Response: JSON

{

“volumeAttachment”: {

“device”: “/dev/vdd”,

“serverId”: “fd783058-0e27-48b0-b102-a6b4d4057cac”,

“id”: “5f800cf0-324f-4234-bc6b-e12d5816e962”,

“volumeId”: “5f800cf0-324f-4234-bc6b-e12d5816e962”

}

需要注意的是这个API返回是同步的，但挂载卷到虚拟机是异步的。

# nova\api\openstack\compute\contrib\volumes.py:

VolumeAttachmentController.create()

@wsgi.serializers(xml=VolumeAttachmentTemplate)

def create(self, req, server_id, body):

“””Attach a volume to an instance.”””

context = req.environ[‘nova.context’]

authorize(context)

if not self.is_valid_body(body, ‘volumeAttachment’):

raise exc.HTTPUnprocessableEntity()

volume_id = body[‘volumeAttachment’][‘volumeId’]

device = body[‘volumeAttachment’].get(‘device’)

msg = _(“Attach volume %(volume_id)s to instance %(server_id)s”

” at %(device)s”) % locals()

LOG.audit(msg, context=context)

try:

instance = self.compute_api.get(context, server_id)

# nova-compute负责挂载卷到虚拟机

device = self.compute_api.attach_volume(context, instance,

volume_id, device)

except exception.NotFound:

raise exc.HTTPNotFound()

# The attach is async

attachment = {}

attachment[‘id’] = volume_id

attachment[‘serverId’] = server_id

attachment[‘volumeId’] = volume_id

attachment[‘device’] = device

# NOTE(justinsb): And now, we have a problem…

# The attach is async, so there’s a window in which we don’t see

# the attachment (until the attachment completes). We could also

# get problems with concurrent requests. I think we need an

# attachment state, and to write to the DB here, but that’s a bigger

# change.

# For now, we’ll probably have to rely on libraries being smart

# TODO(justinsb): How do I return “accepted” here?

return {‘volumeAttachment’: attachment}

# nova\compute\api.py:API.attach_volume()

@wrap_check_policy

@check_instance_lock

def attach_volume(self, context, instance, volume_id, device=None):

“””Attach an existing volume to an existing instance.”””

# NOTE(vish): Fail fast if the device is not going to pass. This

# will need to be removed along with the test if we

# change the logic in the manager for what constitutes

# a valid device.

if device and not block_device.match_device(device):

raise exception.InvalidDevicePath(path=device)

# NOTE(vish): This is done on the compute host because we want

# to avoid a race where two devices are requested at

# the same time. When db access is removed from

# compute, the bdm will be created here and we will

# have to make sure that they are assigned atomically.

device = self.compute_rpcapi.reserve_block_device_name(

context, device=device, instance=instance)

try:

# 云硬盘需要实现的方法，也可以参考nova\volume\cinder.py

volume = self.volume_api.get(context, volume_id)

# 检测卷是否可以挂载

self.volume_api.check_attach(context, volume)

# 预留要挂载的卷，防止并发挂载问题

self.volume_api.reserve_volume(context, volume)

# RPC Cast异步调用到虚拟机所在的宿主机的nova-compute服务进行挂载

self.compute_rpcapi.attach_volume(context, instance=instance,

volume_id=volume_id, mountpoint=device)

except Exception:

with excutils.save_and_reraise_exception():

self.db.block_device_mapping_destroy_by_instance_and_device(

context, instance[‘uuid’], device)

# API在这里返回

return device

# nova\compute\manager.py:ComputeManager.attach_volume()

@exception.wrap_exception(notifier=notifier, publisher_id=publisher_id())

@reverts_task_state

@wrap_instance_fault

def attach_volume(self, context, volume_id, mountpoint, instance):

“””Attach a volume to an instance.”””

try:

return self._attach_volume(context, volume_id,

mountpoint, instance)

except Exception:

with excutils.save_and_reraise_exception():

self.db.block_device_mapping_destroy_by_instance_and_device(

context, instance.get(‘uuid’), mountpoint)

def _attach_volume(self, context, volume_id, mountpoint, instance):

# 同上面的volume_api.get方法

volume = self.volume_api.get(context, volume_id)

context = context.elevated()

LOG.audit(_(‘Attaching volume %(volume_id)s to %(mountpoint)s’),

locals(), context=context, instance=instance)

try:

# 这里返回的是initiator信息，下面有分析

connector = self.driver.get_volume_connector(instance)

# 云硬盘需要实现的方法，下面有cinder的具体实现

connection_info = self.volume_api.initialize_connection(context,

volume,

connector)

except Exception: # pylint: disable=W0702

with excutils.save_and_reraise_exception():

msg = _(“Failed to connect to volume %(volume_id)s ”

“while attaching at %(mountpoint)s”)

LOG.exception(msg % locals(), context=context,

instance=instance)

# 这个方法也要实现

self.volume_api.unreserve_volume(context, volume)

if ‘serial’ not in connection_info:

connection_info[‘serial’] = volume_id

try:

self.driver.attach_volume(connection_info,

instance[‘name’],

mountpoint)

except Exception: # pylint: disable=W0702

with excutils.save_and_reraise_exception():

msg = _(“Failed to attach volume %(volume_id)s ”

“at %(mountpoint)s”)

LOG.exception(msg % locals(), context=context,

instance=instance)

self.volume_api.terminate_connection(context,

volume,

connector)

# 这个方法也要实现，作用是更新cinder数据库中的卷的状态

self.volume_api.attach(context,

volume,

instance[‘uuid’],

mountpoint)

values = {

‘instance_uuid’: instance[‘uuid’],

‘connection_info’: jsonutils.dumps(connection_info),

‘device_name’: mountpoint,

‘delete_on_termination’: False,

‘virtual_name’: None,

‘snapshot_id’: None,

‘volume_id’: volume_id,

‘volume_size’: None,

‘no_device’: None}

self.db.block_device_mapping_update_or_create(context, values)

# nova\virt\libvirt\driver.py:LibvirtDriver.get_volume_connector()

def get_volume_connector(self, instance):

if not self._initiator:

self._initiator = libvirt_utils.get_iscsi_initiator()

if not self._initiator:

LOG.warn(_(‘Could not determine iscsi initiator name’),

instance=instance)

return {

‘ip’: FLAGS.my_ip, #宿主机IP地址

‘initiator’: self._initiator,

‘host’: FLAGS.host #宿主机名

}

# nova\virt\libvirt\utils.py:get_iscsi_initiator()

def get_iscsi_initiator():

“””Get iscsi initiator name for this machine”””

# NOTE(vish) openiscsi stores initiator name in a file that

# needs root permission to read.

contents = utils.read_file_as_root(‘/etc/iscsi/initiatorname.iscsi’)

for l in contents.split(‘\n’):

if l.startswith(‘InitiatorName=’):

return l[l.index(‘=’) + 1:].strip()

nova中cinder API封装实现：

# nova\volume\cinder.py:API.initialize_connection():

def initialize_connection(self, context, volume, connector):

return cinderclient(context).\

volumes.initialize_connection(volume[‘id’], connector)

调用的是cinder中的initialize_connection，iscsi driver的实现如下：

# cinder\volume\iscsi.py:LioAdm.initialize_connection()

def initialize_connection(self, volume, connector):

volume_iqn = volume[‘provider_location’].split(‘ ‘)[1]

(auth_method, auth_user, auth_pass) = \

volume[‘provider_auth’].split(‘ ‘, 3)

# Add initiator iqns to target ACL

try:

self._execute(‘rtstool’, ‘add-initiator’,

volume_iqn,

auth_user,

auth_pass,

connector[‘initiator’],

run_as_root=True)

except exception.ProcessExecutionError as e:

LOG.error(_(“Failed to add initiator iqn %s to target”) %

connector[‘initiator’])

raise exception.ISCSITargetAttachFailed(volume_id=volume[‘id’])

# nova\virt\libvirt\driver.py:LibvirtDriver.attach_volume()

@exception.wrap_exception()

def attach_volume(self, connection_info, instance_name, mountpoint):

virt_dom = self._lookup_by_name(instance_name)

mount_device = mountpoint.rpartition(“/”)[2]

# 可能需要改动，下面会分析这个方法

conf = self.volume_driver_method(‘connect_volume’,

connection_info,

mount_device)

if FLAGS.libvirt_type == ‘lxc’:

self._attach_lxc_volume(conf.to_xml(), virt_dom, instance_name)

else:

try:

# 挂载到虚拟机上

virt_dom.attachDevice(conf.to_xml())

except Exception, ex:

if isinstance(ex, libvirt.libvirtError):

errcode = ex.get_error_code()

if errcode == libvirt.VIR_ERR_OPERATION_FAILED:

self.volume_driver_method(‘disconnect_volume’,

connection_info,

mount_device)

raise exception.DeviceIsBusy(device=mount_device)

with excutils.save_and_reraise_exception():

self.volume_driver_method(‘disconnect_volume’,

connection_info,

mount_device)

# TODO(danms) once libvirt has support for LXC hotplug,

# replace this re-define with use of the

# VIR_DOMAIN_AFFECT_LIVE & VIR_DOMAIN_AFFECT_CONFIG flags with

# attachDevice()

# 重新define一下，以间接实现持久化的挂载

domxml = virt_dom.XMLDesc(libvirt.VIR_DOMAIN_XML_SECURE)

self._conn.defineXML(domxml)

# nova\virt\libvirt\driver.py:LibvirtDriver.volume_driver_method()

def volume_driver_method(self, method_name, connection_info,

*args, **kwargs):

driver_type = connection_info.get(‘driver_volume_type’)

if not driver_type in self.volume_drivers:

raise exception.VolumeDriverNotFound(driver_type=driver_type)

driver = self.volume_drivers[driver_type]

method = getattr(driver, method_name)

return method(connection_info, *args, **kwargs)

def __init__():

……

self.volume_drivers = {}

for driver_str in FLAGS.libvirt_volume_drivers:

driver_type, _sep, driver = driver_str.partition(‘=’)

driver_class = importutils.import_class(driver)

self.volume_drivers[driver_type] = driver_class(self)

volume_drivers是由配置项libvirt_volume_drivers决定的，默认配置是：

cfg.ListOpt(‘libvirt_volume_drivers’,

default=[

‘iscsi=nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver’,

‘local=nova.virt.libvirt.volume.LibvirtVolumeDriver’,

‘fake=nova.virt.libvirt.volume.LibvirtFakeVolumeDriver’,

‘rbd=nova.virt.libvirt.volume.LibvirtNetVolumeDriver’,

‘sheepdog=nova.virt.libvirt.volume.LibvirtNetVolumeDriver’

help=‘Libvirt handlers for remote volumes.’),

云硬盘可以使用已有的iscsi driver，也可以参考iscsi实现自己的driver，iscsi driver的内容为：

# nova\virt\libvirt\volume.py:LibvirtISCSIVolumeDriver:

class LibvirtISCSIVolumeDriver(LibvirtVolumeDriver):

“””Driver to attach Network volumes to libvirt.”””

def _run_iscsiadm(self, iscsi_properties, iscsi_command, **kwargs):

check_exit_code = kwargs.pop(‘check_exit_code’, 0)

(out, err) = utils.execute(‘iscsiadm’, ‘-m’, ‘node’, ‘-T’,

iscsi_properties[‘target_iqn’],

‘-p’, iscsi_properties[‘target_portal’],

*iscsi_command, run_as_root=True,

check_exit_code=check_exit_code)

LOG.debug(“iscsiadm %s: stdout=%s stderr=%s” %

(iscsi_command, out, err))

return (out, err)

def _iscsiadm_update(self, iscsi_properties, property_key, property_value,

**kwargs):

iscsi_command = (‘–op’, ‘update’, ‘-n’, property_key,

‘-v’, property_value)

return self._run_iscsiadm(iscsi_properties, iscsi_command, **kwargs)

@utils.synchronized(‘connect_volume’)

def connect_volume(self, connection_info, mount_device):

“””Attach the volume to instance_name”””

iscsi_properties = connection_info[‘data’]

# NOTE(vish): If we are on the same host as nova volume, the

# discovery makes the target so we don’t need to

# run –op new. Therefore, we check to see if the

# target exists, and if we get 255 (Not Found), then

# we run –op new. This will also happen if another

# volume is using the same target.

try:

self._run_iscsiadm(iscsi_properties, ())

except exception.ProcessExecutionError as exc:

# iscsiadm returns 21 for “No records found” after version 2.0-871

if exc.exit_code in [21, 255]:

self._run_iscsiadm(iscsi_properties, (‘–op’, ‘new’))

else:

raise

if iscsi_properties.get(‘auth_method’):

self._iscsiadm_update(iscsi_properties,

“node.session.auth.authmethod”,

iscsi_properties[‘auth_method’])

self._iscsiadm_update(iscsi_properties,

“node.session.auth.username”,

iscsi_properties[‘auth_username’])

self._iscsiadm_update(iscsi_properties,

“node.session.auth.password”,

iscsi_properties[‘auth_password’])

# NOTE(vish): If we have another lun on the same target, we may

# have a duplicate login

self._run_iscsiadm(iscsi_properties, (“–login”,),

check_exit_code=[0, 255])

self._iscsiadm_update(iscsi_properties, “node.startup”, “automatic”)

host_device = (“/dev/disk/by-path/ip-%s-iscsi-%s-lun-%s” %

(iscsi_properties[‘target_portal’],

iscsi_properties[‘target_iqn’],

iscsi_properties.get(‘target_lun’, 0)))

# The /dev/disk/by-path/… node is not always present immediately

# TODO(justinsb): This retry-with-delay is a pattern, move to utils?

tries = 0

while not os.path.exists(host_device):

if tries >= FLAGS.num_iscsi_scan_tries:

raise exception.NovaException(_(“iSCSI device not found at %s”)

% (host_device))

LOG.warn(_(“ISCSI volume not yet found at: %(mount_device)s. ”

“Will rescan & retry. Try number: %(tries)s”) %

locals())

# The rescan isn’t documented as being necessary(?), but it helps

self._run_iscsiadm(iscsi_properties, (“–rescan”,))

tries = tries + 1

if not os.path.exists(host_device):

time.sleep(tries ** 2)

if tries != 0:

LOG.debug(_(“Found iSCSI node %(mount_device)s ”

“(after %(tries)s rescans)”) %

locals())

connection_info[‘data’][‘device_path’] = host_device

sup = super(LibvirtISCSIVolumeDriver, self)

return sup.connect_volume(connection_info, mount_device)

@utils.synchronized(‘connect_volume’)

def disconnect_volume(self, connection_info, mount_device):

“””Detach the volume from instance_name”””

sup = super(LibvirtISCSIVolumeDriver, self)

sup.disconnect_volume(connection_info, mount_device)

iscsi_properties = connection_info[‘data’]

# NOTE(vish): Only disconnect from the target if no luns from the

# target are in use.

device_prefix = (“/dev/disk/by-path/ip-%s-iscsi-%s-lun-” %

(iscsi_properties[‘target_portal’],

iscsi_properties[‘target_iqn’]))

devices = self.connection.get_all_block_devices()

devices = [dev for dev in devices if dev.startswith(device_prefix)]

if not devices:

self._iscsiadm_update(iscsi_properties, “node.startup”, “manual”,

check_exit_code=[0, 255])

self._run_iscsiadm(iscsi_properties, (“–logout”,),

check_exit_code=[0, 255])

self._run_iscsiadm(iscsi_properties, (‘–op’, ‘delete’),

check_exit_code=[0, 21, 255])

也即主要实现了卷挂载到宿主机和从宿主机卸载两个方法。

2.3 相关代码源文件

nova\volume\cinder.py源文件（云硬盘需要实现的方法或者要封装的API都在这里面）： https://github.com/openstack/nova/blob/stable/folsom/nova/volume/cinder.py

nova\virt\libvirt\volume.py源文件（云硬盘需要实现的driver可以参考这个文件）： https://github.com/openstack/nova/blob/stable/folsom/nova/virt/libvirt/volume.py

# 默认的driver映射关系，可以看出iscsi卷使用的是LibvirtISCSIVolumeDriver

cfg.ListOpt(‘libvirt_volume_drivers’,

default=[

‘iscsi=nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver’,

‘local=nova.virt.libvirt.volume.LibvirtVolumeDriver’, ‘fake=nova.virt.libvirt.volume.LibvirtFakeVolumeDriver’,

‘rbd=nova.virt.libvirt.volume.LibvirtNetVolumeDriver’,

‘sheepdog=nova.virt.libvirt.volume.LibvirtNetVolumeDriver’

help=‘Libvirt handlers for remote volumes.’),

cinder处理各种API请求的抽象类源文件： https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py

上述抽象类会调用不同的driver去执行实际的动作，完成API的请求，其中iSCSI driver源文件为：

# 默认的volume driver是cinder.volume.drivers.lvm.LVMISCSIDriver

cfg.StrOpt(‘volume_driver’,

default=‘cinder.volume.drivers.lvm.LVMISCSIDriver’,

help=‘Driver to use for volume creation’),

] https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/lvm.py#L304

它继承了LVMVolumeDriver, driver.ISCSIDriver两个类，其中后一个类所在的源文件为： https://github.com/openstack/cinder/blob/master/cinder/volume/driver.py#L199 https://github.com/openstack/cinder/blob/master/cinder/volume/driver.py#L339这里的self.tgtadm是在 https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/lvm.py#L321这里初始化的，调用的是 https://github.com/openstack/cinder/blob/master/cinder/volume/iscsi.py#L460这里的方法。

iscsi_helper默认使用的是tgtadm：

cfg.StrOpt(‘iscsi_helper’,

default=‘tgtadm’,

help=‘iscsi target user-land tool to use’),

3. 需要新增的API

扩容云硬盘的API（或者直接调用云硬盘已有的API，但是推荐nova新增一个，这样云硬盘就不必对外暴露任何API了，都可以经过nova来转发处理。）

4. 需要注意的问题

之前云硬盘agent实现的一下错误恢复、异常处理逻辑需要在nova里面实现
挂载点在云主机内外看到的不一致问题（因为nova挂载动作是异步的，所以返回给用户的是libvirt看到的挂载点，不是实际的虚拟机内部的挂载点，目前考虑通过查询卷信息接口返回最终的挂载点）
用户及认证问题（之前云硬盘应该用的是管理平台的用户认证逻辑，如果改为使用nova接口，需要使用keystone的用户认证，不知道可否在管理平台那一层转换一下）

总的来说云硬盘所需要做的改动应该不大，工作重点在于封装已有的API，提供client即可（参考https://github.com/openstack/nova/blob/stable/folsom/nova/volume/cinder.py），另外driver（参考https://github.com/openstack/nova/blob/stable/folsom/nova/virt/libvirt/volume.py）里面要实现扩容逻辑，应该可以重用agent中现有的代码。

技术

Nova image create流程

Posted on 2015-06-15 by aspirer

原文地址：http://aspirer2004.blog.163.com/blog/static/10676472013215111713232/

完整文档下载地址：nova镜像生成流程 nova镜像生成流程.docx

本文主要讨论nova/virt/libvirt/driver.py:_create_image的相关流程，只讨论file磁盘，不包括EBS盘（block设备）。

Resize过程的镜像拷贝优化

优化之前

首先通过libvirt的XMLDesc()方法拿到虚拟机的配置文件，然后从配置文件中读取所有file类型磁盘的信息（路径，driver，qcow2的backing file）；然后如果是不同host之间resize，则qemu-img convert合并base和子镜像为qcow2（无backing file），之后通过rsync ssh方式拷贝合并后的镜像到新的host对于instance目录下，然后在if not os.path.exists(self.path) or not os.path.exists(base):则创建镜像，resize过程这个self.path是已经拷贝过来的，所以不需要创建镜像，也就是什么都不做。

优化之后（仅优化了resize过程，创建过程与优化之前相同）

拷贝镜像是用rsync的daemon push模式，并且不合并base和子镜像，只拷贝子镜像部分，然后在目标host上检查base是否存在，不存在则下载，扩容，最后qemu-img rebase把子镜像rebase到新的base上；目前第二块盘（disk.local）以及swap盘（disk.swap）是不拷贝的，因为如果拷贝过去的仅仅是子镜像，会导致base找不到，为disk.local、disk.swap准备base镜像这部分代码没有实现，所以在拷贝子镜像过程中忽略了disk.local，disk.swap目前没有配置，所以代码里面没有忽略，如果开启了swap的配置，则resize过程会出现问题（base找不到导致虚拟机无法启动）。

优化后的镜像生成流程：

# nova/virt/libvirt/driver.py:LibvirtDriver._create_image()

if snapshot_optimization \

and not self._volume_in_mapping(self.default_root_device,

block_device_info):

self._create_snapshot_image(context, instance,

disk_images[‘image_id’],

basepath(‘disk’), size)

# nova/virt/libvirt/driver.py:LibvirtDriver

# use optimized snapshot image

# 优化后的resize过程镜像生成流程

def _create_snapshot_image(self, context, instance, image_id,

target, size):

# NOTE(hzwangpan): for resize operation, the ‘disk’ is copied from

# source node before _create_image(), so if we fetch the ‘disk’ here,

# it will cover the ‘disk’ copied from source

# 只有当’disk’不存在的时候才下载’disk’，M3从快照恢复流程优化的遗留代码

# M3的快照只有COW部分也即’disk’，所以创建虚拟机的时候要先下载’disk’，

# 然后根据其backing file的名称从glance下载它的base，这里就是下载’disk’

# 的流程，因为社区原有的下载镜像代码会转换镜像格式，而我们不需要转换，

# 所以这里新加了一个fetch_orig_image()方法。

# resize的时候instance目录下’disk’是存在的，已经从源端拷贝过来了。

if not os.path.exists(target):

libvirt_utils.fetch_orig_image(context=context, target=target,

image_id=image_id,

user_id=instance[“user_id”],

project_id=instance[“project_id”])

if not os.path.exists(target):

LOG.error(_(“fetch image failed, image id: %s”), image_id,

instance=instance, context=context)

raise exception.CouldNotFetchImage(image_id)

# 查询’disk’的backing file信息，也即查找其base

backing_file = libvirt_utils.get_disk_backing_file(target)

if not backing_file:

LOG.error(_(“get backing file of image %s failed”), image_id,

instance=instance, context=context)

raise exception.ImageUnacceptable(image_id=image_id,

reason=_(“%s doesn’t has backing file”) % target)

virtual_size = libvirt_utils.get_disk_size(target)

size = max(size, virtual_size)

# get base image by backing file

# 根据backing file名称下载base镜像

# 如果没有M3那种不完整的快照存在，则从backing file名称下载base镜像

# 的流程可以简化为根据image id下载镜像，因为每一个虚拟机都是从一个

# 完整的镜像/快照创建的，所以resize的时候根据虚拟机的image id下载到

# 的镜像就是’disk’的base。

base_dir = os.path.join(FLAGS.instances_path, ‘_base’)

if not os.path.exists(base_dir):

utils.ensure_tree(base_dir)

old_backing_file = os.path.join(base_dir, backing_file)

old_size = 0

if “_” in os.path.basename(old_backing_file):

base_img = old_backing_file.rsplit(“_”, 1)[0]

old_size = int(old_backing_file.rsplit(“_”, 1)[1]) * \

(1024L * 1024L * 1024L)

else:

base_img = old_backing_file

# 先检查不带大小信息的base是否存在，如果存在就不需要从glance下载了

# 如果不存在，则需要从glance下载base

if not os.path.exists(base_img):

self._get_base_image_by_backing_file(context, instance,

image_id, base_img)

lock_path = os.path.join(FLAGS.instances_path, ‘locks’)

@utils.synchronized(base_img, external=True, lock_path=lock_path)

def copy_and_extend(base_img, target_img, size):

if not os.path.exists(target_img):

libvirt_utils.copy_image(base_img, target_img)

disk.extend(target_img, size)

# NOTE(wangpan): qemu-img rebase ‘Safe mode’ need the old backing file,

# refer to qemu-img manual for more details.

# 从没有大小信息的base拷贝扩容出’disk’的老的backing file，因为qemu-img

# rebase默认是采用“安全模式”的，这种模式需要COW部分的新老backing file

# 都存在才能正常执行。

if old_size:

copy_and_extend(base_img, old_backing_file, old_size)

# 从没有大小信息的base拷贝扩容出’disk’的新的backing file，也即resize之后的大小

new_backing_file = base_img

if size:

size_gb = size / (1024 * 1024 * 1024)

new_backing_file += “_%d” % size_gb

copy_and_extend(base_img, new_backing_file, size)

# when old_backing_file != new_backing_file, rebase is needed

# 如果新老backing file不一样，则需要对’disk’进行rebase操作

if old_backing_file != new_backing_file:

libvirt_utils.rebase_cow_image(new_backing_file, target)

def _get_base_image_by_backing_file(self, context, instance,

image_id, backing_file):

base_image_id_sha1 = os.path.basename(backing_file)

LOG.debug(_(“image id sha1 of backing file %(backing_file)s ”

“is: %(base_image_id_sha1)s”) % locals(),

instance=instance, context=context)

(image_service, image_id) = glance.get_remote_image_service(

context, image_id)

# 根据base名称，从glance查询镜像/快照信息

image_info = image_service.get_image_properties(context,

“image_id_sha1”,

base_image_id_sha1)

if not image_info:

LOG.error(_(“can’t find base image by base_image_id_sha1 ”

” %(base_image_id_sha1)s, snapshot image_id: %(image_id)s”) %

locals(), instance=instance, context=context)

raise exception.ImageNotFound(image_id=base_image_id_sha1)

base_image_id = str(image_info[0].get(“image_id”))

lock_path = os.path.join(FLAGS.instances_path, ‘locks’)

# 下载找到的镜像/快照

@utils.synchronized(base_image_id_sha1,

external=True, lock_path=lock_path)

def fetch_base_image(context, target, image_id, user_id, project_id):

if not os.path.exists(target):

# 使用原有的下载镜像的方法，会转换镜像格式

libvirt_utils.fetch_image(context=context,

target=target,

image_id=image_id,

user_id=user_id,

project_id=project_id)

fetch_base_image(context, backing_file, base_image_id,

instance[“user_id”], instance[“project_id”])

if not os.path.exists(backing_file):

LOG.error(_(“fetch base image failed, image id: %s”),

base_image_id, instance=instance, context=context)

raise exception.CouldNotFetchImage(base_image_id)

公共流程

创建和resize的公共流程：

# nova/virt/libvirt/driver.py:LibvirtDriver._create_image()

# syntactic nicety(为了语法好看，定义了三个内部方法)

def basepath(fname=”, suffix=suffix):

return os.path.join(FLAGS.instances_path,

instance[‘name’],

fname + suffix)

def image(fname, image_type=FLAGS.libvirt_images_type):

return self.image_backend.image(instance[‘name’],

fname + suffix, image_type)

def raw(fname):

return image(fname, image_type=‘raw’)

# ensure directories exist and are writable

# 创建instance目录，用来存放镜像和libvirt.xml配置文件

utils.ensure_tree(basepath(suffix=”))

# 写入libvirt.xml配置文件

libvirt_utils.write_to_file(basepath(‘libvirt.xml’), libvirt_xml)

# 写入console.log控制台输出文件

libvirt_utils.write_to_file(basepath(‘console.log’, ”), ”, 007)

# get image type（为了优化镜像流程而新增的代码）

image_type = None

has_base_id_sha1 = False

(image_service, image_id) = glance.get_remote_image_service(

context, disk_images[‘image_id’])

try:

image_info = image_service.show(context, image_id)

if image_info and ‘properties’ in image_info:

if image_info[‘properties’].get(‘image_type’) == “snapshot”:

image_type = “snapshot”

else: # 如果不是快照，则认为是普通镜像

image_type = “image”

# base_image_id_sha1是为了兼容M3时的快照（只上传COW部分）

if image_info[‘properties’].get(‘base_image_id_sha1’):

has_base_id_sha1 = True

except Exception:

image_type = None

has_base_id_sha1 = None

LOG.warn(_(“get image type of %s faild”) % image_id,

context=context, instance=instance)

pass

# 检查镜像是否有backing file，也即是否只是COW部分

backing_file = None

if os.path.exists(basepath(‘disk’)):

backing_file = libvirt_utils.get_disk_backing_file(

basepath(‘disk’))

# 下面的这些判断都是为了检查是否需要走我们自己修改的镜像流程

# snapshot_optimization为True则需要走修改后的流程

snapshot_optimization = False

# check use image snapshot optimization or not

use_qcow2 = ((FLAGS.libvirt_images_type == ‘default’ and

FLAGS.use_cow_images) or

FLAGS.libvirt_images_type == ‘qcow2’)

# only qcow2 image may be need to optimize, and images with

# ‘kernel_id’ or ‘ramdisk_id’ shouldn’t be optimized

if FLAGS.allow_image_snapshot_optimization and use_qcow2 and \

not disk_images[‘kernel_id’] and not disk_images[‘ramdisk_id’]:

# 下面的这些if语句是为了判断当前属于哪种镜像的哪个操作

# 然后就可以判断是否需要走修改后的流程，这种判断方式比较人肉，

# 以后改起来也比较麻烦，但目前没有更好的办法了。

# normal image, when create instance（普通镜像的创建虚拟机过程）

if image_type == “image” and backing_file is None and \

not has_base_id_sha1:

snapshot_optimization = False

# normal image, when resize（普通镜像的resize过程）

if image_type == “image” and backing_file is not None and \

not has_base_id_sha1:

snapshot_optimization = True

# unbroken snapshot, when create instance（完整快照的创建虚拟机过程）

if image_type == “snapshot” and backing_file is None and \

not has_base_id_sha1:

snapshot_optimization = False

# unbroken snapshot, when resize（完整快照的resize过程）

if image_type == “snapshot” and backing_file is not None and \

not has_base_id_sha1:

snapshot_optimization = True

# only cow part snapshot, when create instance

# （只有COW部分的快照（M3修改）的创建过程）

if image_type == “snapshot” and backing_file is None and \

has_base_id_sha1:

snapshot_optimization = True

# only cow part snapshot, when resize

# （只有COW部分的快照（M3修改）的resize过程）

if image_type == “snapshot” and backing_file is not None and \

has_base_id_sha1:

snapshot_optimization = True

# 生成base的文件名

root_fname = hashlib.sha1(str(disk_images[‘image_id’])).hexdigest()

创建过程

概述

对于qcow2格式镜像root盘，原有流程是先下载镜像（或者说先创建base），然后qemu-img create生成子镜像（disk），对于qcow2格式的第二块临时盘和第三块swap盘，首先是通过mkfs/mkswap创建base，之后qemu-img create生成子镜像（disk.local/disk.swap）。

传入参数：

# nova/virt/libvirt/driver.py:LibvirtDriver.spawn()

self._create_image(context, instance, xml, network_info=network_info,

block_device_info=block_device_info,

files=injected_files,

admin_pass=admin_password)

Root盘

目前如果不是M3版本的快照文件，完整快照或者镜像的创建过程与社区F版本流程一致。首先根据image id下载镜像，之后转换、copy、扩容后生成并Cache base镜像，最后qemu-img create创建COW部分的disk。

# nova/virt/libvirt/driver.py:LibvirtDriver._create_image()

elif not self._volume_in_mapping(self.default_root_device,

block_device_info):

# image是上面说的三个内部方法之一，初始化为一个对象，具体的对象是

# 根据镜像的格式来确定的，FLAGS.libvirt_images_type默认是default，

# 然后会再判断FLAGS.use_cow_images是否为True，默认值为True

# 如果是True则image是Qcow2类的对象，目前这两个值都是保持默认。

# 否则则是Raw，LVM则需要配置libvirt_images_type=’lvm’。

# ‘disk’参数是root盘的文件名，也就是加上instance目录后的image.path

# cache就是image类里的一个方法，用来缓存base

# fetch_func就是如果base不存在，用来从glance下载镜像的方法

# filename是base文件的名称

image(‘disk’).cache(fetch_func=libvirt_utils.fetch_image,

context=context,

filename=root_fname,

size=size,

image_id=disk_images[‘image_id’],

user_id=instance[‘user_id’],

project_id=instance[‘project_id’])

# nova/virt/libvirt/imagebackend.py:Image.cache()

def cache(self, fetch_func, filename, size=None, *args, **kwargs):

“””Creates image from template.

Ensures that template and image not already exists.

Ensures that base directory exists.

Synchronizes on template fetching.

:fetch_func: Function that creates the base image

Should accept target argument.

:filename: Name of the file in the image directory

:size: Size of created image in bytes (optional)

“””

# 根据base的文件名加锁，防止两个创建过程同时下载导致的镜像损坏

@utils.synchronized(filename, external=True, lock_path=self.lock_path)

def call_if_not_exists(target, *args, **kwargs):

# 这里的判断必不可少，因为可能拿到锁的时候另外一个创建流程已经下载过了这个镜像

if not os.path.exists(target):

fetch_func(target=target, *args, **kwargs)

# 如果instance目录下’disk’文件已经存在，则什么都不做，否则生成’disk’

if not os.path.exists(self.path): # self.path的初始化见下面的代码

base_dir = os.path.join(FLAGS.instances_path, ‘_base’)

if not os.path.exists(base_dir):

utils.ensure_tree(base_dir)

base = os.path.join(base_dir, filename)

# 把下载镜像的方法作为参数传给创建disk的方法

self.create_image(call_if_not_exists, base, size,

*args, **kwargs)

# nova/virt/libvirt/imagebackend.py:

class Qcow2(Image):

def __init__(self, instance, name):

super(Qcow2, self).__init__(“file”, “qcow2”, is_block_dev=False)

# instance=instance[‘name’]，name=’disk’

# self.path就是instance目录下的disk文件

self.path = os.path.join(FLAGS.instances_path,

instance, name)

def create_image(self, prepare_template, base, size, *args, **kwargs):

# 加锁，防止镜像在使用过程中中被删除或修改

@utils.synchronized(base, external=True, lock_path=self.lock_path)

def copy_qcow2_image(base, target, size):

qcow2_base = base

if size:

size_gb = size / (1024 * 1024 * 1024)

qcow2_base += ‘_%d’ % size_gb

if not os.path.exists(qcow2_base):

with utils.remove_path_on_error(qcow2_base):

# 根据flavor拷贝后扩容base

libvirt_utils.copy_image(base, qcow2_base)

disk.extend(qcow2_base, size)

# 使用qemu-img create命令行创建COW部分也即disk文件

libvirt_utils.create_cow_image(qcow2_base, target)

# 使用传入的下载镜像的方法下载镜像，也即准备base

prepare_template(target=base, *args, **kwargs)

with utils.remove_path_on_error(self.path):

copy_qcow2_image(base, self.path, size)

下载时是先把镜像保存在_base目录下，命名为root_fname.part，然后转换为raw格式，转换过程中的目标文件命名为root_fname.converted，转换完成后删除root_fname.part，并把root_fname.converted改为root_fname，扩容后的后面加上size信息例如root_fname_10。

生成的libvirt.xml配置文件中root盘的配置为：

</disk>

Ephemeral盘

首先qemu-img create创建base（mkfs.ext3格式化），之后qemu-img create创建COW部分的disk.local，配置文件与root盘相同，只是file文件的名称（disk改为disk.local）、target dev（vda改为vdb）不同而已。

# nova/virt/libvirt/driver.py:LibvirtDriver._create_image()

ephemeral_gb = instance[‘ephemeral_gb’]

if ephemeral_gb and not self._volume_in_mapping(

self.default_second_device, block_device_info):

# 如果有第二块盘’disk.local’，则swap盘作为第三块盘vdc

swap_device = self.default_third_device

# 封装创建第二块盘的方法_create_ephemeral

fn = functools.partial(self._create_ephemeral,

fs_label=‘ephemeral0’,

os_type=instance[“os_type”])

fname = “ephemeral_%s_%s_%s” % (“0”,

ephemeral_gb,

instance[“os_type”])

size = ephemeral_gb * 1024 * 1024 * 1024

# 与root盘的创建流程类似，差别只是将从glance下载镜像改为qemu-img创建base

image(‘disk.local’).cache(fetch_func=fn,

filename=fname,

size=size,

ephemeral_size=ephemeral_gb)

else:

swap_device = self.default_second_device

# nova/virt/libvirt/driver.py:LibvirtDriver

def _create_ephemeral(self, target, ephemeral_size, fs_label, os_type):

# 创建未格式化的空磁盘文件

self._create_local(target, ephemeral_size)

# 格式化为ext3格式

disk.mkfs(os_type, fs_label, target)

@staticmethod

def _create_local(target, local_size, unit=‘G’,

fs_format=None, label=None):

“””Create a blank image of specified size”””

if not fs_format:

fs_format = FLAGS.default_ephemeral_format # 默认为None

# qemu-img create命令创建raw格式的base

libvirt_utils.create_image(‘raw’, target,

‘%d%c’ % (local_size, unit))

if fs_format: # =None，这里不执行

libvirt_utils.mkfs(fs_format, target, label)

Swap盘

流程与ephemeral盘相同，只是base格式不同，首先qemu-img创建base，并用mkswap格式化，之后qemu-img create创建COW部分的disk.local。配置文件与root盘相同，只是file文件的名称（disk改为disk. swap）、target dev（vda改为vdb/vdc，根据有无Ephemeral盘而定）不同而已。

Resize/冷迁移过程

Root盘

resize源端：

# nova/virt/libvirt/driver.py:LibvirtDriver

@exception.wrap_exception()

def migrate_disk_and_power_off(self, context, instance, dest,

instance_type, network_info,

block_device_info=None):

LOG.debug(_(“Starting migrate_disk_and_power_off”),

instance=instance)

# 获取虚拟机上所有的type=’file’类型的disk的信息

disk_info_text = self.get_instance_disk_info(instance[‘name’])

disk_info = jsonutils.loads(disk_info_text)

# 关机

self.power_off(instance)

# 块设备处理，我们目前没有使用cinder，所以这里不处理

block_device_mapping = driver.block_device_info_get_mapping(

block_device_info)

for vol in block_device_mapping:

connection_info = vol[‘connection_info’]

mount_device = vol[‘mount_device’].rpartition(“/”)[2]

self.volume_driver_method(‘disconnect_volume’,

connection_info,

mount_device)

# copy disks to destination

# rename instance dir to +_resize at first for using

# shared storage for instance dir (eg. NFS).

# 拷贝disk到目标host

same_host = (dest == self.get_host_ip_addr())

inst_base = “%s/%s” % (FLAGS.instances_path, instance[‘name’])

inst_base_resize = inst_base + “_resize”

clean_remote_dir = False

try:

# 先把instance目录改为instance-xxxxx_resize，备份过程

utils.execute(‘mv’, inst_base, inst_base_resize)

if same_host:

dest = None

utils.execute(‘mkdir’, ‘-p’, inst_base)

else:

# 不同宿主机之间的resize

if not FLAGS.use_rsync:

# 优化前的流程是用ssh创建目标端的instance目录

utils.execute(‘ssh’, dest, ‘mkdir’, ‘-p’, inst_base)

else:

# 新流程是用rsync创建目录

libvirt_utils.make_remote_instance_dir(inst_base_resize,

dest, instance[‘name’])

clean_remote_dir = True

# 遍历所有disk

for info in disk_info:

# assume inst_base == dirname(info[‘path’])

img_path = info[‘path’]

fname = os.path.basename(img_path)

# FIXME(wangpan): when resize, we ignore the ephemeral disk

# 我们在这里忽略了第二块盘’disk.local’，不拷贝到目标端

# 这里我们还应该忽略第三块盘’disk.swap’，不过暂时没用到

if fname == “disk.local”:

LOG.debug(_(“ignore disk.local when resize”),

instance=instance)

continue

from_path = os.path.join(inst_base_resize, fname)

remote_path = “%s/%s” % (instance[‘name’], fname)

if info[‘type’] == ‘qcow2’ and info[‘backing_file’]:

tmp_path = from_path + “_rbase”

# Note(hzzhoushaoyu): if allow optimization, just copy

# qcow2 to destination without merge.

# 优化后的流程是只拷贝COW部分，不合并COW和base

if FLAGS.allow_image_snapshot_optimization:

tmp_path = from_path

else:

# merge backing file

# 老的流程是先合并COW和base之后再拷贝

utils.execute(‘qemu-img’, ‘convert’, ‘-f’, ‘qcow2’,

‘-O’, ‘qcow2’, from_path, tmp_path)

if same_host and \

not FLAGS.allow_image_snapshot_optimization:

utils.execute(‘mv’, tmp_path, img_path)

elif same_host and FLAGS.allow_image_snapshot_optimization:

utils.execute(‘cp’, tmp_path, img_path)

else:

if not FLAGS.use_rsync:

# 老的流程使用rsync的ssh模式拷贝磁盘文件

libvirt_utils.copy_image(tmp_path, img_path,

host=dest)

else:

# 优化后使用rsync的daemon push模式拷贝

libvirt_utils.copy_image_to_remote(tmp_path,

remote_path, dest)

if not FLAGS.allow_image_snapshot_optimization:

utils.execute(‘rm’, ‘-f’, tmp_path)

else: # raw or qcow2 with no backing file

if not FLAGS.use_rsync or same_host:

libvirt_utils.copy_image(from_path, img_path,

host=dest)

else:

libvirt_utils.copy_image_to_remote(tmp_path,

remote_path, dest)

except Exception, e:

try:

# 异常处理，清理残留文件

if os.path.exists(inst_base_resize):

utils.execute(‘rm’, ‘-rf’, inst_base)

if clean_remote_dir and FLAGS.use_rsync:

libvirt_utils.clean_remote_dir(instance[‘name’], dest)

utils.execute(‘mv’, inst_base_resize, inst_base)

if not FLAGS.use_rsync:

utils.execute(‘ssh’, dest, ‘rm’, ‘-rf’, inst_base)

except Exception:

pass

raise e

# 返回磁盘信息共目的端使用

return disk_info_text

resize目的端：

# nova/virt/libvirt/driver.py:LibvirtDriver

@exception.wrap_exception()

def finish_migration(self, context, migration, instance, disk_info,

network_info, image_meta, resize_instance,

block_device_info=None):

LOG.debug(_(“Starting finish_migration”), instance=instance)

# 生成libvirt.xml文件

xml = self.to_xml(instance, network_info,

block_device_info=block_device_info)

# assume _create_image do nothing if a target file exists.

# TODO(oda): injecting files is not necessary

# 这里生成镜像，但是实际上社区原有流程不会生成镜像，因为’disk’已经拷贝过来了

# 所以imagebackend.py里面的cache方法什么事情都不做

# 这里主要是创建instance目录，写入libvirt.xml和console.log文件

# 但是我们修改后的流程，会根据’disk’的backing file下载它的base，

# 还会在这里重新生成第二块盘的base和’disk.local’，

# 第三块盘’disk.swap’因为拷贝的时候没有忽略，所以这里不会重新生成，

# 所以这里可能会导致disk.swap找不到base，虚拟机启动失败。

self._create_image(context, instance, xml,

network_info=network_info,

block_device_info=None)

# resize disks. only “disk” and “disk.local” are necessary.

# resize磁盘，忽略了第三块盘

disk_info = jsonutils.loads(disk_info)

for info in disk_info:

fname = os.path.basename(info[‘path’])

if fname == ‘disk’:

size = instance[‘root_gb’]

elif fname == ‘disk.local’:

size = instance[‘ephemeral_gb’]

else:

size = 0

size *= 1024 * 1024 * 1024

# If we have a non partitioned image that we can extend

# then ensure we’re in ‘raw’ format so we can extend file system.

fmt = info[‘type’]

# 如果是qcow2格式的镜像，并且可以resize，则先把它转换为raw格式

if (size and fmt == ‘qcow2’ and

disk.can_resize_fs(info[‘path’], size, use_cow=True)):

path_raw = info[‘path’] + ‘_raw’

utils.execute(‘qemu-img’, ‘convert’, ‘-f’, ‘qcow2’,

‘-O’, ‘raw’, info[‘path’], path_raw)

utils.execute(‘mv’, path_raw, info[‘path’])

fmt = ‘raw’

# resize磁盘

if size:

disk.extend(info[‘path’], size)

if fmt == ‘raw’ and FLAGS.use_cow_images:

# back to qcow2 (no backing_file though) so that snapshot

# will be available

# 如果是raw格式或者刚刚被转换成raw格式，则再次转换成qcow2

path_qcow = info[‘path’] + ‘_qcow’

utils.execute(‘qemu-img’, ‘convert’, ‘-f’, ‘raw’,

‘-O’, ‘qcow2’, info[‘path’], path_qcow)

utils.execute(‘mv’, path_qcow, info[‘path’])

### 上面的两次转换过程是很耗时的，所以不建议这么做

### 还好我们目前的root_gb大小都是一样的，不会做resize动作

# 创建虚拟机

self._create_domain_and_network(xml, instance, network_info,

block_device_info)

# 等待虚拟机启动

timer = utils.LoopingCall(self._wait_for_running, instance)

timer.start(interval=0.5).wait()

Ephemeral盘

与root盘相同

Swap盘

与root盘相同

热迁移过程（带block migration情况）

Root盘

热迁移目的端：

# nova/virt/libvirt/driver.py:LibvirtDriver

def pre_block_migration(self, ctxt, instance, disk_info_json):

“””Preparation block migration.

:params ctxt: security context

:params instance:

nova.db.sqlalchemy.models.Instance object

instance object that is migrated.

:params disk_info_json:

json strings specified in get_instance_disk_info

“””

# 与resize相同，disk_info_json也是找到的所有type=’file’的disk

disk_info = jsonutils.loads(disk_info_json)

# make instance directory

instance_dir = os.path.join(FLAGS.instances_path, instance[‘name’])

if os.path.exists(instance_dir):

raise exception.DestinationDiskExists(path=instance_dir)

os.mkdir(instance_dir)

# 遍历所有file disk

for info in disk_info:

base = os.path.basename(info[‘path’])

# Get image type and create empty disk image, and

# create backing file in case of qcow2.

instance_disk = os.path.join(instance_dir, base)

# 如果disk没有backing file（raw格式、或者不带backing file的qcow2）

# 则直接用’qemu-img create’创建空盘，磁盘内容会随着热迁移流程拷贝过来

# 这就是block migration的意义。

if not info[‘backing_file’]:

libvirt_utils.create_image(info[‘type’], instance_disk,

info[‘disk_size’])

else:

# 有backing file的disk

# 与创建虚拟机相同的镜像生成流程，也就是先准备base，

# 再qemu-img create ‘disk’，这里生成的’disk’也是类似空盘

# 需要block migration拷贝过来

# 需要注意的是如果是M3的不完整快照，这里的流程会出错，

# 因为这里是根据image id下载base的，而不完整的快照的id就是它本身

# 我们需要的是根据快照的id找到并下载它的base

# M4的完整快照应该与普通镜像相同，所以没有这个问题

# Creating backing file follows same way as spawning instances.

cache_name = os.path.basename(info[‘backing_file’])

# Remove any size tags which the cache manages

cache_name = cache_name.split(‘_’)[0]

# 下面的流程与创建流程相同

image = self.image_backend.image(instance[‘name’],

instance_disk,

FLAGS.libvirt_images_type)

image.cache(fetch_func=libvirt_utils.fetch_image,

context=ctxt,

filename=cache_name,

image_id=instance[‘image_ref’],

user_id=instance[‘user_id’],

project_id=instance[‘project_id’],

size=info[‘virt_disk_size’])

Ephemeral盘

与root盘相同

Swap盘

与root盘相同

技术

debian libvirt-0.9.12编译

Posted on 2015-06-15 by aspirer

原文地址：http://aspirer2004.blog.163.com/blog/static/1067647201312311646747/

需要安装的依赖包：

apt-get install gcc make pkg-config libxml2-dev libgnutls-dev libdevmapper-dev python-dev libnl-dev libyajl-dev

覆盖deb包的安装方式：

./configure –prefix=/usr –libdir=/usr/lib –localstatedir=/var –sysconfdir=/etc

make && make install

也可以不覆盖已有的libvirt，默认参数即可

./configure，不过需要注意库的连接问题

debian编译libvirt-0.9.12遇到的问题：

######error: failed to get the hypervisor version

######error: internal error Cannot find suitable emulator for x86_64

解决方法：安装libyajl-dev之后重新./configure,make,make install

12版本./configure的时候不会提示这个libyajl-dev包，但是编译安装后会无法连接到qemu-kvm hypervisor，这个问题在0.9.13里面解决了，所以提前安装好这个包很重要，这个问题困扰了我两次，所以现在把它记下来。

另外debian下载源码用dget很方便，找到debian网站上的相关软件包，右面会有源码下载链接，右键复制XXX.dsc文件的链接地址，在服务器上安装dget，之后dget 刚刚复制的链接，即可下载到三个文件，一个dsc文件，一个官方原始源码包，一个debian的patch包，之后用dpkg-source -x XXX.dsc，即可把两个源码包解压合并成完整的源码目录，在这个目录下修改代码，之后就可以编译了。

技术

Nova-scheduler浅析

Posted on 2015-06-15 by aspirer

原文地址：http://aspirer2004.blog.163.com/blog/static/1067647201281925258417/

skydrive共享地址：https://skydrive.live.com/redir?resid=B71FFD238B71E836!511

github共享地址：https://github.com/aspirer/docfiles/blob/master/nova-scheduler浅析.docx

本文基于Folsom版本进行分析。

简介

Nova-scheduler是openstack核心组件nova的核心组件之一，由此可见其在openstack中的地位。

顾名思义，Nova-scheduer是nova的调度器，目前是负责为创建/启动虚拟机实例寻找合适的计算节点，迁移虚拟机的时候也负责检查目的端的物理资源是否足够，其重要性随着计算节点数量的增加而增加。

下图是openstack架构图，从中我们可以看出nova-scheduer是通过与message queue和nova database交互来完成调度任务的。

Nova-scheduer的简化工作流程可以用下图表示：

也即过滤不可用节点，并对可用节点进行权重排序，根据用户配置的策略选出最优节点。

下面是nova-scheduler的类图：

计算节点可用资源通知

要想选出策略最优的节点，就需要首先知道各个节点当前的可用资源状况以及请求需要的资源信息，而后者是用户通过API请求主动传递过来的，所以不是nova-scheduer关注的重点。

下图描述了nova-compute服务如何更新其节点资源信息并发布到message queue，以及nova-scheduer如何收集并保存这些信息以备后续使用：

上述图示可以分为三个相互关联的部分：信息收集、信息发布、信息存储。信息收集（上图1.X部分）由nova-compute服务完成，当然它要依赖hypervisor适配层提供的接口，比如libvirt接口或者XenAPI接口等（实际执行收集动作的函数为nova/virt/libvirt/connection.py:HostState.update_status()），收集到的信息供信息发布函数使用；信息发布（上图2.X部分）是nova-compute服务通过RPC机制完成的，因为发布是跨服务的（nova-compute到nova-scheduer），所以使用message queue也是顺理成章的事情，如果信息收集函数没能把最新的信息传递过来，那么发布上次更新到的信息；信息存储（上图message queue下面的部分）是nova-scheduer服务完成的，它从message queue中获取相关信息并存储起来，就可以用来作为相关决策的依据。

这里有个问题就是当上一个资源占用请求还没有处理的时候，又来了一个新的请求，并且被调度到的计算节点的资源只能满足一个请求，那么第二个请求就会失败。

从API到manager

这里的API指的是scheduler的API（当然除了nova-api之外的其他服务的API到manager的流程也是类似的，这里仅讨论nova-scheduer）。我们知道API到manager的消息传递是通过message queue完成的，nova里面的message queue采用的是AMQP，与AMQP建立连接的过程是在服务启动的时候完成的，所以我们首先要看下nova-scheduer服务的启动流程。

Nova-scheduler服务脚本：

创建服务实例和最后的阻塞等待就不用多说了，我们直接看启动服务流程，nova/service.py:serve->launch_server->eventlet用greenthread启动主线程执行run_server->Service.start():

我们继续看创建消费者的过程：

这样就走到了ProxyCallback类里面：

继续

我们看到获取method是用getattr()方法实现的，在Service类里面定义了__getattr__方法，

另外在nova-scheduer的manager.py的SchedulerManager也定义了__getattr__方法，用来处理manager中未实现的被调用到的方法，这些方法都被定向到了_schedule方法。

至此API到manager的流程结束。

这部分要感谢田田高同学的鼎力基情支持！

从manager到driver

这部分的流程比较简单，在SchedulerManager类初始化的时候可以看到driver的初始化过程：

比如我们在resize虚拟机的时候，nova的compute-api的resize方法会通过message queue调用scheduler的prep_resize方法：self._cast_scheduler_message(context, {“method”: “prep_resize”, “args”: args})，这样就会调用到SchedulerManager里面的prep_resize方法，进而调用到MultiScheduler里面的schedule_prep_resize方法，

这里我们又看到了一个dict类型的self.drivers，可以在MultiScheduler的__init__方法中找到它的初始化过程，

这里我们看到了两个子driver，compute_driver和volume_driver，也都是可以在配置文件里面配的，默认值如下：

volume_driver用到的地方不多，在nova/volume/api.py:API.create()方法里面有用到，目的是为了选择一个nova-volume节点创建卷。

compute_driver用到的地方就比较多了，几乎所有涉及到启动虚拟机类的动作都跟它有关，比如创建虚拟机、start虚拟机、resize虚拟机、迁移虚拟机等等，因为启动虚拟机涉及到选择在哪个节点上启动的策略问题。我们继续以resize为例进行说明，从MultiScheduler的schedule_prep_resize方法调用到了FilterScheduler里的schedule_prep_resize方法：

这样就把prep_resize动作定向到了特定的nova-compute节点，以便让resize的目的端节点做好相关准备工作。

除了FilterScheduler调度器之外还有其他的调度器，比如ChanceScheduler、SimpleScheduler，ChanceScheduler是随机选择一个运行中的host，SimpleScheduler则是根据已经占用的核数来选择host，当前占用核数最少的host将被选中。这里我们重点讨论默认的FilterScheduler，这个调度器负责host的选择以及选中后host的资源预占，选择host的依据是使用过滤器过滤host和对host计算权重费用，成功通过所有的过滤器并且权重费用最低的host将被选中。

从Driver到filter

FilterScheduler（目前只支持compute服务调度）有很多的filter用来选择host，可以用的filter通过配置项’scheduler_available_filters’来确定，默认值为’scheduler_available_filters’，根据注释可以知道默认是遍历所有可用的filter，也就是只要在nova/scheduler/filters目录下的*_filter.py都可以使用。

另外还有一个配置项是’scheduler_default_filters’，这个参数可以配置默认使用的filter，也就是说虽然可用的filter很多，但并未全部使用，其默认配置有三个filter：’AvailabilityZoneFilter’, ‘RamFilter’, ‘ComputeFilter’。

我们先看下_scheduler这个方法，它负责筛选可用的Host列表，我们知道如果要做出决策，一定要有数据支持，否则就是瞎蒙了，所以下面我们先看下筛选Host所需要的数据：

从代码里面可以看出需要两个参数作为决策依据：Host和filter_properties，既然是筛选Hosts，那以Host作为输入也是理所当然的，Host信息的获取方式是：首先查询数据库获取到所有的compute 服务节点，然后查询自己之前保存的service_states属性获取相关信息，最后封装成HostState对象返回。

剩下的就是filter_properties了，这个参数保存了虚拟机的规格、以及特定的Host选择要求等信息，这些信息都是由用户在启动虚拟机的时候指定的，比如我们可以在创建虚拟机的API的data里面设置如下参数：

我们还可以指定特定的Host来启动虚拟机或者指定不在特定的Host上启动：

‘force_hosts’: [‘testhost’]，ignore_hosts: [‘testhost’]

这两个参数都是在执行过滤动作之前生效的，看下代码实现：

我们甚至还可以传入任意key-value对到scheduler里以备特定用途使用。

filter_properties里面整合了request_spec这个参数，这也是filter过滤Host的主要依据，通常request_spec包含的内容如下，其中黄色高亮部分是对filter或者costfunction比较重要的项：

上面提到的passes_filters方法，是_schedule方法中执行过滤过程filter_hosts用到的，它属于HostState类，也就是说每个需要过滤的Host都要通过这个方法进行过滤，成功通过的返回True，否则返回False。

我们再看_choose_host_filters是如何选择过滤器的，以及返回的是什么东东：

默认的过滤器肯定是可以用的，因为他们都是从可用过滤器里挑选出来的，_choose_host_filters选出的好用的过滤器的host_passes方法最后都作为参数传给了passes_filters，之后被调用用来过滤Host，我们再来看下这三个过滤器的host_passes方法：

从filter到cost function

用预定义的三个过滤器过滤完之后，会得到一个Hosts列表，保存了所有通过三个过滤器的Host，列表中可能不止一个Host，所以还要在这些Host当中选择一个最优的节点，以起到均衡资源分配的目的，选择最优节点的方式就是利用costfunction计算出各个Host启动虚拟机所需的”费用”，消耗”费用”最低的节点将被选中用来执行启动或者迁入虚拟机之类的操作。

上一节中提到过costfunction是在FilterScheduler类的_scheduler方法中通过调用get_cost_functions获取的，代码如下：

我们再看下nova.scheduler.least_cost.compute_fill_first_cost_fn的定义：

现在我们来看最后的”费用”计算过程（也是在FilterScheduler类的_scheduler方法中调用的）：

资源预占

还有一个问题就是，如果一次启动多个instance的话，nova-scheduer是一个一个的选择Host的，而且中间不会更新节点的可用资源，这里就有一个可用资源数量不准确的问题，比如Host1被选中启动了一个instance，对nova-scheduer来说Host1的资源并没有变化，第二个instance还是可能被启动到Host1上，这是个非常明显的问题，所以社区肯定是考虑到了的，社区的解决方法就是资源预占，简单来说就是先在nova-scheduer的进程内部把被选中的Host的可用资源中把已经启动到这个Host上的instance的相关资源信息给减掉，这样就可以避免资源数量不准确的问题了。

其他问题

Nova-scheduer多节点布置？？–essex版本不支持，folsom正在开发。

问题参考：

https://answers.launchpad.net/nova/+question/199303

社区蓝图：

https://blueprints.launchpad.net/nova/+spec/scheduler-resource-race

Work items:

Added scheduling retries when build errors occur: DONE

Added resource tracking in the compute host to more gracefully control resource usage and provide up-to-date information to the scheduler: INPROGRESS

已经在F-3版本中实现的失败后重新调度代码：

https://review.openstack.org/#/c/9540/

1. Nova现有API统计

2. Nova-Cinder交互流程分析

2.1 创建块设备cinder流程

2.2 挂载块设备cinder流程

2.3 相关代码源文件

3. 需要新增的API

4. 需要注意的问题

原文地址：http://aspirer2004.blog.163.com/blog/static/1067647201281925258417/

skydrive共享地址：https://skydrive.live.com/redir?resid=B71FFD238B71E836!511

简介

计算节点可用资源通知

从API到manager

从manager到driver

从Driver到filter

从filter到cost function

资源预占

其他问题