[qemu+kvm]: iommu 开关代码分析 本文介绍pcie/pci设备、platform设备加入iommu group前的代码流程帮助大伙理解这部分功能。同时介绍一个guestOS 如何同时使用iommu pcie和 non-iommu pcie使用介绍现在qemu支持两个RC第一个RC携带iommu attributepcie.0第二个RC不携带iommu attributepcie.1实战 向guest透传pcie有线网卡有线网卡使用iommu 同时向guest透传pcie无线网卡无线网卡不使用iommu qemu-system-aarch64-M virt,gic-version3,accelkvm,iommusmmuv3-cpu host-kernel Image_guest-appendroot/dev/vda consolettyAMA0 init/sbin/init rootwait rw -nographic-m10240M-drive file1.img,ifnone,iddrv1,formatraw-device virtio-blk-pci,drivedrv1,buspcie.0-device vfio-pci,host0001:01:00.0,buspcie.0-device vfio-pci,host0002:01:00.0,buspcie.1-smp8注第一个透传pci设备加入bus pcie.0 第二个透传pci设备加入bus pcie.1进入guest查看:lspci-vvv//第一个RC pcie.00000:00:00.0Host bridge:Red Hat,Inc.QEMU PCIe Host bridge Subsystem:Red Hat,Inc.QEMU PCIe Host bridge Control:I/O-Mem-BusMaster-SpecCycle-MemWINV-VGASnoop-ParErr-Stepping-SERR-FastB2B-DisINTx-Status:Cap-66MHz-UDF-FastB2B-ParErr-DEVSELfastTAbort-TAbort-MAbort-SERR-PERR-INTx-0000:00:03.0Ethernet controller:Realtek Semiconductor Co.,Ltd.Device8126(rev01)...Interrupt:pin A routed to IRQ35IOMMU group:2//加入iommu groupRegion0:I/O ports at10000[size256]Region2:Memory at8000000000(64-bit,non-prefetchable)[size64K]Region4:Memory at8000018000(64-bit,non-prefetchable)[size16K]...Kernel modules:r8126//第二个RC pcie.10001:00:00.0Host bridge:Red Hat,Inc.QEMU PCIe Host bridge Subsystem:Red Hat,Inc.QEMU PCIe Host bridge Control:I/O-Mem-BusMaster-SpecCycle-MemWINV-VGASnoop-ParErr-Stepping-SERR-FastB2B-DisINTx-Status:Cap-66MHz-UDF-FastB2B-ParErr-DEVSELfastTAbort-TAbort-MAbort-SERR-PERR-INTx-0001:00:01.0Network controller:Realtek Semiconductor Co.,Ltd.RTL8852CE PCIe802.11ax Wireless NetworkController(rev01)...Interrupt:pin A routed to IRQ69// 没有加入iommu groupRegion0:I/O ports at1000[size256]Region2:Memory at20000000(64-bit,non-prefetchable)[size1M]...Kernel driver in use:rtl8852ce Kernel modules:rtl8852ce2. 代码逻辑介绍当device和driver 匹配时内核执行device_driver_attach__driver_probe_devicereally_probestaticintreally_probe(structdevice*dev,structdevice_driver*drv){...if(dev-bus-dma_configure){retdev-bus-dma_configure(dev);if(ret)gotopinctrl_bind_failed;}...retcall_driver_probe(dev,drv);}在执行driver probe之前 首先调用dev-bus-dma_configure dma_configure根据bus type不同而不同pci busstructbus_typepci_bus_type{.namepci,.matchpci_bus_match,.ueventpci_uevent,.probepci_device_probe,.removepci_device_remove,.shutdownpci_device_shutdown,.dev_groupspci_dev_groups,.bus_groupspci_bus_groups,.drv_groupspci_drv_groups,.pmPCI_PM_OPS_PTR,.num_vfpci_bus_num_vf,.dma_configurepci_dma_configure,.dma_cleanuppci_dma_cleanup,};EXPORT_SYMBOL(pci_bus_type);platform busstructbus_typeplatform_bus_type{.nameplatform,.dev_groupsplatform_dev_groups,.matchplatform_match,.ueventplatform_uevent,.probeplatform_probe,.removeplatform_remove,.shutdownplatform_shutdown,.dma_configureplatform_dma_configure,.dma_cleanupplatform_dma_cleanup,.pmplatform_dev_pm_ops,};EXPORT_SYMBOL_GPL(platform_bus_type);platform bus相对简单下面以pci bus为例说明pci_dma_configurestaticintpci_dma_configure(structdevice*dev){structpci_driver*driverto_pci_driver(dev-driver);structdevice*bridge;intret0;bridgepci_get_host_bridge_device(to_pci_dev(dev));if(IS_ENABLED(CONFIG_OF)bridge-parentbridge-parent-of_node){retof_dma_configure(dev,bridge-parent-of_node,true);}elseif(has_acpi_companion(bridge)){structacpi_device*adevto_acpi_device_node(bridge-fwnode);retacpi_dma_configure(dev,acpi_get_dma_attr(adev));}pci_put_host_bridge_device(bridge);if(!ret!driver-driver_managed_dma){retiommu_device_use_default_domain(dev);if(ret)arch_teardown_dma_ops(dev);}returnret;}如果是arm平台更多走基于dtb的解析方式x86下更多使用acpi table。注使用dtb还是acpi table和arch无关。of_dma_configureof_dma_configure_idof_iommu_configureintof_iommu_configure(structdevice*dev,structdevice_node*master_np,constu32*id){......if(dev_is_pci(dev)){structof_pci_iommu_alias_infoinfo{.devdev,.npmaster_np,};errpci_for_each_dma_alias(to_pci_dev(dev),of_pci_iommu_init,info);of_pci_check_device_ats(dev,master_np);}else{errof_iommu_configure_device(master_np,dev,id);}mutex_unlock(iommu_probe_device_lock);if(err-ENODEV||err-EPROBE_DEFER)returnerr;if(err)gotoerr_log;erriommu_probe_device(dev);......}如果设备是pcie设备会走of_pci_iommu_init dev是设备本身np是设备父节点pcie RC的dtb nodeof_pci_iommu_initof_iommu_configure_dev_id{structof_phandle_argsiommu_spec{.args_count1};interr;errof_map_id(master_np,*id,iommu-map,iommu-map-mask,iommu_spec.np,iommu_spec.args);if(err)returnerr;errof_iommu_xlate(dev,iommu_spec);of_node_put(iommu_spec.np);returnerr;}在这里进行检索RC的iommu-map做stream ID映射并且调用of_iommu_xlate将设备对应的stream ID记录到dev-fwspec中。回到of_iommu_configure继续调用iommu_probe_device 调用smmu ops 的attach_dev 建立STE结论到这里就明白了如果是pcie设备只要设备对应RC dtb有iommu-map nodeRC下面的device都使用iommu。如果RC没有iommu-map node也意味着RC没有iommu能力那么RC下面的EP都不能使用iommu。