The previous blog, Execution Framework and Serial Code Optimizations, described a series of improvements for C++ code that extracts a rectilinear view from an equirectangular image. This blog explores converting the code from serial to parallel processing to optimize the extraction further.
The Khronos group created an industry-standard specification called SYCL (pronounced Sickle), allowing programmers to express data parallelism and reuse code across multiple hardware targets such as CPU, GPU, or FPGA). SYCL extends ISO standard C++ (version 17 and up), providing many developers with a familiar language and development environment. Compilers exist for multiple vendors' hardware platforms, meaning that code written using SYCL reduces the amount of vendor lock-in. This work evolved into oneAPI and now the Unified Acceleration Foundation hosts the effort to drive an open standard accelerator software ecosystem.
Intel created the Intel® oneAPI Base Toolkit (Base Kit), which embodies all the tools and libraries required to write in SYCL/oneAPI. Intel refers to the language as Data Parallel C++ (DPC++). The toolkit(s) are available for no fee from https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html. The Intel® oneAPI Base Toolkit integrates into Microsoft Visual Studio and Eclipse Integrated Development Environment (IDE) allowing developers to continue using those familiar development environments if desired. The Base Toolkit includes the DPC++ compiler, Intel® VTune™, and Intel® Advisor discussed previously in this series, as well as highly optimized libraries such as Math Kernel Library, Thread Building Blocks, and Video Processing Library, in addition to many others.
oneAPI supports targetting one or more hardware platforms (e.g., CPU, GPU, or FPGA) to execute code. The specific target hardware must be selected by the code, perhaps with input from the person running the code. Once the target hardware is known, the code sets up a queue for interacting with the selected hardware and dispatching code to be executed on that hardware.
The code in ConfigurableDeviceSelector.cpp enables code to request work queues directed to specific hardware types for code execution. All platforms and devices on a particular system can be listed with a command line flag, as shown below. These results are for the Intel® i9-9900k that has been used for executing all the code in this blog series.
.\x64\Release\OneDevice.exe --platformName=list
Platform: Intel(R) OpenCL
Device information:
vendor: Intel(R) Corporation
name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
type: CPU
version: 3.0
driver_version: 2023.16.6.0.28_042959
max_compute_units: 16
address_bits: 64
error_correction_support: 0
Platform: Intel(R) OpenCL HD Graphics
Device information:
vendor: Intel(R) Corporation
name: Intel(R) UHD Graphics 630
type: GPU
version: 3.0
driver_version: 31.0.101.2125
max_compute_units: 24
address_bits: 64
error_correction_support: 0
Platform: Intel(R) OpenCL
Device information:
vendor: Intel(R) Corporation
name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
type: CPU
version: 3.0
driver_version: 2023.16.6.0.28_042959
max_compute_units: 16
address_bits: 64
error_correction_support: 0
Platform: Intel(R) FPGA Emulation Platform for OpenCL(TM)
Device information:
vendor: Intel(R) Corporation
name: Intel(R) FPGA Emulation Device
type: ACC
version: 1.2
driver_version: 2023.16.6.0.28_042959
max_compute_units: 16
address_bits: 64
error_correction_support: 0
Platform: Intel(R) Level-Zero
Device information:
vendor: Intel(R) Corporation
name: Intel(R) UHD Graphics 630
type: GPU
version: 1.3
driver_version: 1.3.0
max_compute_units: 24
address_bits: 64
error_correction_support: 0
The ConfigurableDeviceSelector class has a class function called set_search, which sets the search parameters for the device selector. The arguments are:
The OneDevice.exe framework calls the ConfigurableDeviceSelector by passing the values from the --typePreference, --platformName, --deviceName, and --driverVersion directly to the set_search function with two exceptions. First, if the platformName or deviceName flag is equal to "all", then the code steps through all the platforms and/or devices to execute the code there. Second, if the platformName or deviceName is “list”, the code provides a list of all available platforms and devices, such as the above listing.
In SYCL, device queues represent the linkage between the program as it executes and the target device to utilize when running a specific chunk of code (also known as a kernel). Device queue creation involves a line of code that looks like the following. In this case, the set_search passes the command line arguments to the ConfigurableDeviceSelector, and then the class's device_selector is called by the oneAPI framework with each platform and device to determine which to use. The device_selector returns integer values for each call indicating whether the provided device is desired (a positive number) or not (a negative value). The device that receives the highest positive value gets utilized. After locating the best match for the device, a queue to that device gets created and any code utilizing that queue will be compiled and executed on the selected device.
ConfigurableDeviceSelector::set_search(m_typePreference, m_platformName, m_deviceName, m_driverVersion);
m_pQ = new sycl::queue(ConfigurableDeviceSelector::device_selector, ehandler);
Another key SYCL construct used by the code is parallel_for. For instance, an example code loop from the original serial implementation looks like the following.
Point3D* pElement = &m_pXYZPoints[0];
for (int y = 0; y < m_parameters->m_heightOutput; y++)
{
for (int x = 0; x < m_parameters->m_widthOutput; x++)
{
pElement->m_x = x * invf + translatecx;
pElement->m_y = y * invf + translatecy;
pElement->m_z = 1.0f;
pElement++;
}
}
In the parallel processing code, that same loop looks like the following.
The submit function delivers the kernel (in this case, a lambda expression) to the appropriate work queue. The submitted kernel receives a sycl::handler object representing the command group handler. This groups all the device-specific data and commands together for use during the kernel execution. The next line utilizes the parallel_for method to break the original 2D image into a series of work items. Each work item processes a single pixel. The index for which pixel is assigned to a particular kernel execution is passed in via the item variable, with index 0 being the y index and index 1 being the x index. The kernel takes these values and locates the proper element to compute. After the loop, the wait() function waits for all parallel processing of pixels to complete before moving on through the remainder of the code in the algorithm. If there was other useful code to execute while the parallel processing is underway, the program could place this code between the submit and the wait. That would run in parallel with the code being executed by the SYCL device.
The previous blog covered algorithms 0 through 4 and this blog covers algorithms 5 through 11 and all are availabe in the GitHub repository OneDevice. Algorithm 5 started from Algorithm 1 (a port to C++ of the Python algorithm) and changed it to utilize parallel processing and select which hardware to target for executing the code using the constructs described above. Once all the for loops were converted to parallel_for equivalents, algorithm 5 execution results in the following frames per second.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=5 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemapping: Computes a Remapping algorithm using oneAPI's DPC++ Universal Shared Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.37628390,s, 376.28390,ms, 376283.900,us, FPS, 2.6575679
times averaging, 1000, frame(s), 0.00434346,s, 4.34346,ms, 4343.463,us, FPS, 230.2310604
total averaging, 1001, Total, 0.00471534,s, 4.71534,ms, 4715.338,us, FPS, 212.0738898
Note that the blog references algorithm numbers as provided on the command line via the --algorithm, --startAlgorithm, or --endAlgorithm flags. Internally, the algorithms utilize classes named in a (hopefully) descriptive fashion. Thus, --algorithm=5 maps to DpcppRemapping (which is essentially v1 of the Data Parallel C++ algorithms), --algorithm=6 maps to DpcppRemappingV2, and so on. Thus, the internal class name is used in the printouts shown below, while the command line and blog reference the external algorithm number. In hindsight, it would have been better if these were all consistent.
As a reminder, running algorithm 1 with the same parameters provides the following results. The serial code is faster for the warmup iteration since the parallel code takes more overhead to select the hardware (even though it results in being the same CPU), construct the queue, and then compile the kernel for the target hardware. However, note that for every subsequent run, that overhead disappears and the parallel nature of the code provides a nice speed improvement (230.2310604 / 67.4700178 = 3.41x). Thus, if the use case requires processing a single image each time the code is invoked, pick the serial algorithm. The parallel algorithm should be considered if many frames will likely be processed each time.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=1 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
V1a Multiple loop serial point by point conversion from equirectangular to flat. Memory array of structure row/column layout.
warmup, 1, frame(s), 0.02807420,s, 28.07420,ms, 28074.200,us, FPS, 35.6198930
times averaging, 1000, frame(s), 0.01482140,s, 14.82140,ms, 14821.398,us, FPS, 67.4700178
total averaging, 1001, Total, 0.01483488,s, 14.83488,ms, 14834.884,us, FPS, 67.4086857
Algorithm 6 optimizes further by taking all the separate invocations of kernels and combining the code into a single kernel, much like Algorithm 4 did when combining the separate functions into a single function. Compared to Algorithm 5, this provides a 35% speedup (311.4885092 / 230.2310604 = 1.35).
Algorithm 6 has two variants. The faster one utilizes Universal Shared Memory (USM), which allows a programmer to allocate memory in a fashion very similar to the standard C malloc call. oneAPI then ensures the memory is allocated on the proper device and data read and written is moved from/to the device. This is done with special calls to malloc_shared and a variation of free. Both of these calls take additional arguments to indicate which device/context is used for the memory allocation and free operations. Universal shared memory can take advantage of memory shared by a CPU and integrated GPU to make execution faster. If the device is a discrete GPU, the GPU is informed of the memory allocation, and data is transferred to/from the discrete GPU.
The second variant uses malloc_device, which specifically allocates memory space on the device. Responsibility for moving data to/from the device rests with the programmer.
The results below show a limitation of using variants in the framework. Notice that the warmup time for the Device Memory variant seems to be shorter, resulting in a higher FPS rate. However, if the variants run in the opposite order, the USM warmup becomes faster. This is most likely because the oneAPI compilation of all the kernels happens at the beginning of the program and, therefore, factors into the first warmup period only. This fact affects the "warmup" and "total averaging" rows but does not impact the validity of the "times averaging" rows.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=6 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV2: Single kernel vs 3 kernels using oneAPI's DPC++ Universal Shared Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.29612750,s, 296.12750,ms, 296127.500,us, FPS, 3.3769238
times averaging, 1000, frame(s), 0.00321039,s, 3.21039,ms, 3210.391,us, FPS, 311.4885092
total averaging, 1001, Total, 0.00350324,s, 3.50324,ms, 3503.243,us, FPS, 285.4497670
DpcppRemappingV2: Single kernel vs 3 kernels using oneAPI's DPC++ Device Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.00687840,s, 6.87840,ms, 6878.400,us, FPS, 145.3826471
times averaging, 1000, frame(s), 0.00354152,s, 3.54152,ms, 3541.519,us, FPS, 282.3646822
total averaging, 1001, Total, 0.00354516,s, 3.54516,ms, 3545.158,us, FPS, 282.0749115
The optimizations attempted in algorithms 7 and 8 decreased the frames per second. They are included for completeness but not discussed in depth since no speed-up was achieved.
Algorithm 7 experimented with parallel_for_work_item and algorithm 8 used sub_groups. It is
unclear if these constructs are not well suited to this particular use case or if a more optimal way exists to configure them. See the results from the two runs below.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=8 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV4: Computes a Remapping algorithm using oneAPI's DPC++ sub-groups to reduce scatter with Universal Shared Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.31897010,s, 318.97010,ms, 318970.100,us, FPS, 3.1350901
times averaging, 1000, frame(s), 0.00628122,s, 6.28122,ms, 6281.221,us, FPS, 159.2047227
total averaging, 1001, Total, 0.00659384,s, 6.59384,ms, 6593.842,us, FPS, 151.6566504
DpcppRemappingV4: Computes a Remapping algorithm using oneAPI's DPC++ sub-groups to reduce scatter with Device Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.00984590,s, 9.84590,ms, 9845.900,us, FPS, 101.5651185
times averaging, 1000, frame(s), 0.00662433,s, 6.62433,ms, 6624.329,us, FPS, 150.9586744
total averaging, 1001, Total, 0.00662780,s, 6.62780,ms, 6627.802,us, FPS, 150.8795759
Algorithm 9 enhances the ExtractFrameImage function relative to Algorithm 6. Many of the previous uses of ExtractFrameImage relied on cv::remap to collect all pixel values into the rectilinear image. Algorithm 8 replaces the remap function with a DPC++ parallel_for and computes all the pixel values using parallel processing. Compared to Algorithm 6, this results in a 2 times improvement (629.4278677 / 311.4885092 = 2.02).
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=9 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV5: DpcppRemappingV2 and optimized ExtractFrame using DPC++ and USM Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.47157100,s, 471.57100,ms, 471571.000,us, FPS, 2.1205715
times averaging, 1000, frame(s), 0.00158874,s, 1.58874,ms, 1588.744,us, FPS, 629.4278677
total averaging, 1001, Total, 0.00205839,s, 2.05839,ms, 2058.388,us, FPS, 485.8171719
DpcppRemappingV5: DpcppRemappingV2 and optimized ExtractFrame using DPC++ and Device Memory on Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.03297480,s, 32.97480,ms, 32974.800,us, FPS, 30.3261885
times averaging, 1000, frame(s), 0.00211797,s, 2.11797,ms, 2117.966,us, FPS, 472.1511556
total averaging, 1001, Total, 0.00214898,s, 2.14898,ms, 2148.975,us, FPS, 465.3380560
Algorithm 10 experiments with possibly sacrificing image quality in favor of speed. If the frames come from a video stream, then perhaps sampling multiple pixels can be dropped. Thus, this algorithm simply takes the nearest top, left pixel to the computed pixel location. Note that the computed pixel locations are floating point so they may not uniquely identify a particular pixel in the equirectangular image. Simply truncating to an integer and using that value as the pixel address, the calculations are nearly 50% faster (930.5499839 / 629.4278677 = 1.478) as shown below.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=10 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV6: DpcppRemappingV5 USM but just taking the truncated pixel point Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.46488320,s, 464.88320,ms, 464883.200,us, FPS, 2.1510779
times averaging, 1000, frame(s), 0.00107463,s, 1.07463,ms, 1074.633,us, FPS, 930.5499839
total averaging, 1001, Total, 0.00153812,s, 1.53812,ms, 1538.124,us, FPS, 650.1424403
DpcppRemappingV6: DpcppRemappingV5 Device Memory but just taking the truncated pixel point Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.03010280,s, 30.10280,ms, 30102.800,us, FPS, 33.2195012
times averaging, 1000, frame(s), 0.00165322,s, 1.65322,ms, 1653.221,us, FPS, 604.8796610
total averaging, 1001, Total, 0.00168184,s, 1.68184,ms, 1681.843,us, FPS, 594.5858356
For a comparison of the output resulting from Algorithms 9 and 10, see Figures 1 and 2, respectively.
The final code optimization utilizes the Universal Shared Memory, but when the target device is the CPU, the code does not make a memory copy and just points to the original image buffer. This boosts the FPS by about 3% (954.9373652 / 930.5499839 = 1.03), as shown below.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=11 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaYaw=10 --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV7: DpcppRemappingV6 USM but on CPU don't copy memory Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.43129320,s, 431.29320,ms, 431293.200,us, FPS, 2.3186083
times averaging, 1000, frame(s), 0.00104719,s, 1.04719,ms, 1047.189,us, FPS, 954.9373652
total averaging, 1001, Total, 0.00147714,s, 1.47714,ms, 1477.137,us, FPS, 676.9853390
DpcppRemappingV7: DpcppRemappingV6 Device Memory but on CPU don't copy memory Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.00729330,s, 7.29330,ms, 7293.300,us, FPS, 137.1121440
times averaging, 1000, frame(s), 0.00161683,s, 1.61683,ms, 1616.828,us, FPS, 618.4949408
total averaging, 1001, Total, 0.00162270,s, 1.62270,ms, 1622.702,us, FPS, 616.2559833
Since each frame is being processed within 1 millisecond, finding additional improvement is becoming increasingly harder, but please comment if there are further enhancements to consider. Algorithm 11 excels when keeping the viewing direction static and changing the equirectangular frame. As shown below, replacing the --deltaYaw with --deltaImage results in 1607 FPS when extracting a rectilinear image.
src\DPC++-OneDevice\x64\Release\OneDevice.exe --algorithm=11 --iterations=1001 --yaw=10 --pitch=20 --roll=30 --deltaImage --typePreference=CPU --img0=images\IMG_20230629_082736_00_095.jpg --img1=images\ImageAndOverlay-equirectangular.jpg
...
All done! Summary of all runs:
DpcppRemappingV7: DpcppRemappingV6 USM but on CPU don't copy memory Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.45048580,s, 450.48580,ms, 450485.800,us, FPS, 2.2198258
times averaging, 1000, frame(s), 0.00062221,s, 0.62221,ms, 622.212,us, FPS, 1607.1695189
total averaging, 1001, Total, 0.00107182,s, 1.07182,ms, 1071.817,us, FPS, 932.9951933
DpcppRemappingV7: DpcppRemappingV6 Device Memory but on CPU don't copy memory Intel(R) OpenCL Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz 2023.16.6.0.28_042959
warmup, 1, frame(s), 0.00696300,s, 6.96300,ms, 6963.000,us, FPS, 143.6162574
times averaging, 1000, frame(s), 0.00128732,s, 1.28732,ms, 1287.324,us, FPS, 776.8054590
total averaging, 1001, Total, 0.00129327,s, 1.29327,ms, 1293.271,us, FPS, 773.2334243
The code implements algorithms 12 - 18 as well. Algorithms 5 - 11 utilize floating point 64 (fp64) sized elements (double), but not all hardware supports fp64, so algorithms 12 - 18 are identical to 5 - 11 except floating point 32 (fp32) sized elements (float) are used instead. Neither fp64 nor fp32 operated significantly faster or slower than the other (generally plus or minus 1 or 2% of the other); thus, algorithms 12 – 18 do not require further discussion beyond pointing out that if hardware being used does not support fp64, use the fp32 variant instead.
Important: If the hardware does not support fp64, the code may throw an exception like "Required aspect fp64 is not supported on the device". If this occurs, converting the code to fp32 requires changing all usage of the double data type in the kernels to float. However, that may not be enough. There is also an extremely subtle situation where using numbers in the code (either hardcoded numerics like 2 or 0.5 in an equation or #defines) for computing a floating point number can also get cast to a double by the compiler. In these cases, denote them as floats (e.g., use 0.5f instead of just 0.5). Unfortunately, no help is provided by the compiler in locating where the code is doing this, so finding the locations and fixing then can be tedious.
After experimenting with 12 algorithms, the code operates around 1700 times faster (1607.1695189 / 0.9273079 = 1,733.2) than the original C++ algorithm. Key speedups reported on in this blog post include: 1) converting from serial to parallel processing, 2) reducing the number of kernels used to one, 3) replacing the cv::remap with a DPC++ parallel algorithm, 4) simplified calculation of which pixel value to extract from the equirectangular image, and 5) keeping the viewing direction static while changing frames to extract. If the viewing direction changes, then the code remains over 1000 times faster (954.9373652 / 0.9273079 = 1,029.8) than the original C++ algorithm and 80 times faster than Python (954.9373652 / 11.821199533998858 = 80.78). Incidentally, the Python code can be modified to cache the viewing direction and executes around 150 FPS. Algorithm 11 remains around 10 times faster than the Python code under these conditions (1607.1695189 / 150 = 10.7).
At 1607 FPS, the code could potentially handle up to 50 video streams @ 30 FPS with an Intel® Core i9 Gen 9 class processor. Newer generations or more powerful computing classes, such as Xeon processors, could handle even more streams. This is ideal for use cases such as Telesitting in hospital environments or video security setups where a single computer monitors multiple cameras simultaneously.
The next blog in this series, GPU Exploration, explores using an integrated GPU when performing the algorithm.
Doug Bogia received his Ph.D. in computer science from the University of Illinois, Urbana-Champaign, and works at Intel Corporation. He enjoys photography, woodworking, programming, and optimizing solutions to run as fast as possible on a given piece of hardware. |
© Intel Corporation. Intel, the Intel logo, VTune, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.