Optimization of the dense matrix multiplication procedure for shared memory systems
Abstract
Optimization of the dense matrix multiplication procedure for shared memory systems
Incoming article date: 11.03.2024The study presents an extensive analysis of methods for low-level optimization of the matrix multiplication algorithm for computing systems with shared memory. Based on a comparison of various approaches, including block optimization, parallel execution with OpenMP, vectorization with AVX and the use of the Intel MKL library, significant improvements in the performance of the resulting software implementations are revealed. In particular, block optimization reduces the number of cache misses, parallelism effectively uses multicore, and vectorization and Intel MKL demonstrate maximum acceleration due to more efficient software optimizations. The obtained results emphasize the importance of careful selection of optimization methods and their compliance with the architecture of the computing system in order to achieve the required performance parameters of the designed software.
Keywords: low-level optimization, block optimization, parallel execution, OpenMP, vectorization, AVX, Intel MKL, performance, benchmarking, matrix multiplication