More efficient movemask for aarch64 #1237

serge-sans-paille · 2025-12-27T15:32:06Z

No description provided.

onalante-ebay · 2025-12-27T20:42:34Z

include/xsimd/arch/xsimd_neon64.hpp

+         ********/
+
+        template <class A, class T, detail::enable_sized_t<T, 1> = 0>
+        XSIMD_INLINE uint64_t mask(batch_bool<T, A> const& self, requires_arch<neon64>) noexcept


It seems like this has lower block throughput than the non-NEON64 variant: https://godbolt.org/z/szPjEzPW7.

Aha, benchmarks on actual CPUs were faster with vaddv: DLTcollab/sse2neon@ed179d7.

The results might need reevaluation for u{16,32}. I can put together a benchmark since I am using a M2-based device.

I can confirm faster execution for u{8,16,32} on this crude benchmark:

PATCH

diff --git a/benchmark/main.cpp b/benchmark/main.cpp index 7a630e4..e921566 100644 --- a/benchmark/main.cpp +++ b/benchmark/main.cpp @@ -12,6 +12,15 @@ #include "xsimd_benchmark.hpp" #include <map> +void benchmark_mask() +{ + std::size_t size = 20000; + xsimd::run_mask_benchmark<uint8_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint16_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint32_t>(std::cout, size, 1000); + xsimd::run_mask_benchmark<uint64_t>(std::cout, size, 1000); +} + void benchmark_operation() { // std::size_t size = 9984; @@ -112,6 +121,7 @@ void benchmark_basic_math() int main(int argc, char* argv[]) { const std::map<std::string, std::pair<std::string, void (*)()>> fn_map = { + { "mask", { "mask", benchmark_mask } }, { "op", { "arithmetic", benchmark_operation } }, { "exp", { "exponential and logarithm", benchmark_exp_log } }, { "trigo", { "trigonometric", benchmark_trigo } }, diff --git a/benchmark/xsimd_benchmark.hpp b/benchmark/xsimd_benchmark.hpp index 6f6b91b..8b8447c 100644 --- a/benchmark/xsimd_benchmark.hpp +++ b/benchmark/xsimd_benchmark.hpp @@ -16,6 +16,7 @@ #include "xsimd/xsimd.hpp" #include <chrono> #include <iostream> +#include <random> #include <string> #include <vector> @@ -310,6 +311,38 @@ namespace xsimd return t_res; } + template <class T, class OS, kernel::detail::enable_integral_t<T> = 0> + void run_mask_benchmark(OS& out, std::size_t size, std::size_t iter) + { + bench_vector<T> f_lhs; + // NOTE: This is a hack to match the signature of `benchmark_simd{,_unrolled}`. + bench_vector<T> f_res; + + size = size / batch<T>::size * batch<T>::size; + f_lhs.resize(size); + f_res.resize(size); + + std::minstd_rand rng(1337); + std::bernoulli_distribution dist; + for (std::size_t i = 0; i < size; ++i) + { + f_lhs[i] = static_cast<T>(dist(rng)); + } + + const auto mask_functor = [](batch<T> const& x) + { + return (x == batch<T>(0)).mask(); + }; + const auto time = benchmark_simd<batch<T>>(mask_functor, f_lhs, f_res, iter); + const auto time_unr = benchmark_simd_unrolled<batch<T>>(mask_functor, f_lhs, f_res, iter); + + out << "============================" << std::endl; + out << "mask" << sizeof(T) * 8 << std::endl; + out << "vector : " << time.count() << "ms" << std::endl; + out << "vector unr : " << time_unr.count() << "ms" << std::endl; + out << "============================" << std::endl; + } + template <class F, class OS> void run_benchmark_1op(F f, OS& out, std::size_t size, std::size_t iter, init_method init = init_method::classic) {

Thanks for the feedback. Let's merge that one then! (once CI is happy)

As a complement to #1236

serge-sans-paille mentioned this pull request Dec 27, 2025

Implement optimized movemasks for NEON #1236

Open

serge-sans-paille force-pushed the feature/aarch64-movemask branch 3 times, most recently from c5f067e to d85a523 Compare December 27, 2025 20:42

onalante-ebay reviewed Dec 27, 2025

View reviewed changes

More efficient batch_bool::mask() for aarch64

5fac2ad

As a complement to #1236

serge-sans-paille force-pushed the feature/aarch64-movemask branch from d85a523 to 5fac2ad Compare December 28, 2025 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More efficient movemask for aarch64 #1237

More efficient movemask for aarch64 #1237

serge-sans-paille commented Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 •

edited

Loading

Uh oh!

onalante-ebay Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 •

edited

Loading

Uh oh!

serge-sans-paille Dec 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

More efficient movemask for aarch64 #1237

Are you sure you want to change the base?

More efficient movemask for aarch64 #1237

Conversation

serge-sans-paille commented Dec 27, 2025

Uh oh!

onalante-ebay Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

onalante-ebay Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

onalante-ebay Dec 27, 2025 •

edited

Loading

onalante-ebay Dec 27, 2025 •

edited

Loading

serge-sans-paille Dec 28, 2025 •

edited

Loading