Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SIMD intrinsics for reverseBits #71

Merged
merged 4 commits into from
Apr 11, 2023

Conversation

konsumlamm
Copy link
Contributor

Refs #66.

There are three C implementations:

  • an SSE version (only uses SSE2)
  • an AVX version (uses AVX2)
  • a pure C version

The SIMD versions are behind an #ifdef __x86_64__ (which is defined by both gcc and clang) and the specific version is selected via __builtin_cpu_supports (which is again defined by both gcc and clang).

I increased the number of reverseBits tests to 500, to make it more likely to catch errors in the C implementation (I intentionally introduced some bugs and some would only show up after a few runs of the test suite, with the default 100 tests).

Benchmark results (with #70)
All
  reverse
    32
      Bit: OK (0.66s)
        38.2 ns ± 2.7 ns
      C:   OK (0.67s)
        38.9 ns ± 718 ps, 1.02x
      SSE: OK (1.18s)
        34.1 ns ± 2.2 ns, 0.89x
      AVX: OK (0.30s)
        33.6 ns ± 1.7 ns, 0.88x
    64
      Bit: OK (0.22s)
        24.5 ns ± 1.4 ns
      C:   OK (0.31s)
        16.7 ns ± 692 ps, 0.68x
      SSE: OK (0.33s)
        18.5 ns ± 838 ps, 0.75x
      AVX: OK (0.34s)
        19.2 ns ± 1.0 ns, 0.78x
    128
      Bit: OK (0.28s)
        31.2 ns ± 2.0 ns
      C:   OK (0.35s)
        19.8 ns ± 968 ps, 0.64x
      SSE: OK (0.33s)
        18.7 ns ± 1.0 ns, 0.60x
      AVX: OK (0.41s)
        23.7 ns ± 960 ps, 0.76x
    256
      Bit: OK (0.43s)
        50.0 ns ± 2.2 ns
      C:   OK (0.24s)
        27.0 ns ± 2.3 ns, 0.54x
      SSE: OK (0.38s)
        21.4 ns ± 1.6 ns, 0.43x
      AVX: OK (0.36s)
        20.3 ns ± 1.7 ns, 0.41x
    512
      Bit: OK (0.35s)
        78.7 ns ± 3.9 ns
      C:   OK (0.36s)
        40.7 ns ± 2.2 ns, 0.52x
      SSE: OK (1.83s)
        26.9 ns ± 320 ps, 0.34x
      AVX: OK (0.21s)
        23.0 ns ± 1.8 ns, 0.29x
    1024
      Bit: OK (0.32s)
        145  ns ±  13 ns
      C:   OK (0.30s)
        68.0 ns ± 3.9 ns, 0.47x
      SSE: OK (0.35s)
        39.6 ns ± 1.8 ns, 0.27x
      AVX: OK (0.51s)
        28.9 ns ± 1.0 ns, 0.20x
    2048
      Bit: OK (0.31s)
        276  ns ±  19 ns
      C:   OK (0.27s)
        120  ns ± 9.0 ns, 0.44x
      SSE: OK (0.29s)
        64.4 ns ± 5.5 ns, 0.23x
      AVX: OK (0.38s)
        43.2 ns ± 3.6 ns, 0.16x
    4096
      Bit: OK (0.29s)
        519  ns ±  25 ns
      C:   OK (0.26s)
        234  ns ±  20 ns, 0.45x
      SSE: OK (0.26s)
        116  ns ± 8.4 ns, 0.22x
      AVX: OK (0.33s)
        73.3 ns ± 5.0 ns, 0.14x
    8192
      Bit: OK (0.57s)
        1.04 μs ±  41 ns
      C:   OK (0.25s)
        447  ns ±  27 ns, 0.43x
      SSE: OK (0.47s)
        214  ns ±  18 ns, 0.20x
      AVX: OK (0.55s)
        126  ns ± 6.1 ns, 0.12x
    16384
      Bit: OK (0.29s)
        2.06 μs ± 143 ns
      C:   OK (0.48s)
        873  ns ±  35 ns, 0.42x
      SSE: OK (0.46s)
        414  ns ±  16 ns, 0.20x
      AVX: OK (0.27s)
        242  ns ±  17 ns, 0.12x

test/Tests/SetOps.hs Outdated Show resolved Hide resolved
src/Data/Bit/Immutable.hs Outdated Show resolved Hide resolved
cbits/bitvec_simd.c Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Owner

I realized that none of emulated jobs test new C implementations because UseSimd remains undefined.

run: |
ghc --version
echo "#define BOUNDS_CHECK(f) (\_ _ _ -> id)" > src/vector.h
echo "#define UNSAFE_CHECK(f) (\_ _ _ -> id)" >> src/vector.h
ghc --make -Isrc:test -isrc:test -o Tests test/Main.hs +RTS -s
./Tests +RTS -s

Could you possibly fix this please?

@konsumlamm
Copy link
Contributor Author

I realized that none of emulated jobs test new C implementations because UseSimd remains undefined.

run: |
ghc --version
echo "#define BOUNDS_CHECK(f) (\_ _ _ -> id)" > src/vector.h
echo "#define UNSAFE_CHECK(f) (\_ _ _ -> id)" >> src/vector.h
ghc --make -Isrc:test -isrc:test -o Tests test/Main.hs +RTS -s
./Tests +RTS -s

Could you possibly fix this please?

How would I do that? Would it suffice to add a line

ghc --make -Isrc:test -isrc:test -DUseSIMD -o Tests test/Main.hs +RTS -s

?

@Bodigrim
Copy link
Owner

It requires some trial and error, I'm afraid. Definitely -DUseSIMD, but also probably point ghc to cbits somehow.

@konsumlamm
Copy link
Contributor Author

Is there no way to install GHCup or at least cabal on these architectures? That would make things a lot easier.

@Bodigrim
Copy link
Owner

Emulation is terribly expensive: a single process with Haskell RTS eats 7-8 Gb of RAM, so Cabal running GHC almost certainly fails with OOM. I've experimented with it a lot some time ago, and running barebone GHC proved to be the most reliable option. Unfortunately.

@Bodigrim
Copy link
Owner

You can do most of experiments locally or on a virtual machine with Ubuntu. It should not be too bad, e. g., text uses cbits in this configuration: https://github.com/haskell/text/blob/3f26df887b9dfd162c1ac9d315ea642e8c247d35/.github/workflows/emulated.yml#L37

@Bodigrim Bodigrim merged commit f3a40b2 into Bodigrim:master Apr 11, 2023
@Bodigrim
Copy link
Owner

Great stuff!

@konsumlamm konsumlamm deleted the reverse-bits branch April 12, 2023 00:30
@konsumlamm konsumlamm mentioned this pull request May 29, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants