Login

Become a leader in the IoT community!

Join our community of embedded and IoT practitioners to contribute experience, learn new skills and collaborate with other developers with complementary skillsets.

Step 1 of 5

CREATE YOUR PROFILE *Required

Step 2 of 5

WHAT BRINGS YOU TO DEVHEADS? *Choose 1 or more

Connect & collaborate 🤝with other tech professionals

I want to connect & collaborate with other techies

Learn & Grow 📚

Learn from our helpers & grow your tech knowledge

Contribute Experience & Expertise 🔧

Become a helpful hacker & assist others by sharing

Step 3 of 5

WHAT'S YOUR INTEREST OR EXPERTISE? *Choose 1 or more

Hardware Design 💡

PCB design, analog circuits, and more.

Embedded Software 💻

Embedded OS, firmware/middleware, debug & tools

Edge Networking ⚡

Real-time/low-power connectivity & IoT device management

Step 4 of 5

Personalize your profile

Step 5 of 5

Read & agree to our COMMUNITY RULES

We want this server to be a welcoming space! Treat everyone with respect. Absolutely no harassment, witch hunting, sexism, racism, or hate speech will be tolerated.
If you see something against the rules or something that makes you feel unsafe, let staff know by messaging @admin in the "support-tickets" tab in the Live DevChat menu.
No age-restricted, obscene or NSFW content. This includes text, images, or links featuring nudity, sex, hard violence, or other graphically disturbing content.
No spam. This includes DMing fellow members.
You must be over the age of 18 years old to participate in our community.
You agree to our Terms of Service (https://www.devheads.io/terms-of-service/) and Privacy Policy (https://www.devheads.io/privacy-policy)

By clicking "Finish", you have read and agreed to the our Terms of Service and Privacy Policy.

How to optimize SIMD instructions for double precision floating point operations on Intel Core i7

Posted by Marvee Amasi
10:22 pm
21/11/2024

I want to optimize a computationally intensive loop using SIMD instructions on an Intel Core i7 `12700K` processor and 32GB of DDR4 `3200` memory , to boost the performance for a double precision floating point vector addition operation within a larger scientific computation

section .data
data_array: dq 1.0, 2.0, 3.0, 4.0, ..., 1000000.0  ; Array of 1 million double-precision values

section .text
global my_function

my_function:
  mov rcx, 1000000 / 4  ; Loop counter (number of 128-bit chunks)
  mov rsi, data_array

loop_start:
  movups xmm0, [rsi]
  movups xmm1, [rsi + 16]
  addps xmm0, xmm1
  movups [rsi], xmm0
  add rsi, 32
  dec rcx
  jnz loop_start
  ret

Marvee Amasi#0000

November 21, 2024 at 10:22 pm

I’ve compiled the code with GCC using the -O3 optimization flag. While there is some performance improvement compared to the scalar version, it’s significantly less than expected. I’ve measured a speedup of approximately 1.5x on an Intel Core i7 `12700K` processor.
So I’m looking for suggestions on how to further optimize this code for maximum performance. Are there any specific SIMD instructions or techniques that could be beneficial? Thinking of exploring memory optimization strategies like prefetching or prolly cache blocking

Cancel Reply

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.