배열 곱셈과 sse intrinsics 곱셈의 타이밍?

sse 내장 함수에 대한 내 이해를 테스트하기 위해 아래 코드를 만들었습니다. 코드가 올바르게 컴파일되고 실행되지만 sse를 사용한 개선은 그리 중요하지 않습니다. sse 내장 함수 사용은 약입니다. 20 % 빨라졌습니다. 속도가 대략 4 배 빨라지거나 400 % 향상되지 않아야합니까? 컴파일러가 스칼라 루프를 최적화하고 있습니까? 그렇다면이 기능을 어떻게 비활성화 할 수 있습니까? 내가 쓴 sse_mult() 함수에 문제가 있습니까?배열 곱셈과 sse intrinsics 곱셈의 타이밍?

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h> 
#include <emmintrin.h> 
// gcc options -mfpmath=sse -mmmx -msse -msse2 \ Not sure if any are needed have been using -msse2 

/*-------------------------------------------------------------------------------------------------- 
* SIMD intrinsics header files 
* 
* <mmintrin.h> MMX 
* 
* <xmmintrin.h> SSE 
* 
* <emmintrin.h> SSE2 
* 
* <pmmintrin.h> SSE3 
* 
* <tmmintrin.h> SSE3 
* 
* <smmintrin.h> SSE4.1 
* 
* <nmmintrin.h> SSE4.2 
* 
* <ammintrin.h> SSE4A 
* 
* <wmmintrin.h> AES 
* 
* <immintrin.h> AVX 
*------------------------------------------------------------------------------------------------*/ 

#define n 1000000 

// Global variables 
float a[n]; // array to hold random numbers 
float b[n]; // array to hold random numbers 
float c[n]; // array to hold product a*b for scalar multiply 
__declspec(align(16)) float d[n] ; // array to hold product a*b for sse multiply 
// Also possible to use __attribute__((aligned(16))); to force correct alignment 

// Multiply using loop 
void loop_mult() { 
    int i; // Loop index 

    clock_t begin_loop, end_loop; // clock_t is type returned by clock() 
    double time_spent_loop; 

    // Time multiply operation 
    begin_loop = clock(); 
     // Multiply two arrays of doubles 
     for(i = 0; i < n; i++) { 
      c[i] = a[i] * b[i]; 
     } 
    end_loop = clock(); 

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second. 
    time_spent_loop = (double)(end_loop - begin_loop)/CLOCKS_PER_SEC; 
    printf("Time for scalar loop was %f seconds\n", time_spent_loop); 
} 

// Multiply using sse 
void sse_mult() { 
    int k,i; // Index 
    clock_t begin_sse, end_sse; // clock_t is type returned by clock() 
    double time_spent_sse; 

    // Time multiply operation 
    begin_sse = clock();  
     // Multiply two arrays of doubles 
     __m128 x,y,result; // __m128 is a data type, can hold 4 32 bit floating point values 
     result = _mm_setzero_ps(); // set register to hold all zeros 
     for(k = 0; k <= (n-4); k += 4) { 
      x = _mm_load_ps(&a[k]); // Load chunk of 4 floats into register 
      y = _mm_load_ps(&b[k]); 
      result = _mm_mul_ps(x,y); // multiply 4 floats 
      _mm_store_ps(&d[k],result); // store result in array 
     } 
     int extra = n%4; // If array size isn't exactly a multiple of 4 use scalar ops for remainder 
     if(extra!=0) { 
      for(i = (n-extra); i < n; i++) { 
       d[i] = a[i] * b[i]; 
      } 
     } 
    end_sse = clock(); 

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second. 
    time_spent_sse = (double)(end_sse - begin_sse)/CLOCKS_PER_SEC; 
    printf("Time for sse was %f seconds\n", time_spent_sse); 
} 

int main() { 
    int i; // Loop index 

    srand((unsigned)time(NULL)); // initial value that rand uses, called the seed 
     // unsigned garauntees positive values 
     // time(NULL) uses the system clock as the seed so values will be different each time 

    for(i = 0; i < n; i++) { 
     // Fill arrays with random numbers 
     a[i] = ((float)rand()/RAND_MAX)*10; // rand() returns an integer value between 0 and RAND_MAX 
     b[i] = ((float)rand()/RAND_MAX)*20; 
    } 

    loop_mult(); 
    sse_mult(); 
    for(i=0; i<n; i++) { 
     // printf("a[%d] = %f\n", i, a[i]); // print values to check 
     // printf("b[%d] = %f\n", i, b[i]); 
     // printf("c[%d] = %f\n", i, c[i]); 
     // printf("d[%d] = %f\n", i, d[i]); 
     if(c[i]!=d[i]) { 
      printf("Error with sse multiply.\n"); 
      break; 
     } 
    } 


    return 0; 
}

출처

2014-10-21 biononic

'1000000 대신'n '을 2048로 설정하십시오.'-funroll-loops'를 사용하여 루프 언롤을 켜십시오. 루프를 여러 번 반복 한 다음 반복 값으로 시간을 나눕니다. –

'z [i] = x를 수행하는 예제는 http://stackoverflow.com/questions/25774190/l1-memory-bandwidth-50-drop-in-fficiency-using-addresses-which-differ-by-4096을 참조하십시오. [i] + y [i]' –

질문에'-O2' 또는 다른 최적화 플래그가 보이지 않습니까? –

프로그램이 메모리 바운드입니다. SSE는 RAM에서 큰 배열을 읽는 데 대부분의 시간을 소비하므로 큰 차이를 만들지 않습니다. 이러한 배열의 크기를 줄여 캐시에 맞출 수있게하십시오. 패스 수를 늘리십시오. 모든 데이터가 이미 캐시에 저장되어 있으면 SSE 버전이 눈에 띄게 빠르게 수행되어야합니다.

추가 요소 관련이있을 수 있습니다 유의 사항 :

GCC는 (어느 정도) 자동 루프를 벡터화 할 수 있습니다. (나는 그것이 필요하다고 생각합니다 -O3)
캐시가 아직 채워지지 않으므로 테스트하는 첫 번째 방법은 느려집니다. 두 방법을 교대로 여러 번 실행하는 것이 좋습니다.

출처

2014-10-21 20:41:35

내가 제안하는 또 하나의 것은'-funroll-loops'로 루프 풀기를 켜는 것입니다. –

패스 수를 늘리면 무엇을 의미합니까? 스칼라 루프 용 프로그램과 sse 용 프로그램을 만드는 것이 더 낫겠습니까? – biononic

@biononic 큰 배열을 처리하는 대신 작은 배열을 처리하지만 여러 번 반복하십시오. 이는 캐시 실패의 영향을 줄이는 방법 중 하나 일뿐입니다. 두 가지 프로그램이이 두 가지 사례를 서로 격리시키는 데 도움이되지만, 이는 포인트 # 2에서만 도움이됩니다. –

배열 곱셈과 sse intrinsics 곱셈의 타이밍?

답변

관련 문제