반복에서의 성능 (캐시 미스)

변수를 사용하여 std::vector<T>::iterator을 카운트하는 대신에 (i)를 사용하면 반복이 빠릅니다.반복에서의 성능 (캐시 미스)

몇 가지 의견을 보내 주시면 다음과 같은 추가 정보를 제공합니다. (1) Visual Studio C++ 컴파일러를 사용합니다.

std::vector<Data> vec(MAX_DATA); 
stopWatch.start(); 
for (unsigned i = 0U; i < MAX_DATA; ++i) { 
    vec[i].x = 0; 
    vec[i].y = 0; 
} 
stopWatch.stop(); 
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

또는 5723ms : i가 증가되고, 변수는 상기 반복

5875ms 걸리는 경우 (2)가 분리 모드 및 최적화 -O2 :

Image of the console

컴파일 :

std::vector<Data*> vec2; 
for (unsigned i = 0U; i < MAX_DATA; ++i) 
    vec2.push_back(new Data()); 

stopWatch.start(); 
for (unsigned i = 0U; i < MAX_DATA; ++i) { 
    vec2[i]->x = 0; 
    vec2[i]->y = 0; 
} 
stopWatch.stop(); 
stopWatch.printSpanAsMs("The data is in memory at a random position");

std::vector<Data>::Iterator 반복이

29ms를 취할 것입니다, 반복 :

std::vector<Data> vec(MAX_DATA); 

stopWatch.start(); 
for (auto& it : vec) { 
    it.x = 0; 
    it.y = 0; 
} 
stopWatch.stop(); 
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

또는 110ms를 :

std::vector<Data*> vec2; 
for (unsigned i = 0U; i < MAX_DATA; ++i) 
    vec2.push_back(new Data()); 

stopWatch.start(); 
for (auto& it : vec2) { 
    it->x = 0; 
    it->y = 0; 
} 
stopWatch.stop(); 
stopWatch.printSpanAsMs("The data is in memory at a random position");

왜 다른 반복이 훨씬 빠르다?

데이터가 메모리의 서로 다른 위치에있는 변수 i의 반복이 메모리에서 병치 된 변수 i의 반복만큼 빠르다는 것을 알고 있습니다. 데이터가 메모리에서 서로 나란히 있다는 사실은 캐시 누락을 줄여야하며 std::vector<Data>::Iterator의 반복과 함께 작동해야합니다. 다른 하나와 함께 사용하지 않는 이유는 무엇입니까? 아니면 29 ~ 110ms의 거리가 빗나간 캐시 미스가 아닌가?

전체 프로그램은 다음과 같습니다

#include <iostream> 
#include <chrono> 
#include <vector> 
#include <string> 

class StopWatch 
{ 
public: 
    void start() { 
     this->t1 = std::chrono::high_resolution_clock::now(); 
    } 

    void stop() { 
     this->t2 = std::chrono::high_resolution_clock::now(); 
     this->diff = t2 - t1; 
    } 

    void printSpanAsMs(std::string startText = "time span") { 
     long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds> 
     (diff).count(); 
     std::cout << startText << ": " << diffAsMs << "ms" << std::endl; 
    } 
private: 
    std::chrono::high_resolution_clock::time_point t1, t2; 
    std::chrono::high_resolution_clock::duration diff; 
} stopWatch; 

struct Data { 
    int x, y; 
}; 

const unsigned long MAX_DATA = 20000000; 

void test1() 
{ 
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl; 

    std::vector<Data> vec(MAX_DATA); 
    stopWatch.start(); 
    for (unsigned i = 0U; i < MAX_DATA; ++i) { 
     vec[i].x = 0; 
     vec[i].y = 0; 
    } 
    stopWatch.stop(); 
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other"); 

    ////////////////////////////////////////////////// 

    std::vector<Data*> vec2; 
    for (unsigned i = 0U; i < MAX_DATA; ++i) 
     vec2.push_back(new Data()); 

    stopWatch.start(); 
    for (unsigned i = 0U; i < MAX_DATA; ++i) { 
     vec2[i]->x = 0; 
     vec2[i]->y = 0; 
    } 
    stopWatch.stop(); 
    stopWatch.printSpanAsMs("The data is in memory at a random position"); 

    for (unsigned i = 0U; i < MAX_DATA; ++i) { 
     delete vec2[i]; 
     vec2[i] = nullptr; 
    } 
} 

void test2() 
{ 
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl; 

    std::vector<Data> vec(MAX_DATA); 

    stopWatch.start(); 
    for (auto& it : vec) { 
     it.x = 0; 
     it.y = 0; 
    } 
    stopWatch.stop(); 
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other"); 

    ////////////////////////////////////////////////// 

    std::vector<Data*> vec2; 
    for (unsigned i = 0U; i < MAX_DATA; ++i) 
     vec2.push_back(new Data()); 

    stopWatch.start(); 
    for (auto& it : vec2) { 
     it->x = 0; 
     it->y = 0; 
    } 
    stopWatch.stop(); 
    stopWatch.printSpanAsMs("The data is in memory at a random position"); 

    for (auto& it : vec2) { 
     delete it; 
     it = nullptr; 
    } 
} 

int main() 
{ 
    test1(); 
    test2(); 

    system("PAUSE"); 
    return 0; 
}

출처

2017-10-30 SOUser

어떤 컴파일러를 사용하고 있습니까? 어떤 깃발을 사용하고 있습니까? –

저는 Visual Studio와 std :: chrono :: high_resolution_clock을 사용합니다. – SOUser

어떤 최적화 수준입니까? '-O2'를 적어도 시도하십시오 – user463035818

가 왜 그렇게 훨씬 더 빨리 다른 반복인가?

그 이유는 MSVC 2017이 제대로 최적화 할 수 없기 때문입니다.

for (unsigned i = 0U; i < MAX_DATA; ++i) { 
    vec[i].x = 0; 
    vec[i].y = 0; 
}

생성 된 코드 (live demo) :

 xor  r9d, r9d 
     mov  eax, r9d 
[email protected]: 
     mov  rdx, QWORD PTR [rcx] 
     lea  rax, QWORD PTR [rax+16] 
     mov  DWORD PTR [rax+rdx-16], r9d 
     mov  rdx, QWORD PTR [rcx] 
     mov  DWORD PTR [rax+rdx-12], r9d 
     mov  rdx, QWORD PTR [rcx] 
     mov  DWORD PTR [rax+rdx-8], r9d 
     mov  rdx, QWORD PTR [rcx] 
     mov  DWORD PTR [rax+rdx-4], r9d 
     sub  r8, 1 
     jne  SHORT [email protected]

size_t i으로 unsigned i 교체하거나 참조하지 않는로 인덱싱 액세스 리프팅 제 경우

는 완전히 루프를 최적화하지 도움 (demo).

이미 발견 한 것처럼 반복자를 사용하는 데 도움이 유일한 :

for (auto& it : vec) { 
    it.x = 0; 
    it.y = 0; 
}

생성 된 코드 (live demo) :

 xor  ecx, ecx 
     npad  2 
[email protected]: 
     mov  QWORD PTR [rax], rcx 
     add  rax, 8 
     cmp  rax, rdx 
     jne  SHORT [email protected]

그 소리는 단지 두 경우 모두 memset 호출합니다.

이야기의 도덕 : 성능에 신경 쓰면 의 코드는으로 보입니다. 공급 업체에 문제를보고합니다.

출처

2017-10-30 15:50:05 rustyx

답변 해 주셔서 감사합니다. 불행히도 어셈블러를 만들 수는 없지만 코드는 훨씬 작아 보입니다. "부호없는 i를 size_t로 대체하면 코드가 더 좋아 보인다"그래, 더 좋아 보이지만 성능 측면에서는 아무 것도 바뀌지 않습니다. – SOUser

결론은 MSVC가 반복자를 사용하지 않을 경우 비효율적 인 코드를 생성합니다. 당신은 그것을 "고칠 수 없다", 마이크로 소프트 만이 (아마 다음 버전에서) 할 수있다. – rustyx

답변

관련 문제