고유 3.3 GCC 컴파일러 최적화를 사용하여 멀티 스레드 할 때 컨쥬 게이트 그라디언트가 느리다

나는 Eigen 3.2에서 ConjugateGradient 솔버를 사용해 왔고 새로운 멀티 스레딩 기능의 혜택을 얻기 위해 Eigen 3.3.3으로 업그레이드하기로 결정했다. .고유 3.3 GCC 컴파일러 최적화를 사용하여 멀티 스레드 할 때 컨쥬 게이트 그라디언트가 느리다

슬프게도 GCC 4.8.4를 사용하여 을 활성화하면 솔버가 느리게 (~ 10 %) 느껴집니다. xosview 보면,

몇 가지 테스트 후, 나는 컴파일러 최적화를 사용하지 않으면, 다음 -fopenmp이 속도를 않습니다 ( -O0 대신 -O3의 사용)을 발견 ... 나는 모든 8 개의 CPU가 사용되는 것을 볼, 아직 성능이 느립니다 ~ 50 %의 해답.

물론 멀티 스레딩의 이점을 누리기 위해 최적화를 비활성화 할 가치는 없습니다. 전반적으로 속도가 느려질 것이기 때문입니다.

https://stackoverflow.com/a/42135567/7974125의 조언에 따라 UpLo 매개 변수로 전체 희소 행렬을 저장하고 Lower|Upper을 전달합니다.

나는 3 가지 전제 조건을 시도해 보았고 또한 RowMajor 행렬을 사용해 보았지만 아무 소용이 없었다.

멀티 스레딩과 컴파일러 최적화의 모든 이점을 얻으려고 다른 방법이 있습니까?

실제 코드를 게시 할 수 없지만 Eigen's documentation의 Laplacian 예제를 사용하는 빠른 테스트입니다 (단, SimplicialCholesky 대신 ConjugateGradient을 사용하는 일부 변경 제외). (이 솔버 모두 SPD 매트릭스와 함께 작동합니다.)

#include <Eigen/Sparse> 
#include <bench/BenchTimer.h> 
#include <iostream> 
#include <vector> 

using namespace Eigen; 
using namespace std; 

// Use RowMajor to make use of multi-threading 
typedef SparseMatrix<double, RowMajor> SpMat; 
typedef Triplet<double> T; 

// Assemble sparse matrix from 
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html 
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs, 
         VectorXd& b, const VectorXd& boundary) 
{ 
    int n = int(boundary.size()); 
    int id1 = i+j*n; 
     if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient 
    else if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient 
    else coeffs.push_back(T(id,id1,w));    // unknown coefficient 
} 

void buildProblem(vector<T>& coefficients, VectorXd& b, int n) 
{ 
    b.setZero(); 
    ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2); 
    for(int j=0; j<n; ++j) 
    { 
    for(int i=0; i<n; ++i) 
    { 
     int id = i+j*n; 
     insertCoefficient(id, i-1,j, -1, coefficients, b, boundary); 
     insertCoefficient(id, i+1,j, -1, coefficients, b, boundary); 
     insertCoefficient(id, i,j-1, -1, coefficients, b, boundary); 
     insertCoefficient(id, i,j+1, -1, coefficients, b, boundary); 
     insertCoefficient(id, i,j, 4, coefficients, b, boundary); 
    } 
    } 
} 

int main() 
{ 
    int n = 300; // size of the image 
    int m = n*n; // number of unknowns (=number of pixels) 
    // Assembly: 
    vector<T> coefficients;   // list of non-zeros coefficients 
    VectorXd b(m);     // the right hand side-vector resulting from the constraints 
    buildProblem(coefficients, b, n); 
    SpMat A(m,m); 
    A.setFromTriplets(coefficients.begin(), coefficients.end()); 
    // Solving: 
    // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading 
    BenchTimer t; 
    t.reset(); t.start(); 
    ConjugateGradient<SpMat, Lower|Upper> solver(A); 
    VectorXd x = solver.solve(b);   // use the factorization to solve for the given right hand side 
    t.stop(); 
    cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER 
    return 0; 
}

결과 출력 :

// No optimization, without OpenMP 
g++ cg.cpp -O0 -I./eigen -o cg 
./cg 
Real time: 23.9473 

// No optimization, with OpenMP 
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg 
./cg 
Real time: 17.6621 

// -O3 optimization, without OpenMP 
g++ cg.cpp -O3 -I./eigen -o cg 
./cg 
Real time: 0.924272 

// -O3 optimization, with OpenMP 
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg 
./cg 
Real time: 1.04809

출처

2017-05-06 Leon

omp_set_num_threads를 4로 사용하여 다른 스레드 수에 openmp를 시도해야합니다. 아마도 메모리는 병목 목입니다. 메모리 액세스를 위해 싸울 8 스레드를 시작하고 성능을 저하시킵니다. –

귀하의 문제는 멀티 스레딩에서 어떤 혜택을 기대하는 것은 너무 작입니다. 스파 스 매트릭스는 적어도 한 자리수 더 커질 것으로 예상됩니다. Eigen의 코드는이 경우 스레드 수를 줄이기 위해 조정되어야합니다.

또한 실제 물리 코어가 4 개인 경우 OMP_NUM_THREADS=4 ./cg을 실행하면 도움이 될 것입니다.

출처

2017-05-07 06:54:12 ggael

하이퍼 스레딩을 사용하는 코어가 4 개 있다는 것을 의미하는 경우 omp_places = cores를 사용하십시오 (단, g ++ 창에는 구현되지 않습니다) – tim18

명확히하기 위해 제 질문의 예제 코드는 90000 * 90000 행렬을 사용하고 있습니다. 'int n = 300; // 이미지의 크기'가 오도 된 것일 수 있습니다. 미지의 실제 수는 n * n 픽셀입니다. 90000 * 90000이 여전히 너무 작습니까? – Leon

cg의 순차적 특성에 따라 1 초만 실행되고 제한된 병렬성을 갖는 작업이 병렬 처리에서 얻지 못할 경우 놀라운 일은 아닙니다. – tim18

고유 3.3 GCC 컴파일러 최적화를 사용하여 멀티 스레드 할 때 컨쥬 게이트 그라디언트가 느리다

답변

관련 문제