내결함성 PyCUDA 및 고정 메모리가있는 내적 제품

현재 PyCUDA를 사용하여 고정 메모리가있는 내적 제품을 작업하고 있습니다. 그리고 큰 배열에 문제가 있습니다. 내가 함께 일하고 있어요내결함성 PyCUDA 및 고정 메모리가있는 내적 제품

NVIDIA의 GTX는 1060
CUDA 9.1
PyCUDA 2017년 1월 1일

코드는 다음과 같습니다

#!/usr/bin/env python 

import numpy as np 
import argparse 
import math 
import pycuda.autoinit 
import pycuda.driver as drv 
from pycuda.compiler import SourceModule 

from time import time 

dot_mod = SourceModule(""" 
__global__ void full_dot(double* v1, double* v2, double* out, int N) { 
    __shared__ double cache[ 1024 ]; 
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    cache[ threadIdx.x ] = 0.f; 
    while(i < N) { 
     cache[ threadIdx.x ] += v1[ i ] * v2[ i ]; 
     i += gridDim.x * blockDim.x; 
    } 
    __syncthreads(); // required because later on the current thread is accessing 
        // data written by another thread  
    i = 1024/2; 
    while(i > 0) { 
     if(threadIdx.x < i) cache[ threadIdx.x ] += cache[ threadIdx.x + i ]; 
     __syncthreads(); 
     i /= 2; //not sure bitwise operations are actually faster 
    } 

#ifndef NO_SYNC // serialized access to shared data; 
    if(threadIdx.x == 0) atomicAdd(out, cache[ 0 ]); 
#else // no sync, what most likely happens is: 
     // 1) all threads read 0 
     // 2) all threads write concurrently 16 (local block dot product) 
    if(threadIdx.x == 0) *out += cache[ 0 ]; 
#endif     

} 
""") 


def main(args): 
    dot = dot_mod.get_function("full_dot") 
    N = args.number 
    BLOCK_SIZE = 1024 
    BLOCKS = int(math.ceil(N/BLOCK_SIZE)) 
    THREADS_PER_BLOCK = BLOCK_SIZE 

    # Time use of pinned host memory: 
    x = drv.aligned_empty((N), dtype=np.float64, order='C') 
    x = drv.register_host_memory(x, flags=drv.mem_host_register_flags.DEVICEMAP) 
    x_gpu_ptr = np.intp(x.base.get_device_pointer()) 

    # Time use of pinned host memory: 
    y = drv.aligned_empty((N), dtype=np.float64, order='C') 
    y = drv.register_host_memory(y, flags=drv.mem_host_register_flags.DEVICEMAP) 
    y_gpu_ptr = np.intp(y.base.get_device_pointer()) 

    # Time use of pinned host memory: 
    z = drv.aligned_empty((1), dtype=np.float64, order='C') 
    z = drv.register_host_memory(z, flags=drv.mem_host_register_flags.DEVICEMAP) 
    z_gpu_ptr = np.intp(z.base.get_device_pointer()) 

    z[:] = np.zeros(1) 
    x[:] = np.zeros(N) 
    y[:] = np.zeros(N) 

    x[:] = np.random.rand(N) 
    y[:] = x[:] 
    x_orig = x.copy() 
    y_orig = y.copy() 

    start = time() 
    dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1)) 
    times = time()-start 
    print "Average kernel GPU dot product execution time with pinned memory: %3.7f" % np.mean(times) 

    start = time() 
    ydot=np.dot(x_orig,y_orig) 
    times = time()-start 
    print "Average numpy dot product execution time: %3.7f" % np.mean(times) 

    print N,ydot,z[0] 

if __name__ == "__main__": 

    parser = argparse.ArgumentParser(description=' ') 
    parser.add_argument('-n', dest='number', type=long, help="Number of samples ", required=True) 

    args = parser.parse_args() 

    main(args)

내가 가진 샘플 배열이 올바르다면이 코드를 작성했습니다. 함께 aprox 1024 * (12)보다 적은 있지만, 1024 * 1024의 값과 같은 큰 배열의 크기를 잘못 result.-에게 모든 사람에게

➜ ./test_dot_pinned.py -n 16384 
Average kernel GPU dot product execution time with pinned memory: 0.0001669 
Average numpy dot product execution time: 0.0000119 
16384 5468.09590706 5468.09590706 
SIZE np.dot() GPU-dot-pinned 

➜ ./test_dot_pinned.py -n 1048576 
Average kernel GPU dot product execution time with pinned memory: 0.0002351 
Average numpy dot product execution time: 0.0010922 
1048576 349324.532564 258321.148593 
SIZE np.dot() GPU-dot-pinned

감사를 제공, 나는 누군가가 나를 도울 수 있기를 바랍니다.

출처

2017-12-18 acancio

pycuda는 커널 시작 후 동기화를 시행하지 않습니다. 일반적으로 커널 실행 후 데이터의 장치 -> 호스트 복사본을 수행하면 작업이 동기화를 강제합니다. 즉, 강제로 커널이 완료됩니다.

하지만 코드에는 동기화가 없습니다. 고정 된 메모리를 사용하기 때문에 커널 실행 시간이 커지면 (결과적으로 작업 크기가 커짐) 결국 z[0]을 인쇄 할 때 커널은 그 시점에서 끝나지 않기 때문에 부분적인 결과 만 얻게됩니다.

이것의 부작용은 커널 시간 측정이 정확하지 않다는 것입니다. 빠른 답변을 많이, 매우 도움이되었다

dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1)) 
#add the next line of code: 
drv.Context.synchronize() 
times = time()-start

출처

2017-12-18 21:26:52

와우, 감사 : 당신이 당신의 시간 측정을 완료하기 전에

당신은 완료하는 커널을 강제로이 두 가지를 해결할 수 있습니다. 친애하는. – acancio

내결함성 PyCUDA 및 고정 메모리가있는 내적 제품

답변

관련 문제