TensorFlow의 SVD가 numpy보다 느리다

내 컴퓨터에서 tensorflow의 SVD가 numpy보다 훨씬 느리게 실행되는 것을 관찰하고 있습니다. 나는 GTX 1080 GPU를 가지고 있으며 SVD가 적어도 CPU (numpy)를 사용하여 코드를 실행하는 것보다 빠르다고 기대하고 있습니다.TensorFlow의 SVD가 numpy보다 느리다

환경 정보

운영 체제

lsb_release -a 
No LSB modules are available. 
Distributor ID: Ubuntu 
Description: Ubuntu 16.10 
Release: 16.10 
Codename: yakkety

CUDA와 cuDNN의 설치된 버전 :

ls -l /usr/local/cuda-8.0/lib64/libcud* 
-rw-r--r-- 1 root  root 556000 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a 
lrwxrwxrwx 1 root  root  16 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0 
lrwxrwxrwx 1 root  root  19 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61 
-rwxr-xr-x 1 root  root 415432 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61 
-rw-r--r-- 1 root  root 775162 Feb 22 2017 /usr/local/cuda-8.0/lib64/libcudart_static.a 
lrwxrwxrwx 1 voldemaro users  13 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5 
lrwxrwxrwx 1 voldemaro users  18 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10 
-rwxr-xr-x 1 voldemaro users 84163560 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10 
-rw-r--r-- 1 voldemaro users 70364814 Nov 6 2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a

TensorFlow 설정

python -c "import tensorflow; print(tensorflow.__version__)" 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally 
1.0.0

코드 :

''' 
Created on Sep 21, 2017 

@author: voldemaro 
''' 
import numpy as np 
import tensorflow as tf 
import time; 
import numpy.linalg as NLA; 




N=1534; 

svd_array = np.random.random_sample((N,N)); 
svd_array = svd_array.astype(complex); 

specVar = tf.Variable(svd_array, dtype=tf.complex64); 

[D2, E1, E2] = tf.svd(specVar); 

init_OP = tf.global_variables_initializer(); 

with tf.Session() as sess: 
    # Initialize all tensorflow variables 
    start = time.time(); 
    sess.run(init_OP); 
    print 'initializing variables: {} s'.format(time.time()-start); 

    start_time = time.time(); 
    [d, e1, e2] = sess.run([D2, E1, E2]); 
    print("Tensorflow SVD ---: {} s" . format(time.time() - start_time)); 


# Equivalent numpy 
start = time.time(); 

u, s, v = NLA.svd(svd_array); 
print 'numpy SVD ---: {} s'.format(time.time() - start);

코드 추적 :

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1080 
major: 6 minor: 1 memoryClockRate (GHz) 1.7335 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.11GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) 
initializing variables: 0.230546951294 s 
Tensorflow SVD ---: 6.56117296219 s 
numpy SVD ---: 4.41714000702 s

출처

2017-09-21 user2109066

이 TensorFlow 작전처럼 implements gesvd을 보이는 반면, 당신은 MKL 사용 NumPy와/scipy 당신에 대해 비교하는 시도 할 수 있습니다 빠른 (그러나 적은 수치 강력한) gesdd

에 (즉, 당신이 CONDA를 사용하는 경우), 기본값을 사용하는 경우 scipy에 gesvd : 나는 또한 MKL 버전으로 더 나은 결과를 경험했던

from scipy import linalg 
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd")

그래서 결과

를 저장하는 tf.Variable를 사용하여, 투명 TensorFlow 및 SVD의 NumPy와 버전 사이를 전환하려면이 도우미 class을 사용하고

당신은 느림에 대한 자세한 내용과이

result = SvdWrapper(tensor) 
result.update() 
sess.run([result.u, result.s, result.v])

문제처럼 사용 https://github.com/tensorflow/tensorflow/issues/13222 나는 코드를 프로파일 링 할 때, 나는 그 NumPy와 모든 8 개 개의 CPU 코어 (인텔 I7)를 통해 부하를 분산되어 볼

출처

2017-09-22 00:03:00

GPU 실행은 일반적으로 병렬 효과 만 CPU의 성능을 능가.

그러나 SVD 알고리즘의 병렬 처리는 아직 능동적 인 연구가 필요하며 병렬 구현 버전이 아직까지 직렬 구현보다 훨씬 우수하지 않다는 것을 의미합니다.

아마도 NumPy 버전은 매우 잘 최적화 된 FORTRAN 구현을 기반으로 합니다만 TensorFlow는 자체 C++ 구현을 가지고 있으며 NumPy가 호출하는 코드처럼 최적화되지 않은 것 같습니다.

EDIT : FORTRAN 구현에 비해 poorer performances of TensorFlow with SVD을 먼저 관찰하지 못할 수도 있습니다.

출처

2017-09-21 23:53:09 norok2

, 그래서 나는 많은 (2560) 개의 CUDA 코어가 있다는 이점을보기를 기대하고있었습니다. – user2109066

이전처럼 인텔 MKL보다 5 배 개선 된 GPU를 활용하려는 노력이있었습니다 - https://s3.amazonaws.com/academia.edu.documents/30806706/Sheetal09Singular.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1506052362&Signature=gCpal% 2Fk2dCnhAUXgYE4sgjqPNOo % 3D 및 응답 내용 처리 = 인라인 % 3B % 20filename % 3DSingular_value_decomposition_on_GPU_usin.pdf – user2109066

TensorFlow의 SVD가 numpy보다 느리다

답변

관련 문제