infiniband에서 MPI로 slurm SBATCH 작업 또는 SRUN 작업을 사용할 때 문제가 있습니다.SLURM에서 openMPI 작업을 사용하는 segmention 오류
OpenMPI가 설치되어 있고 mpirun -n 30 ./hello
과 함께 다음 테스트 프로그램 (hello)을 실행하면 작동합니다.
// compilation: mpicc -o helloMPI helloMPI.c
#include <mpi.h>
#include <stdio.h>
int main (int argc, char * argv [])
{
int myrank, nproc;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &nproc);
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
printf ("hello from rank %d of %d\n", myrank, nproc);
MPI_Barrier (MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
너무 :
[email protected]:~/hello$ mpicc -o hello hello.c
[email protected]:~/hello$ mpirun -n 30 ./hello
--------------------------------------------------------------------------
[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: usNIC
Host: master
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
hello from rank 25 of 30
hello from rank 1 of 30
hello from rank 6 of 30
[...]
hello from rank 17 of 30
나는이 같은 세그먼트 오류를 얻을 SLURM을 통해 그것을 실행하려고 :
[email protected]:~/hello$ srun -n 20 ./hello
[node05:01937] *** Process received signal ***
[node05:01937] Signal: Segmentation fault (11)
[node05:01937] Signal code: Address not mapped (1)
[node05:01937] Failing at address: 0x30
[node05:01937] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fcf6bf7ecb0]
[node05:01937] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7fcf679b64c6]
[node05:01937] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7fcf679b74cb]
[node05:01937] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7fcf679b2141]
[node05:01937] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7fcf679a2ad0]
[node05:01937] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7fcf6c209b34]
[node05:01937] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7fcf67bca652]
[node05:01937] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7fcf6c209359]
[node05:01937] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7fcf65d1b975]
[node05:01937] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7fcf6c21a0bc]
[node05:01937] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7fcf6c1cb89d]
[node05:01937] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7fcf6c1eb56b]
[node05:01937] [12] /home/user/hello/./hello[0x400826]
[node05:01937] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fcf6bbd076d]
[node05:01937] [14] /home/user/hello/./hello[0x400749]
[node05:01937] *** End of error message ***
[node05:01938] *** Process received signal ***
[node05:01938] Signal: Segmentation fault (11)
[node05:01938] Signal code: Address not mapped (1)
[node05:01938] Failing at address: 0x30
[node05:01940] *** Process received signal ***
[node05:01940] Signal: Segmentation fault (11)
[node05:01940] Signal code: Address not mapped (1)
[node05:01940] Failing at address: 0x30
[node05:01938] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f68b2e10cb0]
[node05:01938] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f68ae8484c6]
[node05:01938] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f68ae8494cb]
[node05:01940] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f8af1d82cb0]
[node05:01940] [ 1] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x244c6)[0x7f8aed7ba4c6]
[node05:01940] [ 2] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x254cb)[0x7f8aed7bb4cb]
[node05:01940] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f8aed7b6141]
[node05:01940] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f8aed7a6ad0]
[node05:01938] [ 3] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0xb1)[0x7f68ae844141]
[node05:01938] [ 4] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_btl_openib.so(+0x10ad0)[0x7f68ae834ad0]
[node05:01938] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f68b309bb34]
[node05:01938] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f68aea5c652]
[node05:01940] [ 5] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_btl_base_select+0x114)[0x7f8af200db34]
[node05:01940] [ 6] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7f8aed9ce652]
[node05:01938] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f68b309b359]
[node05:01938] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f68acbad975]
[node05:01940] [ 7] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_bml_base_init+0x69)[0x7f8af200d359]
[node05:01940] [ 8] /opt/cluster/spool/openMPI/1.8/gcc/lib/openmpi/mca_pml_ob1.so(+0x5975)[0x7f8aebb1f975]
[node05:01940] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f8af201e0bc]
[node05:01938] [ 9] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(mca_pml_base_select+0x35c)[0x7f68b30ac0bc]
[node05:01938] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f68b305d89d]
[node05:01940] [10] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(ompi_mpi_init+0x4ed)[0x7f8af1fcf89d]
[node05:01938] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f68b307d56b]
[node05:01938] [12] /home/user/hello/./hello[0x400826]
[node05:01940] [11] /opt/cluster/spool/openMPI/1.8/gcc/lib/libmpi.so.1(MPI_Init+0x16b)[0x7f8af1fef56b]
[node05:01940] [12] /home/user/hello/./hello[0x400826]
[node05:01938] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f68b2a6276d]
[node05:01938] [14] /home/user/hello/./hello[0x400749]
[node05:01938] *** End of error message ***
[node05:01940] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f8af19d476d]
[node05:01940] [14] /home/user/hello/./hello[0x400749]
[node05:01940] *** End of error message ***
[node05:01939] *** Process received signal ***
[node05:01939] Signal: Segmentation fault (11)
[node05:01939] Signal code: Address not mapped (1)
[node05:01939] Failing at address: 0x30
[...]etc
문제가 무엇 사람이 알고 있나요? 나는 Slurm을 지원하는 openMPI를 구축했으며 동일한 버전의 컴파일러와 libs를 설치했다. 사실 모든 libs는 각 노드에 마운트 된 NFS 공유에있다.
말 :이 설치되어
이그것은, 인피니 밴드를 사용해야합니다. 내가 에 mpirun으로하는 openmpi 미사일 때 나는 "인피니 밴드를 통해 실행되지 않음"을 의미 추측
[[5627,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: usNIC
Host: cluster
을 알 수 있습니다. infiniband 드라이버를 설치하고 Infiniband를 통해 IP를 설정합니다. Slurm은 infiniband IP와 함께 실행되도록 구성되었습니다. 올바른 구성입니까? 사전에
감사 안부
편집 : 난 그냥 대신하는 openmpi의 MPICH2로 컴파일하고 SLURM 작업을 시도
. 따라서 문제는 아마도 Slurm 구성이 아닌 openMPI와 관련이 있습니까?
EDIT 2 : 실제로 SRUN 스크립트 대신 SBATCH 명령을 사용하여 1.8 대신 openMPI 1.6.5 (1.8 대신)를 사용하여 스레드 번호, 순위 및 호스트를 반환한다는 것을 보았습니다. openfabric 공급 업체 등록 메모리 할당에 관련된 :
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory. This typically can indicate that the
memlock limits are set too low. For most HPC installations, the
memlock limits should be set to "unlimited". The failure occured
here:
Local host: node05
OMPI source: btl_openib_component.c:1216
Function: ompi_free_list_init_ex_new()
Device: mlx4_0
Memlock limit: 65536
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: node05
Local device: mlx4_0
--------------------------------------------------------------------------
Hello world from process 025 out of 048, processor name node06
Hello world from process 030 out of 048, processor name node06
Hello world from process 032 out of 048, processor name node06
Hello world from process 046 out of 048, processor name node07
Hello world from process 031 out of 048, processor name node06
Hello world from process 041 out of 048, processor name node07
Hello world from process 034 out of 048, processor name node06
Hello world from process 044 out of 048, processor name node07
Hello world from process 033 out of 048, processor name node06
Hello world from process 045 out of 048, processor name node07
Hello world from process 026 out of 048, processor name node06
Hello world from process 043 out of 048, processor name node07
Hello world from process 024 out of 048, processor name node06
Hello world from process 038 out of 048, processor name node07
Hello world from process 014 out of 048, processor name node05
Hello world from process 027 out of 048, processor name node06
Hello world from process 036 out of 048, processor name node07
Hello world from process 019 out of 048, processor name node05
Hello world from process 028 out of 048, processor name node06
Hello world from process 040 out of 048, processor name node07
Hello world from process 023 out of 048, processor name node05
Hello world from process 042 out of 048, processor name node07
Hello world from process 018 out of 048, processor name node05
Hello world from process 039 out of 048, processor name node07
Hello world from process 021 out of 048, processor name node05
Hello world from process 047 out of 048, processor name node07
Hello world from process 037 out of 048, processor name node07
Hello world from process 015 out of 048, processor name node05
Hello world from process 035 out of 048, processor name node06
Hello world from process 020 out of 048, processor name node05
Hello world from process 029 out of 048, processor name node06
Hello world from process 016 out of 048, processor name node05
Hello world from process 017 out of 048, processor name node05
Hello world from process 022 out of 048, processor name node05
Hello world from process 012 out of 048, processor name node05
Hello world from process 013 out of 048, processor name node05
Hello world from process 000 out of 048, processor name node04
Hello world from process 001 out of 048, processor name node04
Hello world from process 002 out of 048, processor name node04
Hello world from process 003 out of 048, processor name node04
Hello world from process 006 out of 048, processor name node04
Hello world from process 009 out of 048, processor name node04
Hello world from process 011 out of 048, processor name node04
Hello world from process 004 out of 048, processor name node04
Hello world from process 007 out of 048, processor name node04
Hello world from process 008 out of 048, processor name node04
Hello world from process 010 out of 048, processor name node04
Hello world from process 005 out of 048, processor name node04
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt/init-fail-no-mem
[node04:04390] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help/error messages
[node04:04390] 47 more processes have sent help message help-mpi-btl-openib.txt/error in device init
내가 그에서 이해하는 것은) v.1.6.5이 더 나은 오류 처리를 가지고와 b) 내가하는 openmpi 및/또는 인피니 밴드를 구성해야한다는 것입니다 더 많은 등록 된 메모리 크기를 가진 드라이버. 내가 this page보고 분명히 내가 openMPI 물건을 수정해야합니까? 나는 그것을 테스트해야한다 ...
답장을 보내 주셔서 감사합니다. 실제로 제가 제안한 페이지를 보았습니다.이 방법으로 사용합니다. 하지만 로그에서 slurm을 통해 사용할 때 infiniband 패브릭을 통해 메모리를 등록하는 데 문제가 있음을 확인했습니다. 나는이 페이지를 발견했다 : http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem 그리고 나는 그것을 풀려고 노력하고있다. 내가 잠시 이해할 수없는 것은 inifniband 매개 변수를 변경해야하는 것입니다 (추가 매개 변수로 커널 모듈을 다시 컴파일해야 할 수도 있습니다). 또는 일부 openMPI 설정을 변경해야하는 경우입니다 ... – Danduk82
아마 이것을보고 싶을 것입니다 FAQ 항목 : http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more SLURM 리소스 데몬이 올바른 잠긴 페이지 제한으로 시작하지 않는다고 생각합니다. 커널을 다시 컴파일 할 필요는 없습니다. SLURM이 모든 잠긴 메모리를 사용하도록 허용하십시오. MPICH에서 실행하면 InfiniBand를 사용하지 않습니다. 그래서 같은 종류의 경고/오류가 발생하지 않습니다. 당신이 관심이 있다면, 우리는 방금 본 "경고"를 수정했습니다. 그것은 1.8.2에 포함될 것입니다 : https://svn.open-mpi.org/trac/ompi/changeset/31490 –
흠, 왜 내가 mpich와 함께 inifinband를 사용하지 않는다고 생각합니까? 내가 뭘 봐야하니? 나는 그들이 infiniband ip가 slurm.conf에 설정된 것들이라고 생각한다. – Danduk82