2017-10-06 17 views
1

이중 값 (64 비트 부동 소수점 데이터)을 포함하는 이진 파일이 있습니다. numz fromfile을 lzma open과 함께 사용하여 이진 파일 읽기

NumPy와 fromfile

>>> data1 = numpy.fromfile(open('myfile', 'rb')) 
사용 나는

>>> data1 
array([ 1.29000000e-07, 3.70000000e-08, 3.80000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.80000000e-08, 
    3.80000000e-08, 3.70000000e-08, 3.80000000e-08, 
    3.60000000e-08, 3.80000000e-08, 3.70000000e-08, 
    3.60000000e-08, 3.60000000e-08, 3.80000000e-08, 
    3.50000000e-08, 3.80000000e-08, 3.80000000e-08, 
    3.80000000e-08, 3.60000000e-08, 3.70000000e-08, 
    3.60000000e-08, 3.70000000e-08, 3.70000000e-08, 
    3.60000000e-08, 3.50000000e-08, 3.70000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.50000000e-08, 
    3.80000000e-08, 3.80000000e-08, 3.60000000e-08, 
    3.50000000e-08, 3.90000000e-08, 3.70000000e-08, 
    3.70000000e-08, 3.70000000e-08, 3.50000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.70000000e-08, 
    3.80000000e-08, 3.90000000e-08, 3.90000000e-08, 
    3.60000000e-08, 3.60000000e-08, 3.70000000e-08, 
    3.60000000e-08, 3.80000000e-08, 3.70000000e-08, 
    3.50000000e-08, 3.50000000e-08, 3.60000000e-08, 
    3.60000000e-08, 3.70000000e-08, 3.50000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.80000000e-08, 
    3.80000000e-08, 3.80000000e-08, 3.80000000e-08, 
    3.90000000e-08, 3.90000000e-08, 3.50000000e-08, 
    3.80000000e-08, 3.80000000e-08, 3.70000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.80000000e-08, 
    3.60000000e-08, 3.70000000e-08, 3.70000000e-08, 
    3.80000000e-08, 3.60000000e-08, 3.60000000e-08, 
    3.50000000e-08, 3.80000000e-08, 3.60000000e-08, 
    3.70000000e-08, 3.60000000e-08, 3.80000000e-08, 
    3.50000000e-08, 3.80000000e-08, 3.70000000e-08, 
    3.60000000e-08, 3.70000000e-08, 3.90000000e-08, 
    3.60000000e-08, 3.60000000e-08, 3.90000000e-08, 
    3.80000000e-08, 3.60000000e-08, 3.60000000e-08, 
    3.70000000e-08, 3.70000000e-08]) 

나는 지금을 읽어보세요 이후 xz

xz -k myfile 

을 사용하여이 파일을 압축하고 (나는 data1 = numpy.fromfile('myfile')와 같은 데이터를 얻을) 정확한 데이터를 수신 lzma 모듈을 사용하여 파이썬의 데이터

>>> data2 = numpy.fromfile(lzma.open('myfile.xz')) 
>>> data2 
array([ 2.05244522e-289, 3.09873319e-303, -9.10852154e-136, 
    9.99900586e-150, -7.22647881e+061, -3.03508634e-168, 
    1.40409926e+097, -8.66961452e+219, 2.28992199e-308, 
    -7.28706929e+173, 1.41101250e+029, -2.94590886e-279, 
    7.21680144e+171, -4.62715868e+045, 3.05536517e-138, 
    -2.94268247e-043, -1.54563603e-295, 7.53024241e+102, 
    -1.22865109e+263, 2.62485731e+044, 4.52556260e-312, 
    1.18164036e-240, 3.56496646e-311, -2.82751232e+286, 
    1.69336097e+127]) 

왜 이런 일이 발생합니까? read을 통해 파일 객체의 내용을 보면

>>> open('myfile', 'rb').read() 
b'B$\xf7\xffgP\x81>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x85U\xef\x82\x1e\xf0d>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\xb3z\xea\x05]\xcab>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x85U\xef\x82\x1e\xf0d>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>' 
>>> lzma.open('myfile.xz').read() 
b'B$\xf7\xffgP\x81>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x85U\xef\x82\x1e\xf0d>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\xb3z\xea\x05]\xcab>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x85U\xef\x82\x1e\xf0d>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>' 

이 나에게 잘 어울립니다. 유형뿐만 아니라 올바른 것 :

>>> type(data1) 
<class 'numpy.ndarray'> 
>>> type(data1[0]) 
<class 'numpy.float64'> 

>>> type(data2) 
<class 'numpy.ndarray'> 
>>> type(data2[0]) 
<class 'numpy.float64'> 

나는 동일하게 배열 data1data2의 내용을 기대합니다.

답변

1

그래서 이유는 모르지만 해결책은이지만 그 중 하나는 있습니다. tofile 메서드에서 파일을 생성했습니다.

저는 frombuffer으로 압축 된 버전을 읽었습니다.

data_xz = np.frombuffer(lzma.open('data.bin.xz', mode='rb').read()) 
data_bin = np.fromfile('data.bin') 

및 데이터는 판독시 동일하다.

내 생각 엔 어딘가에서 np.fromfile으로 바이트 읽기를 처리하면 일반 읽기 방법과 lzma 모듈의 차이가 있음을 알 수 있습니다.

어쨌든 데이터 저장은 일관된 형식을 사용하는 것이 가장 좋습니다. 작은 데이터 세트의 경우 일반 텍스트가 좋습니다. 그렇지 않으면 joblib의 persistence module 또는 HDF5 for Python이 있습니다.

+0

솔루션/문제를 고맙습니다. 나는 결코'frombuffer'를 시도하지 않았을 것입니다. –

+0

그것이 당신에게 도움이 되었기 때문에 다행이다. :-) 기본 데이터가 바이트이기 때문에 생각했지만,'fromfile'도 그렇게했을 것이다. - / –