2010-12-08 4 views
0

imdb 데이터베이스에서 일부 .LIST 파일을 다운로드했으며 SNA 소프트웨어를 사용하여 일부 소셜 네트워크 분석 이유 (참고 자료 포함)로 사용하고 싶습니다 (입력은 xml 또는 CSV는) ...LIST 파일 형식으로 xml 파일을 만듭니다.

+0

".LIST"파일의 모양을 알지 못합니다. 샘플을 추가하십시오. – Tomalak

+0

예 : movies.list이 페이지의 내용 : ftp://ftp.fu-berlin.de/pub/misc/movies/database/ – gaponte69

+0

죄송합니다, 내가 일하는 곳에서 FTP 서버에 액세스 할 수 없습니다. 관련 정보를 질문에 추가하면 모든 것이 한 곳에서 이루어지기 때문에 최선이라고 생각합니다. – Tomalak

답변

0

는 여기가 요리 관련 뭔가 : sed를 파일 'movies2xml.sed'

# ampersand etc .. 
s|&|\&|g 
s|<|\&lt;|g 
s|>|\&gt;|g 
# last field, if range 
s|\([12\?][0189\?][0-9\?][0-9\?]\)-\([12\?][0189\?][0-9\?][0-9\?]\)$|<when><f>\1</f><t>\2</t></when>| 
# last field, if single 
s|\([12?][0189?][0-9?][0-9?]\)$|<when><y>\1</y></when>| 
# made-for tv/vid/vidgame .. 
s|(\([TVG][TVG]*\)) *<when|<for>\1</for><when| 
# episode 
s|{\(.*\)} *|<ep>\1</ep>| 
# ep season, number 
s|<ep>\(.*\)(#\([0-9][0-9]*\)\.\([0-9][0-9]*\))</ep>|<ep s='\2' e='\3'>\1</ep>| 
# release year/Number (when titles are duplicated in a year) 
s| (\([12\?][0189\?][0-9\?][0-9\?]\)\/*\([IVX]*\)) <|<y N='\2'>\1</y><| 
s|<y N=''>|<y>| 
# TV titles 
s|^"\([^<]*\)"<y|<title type='tvseries'>\1</title><y| 
# titles 
s|^\(.[^<]*\)<y|<title type='film'>\1</title><y| 
# vid game 
s| type='film'\(.*<for>VG<\)| type='videogame'\1| 
# wrap tag 
s|^\(<.*>\)$|<entry>\1</entry>| 
# rm other text 
s|^\([^<].*\)$|<!-- \1 -->| 

는 XML tagnames 조금 간결하지만 (6 월 '14 년) 334메가바이트을 구성하는 2,936,679 항목, 거기에 ..

나는이 같은 IMDB의 지퍼 파일을 처리 :

이 XML 출력은 다음이 XSD과 유효성을 확인
(F=movies.xml ; echo '<list>' > $F ; \ 
zcat movies.list.gz | \ 
    tr '\t' ' ' | tr -s ' -' | recode l9..u8..xml | \ 
    sed -f movies2xml.sed >> $F ; \ 
echo '</list>' >> $F ;) & 

:

<?xml version="1.0" encoding="UTF-8"?> 
<!-- imdb_movies_list.xsd --> 
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> 
    <xs:element name="list"> 
    <xs:complexType> 
     <xs:sequence> 
     <xs:element minOccurs="0" maxOccurs="unbounded" ref="entry"/> 
     </xs:sequence> 
    </xs:complexType> 
    </xs:element> 
    <xs:element name="entry"> 
    <xs:complexType> 
     <xs:sequence> 
     <xs:element minOccurs="1" maxOccurs="1" ref="title"/> 
     <xs:element minOccurs="1" maxOccurs="1" ref="y"/> 
     <xs:choice> 
      <xs:element minOccurs="0" maxOccurs="1" ref="for"/> 
      <xs:element minOccurs="0" maxOccurs="1" ref="ep"/> 
     </xs:choice> 
     <xs:element minOccurs="1" maxOccurs="1" ref="when"/> 
     </xs:sequence> 
    </xs:complexType> 
    </xs:element> 
    <xs:element name="title"> 
    <xs:complexType mixed="true"> 
     <xs:attribute name="type" use="required"> 
     <xs:simpleType> 
      <xs:restriction base="xs:token"> 
      <xs:enumeration value="tvseries"/> 
      <xs:enumeration value="videogame"/> 
      <xs:enumeration value="film"/> 
      </xs:restriction> 
     </xs:simpleType> 
     </xs:attribute> 
    </xs:complexType> 
    </xs:element> 
    <xs:element name="y"> 
    <xs:complexType> 
     <xs:simpleContent> 
     <xs:extension base="yeartype"> 
      <xs:attribute name="N" use="optional"> 
      <xs:simpleType> 
       <xs:restriction base="xs:token"> 
      <xs:enumeration value="I"/> 
      <xs:enumeration value="II"/> 
      <xs:enumeration value="III"/> 
      <xs:enumeration value="IV"/> 
      <xs:enumeration value="V"/> 
      <xs:enumeration value="VI"/> 
      <xs:enumeration value="VII"/> 
      <xs:enumeration value="VIII"/> 
      <xs:enumeration value="IX"/> 
      <xs:enumeration value="X"/> 
      <xs:enumeration value="XI"/> 
      <xs:enumeration value="XII"/> 
      <xs:enumeration value="XIII"/> 
      <xs:enumeration value="XIV"/> 
      <xs:enumeration value="XV"/> 
      <xs:enumeration value="XVI"/> 
      <xs:enumeration value="XVII"/> 
      <xs:enumeration value="XVIII"/> 
      <xs:enumeration value="XIX"/> 
      <xs:enumeration value="XX"/> 
      <xs:enumeration value="XXI"/> 
      <xs:enumeration value="XXII"/> 
      <xs:enumeration value="XXIII"/> 
      <xs:enumeration value="XXIV"/> 
      <xs:enumeration value="XXV"/> 
      <xs:enumeration value="XXVI"/> 
      <xs:enumeration value="XXVII"/> 
      <xs:enumeration value="XXVIII"/> 
      <xs:enumeration value="XXIX"/> 
       </xs:restriction> 
      </xs:simpleType> 
      </xs:attribute> 
     </xs:extension> 
     </xs:simpleContent> 
    </xs:complexType> 
    </xs:element> 
    <xs:element name="for"> 
    <xs:simpleType> 
     <xs:restriction base="xs:token"> 
     <xs:enumeration value="TV"/> 
     <xs:enumeration value="V"/> 
     <xs:enumeration value="VG"/> 
     </xs:restriction> 
    </xs:simpleType> 
    </xs:element> 
    <xs:element name="ep"> 
    <xs:complexType mixed="true"> 
     <xs:attribute name="s" type="xs:integer" use="optional"/> 
     <xs:attribute name="e" type="xs:integer" use="optional"/> 
    </xs:complexType> 
    </xs:element> 
    <xs:element name="when"> 
    <xs:complexType> 
     <xs:choice> 
     <xs:sequence> 
      <xs:element name="y" type="yeartype" minOccurs="1" maxOccurs="1"/> 
     </xs:sequence> 
     <xs:sequence> 
      <xs:element name="f" type="yeartype" minOccurs="1" maxOccurs="1"/> 
      <xs:element name="t" type="yeartype" minOccurs="1" maxOccurs="1"/> 
     </xs:sequence> 
     </xs:choice> 
    </xs:complexType> 
    </xs:element> 
    <xs:simpleType name="yeartype"> 
    <xs:restriction base="xs:string"> 
     <xs:pattern value="[12?][0189?][0-9?][0-9?]"/> 
    </xs:restriction> 
    </xs:simpleType> 
</xs:schema> 

성도들을 위해 어딘가에 xml-to-json 변환기가있을 것으로 기대합니다.