게놈 gbff 파일에서 메타 데이터 추출

나는 .gbff.gz 게놈 파일이 1000 개 이상이고 각각에서 메타 데이터를 추출하고 별도의 열에 메타 데이터 항목이 있습니다.게놈 gbff 파일에서 메타 데이터 추출

2017-09-28 user2861089

프로그래밍 언어는 무엇입니까? – bfontaine

가능한 경우 @bfontaine matlab 또는 R? – user2861089

Bioinformatics Toolbox의 genbankread을 Matlab에서 사용할 수 있습니다. 다음은 원하는 것을 달성하는 방법의 예입니다.

results = []; 

% unzip data 
gunzip('*.gbff.gz'); 

% process each file 
files = dir('*.gbff'); 
for file = {files.name} 
    data = genbankread(char(file)); 

    % process each file entry 
    for i = 1:size(data, 2) 
    LocusName = ''; 
    Definition = ''; 
    Organism = ''; 
    GenesTotal = NaN; 
    GenesCoding = NaN; 
    RRNAs = ''; 
    TRNAs = NaN; 
    IsolationSource = ''; 
    Country = ''; 

    % copy fields 
    if isfield(data(i), 'LocusName') 
     LocusName = data(i).LocusName; 
    end 
    if isfield(data(i), 'Definition') 
     Definition = data(i).Definition; 
    end 
    if isfield(data(i), 'Source') 
     Organism = data(i).Source; 
    end 

    % parse comments 
    if isfield(data(i), 'Comment') 
     for j = 1:size(data(i).Comment, 1) 
     tokens = regexp(data(i).Comment(j, :), ... 
      '^\s*([^\s].*[^\s])\s*::\s*([^\s].*[^\s])\s*$', 'tokens'); 
     if ~isempty(tokens) 
      switch tokens{1}{1} 
      case 'Genes (total)' 
       GenesTotal = str2double(tokens{1}{2}); 
      case 'Genes (coding)' 
       GenesCoding = str2double(tokens{1}{2}); 
      case 'rRNAs' 
       RRNAs = tokens{1}{2}; 
      case 'tRNAs' 
       TRNAs = str2double(tokens{1}{2}); 
      end 
     end 
     end 
    end 

    % parse features 
    if isfield(data(i), 'Features') 
     Feature = ''; 
     for j = 1:size(data(i).Features, 1) 
     tokens = regexp(data(i).Features(j, :), '^(\w+)', 'tokens'); 
     if isempty(tokens) 
      tokens = regexp(data(i).Features(j, :), ... 
      '^\s+/(\w+)="([^"]+)"', 'tokens'); 
      if ~isempty(tokens) 
      switch Feature 
       case 'source' 
       switch tokens{1}{1} 
        case 'isolation_source' 
        IsolationSource = tokens{1}{2}; 
        case 'country' 
        Country = tokens{1}{2}; 
       end 
      end 
      end 
     else 
      Feature = tokens{1}{1}; 
     end 
     end 
    end 

    % append entries to results 
    results = [results; struct(... 
     'File', char(file), 'LocusName', LocusName, 'Definition', Definition, ... 
     'Organism', Organism, 'GenesTotal', GenesTotal, ... 
     'GenesCoding', GenesCoding, 'RRNAs', RRNAs, 'TRNAs', TRNAs, ... 
     'IsolationSource', IsolationSource, 'Country', Country)]; 
    end 
end 

% data is in variable results

출처

2017-10-01 14:47:58

굉장, 고마워! – user2861089

'/ isolation_source = "Human"'과'/ country = "Switzerland"'와 같은 변수를 결과에 추가하려고 시도했지만 오류가 발생했습니다. "/"가 있기 때문에 생각합니다. 앞? 어쨌든 다른 모든 것은 훌륭합니다. 감사. – user2861089

@ user2861089 기능 블록의 간단한 구문 분석 및 추출을 포함하도록 코드를 업데이트했습니다. –

게놈 gbff 파일에서 메타 데이터 추출

답변

관련 문제