C : getc를 사용하여 비 UTF-8 문자 생성을 피하는 방법?

저는 현재 3 개의 인자, 2 개의 파일 (하나의 입력과 하나의 출력)과 int (출력 라인의 최대 길이, x라고 함)를 취할 C 프로그램을 작성하고 있습니다. 입력 파일의 모든 줄을 읽고 첫 번째 x 문자를 출력 파일에 쓰고 싶습니다 (실제로 파일을 "트리밍"). 여기 C : getc를 사용하여 비 UTF-8 문자 생성을 피하는 방법?

내 코드입니다 :

int main(int argc, char *argv[]) { 

    const char endOfLine = '\n'; 

    if (argc < 4) { 
    printf("Program takes 4 params\n"); 
    exit(1); 
    } else { 
    // Convert character argument [3] (line length) to an int 
    int maxLen = atoi(argv[3]); 

    char str[maxLen]; 
    char *inputName; 
    char *outputName; 

    inputName = argv[1]; 
    outputName = argv[2]; 

    // Open files to be read and written to 
    FILE *inFile = fopen(inputName, "r"); 
    FILE *outFile = fopen(outputName, "w"); 

    int count = 0; 
    char ch = getc(inFile); 
    while (ch != EOF) { 
     if (ch == '\n') { 
      str[count] = (char)ch; 
      printf("Adding %s to output\n", str); 
      fputs(str, outFile); 
      count = 0; 
     } else if (count < maxLen) { 
      str[count] = ch; 
      printf("Adding %c to str\n", ch); 
      count++; 
     } else if (count == maxLen) { 
      str[count] = '\n'; 
     } 
     ch = getc(inFile); 
    } 

    } 

    return 0; 
}

유일한 문제는 마지막 문자는 작은 따옴표가있는 경우, 그것은 비 UTF-8 문자, 같은 출력한다이다 :

For Whom t 
John Donne 
No man is 
Entire of 
Each is a 
A part of 
If a clod 
Europe is 
As well as 
As well as 
Or of thin 
Each man�� 
For I am i 
Therefore, 
For whom t

출처

2016-12-09 rafro4

배열을 오버플로하는 'else if (count == maxLen)'섹션에서 정의되지 않은 동작이 있습니다. – paddy

싱글 바이트 문자를 포함하는 데이터 스트림의 비 UTF8 문자는 무엇입니까? – bvj

@bvj 0-127 범위에없는 8 비트'char'는 올바르게 코드화 된 UTF8 코드 포인트가 아닙니다. – chux

을 마지막 char 출력이 utf-8 연속 바이트 10xxxxxx인지 확인할 수 있습니다. 그렇다면 문자가 완료 될 때까지 계속 출력하십시오.

// bits match 10xxxxxx 
int is_utf_continue_byte(int ch){ 
    return ch & 0x80 && ~ch & 0x40; 
} 

//... 
while (is_utf_continue_byte(ch)) 
    putchar(ch), ch = getchar();

출처

2016-12-09 03:49:31

어떻게 그렇게할까요? – rafro4

먼저'ch'를'int'로 만들고'EOF' 비교가 맞으면'while (ch & 0x80 && ~ ch & 0x40) putchar (ch), ch = getchar();'이것은 비트 7 1 ('ch & 0x80')이고 비트 6는 0 ('~ ch & 0x40')이다. utf-8 형식의 경우 계속되는 바이트 만이 패턴에 적합합니다. –

왜 '(ch & 0xC0) == 0x80'이 아닌'(ch & 0x80 && ~ ch & 0x40) '로 쓰는 것이 더 좋은가? 그리고 루프에서 괄호를 사용하지 않는 이유는 무엇입니까? – immibis

C : getc를 사용하여 비 UTF-8 문자 생성을 피하는 방법?

답변

관련 문제