C++ 문자열 코드 포인트 및 코드 단위에 적합한 솔루션은 무엇입니까? 자바

는 문자열 메소드가C++ 문자열 코드 포인트 및 코드 단위에 적합한 솔루션은 무엇입니까? 자바

length()/charAt(), codePointCount()/codePointAt()

C++ 11 std::string a = u8"很烫烫的一锅汤";

하지만 a.size()는 문자 배열, 수없는 인덱스 유니 코드 문자의 길이를 갖는다.

C++ 문자열에서 유니 코드에 대한 해결책이 있습니까?

출처

2017-04-09 linrongbin

이 답변을 확인 했습니까? : http://stackoverflow.com/a/31475700/58129 –

대개 'utf-8'을'UTF-32/UCS-2''std :: wstring'로 변환하여 각 코드 포인트는 한 문자입니다. 이 답변에서 변환 할 코드는 다음과 같습니다. https://stackoverflow.com/questions/42791433/c-tolower-on-special-characters-such-as-%c3%bc/42793626#42793626 기타 라이브러리 사용 – Galik

UCS -2에는 모든 한자를위한 공간이 없습니다. –

문자 작업을 수행하기 전에 일반적으로 UTF-8 문자열을 UTF-32/UCS-2 문자열로 변환합니다. C++ 실제로 우리에게 그렇게 할 수있는 기능을 제공하지만, 그래서 나는 여기에 몇 가지 더 좋은 변환 함수를 작성했습니다 매우 사용자 친화적하지 않습니다

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows) 
std::string ws_to_utf8(std::wstring const& s) 
{ 
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv; 
    std::string utf8 = cnv.to_bytes(s); 
    if(cnv.converted() < s.size()) 
     throw std::runtime_error("incomplete conversion"); 
    return utf8; 
} 

std::wstring utf8_to_ws(std::string const& utf8) 
{ 
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv; 
    std::wstring s = cnv.from_bytes(utf8); 
    if(cnv.converted() < utf8.size()) 
     throw std::runtime_error("incomplete conversion"); 
    return s; 
} 

int main() 
{ 
    std::string s = u8"很烫烫的一锅汤"; 

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2) 

    // now we can use code-point indexes on the wide string 

    std::cout << s << " is " << w.size() << " characters long" << '\n'; 
}

출력 :

很烫烫的一锅汤 is 7 characters long

당신이로 변환 할 경우 그리고 플랫폼에 관계없이 UTF-32에서 다음과 같은 (잘 테스트되지 않은) 변환 루틴을 사용할 수 있습니다.

std::string utf32_to_utf8(std::u32string const& utf32) 
{ 
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv; 
    std::string utf8 = cnv.to_bytes(utf32); 
    if(cnv.converted() < utf32.size()) 
     throw std::runtime_error("incomplete conversion"); 
    return utf8; 
} 

std::u32string utf8_to_utf32(std::string const& utf8) 
{ 
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv; 
    std::u32string utf32 = cnv.from_bytes(utf8); 
    if(cnv.converted() < utf8.size()) 
     throw std::runtime_error("incomplete conversion"); 
    return utf32; 
}

출처

2017-04-09 02:37:59 Galik

시원하지만 다른 토론에서 wchart_t는 uint16_t가 아니라 uint32_t가 될 수있는 토론을 보았습니다. 유니 코드 문자열에서 char을 인덱싱 할 때 오류가 발생할 수 있습니다. – linrongbin

@zhaochenyou 이것은 각 플랫폼에 맞게 올바르게 변환되어야합니다. 'Windows'에서는'UCS-2'로 인코딩 된'2 바이트'문자'wchar_t'를 만들고'Linux'에서는'UTF-32'로 인코딩 된'4 바이트''wchar_t' 문자를 만듭니다. – Galik

누군가가 가서 캐릭터에 ''문자가있는 문자열을 제공하기 전까지는 제대로 작동합니다. 그러면 다른 플랫폼에서 다른 길이를 갖게됩니다. –

C++ 문자열 코드 포인트 및 코드 단위에 적합한 솔루션은 무엇입니까? 자바

답변

관련 문제