plaintextextractor: autodetect encoding for text (d42cbe9e) · Commits · Frameworks / KFileMetaData

Commit d42cbe9e authored Mar 18, 2024 by

Sergey Katunin Committed by Christoph Cullmann Mar 18, 2024

plaintextextractor: autodetect encoding for text

#### Autodetect encoding feature

1. Add autodetect encoding feature to `plaintextextractor`. Inspired by a similar algorithm in `KTextEditor`.

2. Also added some test files (plain text files for `win1251`, `gb18030`, `euc-jp` encodings, also test html files with `UTF-16LE` and `win1251` encodings).

3. Manually removing a newline character `\n` AFTER decoding, not before. This is necessary for multi-byte encodings (for encodings with the Little-Endian and Big-Endian byte sequence), which encode a newline character in two or more characters, for example `000A` (2 bytes), where `0A` == `\n` and deleting the character `0A` without processing the accompanying `00` breaks decoding. For example `UTF16-LE`, which eventually has the char `\n` at the beginning of the next line when `readline`, and not at the end of the previous line, because `QFile.readline` reads until it reaches `\n`, so sequence `0A00` breaks on `0A` in the end of first line, and on `00` in the...

parent 198b3870

Pipeline #636689 passed with stage

in 2 minutes and 20 seconds

Hide whitespace changes

Inline Side-by-side

Please register or to comment