Skip to content
Commit d42cbe9e authored by Sergey Katunin's avatar Sergey Katunin Committed by Christoph Cullmann
Browse files

plaintextextractor: autodetect encoding for text

#### Autodetect encoding feature

1. Add autodetect encoding feature to `plaintextextractor`. Inspired by a similar algorithm in `KTextEditor`.

2. Also added some test files (plain text files for `win1251`, `gb18030`, `euc-jp` encodings, also test html files with `UTF-16LE` and `win1251` encodings).

3. Manually removing a newline character `\n` AFTER decoding, not before. This is necessary for multi-byte encodings (for encodings with the Little-Endian and Big-Endian byte sequence), which encode a newline character in two or more characters, for example `000A` (2 bytes), where `0A` == `\n` and deleting the character `0A` without processing the accompanying `00` breaks decoding. For example `UTF16-LE`, which eventually has the char `\n` at the beginning of the next line when `readline`, and not at the end of the previous line, because `QFile.readline` reads until it reaches `\n`, so sequence `0A00` breaks on `0A` in the end of first line, and on `00` in the...
parent 198b3870
Pipeline #636689 passed with stage
in 2 minutes and 20 seconds
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment