Skip to content
Commit 492321e5 authored by Igor Poboiko's avatar Igor Poboiko Committed by Christoph Cullmann
Browse files

[TermGenerator] Skip all unprintable characters

Some extractors can produce text which includes special unicode
control characters (e.g. Poppler can give us 0x0001 from some PDFs).
TermGenerator then generates proper (yet meaningless) terms out of those
characters, and they end up in database. It should be safe to skip all
unprintable characters to avoid that (although surrogates are fine, they
are dealt with later via QString::normalize call).

Character 0x0001 is the worst, as it is used internally in DocTermsCodec
for compactification. Such collision then leads to the corrupted database
(some terms from DocTermsDB are not present in PostingDB).

The corruption is not hypothetical (although not critical), I've encountered bunch of broken DB entries for some PDF files on my machine.
parent c67b33ba
Pipeline #359066 passed with stage
in 1 minute and 54 seconds
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment