[PopplerExtractorTest] Verify multicolumn PDF content (currently broken) (f9f483e6) · Commits · Frameworks / KFileMetaData

Commit f9f483e6 authored Mar 17, 2024 by

Stefan Brüns

[PopplerExtractorTest] Verify multicolumn PDF content (currently broken)

The PDF content extraction currently uses a text "layout" (see
poppler `pdftotext -layout ...`) when extracting the content, i.e.
the lines of multiple columns will be interspersed.

Add a PDF file which uses multiple columns and contains the required
structures to recreate the correct text flow.

Unfortunately, there is no simple way to fix this, as the
`RawOrderLayout` of `Popper::Page::text(...)` creates even worse
output than the default `PhysicalLayout`, (missing spaces between words,
or no output at all).

Also add the corresponding ODT source document.

parent 76f229da

Pipeline #645376 passed with stage

in 5 minutes and 24 seconds

Hide whitespace changes

Inline Side-by-side

Please register or to comment