Skip to content
Commit 59270583 authored by Igor Kushnir's avatar Igor Kushnir
Browse files

Optimize LanguageController's MimeTypeCache

Previously the filename of each url argument to
LanguageController::languagesForUrl() was matched against all cached
glob pattern suffixes (sometimes not "all" as suffixes of an already
matched/added language were skipped). Now the filename is only matched
against the suffixes that end with the same case-insensitive character.
This change should improve the performance of this frequently called
function, because case-insensitive string comparison looks like the most
expensive among the actually performed in practice operations in cases
when the language is found in mimeTypeCache.

Suffix matching performance could be further improved by storing reverse
suffixes in a trie (aka prefix tree). However I could not find a
suitable trie implementation in KDevelop or one of its dependencies.
QmlJS::PersistentTrie seems to be the closest match, but it has many
unneeded features and stores only strings with no ability to attach
user data to them (such as ILanguageSupport*). Implementing a trie data
structure specifically for this optimization would be an overkill.

For some reason, using std::lower_bound in place of std::equal_range in
MimeTypeCache::languagesForFileName() makes the most important benchmark
- benchLanguagesForUrlFilledCache() - somewhat slower on most data rows.

I have implemented and benchmarked 3 alternative optimizations of suffix
matching in MimeTypeCache::languagesForFileName():
  1. Binary searching for the last fileName's extension in the
     std::vector<StringLanguagePair> sorted by the entire last extension
     string case-insensitively.
  2. Binary searching for lower-cased last fileName's extension in the
     lower-cased std::vector<StringLanguagePair> sorted by the entire
     last extension string case-sensitively.
  3. Looking up lower-cased last fileName's extension in lower-cased
     QMultiHash<QString, ILanguageSupport*>.

The second alternative optimization turned out to be faster than the
first one; the third - faster than the second one.

All 3 alternative optimizations turned out to be:
a) Less general than this version: multiple extensions, such as tar.gz,
   were no longer optimized. This is only a theoretical downside because
   none of the maintained KDevelop plugins supports wildcard patterns
   with multiple extensions.
b) Slower than this version. Probably because of expensive temporary
   QString construction - the (lower-cased) last fileName's extension.

There is currently no need to optimize literal pattern matching, because
only one literal pattern is supported: "CMakeLists.txt". If more literal
patterns become supported in the future, the elements of
MimeTypeCache::m_literalPatterns can be easily lower-cased and sorted to
speed up matching against them.

Average BenchLanguageController results before and at this commit in
milliseconds per iteration:
        Data row                    Before      At
1. benchLanguagesForUrlNoCache()
    CMakeLists                      0.00046     0.00045
    cmakelists wrong case           0.00046     0.00045
    lower-case                      0.00058     0.00045
    upper-case                      0.00058     0.00045
    mixed-case                      0.00058     0.00046
    .C                              0.00050     0.00039
    .cl                             0.00070     0.00038
    existent C with extension       0.00053     0.00044
    .cc                             0.00058     0.00041
    .cmake                          0.00039     0.00041
    .diff                           0.00037     0.00038
    .qml                            0.00036     0.00037
    existent C w/o extension        0.16        0.16
    existent patch w/o extension    0.20        0.20
2. benchLanguagesForUrlFilledCache()
    CMakeLists                      0.0011      0.00050
    cmakelists wrong case           0.0011      0.00050
    lower-case                      0.00083     0.00048
    upper-case                      0.00083     0.00048
    mixed-case                      0.00085     0.00048
    .C                              0.00072     0.00040
    .cl                             0.00091     0.00042
    existent C with extension       0.00064     0.00046
    .cc                             0.00080     0.00043
    .cmake                          0.00090     0.00042
    .diff                           0.00083     0.00041
    .qml                            0.00093     0.00045
    existent C w/o extension        0.16        0.16
    existent patch w/o extension    0.20        0.20
3. benchLanguagesForUrlNoMatchNoCache()
    empty                           0.0016      0.000013
4. benchLanguagesForUrlNoMatchFilledCache()
    empty                           0.0018      0.000013

Note: benchLanguagesForUrlNoMatch*() benchmark results are practically
unaffected by this commit because the QMimeDatabase::mimeTypeForFile()
call is orders of magnitude slower than the
MimeTypeCache::languagesForFileName() call optimized here. The only
exception is the "empty" data row, which became more than 100 times
faster thanks to the added early return.

Most benchmarks run faster now. benchLanguagesForUrlFilledCache(), which
is closest to the real-life LanguageController usage in KDevelop, runs
about two times faster thanks to this commit.
parent 025e4fa9
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment