company-llama: llama.cpp backend for company-mode

I need to get back into the habit of announcing my projects, otherwise they just go into the void.

Anyway, company-llama is some glue between between company-mode and llama.cpp.

I created it mainly because I wanted to replace the proprietary TabNine, which modern models running on a GPU outperform anyway.

The implementation would be trivial, if not for details such as:

  • We want to use the streaming API, so that we start receiving data as soon as possible.
  • We want to interrupt the streaming connection as soon as we realize that we don’t want any more data, for example if the user canceled the completion request, or if we receive multiple equally-likely candidates (and therefore should probably stop and show a menu).
  • url.el supports neither of the above natively, so we have to mess with its internals to implement it ourselves.
  • If a response for an obsolete asynchronous completion request comes after a still-current completion request, company-mode can get confused.
  • If the llama.cpp server reports that it is busy (which can happen because we didn’t close the previous connection quickly enough), we want to retry.