Pythia

Real-World Applications
Main Features
Architecture
Analysis
Tooling

Pythia is a simple and modular concordance search engine, first designed in the context of Chiron, to provide a concordance-based search engine capable of handling an overwhelming metadata set, even coming from different sources, and deal with text structures longer than just tokens, even when overlapping. It was designed to be easy to integrate in other systems and fully customizable, with a new approach to text, “dematerialized” into a set of objects. It is currently adopted by another research project, Atti Chiari, in the context of a wider flow including also a couple of other software tools I created for a new type of pseudonymization for documents with sensitive data, and a very simple BLOB remote archiving system to collect them.

For a general introduction see:

Fusi, D. (2020). Text Searching Beyond the Text: a Case Study. Rationes Rerum 15, 199-230.
Clemenzi, L., Fusco, F., Fusi, D., & Lombardi, G. (2023). Masked texts: new tools for the security and linguistic analysis of legal corpora. Umanistica Digitale, 7(16), 1–32. https://doi.org/10.6092/issn.2532-8816/15608
the source code repository. With relation to the 2020 paper the current implementation of the system is more advanced, the query syntax was changed, and some new features were added; but the approach is the same.

Real-World Applications

While being designed for more complex texts, the first real-world application of the Pythia prototype is Minerva, an upcoming digital service from project AttiChiari. You can follow a short, totally non-technical presentation about it in this video (in Italian):

Main Features

full stack architecture, from database to business layer to web API and web frontend, all deployed in a Docker compose stack.
concordance-based: designed from the ground up with concordances in mind: word locations here are not an afterthought or an additional payload attached to an existing location-less engine. The whole architecture is based on positions in documents, and these positions may also refer to other text structures besides words. In this higher abstraction level, a text is somewhat “dematerialized” into a set of token-based positions linked to an open set of metadata. Rather than a long sequence of characters, a text is viewed as an abstract set of entities, each having any metadata, in most cases also including their position in the original text. These entities may represent documents, groups of documents (corpora), and words and any other textual structure (e.g. a verse, a strophe, a sentence, a phrase, etc.), with no limits, even when multiple structures overlap. Searching for a verse or a sentence, or whatever other textual structure is equal to searching for a word; and we can freely mix and combine these different entity types in a query.
minimal dependencies: simple implementation with widely used standard technologies: the engine relies on a RDBMS, and is wrapped in a REST API. The only dependency is the database service. The index is just a standard RDBMS, so that you can easily integrate it into your own project. You might even bypass the search engine, and directly query or otherwise manipulate it via SQL.
flexible, modular and open: designed to be totally configurable via external parameters: you decide every relevant aspect of the indexing pipeline (filtering, tokenization, etc.), and can use any kind of input format (e.g. plain text, TEI, etc.) and source (e.g. file system, BLOB storage, web resources etc.).
UDPipe plugins to incorporate any subset of POS tagging data into the index.

Architecture

Analysis

Tooling

CLI tool