All the samples refer to this TEI document:
<?xml version="1.0" encoding="utf-8"?>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<author>Catullus</author>
<title type="poetry">carmina</title>
<date when="-54">I a.C.</date>
</titleStmt>
<publicationStmt>
<p>test</p>
</publicationStmt>
<sourceDesc>
<p>web</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<div type="poem" n="84">
<head>ad Arrium</head>
<lg type="eleg" n="1">
<l n="1" type="h"><quote>chommoda</quote> dicebat, si quando commoda vellet</l>
<l n="2" type="p">dicere, et insidias <persName>Arrius</persName> <quote>hinsidias</quote>,</l>
</lg>
<lg type="eleg" n="2">
<l n="3" type="h">et tum mirifice sperabat se esse locutum,</l>
<l n="4" type="p">cum quantum poterat dixerat <quote>hinsidias</quote>.</l>
</lg>
<lg type="eleg" n="3">
<l n="5" type="h">credo, sic mater, sic liber avunculus eius</l>
<l n="6" type="p">sic maternus avus dixerat atque avia.</l>
</lg>
<lg type="eleg" n="4">
<l n="7" type="h">hoc misso in <geogName>Syriam</geogName> requierant omnibus aures</l>
<l n="8" type="p">audibant eadem haec leniter et leviter,</l>
</lg>
<lg type="eleg" n="5">
<l n="9" type="h">nec sibi postilla metuebant talia verba,</l>
<l n="10" type="p">cum subito affertur nuntius horribilis,</l>
</lg>
<lg type="eleg" n="6">
<l n="11" type="h"><geogName>Ionios</geogName> fluctus, postquam illuc <persName>Arrius</persName> isset,</l>
<l n="12" type="p">iam non <geogName>Ionios</geogName> esse sed <quote><geogName>Hionios</geogName></quote>.</l>
</lg>
</div>
</body>
</text>
</TEI>
The corresponding profile is:
{
"SourceCollector": {
"Id": "source-collector.file",
"Options": {
"IsRecursive": false
}
},
"TextFilters": [
{
"Id": "text-filter.tei"
},
{
"Id": "text-filter.quotation-mark"
}
],
"AttributeParser": {
"Id": "attribute-parser.xml",
"Options": {
"Mappings": [
"author=/TEI/teiHeader/fileDesc/titleStmt/author",
"title=/TEI/teiHeader/fileDesc/titleStmt/title",
"category=/TEI/teiHeader/fileDesc/titleStmt/title/@type",
"date=/TEI/teiHeader/fileDesc/titleStmt/date",
"date-value=/TEI/teiHeader/fileDesc/titleStmt/date/@when [N]"
]
}
},
"DocSortKeyBuilder": {
"Id": "doc-sortkey-builder.standard"
},
"DocDateValueCalculator": {
"Id": "doc-datevalue-calculator.standard",
"Options": {
"Attribute": "date-value"
}
},
"Tokenizer": {
"Id": "tokenizer.standard",
"Options": {
"TokenFilters": [
{
"Id": "token-filter.ita"
},
{
"Id": "token-filter.len-supplier",
"Options": {
"LetterOnly": true
}
},
{
"Id": "token-filter.sylc-supplier-lat"
}
]
}
},
"StructureParsers": [
{
"Id": "structure-parser.xml",
"Options": {
"RootPath": "/TEI/text/body",
"Definitions": [
"div=/div @n head$",
"p=//p",
"lg=//lg @n$",
"l=//l @n$",
"quote:q:1=//quote",
"persName:pn=//persName",
"geogName:gn=//geogName"
]
},
"Filters": [
{
"Id": "struct-filter.standard"
}
]
},
{
"Id": "structure-parser.xml-sentence",
"Options": {
"RootPath": "TEI//body",
"StopTags": ["head"]
}
}
],
"TextRetriever": {
"Id": "text-retriever.ef"
},
"TextMapper": {
"Id": "text-mapper.xml",
"Options": {
"RootPath": "/TEI/text/body",
"MappedPaths": ["body/div /@type /@n /head"]
}
},
"TextPicker": {
"Id": "text-picker.xml",
"Options": {
"HitOpen": "<hi rend=\"hit\">",
"HitClose": "</hi>"
}
},
"TextRenderer": {
"Id": "text-renderer.liz-html"
}
}
You can use this query to lookup token occurrences in this document:
select o."position", t.value
from occurrence o
inner join token t
on o.token_id = t.id
where o.document_id=1
order by position
Single Token
[value="chommoda"]
: find the wordchommoda
. 1 result:chommoda
.[value<>"sic"]
: find any word different fromsic
. 71 results:ad
,arrium
,chommoda
,dicebat
,si
… etc.[value*="ommo"]
: find any word includingommo
. 2 results:chommoda
,commoda
.[value^="ch"]
: find any word starting withch
. 1 result:chommoda
.[value$="ter"]
: find any word ending withter
. 3 results:mater
,leniter
,leviter
.[value?="ch*da"]
: find any word beginning withch
and ending withda
(wildcards). 1 result:chommoda
.[value?="?um"]
: find any word where a single character is followed byum
. 3 results:tum
,cum
,cum
.[value~="ch?ommoda"]
: find wordschommoda
orcommoda
(regular expression). 2 results:chommoda
,commoda
[value%="chommoda:0.5"]
: find words similar tochommoda
using a similarity treshold equal to 0.5 (in a normalized scale comprised between 0 and 1). 2 results:chommoda
,commoda
.[pn="arrius"]
: non-privileged (personal name): find any anthroponym equal toarrius
. Here we are matching against an attributepn
representing the personal name, and derived from the TEIpersName
element in the input document. 2 results:arrius
,arrius
.
Numeric operators:
[len<"3"]
[len<="2"]
[len=="10"]
[len!="2"]
[len>"9"]
[len>="5"]
Note that in this example the len
attribute refers to the word’s values as filtered, excluding noise characters like punctuation, diacritics, etc. For instance, for [len>"9"]
results are 2: requierant
, horribilis
.
[sylc="4"]
: find any word counting 4 syllables. This attribute relies on the parser’s integration with the Chiron system, so that the data are not derived from the document, but from software analysis performed in real time during indexing. 8 results:insidias
,hinsidias
,mirifice
,hinsidias
,avunculus
,metuebant
,horribilis
,hionios
.
Single Structure
[$name="lg"]
or[$lg]
: the shorter form is available only for non-privileged attributes. Find the strophe structure, here having namelg
. 72 results:chommoda
,dicebat
,si
,quando
,commoda
,vellet
,dicere
…etc. Note that the title’s words (“ad Arrium”) do not appear among the results, as they are not included inside a strophe.
Multiple Pairs
- OR:
[value="chommoda"] OR [value="commoda"]
(this is better accomplished by using a single pair with a regular expression). - AND:
[value="ionios"] AND [gn]
(geographic name): find the word Ionios when it’s a toponym. Here we are matching against an attributegn
representing the geographic name, and derived from the TEIgeogName
element in the input document. 2 results:ionios
,ionios
. - AND NOT:
[value="ionios"] AND NOT [gn]
Collocations
- NEAR:
[value="sic"] NEAR(m=0,s=l) [value="mater"]
:sic
at either side ofmater
, with a maximum distance of 0, when both tokens are inside the same structure namedl
(verse). 2 results:sic
,sic
. - NOT NEAR:
[value="sic"] NOT NEAR(m=0) [value="mater"]
: find the wordsic
not immediately next to the wordmater
. 1 result:sic
at verse 6. - BEFORE:
[value="sic"] BEFORE(m=0,s=l) [value="mater"]
: find the wordsic
at the left side ofmater
, with a maximum distance of 0, when both tokens are inside the same structure namedl
(verse). 1 result:sic
. - NOT BEFORE:
[value="sic"] NOT BEFORE(m=0) [value="mater"]
: find the wordsic
when it is not immediately followed bymater
. 2 results:sic
(all the 3sic
words in the sample text, except for the one beforemater
). - AFTER:
[value="sic"] AFTER(m=0,s=l) [value="mater"]
: find the wordsic
immediately following (minimum distance=0) the wordmater
, only when both tokens are inside the same structure namedl
(verse). 1 result:sic
. - NOT AFTER:
[value="sic"] NOT AFTER(m=0) [value="mater"]
. 2 results. - INSIDE:
[value$="ter"] INSIDE(me=0) [$l]
: find any word ending withter
at verse end, i.e. inside a structure namedl
, with a maximum distance of 0 to the end of that structure. 1 result:leviter
. - NOT INSIDE:
[len="2"] NOT INSIDE() [$lg]
: find any word consisting of 2 letters and not included in a stanza. 1 result:ad
(from the titlead Arrium
). - OVERLAPS:
[$gn] OVERLAPS(n=1) [$l]
- LALIGN:
[$name="l"] LALIGN(m=0) [$name="sent"]
: find any verse whose beginning coincides with the beginning of a sentence. 20 results:chommoda
,dicebat
,si
,quando
,commoda
,vellet
(verse 1),credo
,sic
,mater
,sic
,liber
,avunculus
,eius
(verse 5),hoc
,misso
,in
,syriam
,requierant
,omnibus
,aures
(verse 7). - RALIGN:
[$name="l"] RALIGN(m=0) [$name="sent"]
: find any verse whose end coincides with a sentence end. 17 results:cum
,quantum
,poterat
,dixerat
,hinsidias
(verse 4),sic
,maternus
,avus
,dixerat
,atque
,avia
(verse 6),iam
,non
,Ionios
,esse
,sed
,Hionios
(verse 12).
Scopes
- corpus:
@@[neoteroi rhetoric][value="chommoda"]
: find the wordchommoda
only in the documents belonging to any of the corpora namedneoteoroi
andrhetoric
(assume that the sample document belongs to the former). 1 result:chommoda
. - document:
@[author="Catullus" AND (date_value>="0" OR category="poetry")][value="chommoda"]
: find the wordchommoda
only in those documents having as authorCatullus
and being either dated A.D. or included in the poetry category. Hereauthor
,date_value
, andcategory
are all document attributes. 1 result:chommoda
. - corpus and document:
@@[neoteroi rhetoric]@[author="Catullus" AND (date_value>="0" OR category="poetry")][value="chommoda"]
: find the wordchommoda
only in the documents belonging to any of the corpora namedneoteoroi
andrhetoric
(assume that the sample document belongs to the former), only in those documents having as authorCatullus
and being either dated A.D. or included in the poetry category. 1 result:chommoda
.
Ghost Structures in Search
This sample refers to ghost structures, i.e. those structures defined only for the purpose of decorating the tokens they include with some attributes.
For instance, to find all the foreign Latin words in an Italian corpus one could write a pair like [frgn=lat]
. This implies a structure parser defined in the profile like this:
"StructureParsers": [
{
"Id": "structure-parser.xml",
"Options": {
"RootPath": "/TEI/text/body",
"**Definitions**": [
"div=/div @n head$",
"p=//p",
"lg=//lg",
"l=//l @n$",
"foreign:frgn=//foreign @lang"
]
}
}
],
Note the foreign:frgn=//foreign @lang
definition, where foreign
is the name of the structure (which is not stored but only used to add token’s attributes), frgn
the name of the token attribute to add, and //foreign @lang
the path to the structure value. The latter will be used as the value for the frgn
token attribute. Should you want a fixed value (e.g. 1
), you might do something like: foreign:frgn:1=//foreign @lang
.
⏮️ query