dictionary specification

One goal of this project is to develop a Sicilian dictionary. Along the way, I also hope to develop some software tools for processing natural language.

To seed the project, I used Arthur Dieli's vocabulary lists to create a basic dictionary. Dr. Dieli's work was one of the first Sicilian vocabulary lists on the internet. It contains over 12,000 Sicilian words and phrases, part of speech and translations into English and Italian.

To build upon his work, I am creating a set of Perl hashes to store information about each word. Information that we might include are: part of speech, related words, examples, usage notes and dialectical differences.

Differences between Sicilian dialects are particularly important. Documenting those differences should enable us to (in future work) teach a computer how to recognize a speaker's origin by the words, conjugations and grammar that they use.

And, of course, if we can teach a computer the Sicilian language, we can teach the computer any language, so this project should also be useful to people with a general interest in linguistics.

This project is in its early stages, but I have already created a tool -- Cchiù dâ Palora -- that automatically conjugates Sicilian verbs and creates the singular and plural forms of nouns and adjectives. The tool is based on the grammar rules listed in Kirk Bonner's Introduction.

Below is a description of the information that I am collecting on each word and how I am storing that information. Following the description is a slightly more formal specification of the information collected.

Perl hashes

Some people learn a language by creating an index card for each word that they learn. The Perl hashes that we're creating here are similar to that index card. For the preposition dintra, I created this "index card:"

%{ $vnotes{"dintra_prep"} } = (
    display_as => "dintra",
    dieli => ["dintra"],
    dieli_en => ["inside","into","within",],
    dieli_it => ["dentro","dentro a","in",],
    notex => ["Dintra nu biccheri d'acqua t'anniasti. (pruverbiu sicilianu)",],
    part_speech => "prep",

For invariant words (like dintra), a simple index card like this -- with part of speech, translations and a Sicilian proverb -- may be sufficient for most learners.

But other parts of speech are more complex. Verbs, in particular, can be quite complex, so I am also including information that enables the computer to automatically conjugate each verb.

For that task, we want to give the computer the least amount of information necessary to do the job properly.

Specifically, we do not want to tell the computer what the conjugation is. We want the computer to create the conjugations for us, so that (one day in the future) we can ask the computer to provide a conjugation for each dialect of the Sicilian language.

Fortunately, there are very few irregular verbs in the Sicilian language and the irregularities that do exist are few. For example, after accounting for boot and stem patterns, the verb jiri only has four irregular forms -- the infinitive and three in the present tense (the first-person singular, third-person singular and third-person plural), so we might create the following hash:

%{ $vnotes{"jiri"} } = (
    dieli => ["iri"],
    dieli_en => ["go",],
    dieli_it => ["andare",],
    notex => ["Vaju a accattu li scarpi.",
              "Jemu a accattari li scarpi.",],
    part_speech => "verb",
    verb => {
        conj => "xxiri",
        stem => "j",
        boot => "va",
        irrg => {
            inf => "jiri",
            pri => { us => "vaju", ts => "va", tp => "vannu" },

Similarly, the verb mèttiri only has a few irregular forms -- the past participle and four in the past tense:

%{ $vnotes{"mèttiri"} } = (
    dieli => ["mettiri"],
    dieli_en => ["place","put","start",],
    dieli_it => ["porre","mettere",],
    part_speech => "verb",
    verb => {
        conj => "xxiri",
        stem => "mitt",
        boot => "mètt",
        irrg => {
            pai => { quad => "mìs" },
            pap => "misu",
            adj => "misu",

But many verbs are built by adding a prefix to the verb mèttiri, so we can conjugate the reflexive verb intromèttirisi by creating a hidden hash of intromèttiri:

%{ $vnotes{"intromèttiri"} } = (
    hide => 1,
    part_speech => "verb",
    prepend => { prep => "intro", verb => "mèttiri", },

and then identifying the verb intromèttirisi as a reflexive form of intromèttiri:

%{ $vnotes{"intromèttirisi"} } = (
    dieli => ["intromettirisi"],
    dieli_en => ["interfere in",],
    dieli_it => ["intromettersi in",],
    part_speech => "verb",
    reflex => "intromèttiri",


The tables below list the information that I am collecting in the hashes. The first lists information that may be included for all parts of speech. The tables below it list additional information required for verbs, nouns and adjectives.

all hashes
hash key type description
dieli array list of forms found in Dr. Dieli's dictionary
dieli_en array Dr. Dieli's translations into English
dieli_it array Dr. Dieli's translations into Italian
display_as scalar text to display when not using hash key
(e.g. vìviri and vìviri need different hash keys)
hide scalar indicator to not display the word in the main list
notex array notes and examples
part_speech scalar part of speech -- verb, noun, adj, adv, prep, pron, conj
noun hash information to decline the noun, see below
adj hash information to decline the adjective, see below
verb hash information to conjugate the verb, see below
prepend hash information to conjugate by adding a prefix to another verb,
where  verb  points to the hash key of the other verb
reflex scalar hash key of the non-reflexive verb

The additional information to include for verbs, nouns and adjectives is described in the tables below.


verb hashes
hash key type description
conj scalar which conjugation to use -- xxiri, sciri,
xxari, xcari, xgari, xiari, ciari, giari
stem scalar "stem" of the verb
boot scalar "boot" of the verb
irrg hash information on the irregular forms
inf scalar irregular infinitive
pri hash irregular present indicative forms
pim hash irregular imperative forms
pai hash irregular preterite forms,
when appropriate, use  quad  for convenience
imi hash irregular imperfect indicative forms
ims hash irregular imperfect subjunctive forms
fti hash irregular future forms,
when appropriate, use  stem  for convenience
coi hash irregular conditional forms,
when appropriate, use  stem  for convenience
ger scalar irregular gerund
pap scalar irregular past participle
adj scalar irregular adjective
inf scalar irregular infinitive

Sicilian has two verb conjugations ("-ari" and "-iri"), which I have split into eight subconjugations, so that the verb stems pair properly with the verb endings.

For example:

%{ $vnotes{"dari"} } = (
    dieli => ["dari"],
    dieli_en => ["award","give","pass",],
    dieli_it => ["aggiudicare","dare",],
    part_speech => "verb",
    verb => {
        conj => "xxari",
        stem => "d",
        boot => "dùn",
        irrg => {
            pri => { us => "dugnu", },
            pai => { quad => "dètt" },
            fti => { stem => "dar" },
            coi => { stem => "dar" },


noun hashes
hash key type description
gender scalar gender of the noun -- mas, fem, both
plend scalar noun pattern -- xi, xixa, xa, xura, xx, eddu, aru, uni, uri
plural scalar irregular plural form

Most Sicilian nouns are either masculine or feminine, but some nouns (e.g. "atleta" and "dentista") are both masculine and feminine. Use the noun patterns below to form the plural.

noun patterns
plend pattern
xi plural in "-i"
xixa plural in either "-i" or "-a"
xa plural in "-a"
xura plural in either "-ura" or "-i"
xx no change (foreign word)
eddu "-eddu" to "-edda"
aru "-aru" to "-ara"
uni "-uni" to "-una"
uri "-uri" to "-ura"

For example:

%{ $vnotes{"prufissuri_noun"} } = (
    display_as => "prufissuri",
    dieli => ["prufissuri"],
    dieli_en => ["professor", "teacher",],
    dieli_it => ["professore",],
    part_speech => "noun",
    noun => {
        gender => "mas",
        plend => "uri",


adjective hashes
hash key type description
invariant scalar indicator that the adjective is invariant
femsi scalar feminine singular form

Most Sicilian adjectives must agree in gender and number with the noun that they are modifying, but some are invariant (e.g. "megghiu"). Others only change in the feminine singular form (e.g. "giùvini").

For example:

%{ $vnotes{"megghiu_adj"} } = (
    display_as => "megghiu",
    dieli => ["megghiu","u megghiu",],
    dieli_en => ["better","superior",],
    dieli_it => ["migliore","meglio","maggiore",],
    notex => ["La megghiu cosa è di lassari tuttu com'è.",],
    part_speech => "adj",
    adj => {
        invariant => 1 ,

%{ $vnotes{"giùvini_adj"} } = (
    display_as => "giùvini",
    dieli => ["giuvini","giuvina",],
    dieli_en => ["young boy","young girl",],
    dieli_it => ["giovanotto","giovanotta",],
    part_speech => "adj",
    adj => {
        femsi => "giùvina",