The Matsunoki Treebank – a parsed corpus of Tsugaru dialect folktales

The Matsunoki Treebank is a corpus of Tsugaru dialect folktales with hand worked tree analysis. The Tsugaru dialect is a dialect spoken in the Tsugaru region of Japan. The Tsugaru region is on the west side of Aomori prefecture, which is the northernmost prefecture on Japan's mainland of Honshu. Highlights of the corpus include:

(Hepburn) romanisation
English glosses
full lemmatisation for Tsugaru-ben words linked to dictionary word sense definitions
labelled constituent structure
assignments of grammatical function
zero elements
information to resolve anaphoric dependencies

The name Matsunoki, meaning ‘pine tree’ in English, derives from the frequent appearance of pine trees in the folktale stories that make up the data of the corpus.

About the corpus data

The corpus data comes from audio samples spoken by native speakers of Tsugaru-ben. The samples consist of readings of Mukashi-Banashi (Japanese Folktales) related to the Tsugaru region, which are presented in the dialect. There are 26 readings in total, representing nearly four hours of spoken data.

The speakers are all members of the “Wa No Hanashiko” group. Tsuri Sato (佐藤ツリ) who is the representative of this group, started the activity in 2003 with the aim of handing down folktales of the Tsugaru region. The group is based in the Hirosaki area and has approximately 300 members. The group tells stories in Tsugaru-ben during visits to retirement houses, community centers, and events held by the social welfare council. They also share folktales in this region on the local TV station and work with Hakuryu Shibutani (澁谷白龍) who is a researcher of dialect and Senryu which is a form of short poetry (Megumi 2021).

Using folktale data has strengths and weaknesses. One advantage is the presence of rich vocabulary. Folktales contain unique and old vocabulary, so folktales are a good resource to initiate a Tsugaru-ben corpus and word database. However, due to the nature of folktales, some of the words or phrases which appear in the Tsugaru-ben folktales are exaggerated Tsugaru-ben or they are Tsugaru-ben that is not in daily use. For this reason, some examples in The Matsunoki Treebank can be said to be unnatural, and not fully representative of Tsugaru-ben as it is spoken today.

About the annotation

Hepburn is used as the system of romanization because of its compatibility with the suite of programs and grammar files used during the creation of the morphological analysis. The method of word segmentation and morphological analysis uses the WAKACHI2002 (Miyata 2018) inventory of word class and morpheme codes.

The tree annotation is based on The Kusunoki Treebank (Kainoki 2022), a parsed corpus of contemporary Japanese. Syntactic structure is represented with labelled parentheses in the style of the Penn Treebank (Bies et al. 1995). More particularly, the Penn Historical Corpora scheme (Santorini 2010) has informed the ‘look’ of the annotation. This includes:

adoption of the CorpusSearch format (Randall 2009) as the underlying encoding,
not having any explicit verb phrase structure (although verb phrase structure is implicitly present when there are interpretive consequences),
the use of IP, ADVP, NP, and PP tag labels,
the presentation of phrase conjunction structure with CONJP layers, and
the marking of function for all clausal nodes and all clause level constituents.

Annotation practice strives for observational adequacy. The aim is to present a consistent linguistic analysis for each attestation of an identifiable linguistic relation or process. The annotation also offers syntactic analysis for the subsequent generation of meaning representations using the methods of Treebank Semantics (Butler 2015).

Search Interface

The Matsunoki Treebank is associated with a powerful user interface that enables search using virtually any aspect of the annotation. Results of specific searches can be downloaded in the form of annotated data. The source data to which the search interface links is being updated constantly to reflect improvements in analysis.

About the dictionary

The search interface includes an integrated dictionary function that links English glosses of the annotation to word sense definitions disambiguated with numbers. For example, run1 links up to the definition for a human running, while run2 links up to the definition for driving a vehicle. Enabling “word” mode by clicking “word” at the top of the interface shows correspondences between Tsugaru-ben words and their word sense definitions for all of the texts.

Attribution

Presentations of research results using the The Matsunoki Treebank should include a citation taking the general form of the example below (with appropriate modifications depending on the date of access):

Gwidt, Vance, Mikoto Ono, Alastair Butler, et al. (2022). The Matsunoki Treebank – a parsed corpus of Tsugaru dialect folktales, Hirosaki University. Available at: tsugaruben.github.io (accessed 28 December 2023).

Terms of use

This work is licensed under a Creative Commons Attribution 4.0 International License.