Shahked Bleicher Shahked Bleicher

May 02, 2019

A Guided Tour of Language and Culture

In current language documentation practices, there are three main pillars or genres of documentation: the dictionary, the grammar and the corpus. In this paper I will be presenting the idea of a new type of corpus that can utilize the modern technology and techniques that we have at our disposal today. The current working name for this modified corpus is a “cultural corpus” due to its strong focus on language in cultural contexts. The idea was born after reading Erin Debenports Fixing the Books: Secrecy, Literacy, and Perfectability in Indigenous New Mexico (2015), where she works with a native community to create a dictionary for their language. It occurred to me that these traditional three pillars may be, if not lacking, then at least not used to their full potential.

In Fixing The Books, Erin Debenport (2015) describes her experiences while working with the Keiwa language and its community of speakers in order to create a dictionary. However, her description of a dictionary differs from the traditional view and strays into being a new genre. The community she worked with, San Ramón, gave a high priority to passing on cultural knowledge and tried to create the dictionary with this goal in mind. I believe that the cultural transmission they were attempting is not suited for a dictionary, and attempting to create a dictionary in such a way does a disservice to both the dictionary itself and their own goal. Therefore, I propose a new type of document – or software in this case – that is designed specifically with the purpose of passing on cultural knowledge through the language (or perhaps you could look at it as passing on the language through cultural knowledge, depending on perspective). While it may contain definitions of words, the document will focus on dialogues and other forms of discourse that show language and culture by example. Rather than translations, they will have glosses that explain why certain phrases are used, draw attention to certain topics, etc. Lastly, due to the fact that this will be software, there can be features that may not be feasible in a physical document (e.g. search for dialogues with a certain phrase, play recordings of the dialogue to augment the text, etc.).

Part of developing a new genre is specifying not only what it is, but what it is not. Since the dictionary example provided by Debenport (2015) is the motivation for this concept, a comparison between the traditional dictionary and the proposed alternative is called for. The main difference from which all other differences stem is that while the dictionary is meant to pass on the structure and meaning of individual units of a language, the alternative cultural corpus is meant to pass on the nuances and cultural aspects of a community through a linguistic lens. The San Ramón community struggled to add a cultural context to each entry in their dictionary through example sentences, often spending some amount of time trying to think of an example that fit their goal. The alternative is not constrained by word choice and can use whatever words, phrases, idioms or other linguistic structures that best show a certain cultural focus. It can include sentences, dialogues, letters, descriptions or recipes of a dish. As long as it shows how the language is actually used by the community, it is a good fit. In addition, the organization of the dictionary genre is determined by whatever alphabet the language uses. Related words and example sentences may not be anywhere near each other. The alternative can be organized according to the content, not the lexical order.

The above are just some of the fundamental differences between the dictionary and the proposed alternative genre, which hopefully justifies the development of said genre. The next step is to actually create a specification of the features and properties that will be used, and to what end. At this point in time it is hard to picture what the end result might actually look like, so it is important to paint a picture of exactly what defines this genre. For example, I will be approaching the issue from the point of view of software, but does that mean that the genre is only possible as software? I will begin to narrow down what the defining characteristics are and what features might be possible to achieve those characteristics in a modern software application. Alternatives will also be considered in order to show what a physical document in this genre might look like. Note the keyword “might.” I believe it is important to leave the final form of any document aiming to be in this genre open ended so that it is easier for communities to suit it to their own needs. I mention the concept of a specification because I believe it is a good way to define a genre such as this. As long as a document or application has the required characteristics, how it accomplishes those characteristics is left up to the implementer.


To start off with I will list the characteristics that I think are important for this cultural corpus genre. Any software or document aiming to be in the genre should highlight the following aspects: The relationship between culture and linguistic structures should be emphasized; The organization of the content should be based upon cultural content and relationships; It should be possible to traverse the corpus in a focused way; It should avoid direct translation of the target language entries whenever possible; It should be accessible to the speaker community of the language in question; The data must be stored in a portable and open format. These are deliberately kept fairly vague because culture is a vague term. It is encouraged that the specifics about how to accomplish these goals be discussed before beginning the document. That being said, I will provide examples that I think would be practical and effective.

Cultural and Linguistic Relationships

This most important property of the genre is the focus of not only the culture of a community, but how language is used within the culture. In other words, the document should not be a large encyclopedia of cultural knowledge or a wiki about the community, but should contain insights into how language is used in different cultural contexts. For example, if you have an entry which is a dialogue between two members of a community and it has a non-obvious reference to some cultural event or practice, or even just a phrase, it could be explained as a gloss. In English, someone might say “That band is so last year,” most likely in a joking way. It is not immediately clear what this means or why it is funny, so it would be valuable to explain not just what the speaker means, but also where the phrase comes from. In this case, it refers to how there are certain cultural scenes that obsess over the newest thing and quickly go from one item to the next. It is funny because it mocks those kinds of people and the intonation of the phrase itself is amusing.

In addition, relationships between certain linguistic and semantic units should be documented. A linguistic unit may include structures such as words, phrases and names. On the other hand, a semantic unit may include more of the intent behind a linguistic unit. The phrase “Over my dead body” is a good example. The phrase itself can be considered a linguistic unit that is used in a few different contexts in English. The document could provide a relationship between entries that contain this phrase. This would be a linguistic unit relation and would allow users to see more instances of its use in real life. On the other hand, the semantic unit of “Over my dead body” could be a strong refusal or challenge, usually with a connotation of aggression. In this way, the document would relate this entry with other entries containing similar expressions so that the user can see how language is used in confrontations in the community. Also importantly, they would have the full context of when and how such language is used as well as glossed explanations of the subtler aspects of the interaction. In this way, a user might get a better understanding of how language is used within the community and about the culture at the same time.

In the previous example the entire phrase was one linguistic unit. I think it’s important to note that linguistic units can be as specific as individual words. The word “hell” is one that is rife with cultural history and meaning. It is used in many different situations, from anger to humor. A relationship between this word and other entries that contain or relate to it could be provided, such as the phrase “When hell freezes over” and other instances of the unit “hell” that occur in any entries (be it textual, audio, video, etc.).


While the combination of culture and linguistics is fundamental to the proposed genre, how the content is organized is what makes it unique. I hope it makes it easy to use as well. What makes it especially useful is that it can be organized in many ways depending on what the user is looking for. Though this would only be possible for software (or at least easier), perhaps physical copies of the corpus could be printed in different organizations. For example, one version could be printed based on style of speech such as formal and informal. Another could be based on domain such as song, writing, letter, academic, religious and so on. Creating physical documents like this would be practical for communities that don’t have access to the software or the technology required.

The corpus could be organized in any way that the community sees fit. Some ways that immediately come to mind are by genre, phrase, topic and content type. Each of these structures could make it easier for the user to find certain information. The genre organization would allow the user to see example of language in forms such as stories, news and comedy. Similarly, the content type organization could group entries by text, audio, video or any other format the community has managed to document.

If the user desires more a fine grained ability to browse the corpus, they can choose to organize it by phrase or topic. When I say organized by phrase I mean the content is grouped by entries containing phrases of interest. This might contain duplicates in different groups, but one of the advantages of software is that it would not take up more space and an entry only has to be stored once. The content can be grouped by topic in the same way. Perhaps there could be some sort of tagging system that allows the creator of the corpus to specify keywords and areas associated with each entry.

I would like to point out the fact that in each of these cases it is the user deciding how to organize the content to best suit their needs at any given moment. As the corpus is being developed and entries are being added the creator(s) will be adding all of the necessary metadata needed for multiple organization methods. This includes tags, phrases of interest, searchable transcriptions, content type, genre or any other relevant information. Upon completion and release of the corpus, a user would be able to interact with the provided software and specify which organization they want. The software would need to provide an interface for this as a setting.

Although I do not believe this is required for the cultural corpus, it would be invaluable to provide a sort of guided tour through linguistic culture of the community. This guide would be mostly for learning purposes and meant to be consumed in order from start to finish. This means that the creator and community would have to work together to prioritize different aspects of their culture and manually organize them by importance and complexity. This organization method should begin with the more fundamental and less complex cultural topics. As a simplified example for the English language in America I might start with greetings, then move on to national holidays, then religion. Each of these areas is increasingly complex and nuanced, with much more cultural association. In addition, each “stage” of the tour could have its own smaller progression. Keeping with the example for greetings, I could begin with basic greetings in everyday interactions, then show entries with greetings between an employer and employee, and continue in that vein. Each step should build on the previous one so that the user that is on the tour gains a full understanding of English greetings in America, step by step. Even better would be to have many guided tours that each impart their own lesson or knowledge. Perhaps a general, broad tour could be provided in addition to tours through subcultures such as music or art. The problem of having to decide which entries go into a tour is lessened because the creator – or user, if you so decide – can simply create another guided tour that includes the entries that you are unsure about.

Traversing the Corpus

Keeping with the theme of making the cultural corpus easy for the user, it should be easy for them to search for specific content or entries, with or without knowing exactly what they are looking for. This is similar to the organization requirements described above in that metadata is required about each of the content entries. It differs in that it is more of a search or filter requirement. Assuming the corpus is being used in software, there should be a way for a user to filter out entries that are not relevant to them. The user could use this to complete very focused sessions using the corpus.

An example would be when an interesting phrase is found by a user while listening to an oral recording of a myth. They would like to know how it is used in other contexts, or the history of the phrase. As discussed in previous sessions, the corpus creator should create metadata and relations between entries. The user could search for this phrase and information directly about the phrase (history, explanation, etc.) would be provided in addition to “links” to related entries. A related entry may contain the specified phrase (a linguistic unit relation) or have a similar phrase or subject (a semantic unit relation). The term that I thought of when thinking about this type of functionality was “zooming in and out.” A user should be able to get both a broad idea of how language is used within the community and still be able to find specific details for something that interests them. They “zoom” in to obtain increasingly in-depth information. As a general rule, any metadata should be made easy to search or filter. The information that is included in the metadata must be made clear to the user in order to be effective.


This characteristic boils down to the idea that this corpus is meant to be a corpus of real language use within a community with the intent of passing on cultural information and how it relates to the language. To further this goal, direct translation of entries in the target language into another should be avoided whenever possible. In its place, glosses and transcriptions should be provided in the language being documented. That being said, it may make sense to translate those glosses into another language if deemed necessary for your audience. At the very least, the target language glosses should be emphasized and completed first. Though it could be helpful, the main purpose of this genre is not necessarily to teach the language but to showcase and explain its cultural use and implications. The grammar, vocabulary and basic understanding of the language should already exist in the user. Glosses in other languages glosses could be completed if required after careful consideration.

Each entry contains examples of the language being used within the community in “real” situations. Every single entry should have explanations about why certain language is used, what connotations it has, nuances that are not necessarily clear and other information that is deemed important. The user can use these glosses to gain knowledge and understanding of the language, possibly in combination with a more traditional language genre such as a dictionary. The ideal result would be that a user can be immersed within the language and community and be able to understand and contribute to the interactions they are exposed to. Any interaction that they are confused about (e.g. why did a sentence make the other person angry?) hopefully has an entry in the cultural corpus with a detailed gloss.


Lastly, the corpus or document should be accessible to the speaker community it documents. This should be true even if they do not have access to computers or the required software. It was mentioned earlier that there should be a way to print the corpus, or a subset of it, in any organization desired. In this way the community can make use of the physical documents in any way they need. For example, they could be used by children of the community to help revitalize a language’s use with a good understanding of its history and context. The ability of the community to access and use the cultural corpus should be prioritized above making it accessible to other communities, unless the community explicitly states otherwise.


In Seven Dimensions of Portability for Language Documentation and Description, Bird and Simons (2003) present challenges and propose solutions for storing language documentation and description on modern devices. They base their ideas on the fact that although computers and software are allowing linguists to store, record, and mass distribute their work in novel ways, the constantly changing nature of technology has posed a problem to maintaining all of this documentation. Some formats such as Microsoft Office are proprietary and lose support after time. Tools become obsolete or are replaced with new ones. Documentation that is created with such tools can become inaccessible after a few years due to these problems.

As part of their solution, they examine seven problem areas of portability: Content, Format, Discovery, Access, Citation, Preservation and Rights (2003). Each of these has its own more specific areas to focus upon. I believe that the cultural corpus can benefit greatly by following the standards proposed in this paper. All seven problem areas can and should be considered, but I believe there are a few important areas that are particularly suitable for the genre. Here I will explain how the software could implement a solution for these areas.

The first of these is a subtopic of the Content area, Terminology. Bird and Simons described Terminology as using the same terms across different documentation resources to mean the same thing. The example given was the word ‘absolutive’ which apparently refers to different things depending on the language (2003). In order to avoid this, an existing vocabulary of standard terms should be used whenever possible when creating entries in the cultural corpus. Since the corpus is focused more on the cultural aspects than the grammatical, a more appropriate example would be using such a standard vocabulary for the metadata. If two entries have a common theme, than they should use the same metadata tag or data. For instance, both could have the tag ‘agriculture’ instead of one having ‘agriculture’ and one having ‘farming.’

Another extremely important part of portability is the format of the data. Specifically, the rendering and markup of the data should be completely separated. Rendering is the presentation of the data to the user, while markup is how the data is represented internally. An open and well-known markup language that could be used in this software is XML, which uses tags to indicate what the content is. In addition, characters should be encoded in the UTF-8 format because it is a universal standard that is supported by many technologies and is still growing. Below is an example entry which uses XML.

<entry id="“123”" genre="“story”" date="“5/10/2017”" version="“1”">


    <audio src="“recording.mp4”">




As you can see, the XML itself does not indicate how the data should be presented in the software or document (e.g. bold, color, size). In this case, Entry, Tag, Audio and Gloss were used as tags. It merely describes what it is. However, it is important to note that even what tags to use must be defined with an XML schema or similar. This schema defines the structure of the XML data, meaning what tags they are and how they fit together. One might state that a Tag XML tag can only be put inside an Entry tag, for example. This allows other tools and software that support XML to “understand” your data and use it correctly.

If you are creating a cultural corpus that others can use, it is useful and relatively easy for software to provide a method of citing individual entries or groups of entries. Each entry should have a unique identification, date, source, etc. so that if a user or other tool references an entry, they can specifically cite the data they use. The identification information should not be changed at any point and should be treated similarly to a DOI is for academic articles. Another point that Bird and Simons (2003) bring up is the issue of Immutability of the data. Due to the fact that each entry can be uniquely identified, the data that it points to should not change over time. As a solution they recommend versioning each entry so that whenever its content is modified, its version is also changed. In that way, specific versions of an entry can be cited without having to worry that a future user will see different data than what was intended.

Lastly, an issue that the use of computers and software makes much simpler is Rights, or permissions, of the documentation. This essentially means that you can restrict or limit access of certain entries to people with the correct credentials. In the case of communities that do not want outsiders to have access to their language, this could allow only community members to view and reference the documentation contained in the software. Many methods of implementing such permissions are possible. Some examples are simple user accounts that users must login to, hiding entries unless a certain password or link is provided, and allowing an administrator to approve certain users, just to name few.

All of these problem area can be handled fairly efficiently with well-designed software. I believe that the specific areas mentioned above are especially important for the cultural corpus genre, or at least worth considering if implementing such a corpus. To conclude the topic of data portability, it may be valuable to connect to one or more of the online repositories that are mentioned in by Bird and Simons (2003). The repositories can be found at and searched with the form at If the software could upload the corpus entries to a repository and automatically update it when content is added or changed it would go a long way towards providing portability, longevity and accessibility to the documentation.

Use Cases

Now that the scope of the proposed cultural corpus has been made clear, actual use cases may be discussed to showcase how traditional glossing methods can be improved with the use of this kind of software or document. One great example is the use of prose in the preface of a Chiwere translation that is meant to convey cultural context of the type of story it is and the who the speaker is. In this PDF document, the first page already contains descriptions that are already well suited to being their own entries in a cultural corpus. The author explains the difference between Wórage and Wéikan, the former being a more factual and recent event and with the latter being more like myths (GoodTracks, 1998).

Rather than needing to insert them into the beginning of the document so as to inform the reader about this cultural and linguistic practice, these can each be an individual entry. The description provided can be used the same way, but the actual story, “My Grandmother,” can be a separate yet related entry. In fact, you could categorize any number of other entries as either Wórage or Wéikan and a user could browse for entries of one type or another. Or maybe if they do are not looking for a certain type, but are curious about which category an entry they are looking at falls into, that information could be provided as part of the gloss or metadata. Another entry could be the phrase “Aré gahéda hagú ke” which is the ending of all Wéikan. As an entry, an explanation of the phrase, it’s meaning or history, or any other relevant information could be provided. Similar entries could be made for the clan of Julia Small, the speaker. Presumably, there would be a lot of cultural information attached to a clan, which would then relate to other entries.

Perhaps just as importantly as the ability to relate all of this cultural information with the main story is the ability of the user to view it easily at any point. With the physical document or PDF, the user must go back to the first page to view the information. With software, the experience can be drastically improved. One example could be to display the information when the user hovers or clicks the mouse on a linguistic or semantic unit. Another option could be to simply display the information in a different part of window that can be hidden if the user desires. The possibilities are endless. This type of improvement can be seen with many types of traditional annotation and glossing, including interlinear text that is limited by physical space.

Another case where a corpus such as this could have been useful is shown by Meek (2011) in the section about the Kaska language. In this chapter Meek gives an account of an elderly woman in the Dene community attempting to impart a story in Kaska. She provides a transcript of a dialogue where the children are not behaving and the adults are trying various strategies to maintain their attention. An entry such as this dialogue could have been glossed with many of the nuances of the dialogue. For instance, within the community it is discouraged for children to communicate with the adults, limiting their interaction. Many stories with children as an audience emphasize this practice, which is a perfect cultural and linguistic piece of information that could be part of the entry’s gloss (Meek, 2011). It may explain some parts of the interaction and provide context for some of what they say.


There are countless ways that language documentation, both old and new, could be used in a cultural corpus. The use of software to implement the genre opens up even more possibilities. As you can see, the specifications of the genre are meant to bring a focus to relating culture and language in a unique way. Usability should be paramount. These specifications include an emphasis on cultural and linguistic relationships, content organization, how to traverse the corpus as a user, glossing, accessibility and portability. The organization should be flexible so that the user can view the content according to their own needs, rather than being bound to what the corpus creator decided. In addition, the content should be traversable in an easy way, including both specific and broad search and filter functionality. As a user, you should be able to obtain a rough overview of a cultural and linguistic topic or a detailed picture of something specific. Some guided tours through the content would also be a helpful aspect to traversing it. In terms of explanations, glossing should be prioritized over translation of the content. Lastly, accessibility and portability are important and should be built into the software so that the user and corpus creator do not have to get bogged down in details. If a printed version is desired, it should be easy to do in any organization. The data should be stored in XML or a similar open format and should be stored in an archive so that it can be easily accessed by other users and tools.

I hope that this document has provided insight into what a cultural corpus might look like and has shown use cases to linguists who work on language documentation. The genre’s goal is to give users a chance to understand language and how it used within a culture so that they have a full context for when they speak and hear the language. Such a software could augment the traditional three pillars of language documentation.


Bird, S. & Simons, G. (2003). Seven Dimensions of Portability for Language Documentation and Description. Language 79(3), 557-582. Linguistic Society of America. Retrieved May 10, 2017, from Project MUSE database.

Debenport, E.(2015). Fixing the Books: Secrecy, Literacy, and Perfectibility in Indigenous New Mexico. Santa Fe: SAR Press. Retrieved May 10, 2017, from Project MUSE database.

GoodTracks, J. G. (1998, January 16). IOWAY-OTOE personal narratives [PDF]. Lawrence: Báxoje~Jiwére Language Project.

With notes on worage, stories during a known time period, based on historical facts.

Meek, B. A. (2011). Growing Up Endangered. In We are our language: an ethnography of language revitalization in a northern athabaskan community (pp. 56-107). Tucson: University of Arizona Press.

TAGS: linguistics - software

# Back