A Crowdsourced Corpus of AAC-like Communications

We used Amazon Mechanical Turk to create a large set of fictional AAC-like communications. Workers were asked to invent communications as if they were using a scanning-style AAC interface for communication. Our AAC corpus contains approximately six thousand communications. We found our crowdsourced collection modeled conversational AAC better than datasets based on telephone conversations or newswire text. We leveraged our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. For details, see our paper.

Below you can download our corpus of communications, some of the test sets we used, and some of our trained language models. Language models are in ARPA text format. If you use this resource in your research, please reference:

Keith Vertanen and Per Ola Kristensson. The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL: 700-711, 2011. PDF BibTeX

We thank Keith Trnka for allowing us to provide the Switchboard test set. We thank Horabail Venkatagiri for allowing us to provide the communication test set. Our specialists test set was created from the phrases suggested by AAC professionals on these pages at the University of Nebraska-Lincoln. The original pages are no longer available.

With the exception of lm_test_switch.txt and lm_test_comm.txt, the resources listed below are licensed under a Creative Commons CC BY 4.0 license.

Corpus:

Communication corpus	2MB	The training, development and test sets of the communication corpus. Contains the word lists we used to build our models. It also contains several of the test sets we used for evaluation.
readme.txt	2K	Description of the files in the corpus (contained in the first file).
sent_train_aac.txt	125K	Training set of 5K communications from 80% of the workers (contained in the first file).
sent_dev_aac.txt	15K	Development set of 551 communications from 10% of the workers (contained in the first file).
sent_test_aac.txt	14K	Test set of 563 communications from 10% of the workers (contained in the first file).
lm_test_switch.txt	3K	Test set of 59 sentences from the Switchboard corpus (contained in the first file).
lm_test_comm.txt	12K	Test set of 251 sentences written in response to hypothetical communication situations (contained in the first file).

Language models:

2-gram mixture	24MB	2-gram LM, cross-entropy difference selection using TurkDev optimal thresholds.
3-gram mixture	122MB	3-gram LM, cross-entropy difference selection using TurkDev optimal thresholds.
4-gram mixture	316MB	4-gram LM, cross-entropy difference selection using TurkDev optimal thresholds.
2-gram mixture, small	14MB	Small 2-gram LM, cross-entropy difference selection but throwing away more data.
3-gram mixture, small	63MB	Small 3-gram LM, cross-entropy difference selection but throwing away more data.
4-gram mixture, small	156MB	Small 4-gram LM, cross-entropy difference selection but throwing away more data.
2-gram mixture, large	42MB	Large 2-gram LM, cross-entropy difference selection but keeping more data.
3-gram mixture, large	232MB	Large 3-gram LM, cross-entropy difference selection but keeping more data.
4-gram mixture, large	635MB	Large 4-gram LM, cross-entropy difference selection but keeping more data.
4-gram mixture, tiny	98MB	Tiny 4-gram LM, cross-entropy difference selection but throwing away even more data.