We used Amazon Mechanical Turk to create a large set of fictional AAC-like communications. Workers were asked to invent communications as if they were using a scanning-style AAC interface for communication. Our AAC corpus contains approximately six thousand communications. We found our crowdsourced collection modeled conversational AAC better than datasets based on telephone conversations or newswire text. We leveraged our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. For details, see our paper.
Below you can download our corpus of communications, some of the test sets we used, and some of our trained language models. Language models are in ARPA text format. If you use this resource in your research, please reference:
We thank Keith Trnka for allowing us to provide the Switchboard test set. We thank Horabail Venkatagiri for allowing us to provide the communication test set. Our specialists test set was created from the phrases suggested by AAC professionals on these pages at the University of Nebraska-Lincoln. The original pages are no longer available.
With the exception of lm_test_switch.txt and lm_test_comm.txt, the resources listed below are licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.
Corpus: |
2MB |
The training, development and test sets of the communication corpus. Contains the word lists we used to build our models. It also contains several of the test sets we used for evaluation. |
|
2K |
Description of the files in the corpus (contained in the first file). |
|
125K |
Training set of 5K communications from 80% of the workers (contained in the first file). |
|
15K |
Development set of 551 communications from 10% of the workers (contained in the first file). |
|
14K |
Test set of 563 communications from 10% of the workers (contained in the first file). |
|
3K |
Test set of 59 sentences from the Switchboard corpus (contained in the first file). |
|
12K |
Test set of 251 sentences written in response to hypothetical communication situations (contained in the first file). |
Language models: |
24MB |
2-gram LM, cross-entropy difference selection using TurkDev optimal thresholds. |
|
122MB |
3-gram LM, cross-entropy difference selection using TurkDev optimal thresholds. |
|
316MB |
4-gram LM, cross-entropy difference selection using TurkDev optimal thresholds. |
|
14MB |
Small 2-gram LM, cross-entropy difference selection but throwing away more data. |
|
63MB |
Small 3-gram LM, cross-entropy difference selection but throwing away more data. |
|
156MB |
Small 4-gram LM, cross-entropy difference selection but throwing away more data. |
|
42MB |
Large 2-gram LM, cross-entropy difference selection but keeping more data. |
|
232MB |
Large 3-gram LM, cross-entropy difference selection but keeping more data. |
|
635MB |
Large 4-gram LM, cross-entropy difference selection but keeping more data. |
|
98MB |
Tiny 4-gram LM, cross-entropy difference selection but throwing away even more data. |