The full dataset is 42Gb and it should be downloaded with gsutil. Instructions are given below.
The complete NQ dataset contains the HTML of the Wikipedia pages that were shown to annotators. Many participants will only want to use the extracted text so we have also provided a simplified version of the training data that is only 4Gb. The development set and test set are only provided using the original full NQ format, but we have provided a utility for mapping from the full NQ format to the simplified format. If you use the simplified format, you should use this utility to simplify all of the examples that will be passed to your submitted model at test time.
Natural Questions is released under the Creative Commons Share-Alike 3.0 license. If you want to explore the data format quickly, you can look at 200 example samples of the train set and the dev set with our standalone browser.
To download all of the data in the original format, first install gsutil. Note that there is an option to install gsutil as a standalone tool if you don't want to download the Google Cloud SDK. Then run:
gsutil -m cp -R gs://natural_questions/v1.0 <path to your data directory>
This will download the full 41Gb training set, the development set (1Gb), and the samples described above.
Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing.
In the paper, we demonstrate a human upper bound of 87% F1 on the long answer selection task, and 76% on the short answer selection task.
We believe that matching human performance on this task will require significant progress in natural language understanding; we encourage you to help make this happen.
For a full description of the methodology used to create Natural Questions, please refer to Natural Questions: a Benchmark for Question Answering Research. The data are released as gzipped jsonlines. Each json record contains a single question, a rendered Wikipedia page, a tokenized representation of the text on that page, and the annotations added by our annotators. Each training example has a single annotation, from a single annotator. Examples in the development and test sets have five annotations, from five different annotators. A more complete description of the data format is given in the Natural Questions github page.
To help you navigate NQ, we have provided a data browser that you can run on your own machine. We have also released some baseline models and data utilities in Google AI language's open source repository.