Natural Questions contains 307K training examples, 8K examples for development, and a further 8K examples for testing.

In the paper, we demonstrate a human upper bound of 87% F1 on the long answer selection task, and 76% on the short answer selection task.

We believe that matching human performance on this task will require significant progress in natural language understanding; we encourage you to help make this happen.

Natural Questions Data

For a full description of the methodology used to create Natural Questions, please refer to Natural Questions: a Benchmark for Question Answering Research. The data are released as gzipped jsonlines. Each json record contains a single question, a rendered Wikipedia page, a tokenized representation of the text on that page, and the annotations added by our annotators. Each training example has a single annotation, from a single annotator. Examples in the development and test sets have five annotations, from five different annotators. A more complete description of the data format is given in the Natural Questions github page.

To help you navigate NQ, we have provided a data browser that you can run on your own machine. We have also released some baseline models and data utilities in Google AI language's open source repository.

