Showing posts with label datasets. Show all posts

An Analysis of Online Datasets Using Dataset Search (Published, in Part, as a Dataset)

There are tens of millions of datasets on the web, with content ranging from sensor data and government records, to results of scientific experiments and business reports. Indeed, there are datasets for almost anything one can imagine, be it diets of emperor penguins or where remote workers live. More than two years ago, we undertook an effort to design a search engine that would provide a single entry point to these millions of datasets and thousands of repositories. The result is Dataset Search, which we launched in beta in 2018 and fully launched in January 2020. In addition to facilitating access to data, Dataset Search reconciles and indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.

As of today, the complete Dataset Search corpus contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains are also well represented. The graph below shows the growth of the corpus over the last two years, and while we still don’t know what fraction of datasets on the web are currently in Dataset Search, the number continues to grow steadily.

Growth in the number of datasets indexed by Dataset Search

To better understand the breadth and utility of the datasets made available through Dataset Search, we published “Google Dataset Search by the Numbers”, accepted at the 2020 International Semantic Web Conference. Here we provide an overview of the available datasets, present metrics and insights originating from their analysis, and suggest best practices for publishing future scientific datasets. In order to enable other researchers to build analysis and tools using the metadata, we are also making a subset of the data publicly available.

A Range of Dataset Topics
In order to determine the distribution of topics covered by the datasets, we infer the research category based on dataset titles and descriptions, as well as other text on the dataset Web pages. The two most common topics are geosciences and social sciences, which account for roughly 45% of the datasets. Biology is a close third at ~15%, followed by a roughly even distribution for other topics, including computer science, agriculture, and chemistry, among others.

Distribution of dataset topics

In our initial efforts to launch Dataset Search, we reached out to specific communities, which was key to bootstrapping widespread use of the corpus. Initially, we focused on geosciences and social sciences, but since then, we have allowed the corpus to grow organically. We were surprised to see that the fields associated with the communities we reached out to early on are still dominating the corpus. While their early involvement certainly contributes to their prevalence, there may be other factors involved, such as differences in culture across communities. For instance, geosciences have been particularly successful in making their data findable, accessible, interoperable, and reusable (FAIR), a core component to reducing barriers for access.

Making Data Easily Citable and Reusable
There is a growing consensus among researchers across scientific disciplines that it is important to make datasets available, to publish details relevant to their use, and to cite them when they are used. Many funding agencies and academic publishers require proper publication and citation of data.

Peer-reviewed journals such as Nature Scientific Data are dedicated to publishing valuable datasets, and efforts such as DataCite provide digital object identifiers (DOIs) for them. Resolution services (e.g., identifiers.org) also provide persistent, de-referenceable identifiers, allowing for easy citation, which is key to making datasets widely available in scientific discourse. Unfortunately, we found that only about 11% of the datasets in the corpus (or ~3M) have DOIs. We chose this subset from the dataset corpus to be included in our open-source release. From this collection, about 2.3M datasets come from two sites, datacite.org and figshare.com:

Domain Datasets with DOIs
figshare.com 1,301K
datacite.org 1,070K
narcis.nl 118K
openaire.eu 100K
datadiscoverystudio.org 72K
osti.gov 63K
zenodo.org 50K
researchgate.net 41K
da-ra.de 40K

Publishers can specify access requirements for a dataset via schema.org metadata properties, including details of the license and information indicating whether or not the dataset is accessible for free. Only 34% of datasets specify license information, but when no license is specified, users cannot make any assumptions on whether or not they are allowed to reuse the data. Thus, adding licensing information, and, ideally, adding as open a license as possible, will greatly improve the reusability of the data.

Among the datasets that did specify a license, we were able to recognize a known license in 72% of cases. Those licenses include Open Government licenses for the UK and Canada, Creative Commons licenses, and several Public Domain licenses (e.g., Public Domain Mark 1.0). We found 89.5% of these datasets to either be accessible for free or use a license that allows redistribution, or both. And of these open datasets, 5.6M (91%) allow commercial reuse.

Another critical component of data reusability is providing downloadable data, yet only 44% of datasets specify download information in their metadata. A possible explanation for this surprisingly low value is that webmasters (or dataset-hosting platforms) fear that exposing the data download link through schema.org metadata may lead search engines or other applications to give their users direct access to download the data, thus “stealing” traffic from their website. Another concern may be that data needs the proper context to be used appropriately (e.g., methodology, footnotes, and license information), and providers feel that only their web pages can give the complete picture. In Dataset Search, we do not show download links as part of dataset metadata so that users must go to the publisher’s website to download the data, where they will see the full context for the dataset.

What Do Users Access?
Finally, we examine how Dataset Search is being used. Overall, 2.1M unique datasets from 2.6K domains appeared in the top 100 Dataset Search results over 14 days in May 2020. We find that the distribution of topics being queried is different from that of the corpus as a whole. For instance, geoscience takes up a much smaller fraction, and conversely, biology and medicine represent a larger fraction relative to their share of the corpus. This result is likely explained by the timing of our analysis, as it was performed during the first weeks of the COVID-19 pandemic.

Distribution of topics covered by datasets that appear in search results

Best Practices for Publishing Scientific Datasets
Based on our analysis, we have identified a set of best practices that can improve how datasets are discovered, reused and cited.

  • Discoverability
    Dataset metadata should be on pages that are accessible to web crawlers and that provide metadata in machine-readable formats in order to improve discoverability.

  • Persistence
    Publishing metadata on sites that are likely to be more persistent than personal web pages will facilitate data reuse and citation. Indeed, during our analysis of Dataset Search, we noted a very high rate of turnover — many URLs that hosted a dataset one day did not have it a few weeks or months later. Data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are a good way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.

  • Provenance
    With datasets often published in multiple repositories, it would be useful for repositories to describe the provenance information more explicitly in the metadata. The provenance information helps users understand who collected the data, where the primary source of the dataset is, or how it might have changed.

  • Licensing
    Datasets should include licensing information, ideally in a machine-readable format. Our analysis indicates that when dataset providers select a license, they tend to choose a fairly open one. So, encouraging and enabling scientists to choose licenses for their data will result in many more datasets being openly available.

  • Assigning persistent identifiers (such as DOIs)
    DOIs are critical for long-term tracking and useability. Not only do these identifiers allow for much easier citation of datasets and version tracking, they are also dereferenceable: if a dataset moves, the identifier can point to a different location.

Releasing Metadata for Datasets with Persistent Identifiers
As part of the announcement today, we are also releasing a subset of our corpus for others to use. It contains the metadata for more than three million datasets that have DOIs and other types of persistent identifiers –- these are the datasets that are the most easily citable. Researchers can use this metadata to perform deeper analysis or to build their own applications using this data. For example, much of the growth of DOI usage appears to have been within the last decade. How does this timeframe relate to the datasets covered in the corpus? Is the DOI usage distribution uniform across datasets, or are there significant differences between research communities?

We will update the dataset on a regular basis. Finally, we hope that focusing this data release on datasets with persistent citable identifiers will encourage more data providers to describe their datasets in more detail and to make them more easily citable.

In conclusion, we hope that having data more discoverable through tools such as Google's Dataset Search will encourage scientists to share their data more broadly and do it in a way that makes data truly FAIR.

Acknowledgments
This post reflects the work of the entire Dataset Search team. We are grateful to Shiyu Chen, Dimitris Paparas, Katrina Sostek, Yale Cong, Marc Najork, and Chris Gorgolewski for their contributions. We would also like to thank Hal Varian for suggesting this analysis and for many helpful ideas.

Tackling Open Challenges in Offline Reinforcement Learning

Over the past several years, there has been a surge of interest in reinforcement learning (RL) driven by its high-profile successes in game playing and robotic control. However, unlike supervised learning methods, which learn from massive datasets that are collected once and then reused, RL algorithms use a trial-and-error feedback loop that requires active interaction during learning, collecting data every time a new policy is learned. This approach is prohibitive in many real-world settings, such as healthcare, autonomous driving, and dialogue systems, where trial-and-error data collection can be costly, time consuming, or even irresponsible. Even for problems where some active data collection can be used, the requirement for interactive collection limits dataset size and diversity.

Offline RL (also called batch RL or fully off-policy RL) relies solely on a previously collected dataset without further interaction. It provides a way to utilize previously collected datasets — from previous RL experiments, from human demonstrations, and from hand-engineered exploration strategies — in order to automatically learn decision-making strategies. In principle, while off-policy RL algorithms can be used in the offline setting (fully off-policy), they are generally only successful when used with active environment interaction — without receiving this direct feedback, they often exhibit undesirable performance in practice. Consequently, while offline RL has enormous potential, that potential cannot be reached without resolving significant algorithmic challenges.

In “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, we provide a comprehensive tutorial on approaches for tackling the challenges of offline RL and discuss the many issues that remain. To address these issues, we have designed and released an open-source benchmarking framework, Datasets for Deep Data-Driven Reinforcement Learning (D4RL), as well as a new, simple, and highly effective offline RL algorithm, called conservative Q-learning (CQL).

Benchmarks for Offline RL
In order to understand the capabilities of current approaches and to guide future progress, it is first necessary to have effective benchmarks. A common choice in prior work was to simply use data generated by a successful online RL run. However, while simple, this data collection approach is artificial because it involves training an online RL agent which is prohibitive in many real-world settings as we discussed previously. One wishes to learn a policy that is better than the current best from diverse data sources that provides good coverage of the task. For example, one might have data collected from a hand-designed controller of a robot arm, and use offline RL to train an improved controller. To enable progress in this field under realistic settings, one needs a benchmark suite that accurately reflects these settings, while being simple and accessible enough to enable rapid experimentation.

D4RL provides standardized environments, datasets and evaluation protocols, as well as reference scores for recent algorithms to help accomplish this. This is a “batteries-included” resource, making it ideal for anyone to jump in and get started with minimal fuss.

Environments in D4RL

The key design goal for D4RL was to develop tasks that reflect both real-world dataset challenges as well as real-world applications. Previous datasets used data collected either from random agents or agents trained with RL. Instead, by thinking through potential applications in autonomous driving, robotics, and other domains, we considered how real-world applications of offline RL might require handling of data generated from human demonstrations or hard-coded controllers, data collected from heterogeneous sources, and data collected by agents with a variety of different goals.

Aside from the widely used MuJoCo locomotion tasks, D4RL includes datasets for more complex tasks. The Adroit domain, which requires manipulating a realistic robotic hand to use a hammer, for example, illustrates the challenges of working with limited human demonstrations, without which these tasks are extremely challenging. Previous work found that existing datasets could not distinguish between competing methods, whereas the Adroit domain reveals clear deficiencies between them.

Another common scenario for real-world tasks is one in which the dataset used for training is collected from agents performing a wide range of other activities that are related to, but not specifically targeted towards, the task of interest. For example, data from human drivers may illustrate how to drive a car well, but do not necessarily show how to reach a specific desired destination. In this case, one might like offline RL methods to “stitch” together parts of routes in the driving dataset to accomplish a task that was not actually seen in the data (i.e., navigation). As an illustrative example, given paths labeled “A” and “B” in the picture below, offline RL should be able to “remix” them to produce path C.

Having only observed paths A and B, they can be combined to form a shortest path (C).

We constructed a series of increasingly difficult tasks to exercise this “stitching” ability. The maze environments, shown below, require two robots (a simple ball or an “Ant” robot) to navigate to locations in a series of mazes.

Maze navigation environments in D4RL, which require “stitching” parts of paths to accomplish new navigational goals that were not seen in the dataset.

A more complex “stitching” scenario is provided by the Franka kitchen domain (based on the Adept environment), where demonstrations from humans using a VR interface comprise a multi-task dataset, and offline RL methods must again “remix” this data.

The “Franka kitchen” domain requires using data from human demonstrators performing a variety of different tasks in a simulated kitchen.

Finally, D4RL includes two tasks that are meant to more accurately reflect potential realistic applications of offline RL, both based on existing driving simulators. One is a first-person driving dataset that utilizes the widely used CARLA simulator developed at Intel, which provides photo-realistic images in realistic driving domains, and the other is a dataset from the Flow traffic control simulator (from UC Berkeley), which requires controlling autonomous vehicles to facilitate effective traffic flow.

D4RL includes datasets based on existing realistic simulators for driving with CARLA (left) and traffic management with Flow (right).

We have packaged these tasks and standardized datasets into an easy-to-use Python package to accelerate research. Furthermore, we provide benchmark numbers for all tasks using relevant prior methods (BC, SAC, BEAR, BRAC, AWR, BCQ), in order to baseline new approaches. We are not the first to propose a benchmark for offline RL: a number of prior works have proposed simple datasets based on running RL algorithms, and several more recent works have proposed datasets with image observations and other features. However, we believe that the more realistic dataset composition in D4RL makes it an effective way to drive progress in the field.

An Improved Algorithm for Offline RL
As we developed the benchmark tasks, we found that existing methods could not solve the more challenging tasks. The central challenge arises from a distributional shift: in order to improve over the historical data, offline RL algorithms must learn to make decisions that differ from the decisions taken in the dataset. However, this can lead to problems when the consequences of a seemingly good decision cannot be deduced from the data — if no agent has taken this particular turn in the maze, how does one know if it leads to the goal or not? Without handling this distributional shift problem, offline RL methods can extrapolate erroneously, making over-optimistic conclusions about the outcomes of rarely seen actions. Contrast this with the online setting, where reward bonuses modeled after curiosity and surprise optimistically bias the agent to explore all potentially rewarding paths. Because the agent receives interactive feedback, if the action turns out to be unrewarding, then it can simply avoid the path in the future.

To address this, we developed conservative Q-learning (CQL), an offline RL algorithm designed to guard against overestimation while avoiding explicit construction of a separate behavior model and without using importance weights. While standard Q-learning (and actor-critic) methods bootstrap from previous estimates, CQL is unique in that it is fundamentally a pessimistic algorithm: it assumes that if a good outcome was not seen for a given action, that action is likely to not be a good one. The central idea of CQL is to learn a lower bound on the policy’s expected return (called the Q-function), instead of learning to approximate the expected return. If we then optimize our policy under this conservative Q-function, we can be confident that its value is no lower than this estimate, preventing errors from overestimation.

We found that CQL attains state-of-the-art results on many of the harder D4RL tasks: CQL outperformed other approaches on the AntMaze, Kitchen tasks, and 6 out of 8 Adroit tasks. In particular, on the AntMaze tasks, which require navigating through a maze with an “Ant” robot, CQL is often the only algorithm that is able to learn non-trivial policies. CQL also performs well on other tasks, including Atari games. On the Atari tasks from Agarwal et al., CQL outperforms prior methods when data is limited (“1%” dataset). Moreover, CQL is simple to implement on top of existing algorithms (e.g., QR-DQN and SAC), without training additional neural networks.

Performance of CQL on Atari games with the 1% dataset from Agarwal et al.

Future Thoughts
We are excited about the fast-moving field of offline RL. While we took a first step towards a standard benchmark, there is clearly still room for improvement. We expect that as algorithms improve, we will need to reevaluate the tasks in the benchmark and develop more challenging tasks. We look forward to working with the community to evolve the benchmark and evaluation protocols. Together, we can bring the rich promises of offline RL to real-world applications.

Acknowledgements
This work was carried out in collaboration with UC Berkeley PhD students Aviral Kumar, Justin Fu, and Aurick Zhou, with contributions from Ofir Nachum from Google Research.