November 26st, 2024

Introducing OpenQDC - The Open-Source Hub of ML-Ready Quantum Datasets

We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.

Github page: https://github.com/valence-labs/openQDC 
Website: https://www.openqdc.io/

Machine Learning Interatomic Potentials

Machine learning interatomic potentials (MLIPs) provide a promising alternative to empirical force fields. Trained on QM data, they make it possible to expedite energy and force calculations while maintaining precision comparable to QM. Recent advancements in this field, including models like ANI, TorchANI, AIMNet, and MACE, highlight the potential of MLIPs in accelerating biomolecular and materials discovery. It’s a rapidly growing field with lots of development across novel geometric deep-learning architectures, physics-inspired architectures, physical descriptions, and more.

Challenges with QM Datasets

Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.  

Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research

Introducing OpenQDC

With OpenQDC, we aim to unify and standardize existing, well-known datasets to advance the future of MLIP research. We collected publicly-available datasets and computed essential metadata that was missing but necessary for accurate data processing (e.g. energy, distance, force units, and isolated atom energies).

The QM methods and physical units are rigorously annotated, validated, and used to provide useful statistics and normalization methods and conversions, providing efficient ways to utilize multiple datasets in new and previously impossible ways to further advance the frontier of MLIP research.

The Open QDC Library

The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code.

A simple pythonic API: The simplicity of the Python interface ensures ease of use, making it perfect for quick prototyping.
ML-Ready: All you manipulate are torch.Tensor, jax.Array or numpy.Array objects.
Quantum ready: The quantum methods used by the datasets are checked and standardized to provide additional values, useful normalization, and different statistics.
Standardized: The datasets are written in standard and performant formats with annotated metadata like units and labels.
Performance matters: Read and write multiple formats (memmap, zarr, xyz, etc).
Data: Have access to 1.5+ billion data points.
Open source & extensible: OpenQDC and all its files and datasets are open source, and you can add your own dataset and share it with the community in just a few minutes.

Getting Started

Install OpenQDC with pip or conda:

Python

pip install openqdc 
or
conda install openqdc -c conda-forge

Now you are ready to use all our QM datasets with the ready-to-use CLI:

Unset

openqdc download SpiceV2

Or using the Python API:

Python

from openqdc import SpiceV2
# Automatically download the data
dataset=SpiceV2()

Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:

Python

# Load the dataset 
from openqdc import MACEOFF
from torch.data.utils import DataLoader

dataset=MACEOFF(energy_unit="ang",energy_unit="kj/mol",array_format="torch")

# Create the dataloader by simply passing the dataset
dataloader=DataLoader(dataset, batch_size=32)

# Do your own magic 
...

OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:

Python

from openqdc import SpiceV2
from torch_cluster import radius_graph
from torch_geometric.loader import DataLoader
from torch_geometric.data import Data

# We create a function to convert object into their graph
def to_pyg_data(x):
   # or any other techniques to build a graph (or use the smiles from the dataset)
   edge_index = radius_graph(x.positions, 5)
   return Data(edge_index=edge_index, **x)

# Use the transform attribute to automatically convert your items 
ds=SpiceV2(array_format="torch", distance_unit="ang", transform=to_pyg_data)

# Create the pyg dataloader by simply passing the new dataset
loader = DataLoader(ds, batch_size=32, shuffle=True)

# Do your own magic 
...

We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.  

Please feel free to share your feedback or connect with the Valence Labs team on GitHub, X, LinkedIn, or Valence Portal!

Share this article