We curated and consolidated 40+ quantum mechanics (QM) datasets, covering 1.5 billion geometries across 70 atom species and 250+ QM methods, into a single, accessible hub called OpenQDC. It’s open-source and the datasets are available for access through the OpenQDC Python library. Install it via pip (pip install OpenQDC) to start downloading and using various QM datasets in just a single line of code.
Github page: https://github.com/valence-labs/openQDC
Website: https://www.openqdc.io/
Machine learning interatomic potentials (MLIPs) provide a promising alternative to empirical force fields. Trained on QM data, they make it possible to expedite energy and force calculations while maintaining precision comparable to QM. Recent advancements in this field, including models like ANI, TorchANI, AIMNet, and MACE, highlight the potential of MLIPs in accelerating biomolecular and materials discovery. It’s a rapidly growing field with lots of development across novel geometric deep-learning architectures, physics-inspired architectures, physical descriptions, and more.
Developing robust MLIPs requires vast amounts of QM data. Unfortunately, there is a lack of standardized, plug-and-play datasets that can be used to train and test new ML algorithms, hindering the prototyping of new research in this field.
Existing QM datasets span various methods and different chemical spaces. They’re also scattered across several repositories (ex. QCArchive, ColabFit, NablaDFT, GEOM) with missing metadata (e.g. level of theory and units), adding an extra layer of complexity to working these datasets. This not only hampers the adoption and utility of the data, but also stifles opportunities for collaboration among physicists, chemists, ML experts, and experts in other fields, limiting the progress of ML research
With OpenQDC, we aim to unify and standardize existing, well-known datasets to advance the future of MLIP research. We collected publicly-available datasets and computed essential metadata that was missing but necessary for accurate data processing (e.g. energy, distance, force units, and isolated atom energies).
The QM methods and physical units are rigorously annotated, validated, and used to provide useful statistics and normalization methods and conversions, providing efficient ways to utilize multiple datasets in new and previously impossible ways to further advance the frontier of MLIP research.
The OpenQDC Python library makes it easy to work with all of the quantum datasets in the hub. It’s a package that aims to provide a simple and efficient way to download, load, and utilize various datasets. You can download datasets with just one line of code.
Install OpenQDC with pip or conda:
Python
pip install openqdc
or
conda install openqdc -c conda-forge
Now you are ready to use all our QM datasets with the ready-to-use CLI:
Unset
openqdc download SpiceV2
Or using the Python API:
Python
from openqdc import SpiceV2
# Automatically download the data
dataset=SpiceV2()
Below is a glimpse of how easy it is to use OpenQDC and how it interfaces with torch and torch_geometric:
Python
# Load the dataset
from openqdc import MACEOFF
from torch.data.utils import DataLoader
dataset=MACEOFF(energy_unit="ang",energy_unit="kj/mol",array_format="torch")
# Create the dataloader by simply passing the dataset
dataloader=DataLoader(dataset, batch_size=32)
# Do your own magic
...
OpenQDC being framework agnostic can be easily used with torch_geometric, in this case, we can use the function radius_graph from torch_cluster to create a graph:
Python
from openqdc import SpiceV2
from torch_cluster import radius_graph
from torch_geometric.loader import DataLoader
from torch_geometric.data import Data
# We create a function to convert object into their graph
def to_pyg_data(x):
# or any other techniques to build a graph (or use the smiles from the dataset)
edge_index = radius_graph(x.positions, 5)
return Data(edge_index=edge_index, **x)
# Use the transform attribute to automatically convert your items
ds=SpiceV2(array_format="torch", distance_unit="ang", transform=to_pyg_data)
# Create the pyg dataloader by simply passing the new dataset
loader = DataLoader(ds, batch_size=32, shuffle=True)
# Do your own magic
...
We hope OpenQDC can be a great resource for the community to advance MLIP research towards a future of training universal potentials with greater generalizability and robustness.
Please feel free to share your feedback or connect with the Valence Labs team on GitHub, X, LinkedIn, or Valence Portal!