Molecular Geometry Pretraining
with SE(3)-Invariant Denoising Distance Matching
In Submission
- ^{1}Mila
- ^{2}Université de Montréal
- ^{3}National Research Council Canada
- ^{4}University of Ottawa
- ^{5}HEC Montréal
- ^{6}CIFAR AI Chair
Abstract
Pretraining molecular representations is critical in a variety of applications in drug and material discovery due to the limited number of labeled molecules, yet most of existing work focuses on pretraining on 2D molecular graphs. The power of pretraining on 3D geometric structures, however, has been less explored, owning to the difficulty of finding a sufficient proxy task to empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose a 3D coordinate denoising pretraining framework to model such an energy landscape. Leveraging a SE(3)-invariant score matching method, we propose SE(3)-DDM where the coordinate denoising proxy task is effectively boiled down to the denoising of the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. The source codes of this paper will be released in the near future.
Problem Formulation: Molecular Geometry Pretraining
Molecules are not static but in a continuous motion in the 3D Euclidean space, forming a potential energy surface (PES). As shown in the above figure, it is desirable to study the molecule in the local minima of the PES, called conformer. However, such stable state conformer often comes with different noises for the following reasons.
- First, the statistical and systematic errors on conformation estimation are unavoidable.
- Second, it has been well-acknowledged that a conformer can have vibrations around the local minima in PES.
Denoising Coordinate Matching
The 3D geometric information, or the atomic coordinates are critical to molecular properties.
Then based on this, we propose a geometry perturbation, which adds small noises to the atom coordinates.
For notation, we define the original geometry graph and an augmented geometry graph as two views, denoted as \(g_1=(X_1, R_1)\) and \(g_2=(X_2, R_2)\) respectively.
The augmented geometry graph can be seen as a coordinate perturbation to the original graph with the same atom types, i.e., \(X_2=X_1\) and \(R_2 = R_1 + \epsilon\), where \(\epsilon\) is drawn from a normal distribution.
The two views defined above share certain common information.
By maximizing the mutual information (MI) between them, we expect that the learned representation can better capture the geometric information and is insensitive to noises and thus can generalize well to the target downstream tasks.
To maximize the MI, we turn to maximizing the following lower bound on the two geometry views:
\[
\begin{aligned}
I(G_1; G_2)
& = \mathbb{E}_{p(g_1,g_2)} \Big[ \log \frac{p(g_1,g_2)}{p(g_1) p(g_2)} \Big]
\ge \frac{1}{2} \mathbb{E}_{p(g_1,g_2)} \Big[ \log p(g_1|g_2) + \log p(g_2|g_1) \Big] \triangleq \mathcal{L}_{\text{MI}}.
\end{aligned}
\]
To solve this, we introduce using the energy-based model (EBM) for estimation.
To adapt it for MI maximization in our setting, the lower bound can be turned into:
\[
\begin{aligned}
\mathcal{L}_{\text{Coor-MI}}
& = \frac{1}{2} \mathbb{E}_{p(g_1,g_2)} \Big[ \log p(R_1|g_2) \Big] + \frac{1}{2} \mathbb{E}_{p(g_1,g_2)} \Big[ \log p(R_2|g_1) \Big]\\
& = \frac{1}{2} \mathbb{E}_{p(g_1,g_2)} \Big[ \log \frac{\exp(f(R_1, g_2))}{A_{R_1|g_2}} \Big] + \frac{1}{2} \mathbb{E}_{p(g_2,g_1)} \Big[ \log \frac{\exp(f(R_2, g_1))}{A_{R_2|g_1}} \Big],
\end{aligned}
\]
where the \(f(\cdot)\) are the negative of energy functions, and \(A_{R_1|g_2}\) and \(A_{R_2|g_1}\) are the intractable partition functions.
The first equation in results from that the two views share the same atom types.
This equation can be treated as denoising the atom coordinates of one view from the geometry of the other view.
Denoising Distance Matching
Then we adopt a SE(3)-invariant denoising score matching method to get the following equation: \[ \begin{aligned} \mathcal{L}_{\text{EBM-SM}} = & \frac{1}{2L} \sum_{l=1}^L \sigma_l^\beta \mathbb{E}_{p_{\text{data}}(d_1|g_2)} \mathbb{E}_{q(\tilde d_1|d_1,g_2)} \Big[ \Big\| \frac{s_\theta(\tilde d_1, g_2)}{\sigma_l} - \frac{d_1 - \tilde d_1}{\sigma_l^2}\Big\|^2_2 \Big] \\ & + \frac{1}{2L} \sum_{l=1}^L \sigma_l^\beta \mathbb{E}_{p_{\text{data}}(d_2|g_1)} \mathbb{E}_{q(\tilde d_2|d_2,g_1)} \Big[\Big\|\frac{s_\theta(\tilde d_2, g_1)}{\sigma_l} - \frac{d_2 - \tilde d_2}{\sigma_l^2} \Big\|^2_2 \Big]. \end{aligned} \] This transforms the coordinate-aware mutual information maximization to the denoising distance matching as the final objective.