Getting Started with ML Force Fields

December 20, 2024

Machine learning force fields (MLFFs) are revolutionizing molecular dynamics simulations by providing near-DFT accuracy at a fraction of the computational cost. In this tutorial, I'll walk you through implementing your first MLFF using the Atomic Simulation Environment (ASE) and MACE (Multi-Atomic Cluster Expansion).


Why Machine Learning Force Fields?

Traditional force fields are fast but often lack accuracy for complex systems. DFT calculations are accurate but computationally expensive for large-scale dynamics. MLFFs bridge this gap by:


Prerequisites

Before we start, make sure you have:

Install the required packages:

pip install ase mace-torch matplotlib
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Step 1: Prepare Training Data

First, let's generate some training data. For this example, we'll use a simple water molecule:

from ase import Atoms
from ase.calculators.emt import EMT
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution
from ase.md import VelocityVerlet
from ase.io import write, read
import numpy as np

# Create a water molecule
water = Atoms('H2O', 
              positions=[[0, 0, 0],
                        [0.76, 0.59, 0],
                        [-0.76, 0.59, 0]])

# For demonstration, we'll use EMT (in practice, use DFT)
water.calc = EMT()

# Run a short MD to generate configurations
MaxwellBoltzmannDistribution(water, temperature_K=300)
dyn = VelocityVerlet(water, timestep=0.5)

# Collect configurations
configurations = []
for i in range(100):
    dyn.run(10)
    # Add some noise to create diverse configurations
    water.positions += np.random.normal(0, 0.01, water.positions.shape)
    configurations.append(water.copy())

# Save training data
write('training_data.xyz', configurations)

Note: In real applications, you should use DFT calculations (e.g., Quantum ESPRESSO, VASP) to generate accurate reference data.


Step 2: Convert Data to MACE Format

MACE requires data in a specific format. Let's prepare our dataset:

from mace.tools import torch_geometric
from mace.data import AtomicData, Configuration
import torch

# Read configurations
configs = read('training_data.xyz', index=':')

# Convert to MACE configurations
mace_configs = []
for atoms in configs:
    config = Configuration(
        atomic_numbers=atoms.numbers,
        positions=atoms.positions,
        energy=atoms.get_potential_energy(),
        forces=atoms.get_forces(),
        cell=atoms.cell if atoms.pbc.any() else None,
        pbc=atoms.pbc,
    )
    mace_configs.append(config)

# Split into training and validation
n_train = int(0.8 * len(mace_configs))
train_configs = mace_configs[:n_train]
valid_configs = mace_configs[n_train:]

print(f"Training configurations: {len(train_configs)}")
print(f"Validation configurations: {len(valid_configs)}")

Step 3: Train the MACE Model

Now comes the exciting part - training the MLFF:

# Create a configuration file for MACE training
config_yaml = """
# MACE training configuration
seed: 123
model: MACE
r_max: 5.0
num_radial_basis: 8
num_cutoff_basis: 5
max_ell: 3
interaction_cls: RealAgnosticResidualInteractionBlock
num_interactions: 2
num_species: 2
hidden_irreps: 32x0e + 32x1o
MLP_irreps: 16x0e
avg_num_neighbors: auto
correlation: 3
gate: silu

# Training settings
batch_size: 10
max_num_epochs: 1000
patience: 50
learning_rate: 0.01
weight_decay: 0.0
energy_weight: 1.0
forces_weight: 100.0
stress_weight: 0.0

# Loss and optimization
loss: weighted_mse
optimizer: adam
lr_scheduler: reduce_on_plateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.8
"""

with open('config.yaml', 'w') as f:
    f.write(config_yaml)

# Train the model using MACE CLI
import subprocess
subprocess.run([
    'mace_train',
    '--config', 'config.yaml',
    '--train_file', 'training_data.xyz',
    '--valid_file', 'validation_data.xyz',
    '--model_dir', './mace_model',
    '--device', 'cuda'  # or 'cpu' if no GPU
])

Step 4: Use the Trained Model

Once training is complete, let's use our MLFF for molecular dynamics:

from mace.calculators import MACECalculator
from ase.md.langevin import Langevin
from ase import units
from ase.io.trajectory import Trajectory

# Load the trained model
calc = MACECalculator(model_path='mace_model/best_model.pth', device='cuda')

# Create a new water molecule
water_md = Atoms('H2O', 
                 positions=[[0, 0, 0],
                           [0.76, 0.59, 0],
                           [-0.76, 0.59, 0]])
water_md.calc = calc

# Run MD simulation
dyn = Langevin(water_md, 
               timestep=0.5 * units.fs,
               temperature_K=300,
               friction=0.01)

# Save trajectory
traj = Trajectory('water_mlff_md.traj', 'w', water_md)
dyn.attach(traj.write, interval=10)

# Run for 1000 steps
dyn.run(1000)

print("MD simulation completed!")

Step 5: Analyze Results

Let's analyze our MLFF simulation and compare with reference data:

import matplotlib.pyplot as plt
from ase.io import read

# Read trajectory
traj = read('water_mlff_md.traj', index=':')

# Extract energies
energies = [atoms.get_potential_energy() for atoms in traj]
times = np.arange(len(energies)) * 0.5 * 10  # fs

# Plot energy conservation
plt.figure(figsize=(10, 6))
plt.plot(times, energies)
plt.xlabel('Time (fs)')
plt.ylabel('Potential Energy (eV)')
plt.title('Energy Conservation in MLFF MD')
plt.grid(True)
plt.savefig('energy_conservation.png')

# Calculate and plot bond lengths
def get_bond_lengths(atoms):
    d1 = atoms.get_distance(0, 1)  # O-H1
    d2 = atoms.get_distance(0, 2)  # O-H2
    angle = atoms.get_angle(1, 0, 2)  # H-O-H
    return d1, d2, angle

bond_data = [get_bond_lengths(atoms) for atoms in traj]
oh1_lengths = [d[0] for d in bond_data]
oh2_lengths = [d[1] for d in bond_data]
angles = [d[2] for d in bond_data]

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

ax1.plot(times, oh1_lengths, label='O-H1')
ax1.plot(times, oh2_lengths, label='O-H2')
ax1.set_xlabel('Time (fs)')
ax1.set_ylabel('Bond Length (Å)')
ax1.legend()
ax1.grid(True)

ax2.plot(times, angles)
ax2.set_xlabel('Time (fs)')
ax2.set_ylabel('H-O-H Angle (degrees)')
ax2.grid(True)

plt.tight_layout()
plt.savefig('structural_analysis.png')

Best Practices and Tips

  1. Data Quality: The quality of your MLFF depends entirely on your training data. Use well-converged DFT calculations.
  2. Data Diversity: Include various configurations:
    • Different temperatures
    • Compressed and expanded structures
    • Transition states if studying reactions
  3. Validation: Always validate your MLFF against unseen DFT data before production runs.
  4. Active Learning: Use uncertainty quantification to identify configurations where the model is uncertain, then add DFT data for those.
  5. Model Selection: Different architectures work better for different systems:
    • MACE: Excellent for general systems
    • NequIP: Good for small molecules
    • DeepMD: Efficient for large-scale simulations

Common Pitfalls


Next Steps

Now that you have a working MLFF, consider:


Resources

Happy simulating! Feel free to reach out with questions or share your MLFF experiences.


#MachineLearning #MolecularDynamics #MACE #ComputationalChemistry #MLForceFields

2025 Yi Cao. CC BY-NC-SA 4.0.