Getting Started with ML Force Fields

December 20, 2024

Machine learning force fields (MLFFs) are revolutionizing molecular dynamics simulations by providing near-DFT accuracy at a fraction of the computational cost. In this tutorial, I'll walk you through implementing your first MLFF using the Atomic Simulation Environment (ASE) and MACE (Multi-Atomic Cluster Expansion).

Why Machine Learning Force Fields?

Traditional force fields are fast but often lack accuracy for complex systems. DFT calculations are accurate but computationally expensive for large-scale dynamics. MLFFs bridge this gap by:

Learning from DFT reference data
Providing forces and energies 1000-10000× faster than DFT
Maintaining chemical accuracy (~1 meV/atom)
Capturing complex many-body interactions

Prerequisites

Before we start, make sure you have:

Python 3.8 or higher
Basic knowledge of ASE and molecular dynamics
Some DFT trajectory data (or we'll generate it)

Install the required packages:

pip install ase mace-torch matplotlib
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Step 1: Prepare Training Data

First, let's generate some training data. For this example, we'll use a simple water molecule:

from ase import Atoms
from ase.calculators.emt import EMT
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution
from ase.md import VelocityVerlet
from ase.io import write, read
import numpy as np

# Create a water molecule
water = Atoms('H2O', 
              positions=[[0, 0, 0],
                        [0.76, 0.59, 0],
                        [-0.76, 0.59, 0]])

# For demonstration, we'll use EMT (in practice, use DFT)
water.calc = EMT()

# Run a short MD to generate configurations
MaxwellBoltzmannDistribution(water, temperature_K=300)
dyn = VelocityVerlet(water, timestep=0.5)

# Collect configurations
configurations = []
for i in range(100):
    dyn.run(10)
    # Add some noise to create diverse configurations
    water.positions += np.random.normal(0, 0.01, water.positions.shape)
    configurations.append(water.copy())

# Save training data
write('training_data.xyz', configurations)

Note: In real applications, you should use DFT calculations (e.g., Quantum ESPRESSO, VASP) to generate accurate reference data.

Step 2: Convert Data to MACE Format

MACE requires data in a specific format. Let's prepare our dataset:

from mace.tools import torch_geometric
from mace.data import AtomicData, Configuration
import torch

# Read configurations
configs = read('training_data.xyz', index=':')

# Convert to MACE configurations
mace_configs = []
for atoms in configs:
    config = Configuration(
        atomic_numbers=atoms.numbers,
        positions=atoms.positions,
        energy=atoms.get_potential_energy(),
        forces=atoms.get_forces(),
        cell=atoms.cell if atoms.pbc.any() else None,
        pbc=atoms.pbc,
    )
    mace_configs.append(config)

# Split into training and validation
n_train = int(0.8 * len(mace_configs))
train_configs = mace_configs[:n_train]
valid_configs = mace_configs[n_train:]

print(f"Training configurations: {len(train_configs)}")
print(f"Validation configurations: {len(valid_configs)}")

Step 3: Train the MACE Model

Now comes the exciting part - training the MLFF:

# Create a configuration file for MACE training
config_yaml = """
# MACE training configuration
seed: 123
model: MACE
r_max: 5.0
num_radial_basis: 8
num_cutoff_basis: 5
max_ell: 3
interaction_cls: RealAgnosticResidualInteractionBlock
num_interactions: 2
num_species: 2
hidden_irreps: 32x0e + 32x1o
MLP_irreps: 16x0e
avg_num_neighbors: auto
correlation: 3
gate: silu

# Training settings
batch_size: 10
max_num_epochs: 1000
patience: 50
learning_rate: 0.01
weight_decay: 0.0
energy_weight: 1.0
forces_weight: 100.0
stress_weight: 0.0

# Loss and optimization
loss: weighted_mse
optimizer: adam
lr_scheduler: reduce_on_plateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.8
"""

with open('config.yaml', 'w') as f:
    f.write(config_yaml)

# Train the model using MACE CLI
import subprocess
subprocess.run([
    'mace_train',
    '--config', 'config.yaml',
    '--train_file', 'training_data.xyz',
    '--valid_file', 'validation_data.xyz',
    '--model_dir', './mace_model',
    '--device', 'cuda'  # or 'cpu' if no GPU
])

Step 4: Use the Trained Model

Once training is complete, let's use our MLFF for molecular dynamics:

from mace.calculators import MACECalculator
from ase.md.langevin import Langevin
from ase import units
from ase.io.trajectory import Trajectory

# Load the trained model
calc = MACECalculator(model_path='mace_model/best_model.pth', device='cuda')

# Create a new water molecule
water_md = Atoms('H2O', 
                 positions=[[0, 0, 0],
                           [0.76, 0.59, 0],
                           [-0.76, 0.59, 0]])
water_md.calc = calc

# Run MD simulation
dyn = Langevin(water_md, 
               timestep=0.5 * units.fs,
               temperature_K=300,
               friction=0.01)

# Save trajectory
traj = Trajectory('water_mlff_md.traj', 'w', water_md)
dyn.attach(traj.write, interval=10)

# Run for 1000 steps
dyn.run(1000)

print("MD simulation completed!")

Step 5: Analyze Results

Let's analyze our MLFF simulation and compare with reference data:

import matplotlib.pyplot as plt
from ase.io import read

# Read trajectory
traj = read('water_mlff_md.traj', index=':')

# Extract energies
energies = [atoms.get_potential_energy() for atoms in traj]
times = np.arange(len(energies)) * 0.5 * 10  # fs

# Plot energy conservation
plt.figure(figsize=(10, 6))
plt.plot(times, energies)
plt.xlabel('Time (fs)')
plt.ylabel('Potential Energy (eV)')
plt.title('Energy Conservation in MLFF MD')
plt.grid(True)
plt.savefig('energy_conservation.png')

# Calculate and plot bond lengths
def get_bond_lengths(atoms):
    d1 = atoms.get_distance(0, 1)  # O-H1
    d2 = atoms.get_distance(0, 2)  # O-H2
    angle = atoms.get_angle(1, 0, 2)  # H-O-H
    return d1, d2, angle

bond_data = [get_bond_lengths(atoms) for atoms in traj]
oh1_lengths = [d[0] for d in bond_data]
oh2_lengths = [d[1] for d in bond_data]
angles = [d[2] for d in bond_data]

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

ax1.plot(times, oh1_lengths, label='O-H1')
ax1.plot(times, oh2_lengths, label='O-H2')
ax1.set_xlabel('Time (fs)')
ax1.set_ylabel('Bond Length (Å)')
ax1.legend()
ax1.grid(True)

ax2.plot(times, angles)
ax2.set_xlabel('Time (fs)')
ax2.set_ylabel('H-O-H Angle (degrees)')
ax2.grid(True)

plt.tight_layout()
plt.savefig('structural_analysis.png')

Best Practices and Tips

Data Quality: The quality of your MLFF depends entirely on your training data. Use well-converged DFT calculations.
Data Diversity: Include various configurations:
- Different temperatures
- Compressed and expanded structures
- Transition states if studying reactions
Validation: Always validate your MLFF against unseen DFT data before production runs.
Active Learning: Use uncertainty quantification to identify configurations where the model is uncertain, then add DFT data for those.
Model Selection: Different architectures work better for different systems:
- MACE: Excellent for general systems
- NequIP: Good for small molecules
- DeepMD: Efficient for large-scale simulations

Common Pitfalls

Extrapolation: MLFFs perform poorly outside their training domain. Always check if your simulation explores new configurations.
Energy Drift: Monitor energy conservation in NVE simulations. Poor conservation indicates the model needs improvement.
Overfitting: Use proper train/validation splits and early stopping to prevent overfitting.

Next Steps

Now that you have a working MLFF, consider:

Scaling up to larger systems (proteins, materials)
Implementing active learning workflows
Combining with enhanced sampling methods
Studying rare events and chemical reactions

Resources

Happy simulating! Feel free to reach out with questions or share your MLFF experiences.

#MachineLearning #MolecularDynamics #MACE #ComputationalChemistry #MLForceFields