This file may be a little outdated...

BUILDING

Basic:
$ tar -jxvf dmtcp-X.Y.tar.bz2
$ cd dmtcp
$ ./configure
$ make


With (spammy) debugging traces and debugging symbols:
$ tar -zxvf dmtcp-X.Y.tar.gz
$ cd dmtcp
$ ./configure CPPFLAGS="-DDEBUG" CXXFLAGS="-O0 -g3" --enable-debug=full
$ make

For (generic) building/install directions see ./INSTALL

IMPORTANT PROGRAMS

./src/dmtcp_coordinator - checkpoint coordinator that synchronizes checkpoints between nodes (run first)
./src/dmtcp_checkpoint  - wrapper to executate a user program with checkpointing
dmtc_restart_script.sh - outputted by coordinator at checkpoint time, restarts a checkpoint


RUNNING A DMTCP ENABLED PROGRAM

dmtcp_coordinator, dmtcp_checkpoint, and dmtcp_restart are expected to be in your $PATH

First you must run a coordinator server:
$ dmtcp_coordinator [port]
(if left off, port defaults to 7999)

Define environmental variables:
DMTCP_HOST="hostname.of.yourcoordinator.com"
DMTCP_PORT=7999
(if left off, defaults to localhost:7999)


Now run the program
$ dmtcp_checkpoint ./a.out 1 2 3 4

Child processes are also checkpointed.  
Calls to "ssh theHost ./a.out 1 2 3" will be transformed to
  "ssh theHost dmtcp_checkpoint ./a.out 1 2 3"


RESTARTING FROM A CHECKPOINT
The easy way is to just type:  ./dmtcp_restart_script.sh
on the host containing the coordinator.  This is a convenience script
that restarts the processes as shown below.

First you must run a coordinator server:
$ dmtcp_coordinator [port]
(if left off, port defaults to 7999)
Note: Currently coordinator host/port MUST be the same as it was at
      checkpoint time.

In any order, restore the checkpoints of each node in cluster.  The locations 
of each node CAN change, they will use the coordinator server to find the 
new locations of all their peers.

Currently every process must be restored independently with a command like:
$ dmtcp_restart cktp-XXXXXXX-YYYY-ZZZZZ.mtcp

Names of checkpoints are cktp-HOSTID-PID-STARTTIME.mtcp. In working directory
of each program.

Checkpoints can be restarted in any order, execution will not resume until all
peers have been restarted.

Once all peers are restarted, cluster-wide execution will automatically restart
checkpoints will start again.   
