Replacing a Failed Disk

If a disk has failed, or is in the process of failing, you can replace the disk. Disk replacement procedures are necessary to keep the store running. This section walks you through the steps of replacing a failed disk, to preserve data availability.

The following example deploys a KVStore to a set of three machines, each with 3 disks. Use the storagedir flag of the makebootconfig command to specify the storage location of the other two disks.

> java -jar KVHOME/lib/kvstore.jar makebootconfig \
    -root /opt/ondb/var/kvroot \
    -port 5000  \
    -admin 5001 \
    -host node09
    -harange 5010,5020 \
    -num_cpus 0  \
    -memory_mb 0 \
    -store-security none \
    -capacity 2  \
    -storagedir /disk1/ondb/data \
    -storagedir /disk2/ondb/data  

With a boot configuration such as that shown above, the directory structure that is created and populated on each machine would then be:

 - Machine 1 (SN1) -     - Machine 2 (SN2) -    - Machine 3 (SN3) -
/opt/ondb/var/kvroot   /opt/ondb/var/kvroot  /opt/ondb/var/kvroot
  log files              log files             log files
  /store-name           /store-name           /store-name
    /log                   /log                  /log
    /sn1                   /sn2                  /sn3
      config.xml             config.xml            config.xml
      /admin1                /admin2               /admin3
        /env                   /env                  /env

/disk1/ondb/data         /disk1/ondb/data        /disk1/ondb/data
  /rg1-rn1                 /rg1-rn2                /rg1-rn3
    /env                     /env                    /env

/disk2/ondb/data         /disk2/ondb/data        /disk2/ondb/data
  /rg2-rn1                 /rg2-rn2                /rg2-rn3
    /env                     /env                    /env 

In this case, configuration information and administrative data is stored in a location that is separate from all of the replication data. The replication data itself is stored by each distinct Replication Node service on separate, physical media as well. Storing data in this way provides failure isolation and will typically make disk replacement less complicated and time consuming.

To replace a failed disk:

  1. Determine which disk has failed. To do this, you can use standard system monitoring and management mechanisms. In the previous example, suppose disk2 on Storage Node 3 fails and needs to be replaced.

  2. Then given a directory structure, determine which Replication Node service to stop. With the structure described above, the store writes replicated data to disk2 on Storage Node 3, so rg2-rn3 must be stopped before replacing the failed disk.

  3. Use the plan stop-service command to stop the affected service (rg2-rn3) so that any attempts by the system to communicate with it are no longer made; resulting in a reduction in the amount of error output related to a failure you are already aware of.

    kv-> plan stop-service -service rg2-rn3
  4. Remove the failed disk (disk2) using whatever procedure is dictated by the operating system, disk manufacturer, and/or hardware platform.

  5. Install a new disk using any appropriate procedures.

  6. Format the disk to have the same storage directory as before; in this case, /disk2/ondb/var/kvroot.

  7. With the new disk in place, use the plan start-service command to start the rg2-rn3 service.

    kv-> plan start-service -service rg2-rn3

    Note

    It can take a considerable amount of time for the disk to recover all of its data; depending on the amount of data that previously resided on the disk before failure. It is also important to note that the system may encounter additional network traffic and load while the new disk is being repopulated.