Replacing a Failed Storage Node

If a Storage Node has failed, or is in the process of failing, you can replace the Storage Node. Upgrading a healthy machine to another one with better specifications is also a common Storage Node replacement scenario. Generally, you should repair the underlying problem (be it hardware or software related) before proceeding with this procedure.

There are two ways to replace a failed Storage Node.

To replace a failed Storage Node by using a new, different Storage Node (node uses different host name, IP address, and port as the failed host):

  1. If you are replacing hardware, bring it up and make sure it is ready for your production environment.

  2. On the new, replacement node, create a "boot config" configuration file using the makebootconfig utility. Do this on the hardware where your new Storage Node runs. You only need to specify the -admin option (the Admin Console's port) if the hardware hosts the Oracle NoSQL Database administration processes.

    To create the "boot config" file, issue the following commands:

    > mkdir -p KVROOT     (if it doesn't already exist)
    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar makebootconfig -root KVROOT \
                                                      -port 5000 \
                                                      -admin 5001 \
                                                      -host <hostname> \
                                                      -harange 5010,5020
                                                      -store-security none 
  3. Start the Oracle NoSQL Database software on the new node.

    > nohup java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar start -root KVROOT& 
  4. Deploy the new Storage Node to the new node. You use an existing administrative process to do this, either using the CLI or the Admin Console. To do this using the CLI:

    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar runadmin  \
    -port <port> -host <host>
    kv-> plan deploy-sn -zn <id> -host <host> -port <port> -wait
    kv-> 
  5. Add the new Storage Node to the Storage Node pool. (You created a Storage Node pool when you installed the store, and you added all your Storage Nodes to it, but it is otherwise not used in this version of the product.)

    kv-> show pools
    AllStorageNodes: sn1, sn2, sn3, sn4 ... sn25, sn26
    BostonPool: sn1, sn2, sn3, sn4 ... sn25
    kv-> pool join -name BostonPool -sn sn26
    AllStorageNodes: sn1, sn2, sn3, sn4 ... sn25, sn26
    BostonPool: sn1, sn2, sn3, sn4 ... sn25, sn26 
    kv->
  6. Make sure the old Storage Node is not running. If the problem is with the hardware, then turn off the broken machine. You can also stop just the Storage Node software by:

    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar stop -root KVROOT
  7. Migrate the services from one Storage Node to another. If the old node hosted an admin service, the –admin-port argument is required. The syntax for this plan is:

    plan migrate-sn -from <old SN ID> -to <new SN ID> \
    -admin-port <admin port>

    Assuming that you are migrating from Storage Node 25 to 26 on port 5000, you would use:

    kv-> plan migrate-sn -from sn25 -to sn26 -admin-port 5000
  8. The old Storage Node is shown in the topology and is reported as UNREACHABLE. The source SNA should be removed and its rootdir should be hosed out. Bringing up the old SNA will also bring up the old Replication Nodes and admins, which are no longer members of their replication groups. This should be harmless to the rest of the store, but it produces log error messages that might be misinterpreted as indicating a problem with the store. Use the plan remove-sn command to remove the old and unused Storage Node in your deployment.

    kv-> plan remove-sn sn25 -wait

Note

Replacing a Storage Node qualifies as a topology change. This means that if you want to restore your store from a snapshot taken before the Storage Node was replaced, you must use the Load program. See Using the Load Program for more information.

To replace a failed Storage Node by using an identical node (node uses the same host name, internet address, and port as the failed host):

  1. Prerequisite information:

    1. A running Admin process on a known host, with a known registry port.

    2. The ID of the Storage Node to replace (e.g. "sn1").

    3. Before starting the new Storage Node, the SN to be replaced must be taken down. This can be done administratively or via failure.

    Note

    It is recommended that the KVROOT is empty and that you do a full network recovery of data before proceeding.

    The instructions below assume that the KVROOT is empty and has no valid data. When the new Storage Node Agent begins it starts the services it hosts, which recovers their data from other hosts. This recovery may take some time, depending on the size of the shards involved and it happens in the background.

  2. Create the configuration using the generateconfig command:

    The generateconfig's usage is:

    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar generateconfig \
    -host <hostname> -port <port> -sn <StorageNodeId> -target <zipfile>

    For example:

    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar generateconfig -host adminhost \
    -port 13230 -sn sn1 -target /tmp/sn1.config.zip

    The command above creates the target "/tmp/sn1.config.zip" which is a zip file with the required configuration to re-create that Storage Node. The top-level directory in the zip file is the store's KVROOT.

  3. Restore the Storage Node configuration on the target host:

    1. Copy the zip file to the target host.

    2. Unzip the archive into your KVROOT directory. That is, if KVROOT is /opt/kvroot, then do the following:

      > cd/opt
      > unzip <path-to-sn1.config.zip>
  4. Restart the Storage Node on the new host.

    > java -Xmx256m -Xms256m \
    -jar KVHOME/lib/kvstore.jar start -root KVROOT