Addressing Lost Admin Service Quorum

If quorum is lost as a result of the failure of one or more of the admin service instance(s), the store cannot be administered; although it can still be used to read and write key/value pairs. Because the admin service employs quorum based replication for high availability of the administrative data, the number of admins deployed should be large enough to make quorum loss unlikely; typically an odd number such as 3.

Usually, the Storage Node Agent (SNA) will automatically restart its failed admin service; preserving quorum and the ability to administer the store. But if the SN (specifically, the SNA) fails, then the admin service will remain unavailable.

In that case, the following procedure allows you to reconfigure the store so that the remaining admin service(s) can be used to perform administrative tasks:

For the following example, assume all of the following:

To address the lost admin service quorum:

  1. Verify that admin service quorum has been lost. If a command that requires quorum is executed in the CLI, a timeout should occur if quorum has been lost. For example:

    kv-> pool create -name new-pool
    Transaction retry limit exceeded (12.1.3.0.1) 
  2. Use the show admins command to determine the names of the admin services that were deployed:

    kv-> show admins
    admin1: Storage Node sn1, HTTP port 13231 (master)
    admin2: Storage Node sn2, HTTP port 13231 
  3. Identify the remaining healthy admin service(s) so quorum can be reconfigured for those service(s). Login to sn1 and run the following command:

    > jps -m | grep Admin
    12276 ManagedService -root /opt/ondb/var/kvroot -class Admin -service 
    BootstrapAdmin.13230 -config config1.xml 

    which means that only the first admin service (the bootstrap admin) that was deployed to the store is still healthy and running on host-sn1.

    Note

    If a given admin service is down, either the Storage Node itself is down and cannot be accessed, or the command above will produce no output for the associated admin service. In this case, admin2 is down and thus omitted in the output.

  4. Once the healthy admin service(s) have been identified, the configuration file for each of the corresponding SN(s) should be modified so that the je.rep.electableGroupSizeOverride configuration property of the admin component equals the number of remaining healthy admin services. In this example, the configuration file (config.xml) corresponding to sn1 must be modified. The file has the following structure:

    <config version="1">
      <component name="storageNodeParams" 
                              type="storageNodeParams" validate="true">
        <property name="storageNodeId" value="1" type="INT"/>
        <property name="rootDirPath" 
                              value="/opt/ondb/var/kvroot" type="STRING"/>
        <property name="haHostname" value="host-sn1" type="STRING"/>
        <property name="registryPort" value="13230" type="INT"/>
        ..........
      </component>
      <component name="mountPoints" type="bootstrapParams"
                                                       validate="false">
        <property name="/disk1/ondb/data" value="" type="STRING"/>
      </component>
      <component name="globalParams" type="globalParams" validate="true">
        <property name="storeName" value="example-store" type="STRING"/>
        <property name="isLoopback" value="false" type="BOOLEAN"/>
      </component>
      <component name="admin1" type="adminParams" validate="true">
        <property name="adminHttpPort" value="13231" type="INT"/>
        <property name="storageNodeId" value="1" type="INT"/>
        ..........
      </component>
      <component name="rg1-rn1" type="repNodeParams" validate="true">
        <property name="repNodeId" value="rg1-rn1" type="STRING"/>
        <property name="repNodeType" value="ELECTABLE" type="STRING"/>
        <property name="storageNodeId" value="1" type="INT"/>
        ..........
      </component> 

    In the component named admin1, add (or modify) the property named configProperties and set the value to 1. For example:

    <component name="admin1" type="adminParams" validate="true">
    <property name="configProperties" 
    value="je.rep.electableGroupSizeOverride 1;" type="STRING"/>
    .......... 
    </component> 
  5. Kill each of the healthy admin service process(es) for which a configuration file was modified. To kill the healthy admin service (admin1) on host-sn1 run:

    > kill 11762

    where 11762 is the process id of the single healthy admin service, identified in step 2.

    Note

    The SNA will automatically restart the killed admin service; which, for this example, will then be configured to operate with a quorum size of 1 rather than 2.

  6. Use the single healthy admin service (admin1) to perform the necessary failure recovery administration tasks on the store.

  7. Once the failed admin service(s) are recovered (admin2), repeat steps 4 and 5 to reconfigure the admin service(s) to their original notion of quorum. For this example, je.rep.electableGroupSizeOverride should be set to 2.