Configuring Hook Connections for Hive High Availability

Describes how to configue the EZHiveServer2Hook, the EzHiveCLIHook, and the EzHiveMetastoreHook to connect to Hive with High Availability (HA) enabled.

EEP 9.2.1 and later include hook connection support in Airflow to connect to HiveServer2 with High Availability (HA) enabled. When any of the hooks are configured, if one of the HS2 servers is unreachable, Airflow connects to another server in the list of hosts that you specify.

Configuring the EzHiveServer2Hook for Hive HA

The EZHiveServer2Hook supports a pyhive connection to HiveServer 2 HA. To configure the pyhive connection with HiveServer 2 HA:
  1. Add the hive_ha property to the extra section of the connection configuration. For example:
    {  "authMechanism": "MAPRSASL",
      "ssl": "true",
      "hive_ha": "true"
    }
  2. Add the list of your active HS2 instances in the host section using this format:
    <hs2_hostname1>:<port1>,<hs2_hostname2>:<port2>,<hs2_hostname3>:<port3>…
    For example:
    myhost-48-n2.storage.mycorp.net:10000,myhost-23-n2.storage.mycorp.net:10000
In the following example, one of the HS2 servers is unusable, so Airflow reconnects to another server:
{ezhive.py:196} INFO - Trying to connect to myhost-23-n2.storage.mycorp.net:10000
{TSocket.py:142} INFO - Could not connect to ('<ip_address>', 10000)
Traceback (most recent call last):
  File "/opt/mapr/airflow/airflow-2.7.3/build/env/lib/python3.9/site-packages/thrift/transport/TSocket.py", line 137, in open
    handle.connect(sockaddr)
  File "/opt/mapr/airflow/airflow-2.7.3/build/python/lib/python3.9/ssl.py", line 1343, in connect
    self._real_connect(addr, False)
  File "/opt/mapr/airflow/airflow-2.7.3/build/python/lib/python3.9/ssl.py", line 1330, in _real_connect
    super().connect(addr)
ConnectionRefusedError: [Errno 111] Connection refused
{TSocket.py:145} ERROR - Could not connect to any of [('<ip_address>', 10000)]
[2023-12-15, 09:06:32 UTC] {ezhive.py:210} WARNING - Failed to connect to myhost-23-n2.storage.mycorp.net:10000
{ezhive.py:196} INFO - Trying to connect to myhost-48-n2.storage.mycorp.net:10000
{hive.py:475} INFO - USE 'default'

Configuring the EzHiveCliHook for Hive HA

The EZHiveCliHook supports a beeline connection to HiveServer 2 HA. To configure the beeline connection with HiveServer 2 HA:
  1. Add the following properties to the extra section of the connection configuration:
    {
      "use_beeline": true,
      "ssl": "true",
      "hive_ha": "true",
      "serviceDiscoveryMode": "zooKeeper",
      "zooKeeperNamespace": "hiveserver2"
    }
  2. Add the list of your active ZooKeeper instances in the host section using this format:
    <ZK_FQDN1>:5181,<ZK_FQDN2>:5181,<ZK_FQDN3>:5181

Configuring the EzHiveMetastoreHook for Hive HA

The EZHiveMetastoreHook supports an hmsclient connection to HiveServer 2 HA. To configure the hmsclient connection with HiveServer 2 HA:
  1. Configure Hive Metastore HA as described in Enabling High Availability for Hive Metastore.
  2. Add the following properties to the extra section of the connection configuration:
    {  "authMechanism": "MAPRSASL"}
  3. In the host section, specify the list of active Hive metastore hosts using the following format:
    <hive_metastore1>,<hive_metastore2>,<hive_metastore3>
With this configuration, if one Hive metastore host is unavailable, a connection will be made to another host in the list. For example:
[2023-12-15, 12:57:54 UTC] {base.py:73} INFO - Using connection ID 'metastore_default' for task execution.
[2023-12-15, 12:57:54 UTC] {hive.py:576} INFO - Trying to connect to myhost-23-n2.storage.mycorp.net:9083
[2023-12-15, 12:57:54 UTC] {hive.py:582} ERROR - Could not connect to myhost-23-n2.storage.mycorp.net:9083
[2023-12-15, 12:57:54 UTC] {hive.py:576} INFO - Trying to connect to myhost-48-n2.storage.mycorp.net:9083
[2023-12-15, 12:57:54 UTC] {hive.py:578} INFO - Connected to myhost-48-n2.storage.mycorp.net:9083