HDFS Configuration Options

Use the following parameters to configure the Kafka Connect for HPE Ezmeral Data Fabric Streams HDFS connector.

NOTE For the HDFS connector, both Avro and Parquet files can be written.

In standalone mode, specify the HDFS connector configuration in the quickstart-hdfs.properties file. You can also configure the offset storage location and the port for the REST interface, which are specified in the connect-standalone.properties file. See Configuring in Standalone Mode.

/opt/mapr/kafka-connect-hdfs/kafka-connect-hdfs-<version>/etc/kafka-connect-hdfs/quickstart-hdfs.properties
/opt/mapr/kafka/kafka-<version>/config/connect-standalone.properties
In distributed mode, HDFS connector configuration is provided in the POST and PUT requests when creating or modifying the connector. See POST /connectors and PUT /connectors/(string:name)/config for more information about using the REST API. Additional configurations such as defining the topics that will store the connector state, task configuration state, and connector offset state are specified in the connect-distributed.properties file. See Configuring in Distributed Mode .
/opt/mapr/kafka/kafka-<version>/config/connect-distributed.properties
Table 1. HDFS Configuration Parameters
Parameter Description

flush.size

Number of records written to the file system before invoking file commits.

  • Type: int
  • Default: ""

hdfs.url

The file system connection URL. This configuration has the format of maprfs:://hostname:port and specifies the data fabric file system to export data to.

  • Type: string
  • Default: ""

connect.hdfs.keytab

The path to the keytab file for the HDFS connector principal. This keytab file should only be readable by the connector user.

  • Type: string
  • Default: ""

connect.hdfs.principal

The principal used when the file system is using Kerberos for authentication.

  • Type: string
  • Default: ""

format.class

The format class used when writing data to the file system.

  • Type: string
  • Default: "io.confluent.connect.hdfs.avro.AvroFormat"
    NOTE If you want to write to a Parquet set, use "io.confluent.connect.hdfs.parquet.ParquetFormat"

hadoop.conf.dir

The Hadoop configuration directory.

  • Type: string
  • Default: ""

hadoop.home

The Hadoop home directory.

  • Type: string
  • Default: ""

hdfs.authentication.kerberos

Specifies whether the file system uses Kerberos for authentication.

  • Type: boolean
  • Default: false

hdfs.namenode.principal

The Kerberos principal for CLDB.

  • Type: string
  • Default: ""

hive.conf.dir

The Hive configuration directory.

  • Type: string
  • Default: ""

hive.database

The database used when the connector creates tables in Hive.

  • Type: string
  • Default: "default"

hive.home

The Hive home directory.

  • Type: string
  • Default: ""

hive.integration

Specifies whether Hive is integrated when running the connector.

  • Type: boolean
  • Default: false

hive.metastore.uris

The Hive metastore URIs. Can be an IP address or fully-qualified domain name and port of the metastore host.

  • Type: string
  • Default: ""

logs.dir

Top-level file system directory to store the write ahead logs.

  • Type: string
  • Default: "logs"

partitioner.class

The partitioner used when writing data to the file system. You can use DefaultPartitioner, which preserves the Kafka partitions; FieldPartitioner, which partitions the data to different directories according to the value of the partitioning field specified in partition.field.name; TimeBasedPartitioner, which partitions data according to the time ingested to the file system.

  • Type: string
  • Default: "io.confluent.connect.hdfs.partitioner.DefaultPartitioner"

rotate.interval.ms

The time interval (milliseconds) before invoking file commits. This configuration ensures that file commits are invoked every configured interval. This configuration is useful when data ingestion rate is low and the connector didn't write enough messages to commit files. The default value -1 means that this feature is disabled.

  • Type: long
  • Default: -1

schema.compatibility

The schema compatibility rule used when the connector is observing schema changes. The supported configurations are NONE, BACKWARD, FORWARD and FULL.

  • Type: string
  • Default: "NONE"

topics

A list of topics to use as input for the HDFS connector.
  • Type: string
  • Default: ""

topics.dir

Top-level file system directory to store the data ingested from Kafka.

  • Type: string
  • Default: "topics"
locale

The locale used when partitioning with TimeBasedPartitioner.

  • Type: string
  • Default: ""

partition.duration.ms

The duration of a partition (milliseconds) used by TimeBasedPartitioner. The default value -1 means that TimeBasedPartitioner is not being used.

  • Type: long
  • Default: -1

partition.field.name

The name of the partitioning field when FieldPartitioner is used.

  • Type: string
  • Default: ""

path.format

This configuration is used to set the format of the data directories when partitioning with TimeBasedPartitioner. The format set in this configuration converts the Unix timestamp to proper directories strings. For example, if you setpath.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/, the data directories will have the format /year=2015/month=12/day=07/hour=15

  • Type: string
  • Default: ""

shutdown.timeout.ms

Clean shutdown timeout. This makes sure that asynchronous Hive metastore updates are completed during connector shutdown.

  • Type: long
  • Default: 3000

timezone

The timezone to use when partitioning with TimeBasedPartitioner.

  • Type: string
  • Default: ""

filename.offset.zero.pad.width

Sets the width to the zero-pad offsets in the file system file names. If the offsets are too short it provides fixed width filenames that can be ordered by simple lexicographic sorting.

  • Type: int
  • Default: 10

kerberos.ticket.renew.period.ms

The period in milliseconds to renew the Kerberos ticket.

  • Type: long
  • Default: 3600000 (milliseconds)

retry.backoff.ms

Used to notify Kafka Connect to retry delivering a message batch or performing recovery in case of transient exceptions. The retry backoff is in milliseconds.

  • Type: long
  • Default: 5000 (milliseconds)

schema.cache.size

The sized of the schema cache used in the Avro converter.

  • Type: int
  • Default: 1000

storage.class

The underlying storage layer. The default is MapR-FS.

  • Type: string
  • Default: "io.confluent.connect.hdfs.storage.HdfsStorage"