Accessing HPE Ezmeral Data Fabric File Store in Java Applications

As a high-performance file system, portions of the HPE Ezmeral Data Fabric File Store file client are based on a native maprfs library. When developing an application, specifying dependence on the JAR file that includes the maprfs library enables you to build applications without having to manage platform-specific dependencies.

The following sections describe how to access the HPE Ezmeral Data Fabric File Store in a Java program.

Writing a Java Application

In your Java application, you will use a Configuration object to interface with the file system. When you instantiate a Configuration object, it is created with values from Hadoop configuration files.

If the program is built with JAR files from the Data Fabric installation, the Hadoop 1 configuration files are in the $MAPR_HOME/hadoop/hadoop-<version>/conf directory, and the Hadoop 2 configuration files are in the $HADOOP_HOME/etc/hadoop directory. This Hadoop configuration directory is in the hadoop classpath that you include when you compile and run the Java program.

If the program is built through maven using mapr maven artifacts, the default Hadoop configuration files are included in the maven artifacts. The user needs to programmatically update the Hadoop configuration to match the Hadoop configuration files on the Data Fabric cluster.

Sample Code

The following sample code shows how to interface with MapR file system using Java. The example creates a directory, writes a file, then reads the contents of the file.

/* Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved */

//package com.mapr.fs;

import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;

/**
* Assumes mapr installed in /opt/mapr
*
* Compilation:
* javac -cp $(hadoop classpath) MapRTest.java
*
* Run:
* java -cp .:$(hadoop classpath) MapRTest /test
*/
public class MapRTest
{
 public static void main(String args[]) throws Exception {
        byte buf[] = new byte[ 65*1024];
        int ac = 0;
        if (args.length != 1) {
            System.out.println("usage: MapRTest pathname");
        return;
        }

        // maprfs:/// -> uses the first entry in /opt/mapr/conf/mapr-clusters.conf
        // maprfs:///mapr/my.cluster.com/
        // /mapr/my.cluster.com/

        // String uri = "maprfs:///";
        String dirname = args[ac++];

        Configuration conf = new Configuration();
        
        //FileSystem fs = FileSystem.get(URI.create(uri), conf); // if wanting to use a different cluster
        FileSystem fs = FileSystem.get(conf);
        
        Path dirpath = new Path( dirname + "/dir");
        Path wfilepath = new Path( dirname + "/file.w");
        //Path rfilepath = new Path( dirname + "/file.r");
        Path rfilepath = wfilepath;


        // try mkdir
        boolean res = fs.mkdirs( dirpath);
        if (!res) {
                System.out.println("mkdir failed, path: " + dirpath);
        return;
        }

        System.out.println( "mkdir( " + dirpath + ") went ok, now writing file");

        // create wfile
        FSDataOutputStream ostr = fs.create( wfilepath,
                true, // overwrite
                512, // buffersize
                (short) 1, // replication
                (long)(64*1024*1024) // chunksize
                );
        ostr.write(buf);
        ostr.close();

        System.out.println( "write( " + wfilepath + ") went ok");

        // read rfile
        System.out.println( "reading file: " + rfilepath);
        FSDataInputStream istr = fs.open( rfilepath);
        int bb = istr.readInt();
        istr.close();
        System.out.println( "Read ok");
        }
}

Compiling and Running a Java Application

You can compile and run the Java application using JAR files from the mapr maven repository or from the Data Fabric installation.

Using JARs from the Maven Repository

Maven artifacts from version 2.1.2 onward are published to https://repository.mapr.com/maven/. When compiling for Data Fabric core version 6.1, add the following dependency to the pom.xml file for your project:

<dependency>
   <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.7.0-mapr-1808</version>
</dependency>

This dependency adds the dependencies from the mapr maven repository the next time you do a mvn clean install. The JAR that includes the maprfs library is a dependency for the hadoop-common artifact.

For a complete list of artifacts and further details, see Maven Artifacts for the HPE Ezmeral Data Fabric.

Using JARs from the Data Fabric Installation

The maprfs library is included in the hadoop classpath. Add the hadoop classpath to the JAVA classpath when you compile and run the Java application.

To compile the sample code, use the following command:
```
javac -cp $(hadoop classpath) MapRTest.java
```
To run the sample code, use the following command:
```
java -cp .:$(hadoop classpath) MapRTest /test
```

Loading the Data Fabric Native Library

By default, the root class loader will load the native library to allow all children to see and access it. If the native library is loaded by a child class, other classes will not be able to access the library. To allow applications and associated child classes to access the symbols and variables in the native library, we recommend loading the native library via the root loader.

The loading of the native library via the root class loader is accomplished by injecting code into the root loader. If Data Fabric runs on top of applications (such as Tomcat) where it does not have access to the root class loader, the native library will not be loaded. Child classes that try to access the symbols under the assumption that the root class loader successfully loaded the native library will fail.

The parameter -Dmapr.library.flatclass, when specified with Java, disables the injection of code via the root class loader, thus disabling the loading of the native library using the root class loader. Instead, the application trying to access the symbols can load the native library themselves. However, since the native library can be loaded only once and can only be seen by the application loading it, ensure that only one application within the JVM attempts to load and access the native library.

Garbage Collection in Data Fabric

The garbage collection (GC) algorithms in Java provide opportunities for performance optimizations for your application. Java provides the following GC algorithms:

Serial GC. This algorithm is typically used in client-style applications that don't require low pause times. Specify -XX:+UseSerialGC to use this algorithm.
Parallel GC, which is optimized to maximize throughput. Specify -XX:+UseParNewGC to use this algorithm.
Mostly-Concurrent or Concurrent Mark-Sweep GC, which is optimized to minimize latency. Specify -XX:+UseConcMarkSweepGC to use this algorithm.
Garbage First GC, a new GC algorithm intended to replace Concurrent Mark-Sweep GC. Specify -XX:+UseG1GC to use this algorithm.

Consider testing your application with different GC algorithms to determine their effects on performance.

Flags for GC Debugging

Set the following flags in Java to log the GC algorithm's behavior for later analysis:

-verbose:gc
-Xloggc:<filename>
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime

For more information, see the Java Garbage Collection Tuning document or the Java Garbage Collection links.

Converting fid and volid

The following file system APIs are available in com.mapr.fs.MapRFileSystem for converting fid to file path and volid to volume name:

public String getMountPathFidCached(String fidStr) throws IOException
public String getVolumeNameCached(int volId) throws IOException
public String getVolumeName(int volId) throws IOException
public String getMountPathFid(String fidStr) throws IOException

Converting fid to File Path

The getMountPathFid(string) and getMountPathFidCached(string) APIs can be used for converting file ID to the full path to the file. The getMountPathFid() API makes a call to CLDB and file system to get the file path from the fid. Because this API does not cache or store this information locally, it might make repeated requests to CLDB and file system for the same fid and this might result in many RPCs to both CLDB and file system. The getMountPathFidCached() API makes a call the CLDB and file system one time and stores the information locally in the shared library of the client. For subsequent calls, it uses the locally stored information to retrieve the file path from the fid. However, if there are many files in the volume, there might still be a large number of calls to CLDB and file system to determine the file path for each fid in the volume. The caching is useful if the API attempts to determine the file path for the same fid repeatedly. The cache is purged after 15 seconds. If the file name changes before the cache is purged, you will see the old name for the file until the cache expires. You can use these APIs to convert the fid to the file path.

For example, the sample consumer application and the sample uncached consumer application for consuming audit logs as stream messages use these methods as shown below.

Sample Cached Consumer

{
     String token = st1.nextToken();
     /* If the field has fid, expand it using Cached API */
     if (token.endsWith("Fid")) {
       String lfidStr = st1.nextToken();
       String path= null;
       try {
           path = fs.getMountPathFidCached(lfidStr); // Expand FID to path
       } catch (IOException e){
     }
     lfidPath = "\"FidPath\":\""+path+"\",";
     // System.out.println("\nPAth for fid " + lfidStr +  "is " +  path);
}

Sample Uncached Consumer

{
     String token = st1.nextToken();
     if (token.endsWith("Fid")) {
         String lfidStr = st1.nextToken();
         String path= null;
         try {
           path = fs.getMountPathFid(lfidStr);// Expand FID to path
         } catch (IOException e){
         }
     lfidPath = "\"FidPath\":\""+path+"\",";
     // System.out.println("\nPAth for fid " + lfidStr +  "is " +  path);
}

Converting volid to Volume Name

The getVolumeName() and getVolumeNameCached() APIs can be used for converting volume IDs to volume name. The getVolumeName() API makes a call to the CLDB every time to get the volume name from the volid and this may result in too many RPCs to CLDB. The getVolumeNameCached() API makes a call to the CLDB one time and stores the information locally in the shared library of the client. For subsequent calls, it uses the locally stored information to retrieve the volume name from the volid. The cache is purged after 15 seconds. You can use these APIs to convert the volid to volume name.

For example, the sample consumer application and the sample uncached consumer application for consuming audit logs as stream messages uses these methods as shown below.

Sample Cached Consumer

if (token.endsWith("volumeId")) {
       String volid = st1.nextToken();
       String name= null;
       try {
         int volumeId = Integer.parseInt(volid);
           // Cached API to convert volume Id to volume Name
           name = fs.getVolumeNameCached(volumeId);
         }
       catch (IOException e){
       }
       lvolName = "\"VolumeName\":\""+name+"\",";
       //  System.out.println("\nVolume Name for volid " + volid +  "is " +  name);
}

Sample Uncached Consumer

if (token.endsWith("volumeId")) {
       String volid = st1.nextToken();
       String name= null;
       try {
         int volumeId = Integer.parseInt(volid);
           // API to convert volume Id to volume Name
           name = fs.getVolumeName(volumeId);
         }
       catch (IOException e){
       }
       lvolName = "\"VolumeName\":\""+name+"\",";
       //  System.out.println("\nVolume Name for volid " + volid +  "is " +  name);
}