MapR System Overview

MapR is a complete enterprise-grade distribution for Apache Hadoop. The MapR Converged Data Platform has been engineered to improve Hadoop’s reliability, performance, and ease of use. The MapR distribution provides a full Hadoop stack that includes the MapR File System (MapR-FS), the MapR-DB NoSQL database management system, MapR Streams, the MapR Control System (MCS) user interface, and a full family of Hadoop ecosystem projects. You can use MapR with Apache Hadoop, HDFS, and MapReduce APIs.

MapR supports the Hadoop 2.x architecture and YARN (Yet Another Resource Negotiator). Hadoop 2.x and YARN make up a resource management and scheduling framework that distributes resource management and job management duties.

Hadoop 2.x was designed to solve two main problems in the Hadoop 1.x architecture:

  • Centralization of job scheduling, resulting in scheduler bottlenecks
  • Separating resource management from application programming concerns

Here is a high-level view of the MapR Converged Data Platform, showing its main components and supported ecosystem projects:

This system overview contains architectural details about the components that run on the MapR Data Platform, how the components assemble into a cluster, and the relationships between the components.

The MapR distribution provides several unique features that address common concerns with Apache Hadoop:

Issue Addressed by MapR Feature Apache Hadoop
Data Protection MapR Snapshots provide complete recovery capabilities. MapR Snapshots are rapid point-in-time consistent snapshots for both files and tables. MapR Snapshots make efficient use of storage and CPU resources, storing only changes from the point the snapshot is taken. You can configure schedules for MapR Snapshots with easy to use but powerful scheduling tools. Snapshot-like capabilities are not consistent, require application changes to make consistent, and may lead to data loss in certain situations.
Security With wire-level security, data transmissions to, from, and within the cluster are encrypted, and strong authorization mechanisms enable you to tailor the actions a given user is able to perform. Authentication is robust without burdening end-users. Permissions for users are checked on each file access. Permissions for users are checked on file open only.
Disaster Recovery MapR provides business continuity and disaster recovery services out of the box with mirroring that’s simple to configure and makes efficient use of your cluster’s storage, CPU, and bandwidth resources. No standard mirroring solution. Scripts based on distcp quickly become hard to administer and manage. No enterprise-grade consistency.
Enterprise Integration With high-availability Direct Access NFS, data ingestion to your cluster can be made as simple as mounting an NFS share to the data source. Support for Hadoop ecosystem projects like Flume or Sqoop means minimal disruptions to your existing workflow.
Performance MapR uses customized units of I/O, chunking, resync, and administration. These architectural elements allow MapR clusters to run at speeds close to the maximum allowed by the underlying hardware. In addition, the DirectShuffle technology leverages the performance advantages of MapR-FS to deliver strong cluster performance, and Direct Access NFS simplifies data ingestion and access. MapR-DB tables, available with the M7 license, are natively stored in the file system and support the Apache HBase API. MapR-DB tables provide the fastest and easiest to administer NoSQL solution on Hadoop. Stock Apache Hadoop’s NFS cannot read or write to an open file.
Scalable Architecture (without single points of failure) The MapR Converged Data Platform provides High Availability for the Hadoop components in the stack. MapR clusters don’t use NameNodes and provide stateful high-availability for the MapReduce JobTracker and Direct Access NFS. Works out of the box with no special configuration required. NameNode HA provides failover, but no failback, while limiting scale and creating complex configuration challenges. NameNode federation adds new processes and parameters to provide cumbersome, error-prone file federation. The High-Availability JobTracker in stock Apache Hadoop does not preserve the state of running jobs. Failover for the JobTracker requires restarting all in-progress jobs and brings complex configuration requirements.

Read the following sections in this document to learn more about these key features.

MapR Editions

The edition of MapR that you use determines which features are available on the cluster. MapR offers the following editions:
MapR Community Edition (formerly M3)
Free community edition
MapR Enterprise Edition (formerly M5)
Adds high availability and data protection, including multi-node NFS
MapR Enterprise Database Edition (formerly M7)
Adds structured table data natively in the storage layer and provides a flexible NoSQL database

Get Started

Now that you know a bit about how the features of MapR Converged Data Platform work, take a quick tour to see for yourself how they can work for you:
  • MapR Sandbox for Hadoop: Try out a single-node cluster that's ready to roll, right out of the box!
  • Use the MapR Installer to set up a production cluster, large or small.

For more details about the features introduced here, see the following sections: