bookmarks  3

  •  

    Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to 'think' in MapReduce. Cascading is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application. As a library and API that can be driven from any JVM based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are "operationalized". That is, a single deployable Jar can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell. Instead of using external schedulers to glue many individual applications together with XML against each individual command line interface. The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware. Cascading is not a new text based query syntax (like Pig) or another complex system that must be installed on a cluster and maintained (like Hive). But Cascading is both complimentary and a valid alternative to either application.
    13 years ago by @gresch
    (0)
     
     
  •  

    Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database. It examines each table’s schema and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS). Sqoop then creates and launches a MapReduce job to read tables from the database via DBInputFormat, the JDBC-based InputFormat. Tables are read into a set of files in HDFS. Sqoop supports both SequenceFile and text-based target and includes performance enhancements for loading data from MySQL.
    14 years ago by @gresch
    (0)
     
     
  •  

    Katta is a scalable, failure tolerant, distributed, data storage for real time access. Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles. * Makes serving large or high load indices easy * Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers * Replicate shards on different servers for performance and fault-tolerance * Supports pluggable network topologies * Master fail-over * Fast, lightweight, easy to integrate * Plays well with Hadoop clusters * Apache Version 2 License
    15 years ago by @gresch
    (0)
     
     
  • ⟨⟨
  • 1
  • ⟩⟩

publications  

    No matching posts.
  • ⟨⟨
  • ⟩⟩