Tags : Browse Projects

Select a tag to browse associated projects and drill deeper into the tag cloud.

Apache Spark

Compare

Claimed by Apache Software Foundation Analyzed about 1 hour ago

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly more rapidly than with ... [More] disk-based systems like Hadoop. To make programming faster, Spark offers high-level APIs in Scala, Java and Python, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala or Python shells. Spark integrates closely with Hadoop to run inside Hadoop clusters and can access any existing Hadoop data source. [Less]

1.3M lines of code

374 current contributors

about 10 hours since last commit

56 users on Open Hub

Very High Activity
5.0
 
I Use This

Apache HBase

Compare

Claimed by Apache Software Foundation Analyzed about 15 hours ago

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides ... [More] Bigtable-like capabilities on top of Hadoop. HBase's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardward. Try it if your plans for a data store run to big. [Less]

989K lines of code

120 current contributors

1 day since last commit

31 users on Open Hub

High Activity
5.0
 
I Use This

Apache Mahout

Compare

Claimed by Apache Software Foundation Analyzed about 12 hours ago

Apache Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. ... [More] However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms [Less]

146K lines of code

0 current contributors

2 months since last commit

25 users on Open Hub

Low Activity
3.6
   
I Use This

Apache Accumulo

Compare

Claimed by Apache Software Foundation Analyzed about 21 hours ago

Apache Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can ... [More] modify key/value pairs at various points in the data management process. [Less]

449K lines of code

34 current contributors

7 days since last commit

24 users on Open Hub

High Activity
0.0
 
I Use This

Apache Hive

Compare

Claimed by Apache Software Foundation No analysis available

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called ... [More] Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language. [Less]

0 lines of code

0 current contributors

0 since last commit

23 users on Open Hub

Activity Not Available
5.0
 
I Use This
Mostly written in language not available
Licenses: apache_2

Apache Pig

Compare

Claimed by Apache Software Foundation Analyzed about 10 hours ago

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which ... [More] in turns enables them to handle very large data sets. [Less]

762K lines of code

4 current contributors

9 months since last commit

10 users on Open Hub

Very Low Activity
5.0
 
I Use This
Tags hadoop pig

Apache Flink

Claimed by Apache Software Foundation Analyzed about 3 hours ago

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Learn more about Flink at http://flink.apache.org/

2.11M lines of code

323 current contributors

about 16 hours since last commit

9 users on Open Hub

Very High Activity
5.0
 
I Use This

Apache Avro

Compare

Claimed by Apache Software Foundation No analysis available

Avro is a serialization system.

0 lines of code

75 current contributors

0 since last commit

8 users on Open Hub

Activity Not Available
0.0
 
I Use This
Mostly written in language not available
Licenses: apache_2

AppScale

Compare

  Analyzed about 10 hours ago

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and ... [More] reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces. [Less]

1.23M lines of code

10 current contributors

almost 4 years since last commit

7 users on Open Hub

Inactive
5.0
 
I Use This

Apache Impala

Compare

Claimed by Apache Software Foundation Analyzed about 22 hours ago

Apache Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This ... [More] provides a familiar and unified platform for batch-oriented or real-time queries. [Less]

838K lines of code

64 current contributors

11 days since last commit

7 users on Open Hub

High Activity
5.0
 
I Use This