|
The Rutgers University CAC site houses the following ongoing projects. Additional information about these projects is available at the CAC at Rutgers Web site. Adaptive policy application for autonomic system management using decentralized online clustering Autonomic techniques based on dynamic policy application provide a powerful and promising approach for the effective management of distributed computational infrastructures, by reducing management complexity and allowing human administrators to focus primarily on the definition of these policies at a high level. However, these high-level polices (which we refer to as meta-policies) are typically defined with static constraint thresholds and are either associated with specific system goals or with known states of the managed entities, obtained through feedback from events or actions. This limits their applicability in situations where the appropriate management actions depend on dynamic system properties, which require adapting application thresholds and parameters without modifying absolute policy definition constraints.
This project addresses the gap that exists between goal-driven meta-policies expressed in terms of these absolute constraints and the actual thresholds on operational parameters of low-level policies (simply, policies) that must be applied so that these constraints are met. The main contributions of this research project are: 1) a conceptual framework for meta-policy definition in terms of event-based descriptions of system state (clustering profiles), and 2) a mechanism for dynamic policy generation based on a mapping of system states to an agglomeration of patterns in run-time events. Autonomic Data Streaming and In-transit Processing This project addresses the problem of autonomic data streaming at three levels: (1) data extraction level tries to minimize the overheads and impact of the I/O operations on the application execution while extracting data from the running applications, (2) data sharing/redistribution level provides a virtual, distributed memory shared space that supports online data indexing, flexible data querying and data processing (e.g., reduction, min, max, data redistribution, range querying, etc.), and (3) data transport/streaming level tries to optimize the data transport over wide-area networks with in-transit data processing, to satisfy strict data coupling constraints.
DART: Data extraction is implemented using DART. DART is a communication infrastructure built on advanced network interconnects, e.g., RDMA, and uses an asynchronous data transport paradigm provided by the Portals Library. DART provides flexible, asynchronous APIs that allow an application to overlap computation with data communication and thus reduces the CPU overhead spent in I/O operations and dedicate more time to application computations. DART provides higher application throughputs, minimizes the overhead of I/O operations, overlaps computation with communication and increases the CPU availability for an application.
DataSpaces: DataSpaces is a dynamic and asynchronous interaction framework. It provides the abstraction of a distributed memory system through a semantically specialized shared data space framework, i.e., DataSpaces. The framework has decoupled and asynchronous data sharing semantics, which enables cooperative interactions between distinct and heterogeneous parallel applications that run on different resources and can progress at different rates. Our goal is to complement conventional approaches for data movement encountered in workflow engines, i.e., files, with an in-memory sharing mechanism that alleviates the performance penalties, but leverages their flexibility. DataSpaces provides an in-memory temporary storage into which applications can insert or retrieve data in an asynchronous fashion. The applications interactions are abstracted as data queries to/from the space using a simple, yet powerful API, e.g., get(), put(). This simple mechanism can build complex application interactions with different coupling patterns, e.g., one-way, two-ways coupling, one-to-one, one-to-many, many-to-many data redistribution, etc. It can also serve as an implicit coordination and synchronization mechanism between loosely coupled applications.
ADAPT: ADAPT provides services for high-throughput data streaming services and in-transit data manipulation, and provides the mechanisms as well as the management strategies for large-scale data-intensive scientific and engineering workflows. The ADAPT architecture for autonomic data-streaming and in-transit processing focuses on scheduling the in-transit processing of data using available resources while ensuring that the end-to-end QoS constraints are satisfied and the data arrived at the sink “in-time”. The specific research was driven by the requirements of the data-intensive workflows associated with coupled fusion simulations and focused on the definition of a “slack metric” that estimates the time between when the data was produced and when it is required at the sink and determines the amount of processing that can be performed in-transit. The approach used is to develop a two-level strategy. An estimate of the slack was determined in an end-to-end manner between the source and sink based on prior interactions. The slack is then used by the in-transit nodes to make provisioning and processing/forwarding decisions at the in-transit nodes. The slack along with the application data generation rates are also used to determine the size of the in-transit node overlay. Autonomic Computing Engines Consolidated and virtualized cluster-based computing centers have become dominant computing platforms in industry and research for enabling complex and compute intensive applications. However, as scales, operating costs, and energy requirements increase, maximizing efficiency, cost-effectiveness, and utilization of these systems becomes paramount. Furthermore, the complexity, dynamism, and often time critical nature of application workloads makes on-demand scalability, integration of geographically distributed resources, and incorporation of utility computing services extremely critical. Finally, the heterogeneity and dynamics of the system, application, and computing environment require context-aware dynamic scheduling and runtime management. Accelerating Hadoop/MapReduce for Heterogeneous Moderate-Sized Datasets using CometCloud - Deploying real-world applications The objective of this research is to (1) deploy three real world application from BMS, Protein Data Bank, MapDistances and ScorePose and evaluate the performance of the applications as well as MapReduce-CometCloud (2) develop an interface to support multi-threaded worker for multi-processor node. In this research we use CometCloud and its services to build a MapReduce infrastructure that address the above requirements. CometCloud is a decentralized (peer-to-peer) computational infrastructure that supports distributed applications with asynchronous coordination and communication requirements. Specifically, we use CometCloud to enable pull based scheduling of Map tasks as well as stream based coordination and data exchange. Also as many nodes have multiple processors, we developed an interface of multi-threaded worker to maximize the utilization of multi-processor node. A representative worker takes the responsibility to communicate with Comet space to pull tasks and communicate with the master to send results. Other threaded workers concentrate on computation. We deployed the real world applications using the CometCloud-based MapReduce/Hadoop framework on BMS cluster as well as Rutgers campus cluster. The applications ran with multi-threaded workers and demonstrated the performance improvement with multi-threaded workers. Overall, the CometCloud-based MapReduce solution can accelerate the computations on heterogeneous, medium sized datasets by delaying or avoiding the use of distributed file reads and writes. Also it can accelerate the computations more by enabling multi-threaded workers. In this research, we showed preliminary results and ongoing efforts are focused on the extended evaluation of the application performance in various aspects and MapReduce/Hadoop-CometCloud overhead. Besides, we are working on event notification based pull-tasks consumption model and load-balancing on Hadoop-CometCloud. Exploring adaptation for dynamic applications on hybrid grids-clouds infrastructure using CometCloud Clouds support a different although complementary usage model to more traditional High Performance Computing (HPC) grids. Cloud usage models are based upon ondemand access to computing utilities, an abstraction of unlimited computing resources, and a usage-based payment model whereby users essentially “rent” virtual resources and pay for what they use. Several recent efforts have clearly demonstrated that clouds can be effectively used as alternate platforms for certain classes of applications. Many of these applications that currently use clouds are cross-over applications from the legacy & cluster world. Whereas it is important to explore and support the migration of traditional applications to cloud computational platforms, it is also imperative to ask: What new applications and application capabilities can be supported by clouds – either as stand-alone or as part of a hybrid grid-cloud computational platform? Can the addition of clouds enable scientific applications and usage modes, that are not possible otherwise? What abstractions and systems are essential to support these advanced applications on different hybrid grid-cloud platforms (eg. HPC grids-clouds, High Throughput Computing (HTC) grids-clouds). We address these in the context of dynamic applications. Objectives of this project are the following: - Develop an objective-driven autonomic scheduler over a hybrid computing environment
- Integrate TeraGrid as a Grid and several types of EC2 instances (m1.small, m1.large, m1.xlarge, c1.medium, c1.xlarge) as clouds to form a hybrid computing environment
- Explore infrastructure/application/hybrid adaptation for dynamic workflow based Defiant Reservoir Simulator with Ensemble Kalman Filter (EnKF)
We are based on objective-driven hybrid usage and possible objectives are 1) acceleration using clouds 2) conservation of HPC resource (especially TeraGrid) 3) resilience management of HPC resource using clouds. We explored adaptation specifically within the acceleration objective. Three approaches to adapt computational science applications are 1) infrastructure adaptivity which appropriate resource types for objective are selected dynamically, 2) application tuning which optimized application configuration is evaluated in every stage of the workflow and applied for the next stage, 3) hybrid adaptivity which both infrastructure adaptivity and application tuning are applied at the same time. We implemented the autonomic scheduler over the hybrid computing environment and adopted MPI-EnKF for running a task including MPI run across multiple VMs as well as inside of a VM. Results showed the performance improvement when adaptation is applied. We are improving autonomic scheduler to integrate more resource classes such as Nimbus and Eucalyptus and to support multiple applications at the same time. |