Home Projects
Projects
UA Projects

Brief descriptions of ongoing projects at the CAC University of Arizona site are listed below. Additional information about these projects is available at the CAC at University of Arizona Web site.

Autonomic Power and Performance Management of Large-scale Data Centers

The goal of this project is to design innovative autonomic framework and architecture to optimize performance/Watt in traditional server platforms and enclosures. In CAC’s first year, we have modeled and simulated the operations of an interleaved memory system and developed a runtime algorithm to automatically allocate memory blocks to applications based on its current requirements. We have shown in our evaluation that we can reduce power consumption by more than 48% without compromising the performance of the applications. 

 

Autonomic Network Defense (AND) System

The complexity, multiplicity, and impact of cyber attacks have been increasing at an alarming rate in spite of the significant research and development investment in cyber security products and tools. The current techniques to detect and protect cyber infrastructures from these smart and sophisticated attacks are mainly characterized as being ad hoc, labor-intensive, and too slow. We are developing an innovative intrusion detection system based on an alternative approach inspired by biological systems, which can efficiently handle complexity, dynamism and uncertainty. In the past year, we have implemented a proof-of-concept prototype that demonstrates the ability of the AND prototype to detect and protect against any type of worms, denial of service attacks, and scanning attacks. Our detection rate is more than 99% with a very low rate of false alarms. Our accurate detection rate and infrequent false alarms are due to the use of a multi-level intrusion detection algorithm that integrates the behavior analysis results from several layers (application, transport, network, and link layer).

 

Autonomic High-productivity Computing

In recent years, the holistic view of high productivity (HPC) is being used to address raw productivity or specific system requirements such as high performance or high availability, rather than traditional supercomputers or large clusters. HPC factors in all physical (e.g., computing resources, tools, and physical space), runtime environment (e.g., workload changes, faults and automation) and human (e.g., programmers, skills and proficiency with tools) aspects in quantifying its merit. Although the majority of performance-targeted HPC deployments are expensive custom-built solutions, they still suffer form several pain points including, but not limited to, minimal degree of autonomic capabilities, efficiency and speed of data transfer, benchmarks to evaluate productivity and difficulty of migration to off-the-shelf commodity components.

In this research, we look at establishing a high productivity computing (HPC) testbed environment, defining HPC metrics and targeted benchmarks and integrating autonomic middleware to enable third-party intelligence add-ons. The autonomic HPC-enabling technologies that will be developed at the proposed HPC lab will enable us to efficiently: 1) optimize resource allocation to respond to runtime workload changes for best performance and/or throughput via support for manageability, virtualization, and dynamicity of runtime adaptation; 2) handle the recovery from hardware and/or software failures; 3) autonomize this highly dynamic environment; and 4) provide support for application workflows  consisting of heterogeneous and couple tasks/jobs through programming and runtime support for a range of computing patterns. The ultimate goals for this project are: first, to make high productivity the ultimate goal for high performance computing and promote awareness of HPC utilities; and second, to incorporate and expose autonomic middleware and tools in HPC environments to mitigate deployment and user constraints and enable automation through third-party intelligence. HPC utility from the user's perspective refers to the value the user places on getting results through faithfully executing their tasks and guaranteeing agreed-upon service-level agreements (SLAs). However, from the provider's perspective, HPC utility refers to dependably executing users' tasks and at the same time minimizing the provider's total cost of ownership (TCO). We will research, develop, test, evaluate and integrate specific autonomic capabilities and establish metrics and benchmarks for HPC. We will then measure and predict HPC efficiencies given the above constraints.

 

 

Autonomic Cardiac Simulation on Multi-Gpu Platform

Cardiovascular disease has become a major challenge in the modern health care services. Sudden cardiac death is one of the leading causes of mortality in the Western Countries. Each year, heart disease kills more Americans than cancer.  In most of the cases sudden cardiac death is caused by the occurrence of a cardiac arrhythmia called ventricular fibrillation. It is now widely accepted that most dangerous cardiac arrhythmias are associated with abnormal wave propagation caused by reentrant sources of excitation. The mechanisms underlying the initiation and subsequent dynamics of these reentrant sources in the human heart are largely unidentified, primarily due to the limited possibilities of invasively studying cardiac arrhythmias in humans. Consequently, a lot of research has been performed on animal hearts, like the mouse, rabbit, dog to the pig heart. However, the arrhythmias studied in animal hearts are not necessarily the same as those which occur in the human heart. The freedom of experimentation, spectrum of details gathered with animal hearts is limited for ethical reasons. Computer simulation is a great alternative.

3D simulations of excitation potential propagation in cardiac tissue help medical researchers to understand electrical instabilities and excitation dynamics in cardiac tissues. Many mathematical models of cardiac cells have been published to facilitate the simulation of excitation propagation in tissue. Tissue simulation involves solving for cell models while taking into account the effect of surrounding cells. For 3D simulation of cardiac tissue, a more realistic model, referred to as BiDomain model, is developed. 3D BiDomain simulations are till date considered as large scale simulations and require huge amount of computation power. BiDomain simulations assume heart as continuous system and involve solving coupled nonlinear partial differential equations. To simulate such a model, the computing platform should have enormous number crunching capability.  General technique is to convert the continuous equations into discrete finite difference equations and solve them using numerical techniques. In the large grid, every node's value is to be solved which is a huge computational overhead. As these simulations are inherently data parallel and computationally intensive, general purpose computation on GPUs looks promising. A great deal of computational acceleration can be achieved in applications with SIMD nature using GPUs.  Though with the introduction of CUDA by NVIDIA to ease programming of GPUs, their architecture, synchronization and complex memory hierarchy pose problems in development of efficient and scalable applications. By developing scalable data parallel algorithms, the power of multi GPU platform can be harnessed. There is not much research on understanding the worthiness of multi GPU platform for large scale simulations. This project's goal is to identify and solve various problems in developing scalable data parallel algorithms for Multi GPU architecture. This study also emphasize on searching the design space for improving performance of such applications at runtime by dynamically adapting the timestep of simulation. By understanding the physics involved in large scale simulation, the various phases of the application can be optimized accordingly using Physics Aware Programming(PAP). In PAP approach, the simulation execution phase can be identified and this information can be utilized to exploit the spatial and temporal attributes in mathematical solvers.

 

Scale-Right Provisioning Architecture for Next Generation Data Centers (SRPA for NGDC)

The dramatic and unpredictable fluctuation in the resource provisioning for real-time web-applications calls for an elastic delivery of computing services. Current datacenter deployments have a strong tie between servers and running applications, which results in inefficiencies in terms of multiple peak loads provisioning, optimal average resources utilization, autonomic features for recording varying runtime workloads, datacenter manageability, and overhead control on the datacenter Total Cost of Ownership (TCO). Novel research approaches (e.g. Cloud Computing, Virtualization) in parallel and distributed computing and web services have emerged as paradigms for utility maximization and cost minimization.
Inspired by these hottest technologies and motivated by datacenters inefficiencies, this research is conducted by addressing the followings: (i) Understand workloads’ resources requirements, including computer, memory, network and storage, as well as their constraints such as Service-Level Agreement (SLA) and workload signature; (ii) Map such workload to the most-optimal set of physical or virtual resource to run the workloads, which does not only refer to performance and guaranteeing SLAs, but also to minimal power envelopes and no thermal hotspots during the operational period; and (iii) Continuously monitor runtime workloads’ resources requirements, and scale resources up/down accordingly keeping the above goals in mind.

 

Autonomic Protection System for DNS Protocol (APS-DNS)

Nowadays the Internet is almost meaningless without Domain Name System protocol that is used frequently when one browses websites, sends an email or connects to remote PC’s. The users of the DNS protocol trust the DNS protocol and they assume it is a secure protocol, but it's not as secure as they think.  The problem is that most of the current DNS systems that are being used today are based on two RFC1035 and RFC1034 which has been written in 1987, when the performance was the most challenging problem.  Consequently, the DNS protocol is not as secure and extremely vulnerable for exploitation by attackers’ misuse of any security holes in network protocols. The importance of the DNS security has led some researchers to redesign the DNS protocol with security consideration as a new DNSSEC protocol. But since the DNS protocol is a distributed system the deployment cost of any change in the protocol is very expensive. In addition the DNSSEC itself can be targeted by new attacks.
We propose an alternative approach based on autonomic computing by continuously monitoring and analyzing the behavior of the DNS protocol to detect any anomalous behavior that might be triggered by DNS attacks. In this project we are designing a DNS anomaly detection system which employs the behavior analysis of DNS protocol to define the normal protocol activity model. Since most of the attacks will ignore the normal behavior of protocol, any deviation from these models can be detected as a potential threat. During the training phase, pattern generator will produce n-grams of a wide range of normal DNS traffic patterns during a window of interval T.  These n-grams will define the frequency of normal usage of these n-grams. During the testing phase the Behavior Analysis will analyze protocol transition sequences to match them with normal behavior profile. When an attack exploits the vulnerabilities in the DNS protocol, it typically generates illegal or abnormal transitions in the protocol that can be detected by the DNS behavior analysis module.

 

Anomaly Based HTTP Attack Detection System

HTTP has become a universal transport protocol being used for file sharing, web services, media streaming, payment processing, and even for protocols such as SSH. With the advent of Web Services and Cloud Computing technologies, more and more businesses are being hosted in the Internet, meaning that we can expect increased used of HTTP in future. There have been many application level attacks using HTTP in the past and new attacks are emerging continually. This work aims at developing a robust Anomaly Based HTTP Attack Detection System. Our current prototype collects data, trains the system based on the data collected and then observes the network behavior for any deviation from the normal traffic behavior. Currently we consider only HTTP headers and use multiple features to capture the behavior. RIPPER Association Rule Generation technique is employed to build normal and abnormal profiles. We have performed a small scale testing with an attack library consisting of 15 attacks. The initial results are very encouraging with over 90% detection rate and very few false negatives. We are currently refining our detection approach to improve the performance of our system further by including Temporal Behavior Analysis and using a richer attack library.

 
RU Projects

The Rutgers University CAC site houses the following ongoing projects. Additional information about these projects is available at the CAC at Rutgers Web site.

 

Adaptive policy application for autonomic system management using decentralized online clustering

Autonomic techniques based on dynamic policy application provide a powerful and promising approach for the effective management of distributed computational infrastructures, by reducing management complexity and allowing human administrators to focus primarily on the definition of these policies at a high level. However, these high-level polices (which we refer to as meta-policies) are typically defined with static constraint thresholds and are either associated with specific system goals or with known states of the managed entities, obtained through feedback from events or actions. This limits their applicability in situations where the appropriate management actions depend on dynamic system properties, which require adapting application thresholds and parameters
without modifying absolute policy definition constraints.

This project addresses the gap that exists between goal-driven meta-policies expressed in terms of these absolute constraints and the actual thresholds on operational parameters of low-level policies (simply, policies) that must be applied so that these constraints are met. The main contributions of this research project are: 1) a conceptual framework for meta-policy definition in terms of event-based descriptions of system state (clustering profiles), and 2) a mechanism for dynamic policy generation based on a mapping of system states to an agglomeration of patterns in run-time events.

 

Autonomic Data Streaming and In-transit Processing

This project addresses the problem of autonomic data streaming at three levels: (1) data extraction level tries to minimize the overheads and impact of the I/O operations on the application execution while extracting data from the running applications, (2) data sharing/redistribution level provides a virtual, distributed memory shared space that supports online data indexing, flexible data querying and data processing (e.g., reduction, min, max, data redistribution, range querying, etc.), and (3) data transport/streaming level tries to optimize the  data transport over wide-area networks with in-transit data processing, to satisfy strict data coupling constraints.

DART: Data extraction is implemented using DART. DART is a communication infrastructure built on advanced network interconnects, e.g., RDMA, and uses an asynchronous data transport paradigm provided by the Portals Library. DART provides flexible, asynchronous APIs that allow an application to overlap computation with data communication and thus reduces the CPU overhead spent in I/O operations and dedicate more time to application computations. DART provides higher application throughputs, minimizes the overhead of I/O operations, overlaps computation with communication and increases the CPU availability for an application.

DataSpaces: DataSpaces is a dynamic and asynchronous interaction framework. It provides the abstraction of a distributed memory system through a semantically specialized shared data space framework, i.e., DataSpaces. The framework has decoupled and asynchronous data sharing semantics, which enables cooperative interactions between distinct and heterogeneous parallel applications that run on different resources and can progress at different rates. Our goal is to complement conventional approaches for data movement encountered in workflow engines, i.e., files, with an in-memory sharing mechanism that alleviates the performance penalties, but leverages their flexibility.
  DataSpaces provides an in-memory temporary storage into which applications can insert or retrieve data in an asynchronous fashion. The applications interactions are abstracted as data queries to/from the space using a simple, yet powerful API, e.g., get(), put(). This simple mechanism can build complex application interactions with different coupling patterns, e.g., one-way, two-ways coupling, one-to-one, one-to-many, many-to-many data redistribution, etc. It can also serve as an implicit coordination and synchronization mechanism between loosely coupled applications.

ADAPT: ADAPT provides services for high-throughput data streaming services and in-transit data manipulation, and provides the mechanisms as well as the management strategies for large-scale data-intensive scientific and engineering workflows. The ADAPT architecture for autonomic data-streaming and in-transit processing focuses on scheduling the in-transit processing of data using available resources while ensuring that the end-to-end QoS constraints are satisfied and the data arrived at the sink “in-time”. The specific research was driven by the requirements of the data-intensive workflows associated with coupled fusion simulations and focused on the definition of a “slack metric” that estimates the time between when the data was produced and when it is required at the sink and determines the amount of processing that can be performed in-transit. The approach used is to develop a two-level strategy. An estimate of the slack was determined in an end-to-end manner between the source and sink based on prior interactions. The slack is then used by the in-transit nodes to make provisioning and processing/forwarding decisions at the in-transit nodes. The slack along with the application data generation rates are also used to determine the size of the in-transit node overlay.

 

Autonomic Computing Engines

Consolidated and virtualized cluster-based computing centers have become dominant computing platforms in industry and research for enabling complex and compute intensive applications. However, as scales, operating costs, and energy requirements increase, maximizing efficiency, cost-effectiveness, and utilization of these systems becomes paramount. Furthermore, the complexity, dynamism, and often time critical nature of application workloads makes on-demand scalability, integration of geographically distributed resources, and incorporation of utility computing services extremely critical. Finally, the heterogeneity and dynamics of the system, application, and computing environment require context-aware dynamic scheduling and runtime management.

 

Accelerating Hadoop/MapReduce for Heterogeneous Moderate-Sized Datasets using CometCloud - Deploying real-world applications 

The objective of this research is to (1) deploy three real world application from BMS, Protein Data Bank, MapDistances and ScorePose and evaluate the performance of the applications as well as MapReduce-CometCloud (2) develop an interface to support multi-threaded worker for multi-processor node. In this research we use CometCloud and its services to build a MapReduce infrastructure that address the above requirements. CometCloud is a decentralized (peer-to-peer) computational infrastructure that supports distributed applications with asynchronous coordination and communication requirements. Specifically, we use CometCloud to enable pull based scheduling of Map tasks as well as stream based coordination and data exchange. Also as many nodes have multiple processors, we developed an interface of multi-threaded worker to maximize the utilization of multi-processor node. A representative worker takes the responsibility to communicate with Comet space to pull tasks and communicate with the master to send results. Other threaded workers concentrate on computation.

We deployed the real world applications using the CometCloud-based MapReduce/Hadoop framework on BMS cluster as well as Rutgers campus cluster. The applications ran with multi-threaded workers and demonstrated the performance improvement with multi-threaded workers. Overall, the CometCloud-based MapReduce solution can accelerate the computations on heterogeneous, medium sized datasets by delaying or avoiding the use of distributed file reads and writes. Also it can accelerate the computations more by enabling multi-threaded workers.  In this research, we showed preliminary results and ongoing efforts are focused on the extended evaluation of the application performance in various aspects and MapReduce/Hadoop-CometCloud overhead. Besides, we are working on event notification based pull-tasks consumption model and load-balancing on Hadoop-CometCloud.

 

Exploring adaptation for dynamic applications on hybrid grids-clouds infrastructure using CometCloud

Clouds support a different although complementary usage model to more traditional High Performance Computing (HPC) grids. Cloud usage models are based upon ondemand access to computing utilities, an abstraction of unlimited computing resources, and a usage-based payment model whereby users essentially “rent” virtual resources and pay for what they use. Several recent efforts have clearly demonstrated that clouds can be effectively used as alternate platforms for certain classes of applications. Many of these applications that currently use clouds are cross-over applications from the legacy & cluster world.

Whereas it is important to explore and support the migration of traditional applications to cloud computational platforms, it is also imperative to ask: What new applications and application capabilities can be supported by clouds – either as stand-alone or as part of a hybrid grid-cloud computational platform? Can the addition of clouds enable scientific applications and usage modes, that are not possible otherwise? What abstractions and systems are essential to support these advanced applications on different hybrid grid-cloud platforms (eg. HPC grids-clouds, High Throughput Computing (HTC) grids-clouds).

We address these in the context of dynamic applications. Objectives of this project are the following:

  • Develop an objective-driven autonomic scheduler over a hybrid computing environment
  • Integrate TeraGrid as a Grid and several types of EC2 instances (m1.small, m1.large, m1.xlarge, c1.medium, c1.xlarge) as clouds to form a hybrid computing environment
  • Explore infrastructure/application/hybrid adaptation for dynamic workflow based Defiant Reservoir Simulator with Ensemble Kalman Filter (EnKF)

We are based on objective-driven hybrid usage and possible objectives are 1) acceleration using clouds 2) conservation of HPC resource (especially TeraGrid) 3) resilience management of HPC resource using clouds. We explored adaptation specifically within the acceleration objective. Three approaches to adapt computational science applications are 1) infrastructure adaptivity which appropriate resource types for objective are selected dynamically, 2) application tuning which optimized application configuration is evaluated in every stage of the workflow and applied for the next stage, 3) hybrid adaptivity which both infrastructure adaptivity and application tuning are applied at the same time.

We implemented the autonomic scheduler over the hybrid computing environment and adopted MPI-EnKF for running a task including MPI run across multiple VMs as well as inside of a VM. Results showed the performance improvement when adaptation is applied. We are improving autonomic scheduler to integrate more resource classes such as Nimbus and Eucalyptus and to support multiple applications at the same time.

 

  

 

 

 

 
UF Projects

CAC projects at the University of Florida:

Demand-driven Service and Power Management in Data Centers

Power consumption is an increasingly significant percentage of the cost of operating large data centers. These centers are used by banks, investment firms, IT service providers, and other large enterprises. One possible approach to reduce power consumption is to keep machines in standby or off modes except when the data center workload requires them to be fully on. This approach depends on being able to monitor performance, workload or resource demands and to anticipate the need for resources in order to meet service-level agreements of the users who generate the workload.

This project seeks to devise mechanisms that perform the following functions: to monitor, model and predict workload associated with individual services; to model and predict global resource demand; to dynamically allocate and de-allocate virtual machines to physical machines; to devise methods based on control theory and/or market-based approaches to use the above-described mechanisms to minimize the cost of providing individual services while globally minimizing power consumption and delivering contracted service levels; and to develop and evaluate software that implements these methods.

Self-organizing IP-over-P2P Overlays for Virtual Networking

This project is relevant to industries interested in provisioning virtualized environments in data centers and provisioning seamless IP-layer connectivity in wide-area environments. This project is relevant to CAC as it applies various self-managing, autonomic techniques in the area of overlay networking.

This project focuses on research and development of self-managing virtual IP overlays with the objective of enabling seamless deployment and use of virtual networks that support existing, unmodified operating systems and TCP/IP applications. It builds upon and extends the self-configuring IP-over-P2P (IPOP) overlay system developed at the University of Florida, which enables scalable, robust, self-configuring virtual network overlays interconnecting physical or virtual resources within a LAN or across a WAN (even in the presence of NATs and firewalls), and supports IPsec-based virtual private networking. This project has three focus activities:

  • Self-configuring bootstrapping of secure, private virtual network connections with techniques that efficiently integrate centralized infrastructures such as online social networks (to establish trust and store public-key cryptographic credentials) and decentralized overlays (for resource discovery and routing of IP packets);
  • Efficiently supporting multicast resource discovery in self-organizing IP overlays;
  • Integration with virtual machines and performance enhancements

To learn more about applications of the IPOP overlay software, visit our project websites: SocialVPN and Grid Appliance.

 

Improving Timer Accuracy in Virtualized Systems for Real-time Computing

Real-time systems are frequently used in defense, transportation, financial, medical and other applications. They require precise timing information which is hard to obtain when using virtual machines. One possible approach to addressing this problem is to automatically adjust how processors are allocated to time-critical processes at runtime in order to increase the timing accuracy of those processes as needed. This approach depends on being able to build autonomic capabilities into processor affinity management middleware.

The goals of this project are to devise processor allocation mechanisms for automatically adjusting processor affinity to time-critical processes that require accurate timing signals; to characterize timing accuracy achievable by these mechanisms in virtualized environments; and to study the robustness of these mechanisms in the presence of varying workloads and job mixes.

 

Health Management of IT Infrastructures

Larger and more powerful computational infrastructures are becoming prevalent due to technological advances and increasing scientific and business needs. Ensuring healthy operation of these facilities despite their scale and complexity is necessary to successfully utilize these resources. 

This research focuses on the use of a modeling framework that will enable systematic design, operation and healthy management of IT facilities.