INFO-I 416 Cloud Computing for Data Science
3 credits
- Prerequisites: Programming (INFO-B 211 or CSCI-A 205, or CSCI 23000), and Database (CSCI-N 211 or INFO-I 308, or CSCI 44300)
- Delivery: On-Campus, Online
- Data intensive sciences and the data center model
- Clouds with infrastructure, platform, and software as a service
- Virtualization technologies and tools
- Introduction to FutureGrid (or Openstack) as an experimental testbed
- Parallel programming using MapReduce vs. MPI
- MapReduce and data parallel applications using Hadoop
- Iterative MapReduce and data mining algorithms using Twister (expectation maximization, clustering, multidimensional scaling, latent Dirichlet allocation, Bayes networks)
- MapReduce on multicore/graphics processing unit (CUDA)
- NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
- High level language (Hive and Pig)
- Amazon EC2 and Microsoft Azure and their applications
- Explain the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
- Examine the technical capabilities and commercial benefits of hardware virtualization.
- Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
- Explain the core challenges of cloud computing deployments, including public, private, and community clouds, in terms of privacy, security, and interoperability.
- Create cloud computing infrastructure models.
- Demonstrate and compare the use of cloud storage vendor offerings, such as Amazon S3, Microsoft Azure, OpenStack, and Hadoop distributed file system.
- Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing Pig, Hive, and other cloud-computing frameworks and libraries.
- Apply the MapReduce programming model to data analytics in informatics-related domains.
- Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).
This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.
This course includes the following topics: