INFO-H 516 Cloud Computing for Data Science
3 credits
- Prerequisites: CSCI 54100, LIS S511, INFO B512, or INFO B556; prior programming experience required
- Delivery: On-Campus
Description
This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.
Course Topics
This course includes the following topics:
- Clouds with infrastructure, platform, and software as a service
- Virtualization technologies and tools
- MapReduce and data parallel applications using Apache Spark
- Apache Hadoop Distributed File System
- YARN cluster resource management and Mesos distributed system kernel
- Large-scale data storage: NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
- Large-scale machine learning: Classification, regression, and clustering using MLlib
- Spark streaming
- Amazon AWS (EC2 and S3) and its applications
- Exploring large spatiotemporal datasets
Learning Outcomes
- Research the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
- Examine the technical capabilities and commercial benefits of hardware virtualization.
- Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
- Evaluate the core challenges of cloud computing deployments, including public, private, and community clouds, with respect to privacy, security, and interoperability.
- Create cloud computing infrastructure models.
- Demonstrate and compare the use of cloud storage vendor offerings.
- Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing cloud-computing frameworks and libraries.
- Apply the MapReduce programming model to data analytics in informatics-related domains.
- Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).
- Overcome difficulties in managing very large datasets, both structured and unstructured, using nonrelational data storage and retrieval (NoSQL), parallel algorithms, and cloud computing.
- Apply the MapReduce programming model to data-driven discovery and scalable data processing for scientific applications.