Big Data Workshop

28–30 May • 8:30 am–4 pm • 1310 Newmark

General Information

What is Big Data? What are data analytics? How are they being used in ways practical to scientists and (especially) engineers? This workshop will be an overview of the field and its jargon and practices—to answer, in short: where do I even get started?

This example-driven three-day research workshop, hosted by Computational Science and Engineering at the University of Illinois, will introduce researchers to Big Data. Short tutorials alternate with hands-on practical exercises, and participants are encouraged both to help one another, and to try applying what they have learned to their own research problems during and between sessions. Participants should bring their own laptops to work on—contact Neal Davis if you need special accomodation.

The target audience is academic and industrial researchers in the hard sciences and engineering who are interested in analyzing and exploiting "Big Data" to further their insight into processes and structures. The workshop will focus on real research questions and how data analytics provides a chance to progress. It will teach participants the know-how and tools to begin using data analytics in their research.

Lunch will be provided during the three main days of the workshop.

Organizer: Neal Davis

Who: The target audience for this research workshop is academic researchers— faculty, postdoctoral students, graduate students, and other researchers— who are interested in exploiting “Big Data” techniques to further their insight into processes and structures.

Where: 1310 Nathan Newmark Civil Engineering Laboratory, 205 North Mathews Avenue, Urbana, Illinois

When: 28–30 May, 2014
8:30 am–4 pm

Contact: Please mail cse@cse.illinois.edu for more information.


Registration

Online event registration for CSE Big Data/Data Analytics Research Workshop powered by Eventbrite
If you will pay with UIUC research development funds, please purchase using the CFOP ticket option.

Tentative Agenda

Day 1
Introduction 8:30–9 Neal Davis, CSE Training Coordinator
Motivation and Big Graph Challenges 9–10 Rakesh Nagi, ISE Head and Professor Slides
Architecture for Big Data and Cloud Computing 10–12 Roy Campbell, CS Professor Slides
Apache Hadoop & MapReduce 1–4 Yahoo! Hadoop development team Slides Material
Day 2
Application: Phasor measurement units 8:30–10 Neal Davis, CSE Training Coordinator Code (pmu_map.py) Data (only hosted temporarily) fields.out data.out File 1 2 3 4 5 Notes Hadoop Handout Data Sets Handout
Application: Large-scale statistics 10–11 Nate Helwig, Statistics Visiting Professor Materials
Data visualization 11–12 Dave Semeraro, NCSA & Blue Waters Asst Director
Data Mining 1–4 Adam Filion, MathWorks Materials (User: anonymous ; Password: simulink)
Day 3
Application: Astroinformatics 8:30–10 Matias Carrasco-Kind, Astronomy Slides and Materials
Machine Learning 10–1 Adam Filion, MathWorks Materials
Application: Efficient ranking 2–3:30 Lidan Wang, Computer Science
Open Questions and Challenges 3:30–3:45 Radha Nandkumar, ISE
formerly ICARE Director, NCSA
Slides
Wrap-Up and Research Opportunities 3:45–4 Neal Davis, CSE Training Coordinator


Farzaneh Masoud, CSE Research Coordinator

Setup

Participants who wish to participate in hands-on elements should bring a laptop with VirtualBox or a compatible virtualization program installed. The VM image will be available for download here as we draw closer to the event (I will send out a notice). Please download and test it prior to the start of the workshop.

Software Packages

Virtual Machine

Install VirtualBox. Download our VM image. Warning: this file is quite large, so please download it before coming to the workshop. Load the VM into VirtualBox by selecting "New...", creating an Ubuntu 64-bit Linux machine, giving it about half of your RAM, and loading the .vdi file as the hard drive.

The password for sudo and both users (cse-user and hduser) is cse-user1.

MapReduce

Apache Hadoop is a software framework for running certain types of highly parallel jobs across distributed clusters with a simple programming model.

We will utilize Hadoop locally and on AWS EC2.

The version of Pig bundled with the VM was compiled for a different version of Hadoop and needs to be recompiled. Please download the script recompile-pig.sh and save it to your home directory in the VM. Then, on the command line, please execute the following:
cd
chmod +x recompile-pig.sh
./recompile-pig.sh

Amazon Web Services Elastic Cloud Compute (EC2)

Amazon Web Serices is one of the major infrastructure and service providers of cloud computing today. The EC2 service allows users to create an array of virtual servers, run a map/reduce-style algorithm across the system, and process their data efficiently.

AWS is one of the contenders for an Internet2 contract, making it a likely component of a future Big Data platform at UIUC.

MATLAB

MATLAB is a high- level language and interactive environment for numerical computation, visualization, and programming. MATLAB allows you to analyze data, develop algorithms, and create models and applications. An integrated development environment and toolboxes enable you to explore multiple approaches.

We will use the Statistics Toolbox, Neural Network Toolbox, and Fuzzy Logic Toolbox. As the University has already purchased a sitewide MATLAB license, we will use MATLAB 8.3 R2014a, packaged into the VM.

Network MAC address settings

PMU Grid Data Example

In order to complete these hands-on exercises, please execute the following on your VM.

sudo login hduser
cd pmu
wget https://s3.amazonaws.com/PMU_Data/Archive.zip
unzip Archive.zip

Astronomy Big Data Example

In order to complete these hands-on exercises, please download this script to your VM (right-click, save as).