What is Big Data? What are data analytics? How are they being used in ways practical to scientists and (especially) engineers? This workshop will be an overview of the field and its jargon and practices—to answer, in short: where do I even get started?
This example-driven three-day research workshop, hosted by Computational Science and Engineering at the University of Illinois, will introduce researchers to Big Data. Short tutorials alternate with hands-on practical exercises, and participants are encouraged both to help one another, and to try applying what they have learned to their own research problems during and between sessions. Participants should bring their own laptops to work on—contact Neal Davis if you need special accomodation.
The target audience is academic and industrial researchers in the hard sciences and engineering who are interested in analyzing and exploiting "Big Data" to further their insight into processes and structures. The workshop will focus on real research questions and how data analytics provides a chance to progress. It will teach participants the know-how and tools to begin using data analytics in their research.
Lunch will be provided during the three main days of the workshop.
Organizer: Neal Davis
Who: The target audience for this research workshop is academic researchers— faculty, postdoctoral students, graduate students, and other researchers— who are interested in exploiting “Big Data” techniques to further their insight into processes and structures.
Where: 1310 Nathan Newmark Civil Engineering Laboratory, 205 North Mathews Avenue, Urbana, Illinois
28–30 May, 2014
8:30 am–4 pm
Contact: Please mail firstname.lastname@example.org for more information.
|Introduction||8:30–9||Neal Davis, CSE Training Coordinator
|Motivation and Big Graph Challenges||9–10||Rakesh Nagi, ISE Head and Professor||Slides|
|Architecture for Big Data and Cloud Computing||10–12||Roy Campbell, CS Professor||Slides|
|Apache Hadoop & MapReduce||1–4||Yahoo! Hadoop development team||Slides Material|
|Application: Phasor measurement units||8:30–10||Neal Davis, CSE Training Coordinator||Code (pmu_map.py) Data (only hosted temporarily) fields.out data.out File 1 2 3 4 5 Notes Hadoop Handout Data Sets Handout|
|Application: Large-scale statistics||10–11||Nate Helwig, Statistics Visiting Professor||Materials|
|Data visualization||11–12||Dave Semeraro, NCSA & Blue Waters Asst Director|
|Data Mining||1–4||Adam Filion, MathWorks||Materials (User: anonymous ; Password: simulink)|
|Application: Astroinformatics||8:30–10||Matias Carrasco-Kind, Astronomy||Slides and Materials|
|Machine Learning||10–1||Adam Filion, MathWorks||Materials|
|Application: Efficient ranking||2–3:30||Lidan Wang, Computer Science|
|Open Questions and Challenges||3:30–3:45||Radha Nandkumar, ISE
formerly ICARE Director, NCSA
|Wrap-Up and Research Opportunities||3:45–4||Neal Davis, CSE Training Coordinator|
|Farzaneh Masoud, CSE Research Coordinator|
Participants who wish to participate in hands-on elements should bring a laptop with VirtualBox or a compatible virtualization program installed. The VM image will be available for download here as we draw closer to the event (I will send out a notice). Please download and test it prior to the start of the workshop.
Download our VM image.
Warning: this file is quite large,
so please download it before coming to the workshop.
Load the VM into VirtualBox by selecting "New...",
creating an Ubuntu 64-bit Linux machine, giving it about half of your RAM,
and loading the
.vdi file as the hard drive.
The password for
sudo and both users (
Apache Hadoop is a software framework for running certain types of highly parallel jobs across distributed clusters with a simple programming model.
We will utilize Hadoop locally and on AWS EC2.
The version of Pig bundled with the VM was compiled for a different
version of Hadoop and needs to be recompiled. Please download the
script recompile-pig.sh and save it to your home directory in
the VM. Then, on the command line, please execute the following:
chmod +x recompile-pig.sh
Amazon Web Serices is one of the major infrastructure and service providers of cloud computing today. The EC2 service allows users to create an array of virtual servers, run a map/reduce-style algorithm across the system, and process their data efficiently.
AWS is one of the contenders for an Internet2 contract, making it a likely component of a future Big Data platform at UIUC.
MATLAB is a high- level language and interactive environment for numerical computation, visualization, and programming. MATLAB allows you to analyze data, develop algorithms, and create models and applications. An integrated development environment and toolboxes enable you to explore multiple approaches.
We will use the Statistics Toolbox, Neural Network Toolbox, and Fuzzy Logic Toolbox. As the University has already purchased a sitewide MATLAB license, we will use MATLAB 8.3 R2014a, packaged into the VM.
In order to complete these hands-on exercises, please execute the following on your VM.
sudo login hduser
In order to complete these hands-on exercises, please download this script to your VM (right-click, save as).