Syllabus

Introduction to Data Science

Students will obtain a global perspective of the different steps of a typical Data Science project. For each of these steps some of the main techniques and methods will be presented, with further details left for other more specific UCs. The current UC will allow students to better frame the different topics addressed by the other UCs, giving them a general perspective of how they fit within the Data Science area and of their importance in the context of specific Data Science projects. The unit will also provide concrete examples of the application of the main Data Science techniques, thus immediately providing a case-base approach to learning Data Science.

Time Series and Forecasting

Introduction. Time series data and their characteristics. Measures of dependence: autocorrelation and cross-correlation. Stationary time series. Estimation of correlation. Use of R for time series analysis. Time series decomposition and exponential smoothing. Exploratory data analysis. Estimation of trend, cycle and seasonal components. Loess, STL and “Bureau of the Census” decompositions. Moving averages, exponential smoothing. Forecasting. Time series models. ARMA models. Estimation and forecasting. Integrated ARIMA models for nonstationary data. Multiplicative Seasonal ARIMA models. Forecasting. Box-Jenkins methodology: building SARIMA models- identification, estimation and diagnostic. Model selection. Unit root tests. Forecasting. Visualizing and forecasting big time series data. representation of many time series. Summarization of main characteristics. Automatic model selection. Automatic forecasting.

Applied Statistics

(This course should be taken by students with a CS background)

At the end of the course, the students are expected to:

feel at ease with basic concepts from Statistical Inference;
have a broad knowledge and understanding of linear regression models;
be able to expand linear regression to Ridge and LASSO regression, and to correctly apply the models to real situations;
be able to expand linear regression to generalized linear models, and correctly apply the models to real situations;
critically interpret the results from any of the previous models;
develop critical thinking skills.

Programming and Databases

(This course should be taken by students with a non-CS background)

Students should learn the main concepts of programming, illustrated through the two programming languages most used for data analysis (R and Python). Major data structures, object-oriented programming concepts as well as some essential search algorithms will be introduced.Students will be introduced to key topics related to relational databases. Basic concepts of data modeling will be taught using the EER model. They will be taught basic SQL concepts.

Big Data and Cloud Computing

Deployment of cloud-based infrastructures for big data applications. Programming big data applications using cloud programming models. Understanding of core problems and algorithms in big data applications. Hands-on practice with state-of-the-art tools for cloud computing.

Statistics and Data Analysis

Introduction and formulation of a supervised and unsupervised classification problem. Examples of application.
Brief summary of random vectors.
Multivariate normal distribution.
Principal component analysis (PCA).
Classification Analysis: hierarchical and non-hierarchical methods.
Statistical decision theory.
Linear and quadratic discriminant analysis. Logistic regression.
Topics of Statistical Learning

Machine Learning

Students should be aware of the algorithmic fundamentals of machine learning. They should be able to select the appropriate algorithms for each problem and apply the algorithms to new datasets and understand and evaluate their results. Linear models: least squares, shrinkage (Lasso);Nearest neighbours; Statistical decision theory; Bias-variance tradeoff; Mixture models; Evaluation: Cross validation and bootstrap; measures; using statistical testing; Maximum Likelihood; Expectation-Maximization and Gibbs sampling; MCMC; Boosting and Bagging; Neural Networks, auto-encoders and deep learning; Kernel methods and SVM; Embeddings, matrix factorization and gradient descent.

Management and Entrepeneurship

The purpose of this course is to provide students with: A global vision on organization’s management and an comprehensive knowledge on the major strategically issues that enterprises have to deal with; An understanding of financial and economical analysis needed to evaluate financial and accounting reporting information; The basic skills on entrepreneurship matters that may allow students to built their own business or financial project.

Image Processing and Analysis

The course presents the main concepts and techniques of digital image analysis and processing. The goal is that at the end of the course students will be able to plan and implement algorithms for extracting information from images. The discipline orientation emphasizes the understanding of concepts and methods and their effective use in the analysis of simulated and experimental data. An intensive use of advanced computational tools (Matlab) will be used.

Statistical Inference

Acquire a solid knowledge in inductive statistics and develop capacities and skills in statistical modelling techniques, fundamental to the presentation, analysis and interpretation of data sets. Upon completing this course, the student should:

have a deep understanding of the fundamental concepts and principles of statistics;
know the fundamental parametric and nonparametric statistical methods and how to apply them to concrete situations;
be able to use the programming language R to analyze different types of data and solve statistical problems;
be able to identify and formulate a problem, to choose adequate statistical methods, to analyze and interpret in a critical way the obtained results and to communicate and present them.

Network Science

At the end of this unit the students should be able to:

explain the key concepts of network science and network analysis;
apply a range of techniques for characterizing network structure;
define methodologies for analyzing networks of different fields;
demonstrate knowledge of recent research in the area.

Bioinformatics

It is intended that students:

learn the foundations of Bioinformatics, with special emphasis on Computational Molecular Biology;
know and understand the types and sources of data used and the most important computational problems;
understand the most important and interesting algorithms, particularly in sequence pairing, phylogeny and pattern recognition (in the genome, proteome and interaction networks);
get a perspective on the most popular tools and open questions in the area.

Parallel Computing

Introduce the students to advanced concepts on the theory and practice of computational models for parallel and distributed memory architectures. Hands-on experience on programming distributed memory architectures with MPI, and programming shared memory architectures using processes, threads and OpenMP. On completing this course, the students must be able to:

be aware of the main models, paradigms, environments and tools for parallel programming
understand and assess the concepts related to the structure, operation and performance of parallel programs
formulate solutions in the main parallel programming paradigms, namely MPI, Pthreads and OpenMP

Data Stream Mining

Fundamentals of data streams: sufficient statistics, Hoeffding bounds; Algorithms and tools: online algorithms for data stream learning; Evaluation: adapted cross-validation, prequential evaluation; Applications: Sensor data, Internet of things

Statistical Learning for Data Science

The student should be able to: recognize different problems of unsupervised and supervised classification tasks and to solve them using the discussed methods and the software R; prepare, solve and present data mining computational projects, where the several presented models are discussed, evaluated and compared to concrete cases; solve computational and non-computational exercises on the methodologies.

Advanced Topics in Data Science

Frequent Pattern Mining: frequent itemsets and association rules; Apriori algorithm; itemsets summarization and rules selection; FP-Growth algorithm. Sequential Pattern Mining: GSP algorithm; PrefixSpan algorithm. Web Mining: information retrieval; recommender systems; link analysis. Text Mining: document clustering; document classification. Outlier Mining: challenges; unsupervised, semi-supervised and supervised techniques.

Advanced Topics in Artificial Intelligence

Review of fundamental concepts in artificial intelligence; Unsupervised learning; Knowledge-based Decision Systems; Search and Optimization Algorithms; Monte Carlo Learning and Methods6. Neural networks, "deep learning"; Algorithms for search, learning and optimization.

Computer Vision

Digital image: The human visual system, formation of an image, digital representation of an image, color, noise. Image processing: Point-to-point manipulation, spatial filters, extraction of geometric structures, segmentation. Video processing: Optical flow, video compression. Pattern Recognition: Introduction, knowledge representation, statistical recognition of patterns, machine learning. Fields of application.- New directions in Bioinformatics.

Manifold Learning

Overview of Dimensionality Reduction: High Dimension Data Acquisition. Curse of the Dimensionality. Intrinsic and Extrinsic Dimensions. Preliminary Calculus on Manifolds. Geometric Structure of High-Dimensional: Similarity and Dissimilarity of Data. Graphs on Data Sets. Spectral Analysis of Graphs. Data Models and Structures of Kernels of DR. Linear Dimensionality Reduction. Classical Multidimensional Scaling. Random Projection. Nonlinear Dimensionality Reduction. Locally Linear Embedding. Local Tangent Space Alignment. Laplacian Eigenmaps. Diﬀusion Maps. Fast Algorithms for DR Approximation

Statistical Analysis and Signal Processing

Digital Signal Processing Review. Topics of probabilistic methods in signals, systems and time series. Measures of dependence and joint analysis. Stationarity and ergodicity. Linear modeling and prediction, spectral analysis and filtering- Data-driven signal analysis methods. ntroduction to time-frequency analysis and wavelets. Optimal and adaptive signal processing fundamentals. Least mean squares methods. Dimension reduction and Data decompositions such as, PCA/KLTransform, Empirical mode decomposition. Introduction to novel paradigms in statistical signal processing and time series, selected topics as. Independent component analysis. Bayesian Signal processing and Monte Carlo based approaches. Case study application and critical insight of the studied methods.

Data Driven Decision Making

1. Decision Analysis - decision trees - conditional probabilities 2. Simulation - simulation models based on random number generators - decision problems combining different random variables 3. Introduction to Linear Optimization Modeling - formulations, key concepts, graphical solution methods - constructing, solving, and interpreting the solution - sensitivity and economic analysis - informed decision-making with linear optimization 4. Introduction to Nonlinear Optimization - similarities and differences between linear and nonlinear optimization - applications 5. Introduction to Discrete Optimization - modeling with discrete variables - discrete optimization to make informed and efficient decisions 6. Dynamic Optimization - the role of data - applications.

Data Visualization

This course will introduce the concepts of Data Visualization with a focus on Data Science and Visual Analytics that support tasks that take the user from raw data into insights. Topics include basic concepts of information visualization; visual analytics of evolving phenomena; analysis of spatial and temporal data sets; visual social media analytics; and the visual analytics of text and multimedia collections. Coursework will integrate graphics developed in R (ggplot2) / Python (plotly) into interactive environments, namely data access dashboards for interactive manipulation of multiple graphs.

Syllabus

Study Plan

1st Year

2nd Year

Course unit details