About the Program
About The Certified Big Data Scientist Program
Big Data Science course will cover the Advanced Analytical and Machine learning techniques using most popular tools in the analytics industry like Hadoop, Python and Spark. This course focuses on case study approach for learning and is a great blend of analytics and technology, making it apt for aspirants who want to develop advanced data science skills.
Course Objective
This is our advanced Big Data training, where students will gain practical skill set not only on Hadoop in detail, but also learn advanced analytics concepts through Python, Hadoop and Spark. For extensive hands-on practice, students will get several assignments and projects. At end of the program candidates are awarded Certified Big Data Science Certification on successful completion of projects that are provided as part of the training.
Who should do this course?
Students coming from IT, Software, Datawarehouse background and wanting to get into the Big Data Analytics domain
Who are the trainers?
Our trainers are highly qualified industry experts and certified instructors with more than 10 years of global analytical experience.
Prerequisites
Students must have knowledge on Hadoop, Python and spark and also possess exposure to statistics in data analysis.
Project - Case Studies
Data storage using HDFS
This case study aims to give practical experience on Storing & managing different types of data(Structured/Semi/Unstructured) – both compressed and un-compressed.
Processing data using map reduce
This case study aims to give practical experience on understanding & developing Map reduce programs in JAVA & R and running streaming job in terminal & Ecclipse
Data integration using sqoop & flume
This case study aims to give practical experience on Extracting data from Oracle and load into HDFS and vice versa also Extracting data from twitter and store in HDFS
Data Analysis using Pig
This case study aims to give practical experience on complete data analysis using pig and create and usage of user defined function (UDF)
Data Analysis using Hive
This case study aims to give practical experience on complete data analysis using Hive and create and usage of user defined function (UDF)
Hbase-NoSql data base creation
This case study aims to give practical experience on Data table/cluster creation using Hbase
Final Project : Integration of Hadoop components
The final project aims to give practical experience on how different modules(Pig-Hive-Hbase) can be used for solving big data problems
Exam & Certification
The certification is provided by Databyte Academy
Upon successful completion of the program, students will be conferred with dual certification:
- Certificate of Completion
- CERTIFIED BIG DATA SCIENTIST*
In order to be “Certified” as part of the course, students need to complete the assignments and examination. Once all your assignments are submitted and evaluated, the certificate shall be awarded.
New Intake – To be commenced soon
Certified Big Data Scientist
Course ID – CBDS
Duration – 120 Hours
Classes – 15 Days
Tools – HADOOP, SPARK & PHYTON
Learning Mode – Instructor Led-Classroom Training
Next Batch – To be commenced soon
Course Outcome
Ability to understand big data and use Big Data Ecosystem tools store and process the big data. Also get hands on exposure on how to use big data technology to improve performance across functions by storing, managing and processing big data in efficient manner
Course Content
The field of data analysis, as the name implies, analyses data to discover trends. It has tremendous uses not only in the economics and financial sector but fields like law, healthcare, public administration, politics, telecom, social media, manufacturing, banking & financial institutions etc. who rely on quality data analysis to arrive at strategic business decisions. Working professionals can definitely improve their resume and their job prospects by achieving a certificate in data analytics.
Data Science with Python
- What is Data Science?
- Data Science Vs. Analytics vs. Data warehousing, OLAP, MIS Reporting
- Relevance in Industry and need of the hour
- Type of problems and objectives in various industries
- How leading companies are harnessing the power of Data Science?
- Different phases of a typical Analytics / Data Science projects
PYTHON: Introduction and Essentials
- Overview of Python- Starting Python
- Introduction to Python Editors & IDE’s (Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
- Custom Environment Settings
- Concept of Packages/Libraries – Important packages (NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
- Installing & loading Packages & Name Spaces
- Data Types & Data objects/structures (Tuples, Lists, Dictionaries)
- List and Dictionary Comprehensions
- Variable & Value Labels – Date & Time Values
- Basic Operations – Mathematical – string – date
- Reading and writing data
- Simple plotting
- Control flow
- Debugging
- Code profiling
PYTHON: Accessing/Importing and Exporting Data
- Importing Data from various sources (CSV, Txt, Excel, Access etc…)
- Database Input (Connecting to database)
- Viewing Data objects – subsetting,
- Exporting Data to various formats
PYTHON- Data Manipulation-Cleansing
- Cleansing Data with Python
- Data Manipulation steps (Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversations, renaming, formatting etc)
- Data manipulation tools (Operators, Functions, Packages, control structures, Loops, arrays etc)
- Python Built-in Functions (Text, numeric, date, utility functions)
- Python User Defined Functions
- Stripping out extraneous information
- Normalizing data
- Formatting data
- Important Python Packages for data manipulation (Pandas, Numpy etc)
PYTHON: Data Analysis And Visualization
- Introduction exploratory data analysis
- Descriptive statistics, Frequency Tables and summarization
- Univariate Analysis (Distribution of data & Graphical Analysis)
- Bivariate Analysis (Cross Tabs, Distributions & Relationships, Graphical Analysis)
- Creating Graphs- Bar/pie/line chart/histogram/boxplot/scatter/density etc)
- Important Packages for Exploratory Analysis (NumPy Arrays, Matplotlib, Pandas and scipy.stats etc)
PYTHON- BASIC STATISTIC
- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing
- Statistical Methods – Z/t-tests (One sample, independent, paired), Anova, Correlations and Chi-square
PYHTON: Machine Learning-Predictive Modeling Basic
- Introduction to Machine Learning & Predictive Modeling
- Types of Business problems – Mapping of Techniques
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
- Overfitting (Bias-Variance Trade off) & Performance Metrics
- Types of validation(Bootstrapping, K-Fold validation etc)
PYTHON: Machine Learning in Practice
- Linear Regression
- Logistic Regression
- Segmentation – Cluster Analysis (K-Means)
- Decision Trees (CHAID/CART/CD 5.0)
- Artificial Neural Networks (ANN)
- Support Vector Machines (SVM)
- Ensemble Learning (Random Forest, Bagging & boosting)
- Other Techniques (KNN, Naïve Bayes)
- Important Packages for Machine Learning (Sci Kit Learn, scipy.stats etc)
Hadoop & Hadoop- Ecosystem Introduction to Big Data
- Introduction and Relevance
- Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.
- Problems with Traditional Large-Scale Systems
Hadoop (Big Data) Ecosystem
- Motivation for Hadoop
- Different types of projects by Apache
- Role of projects in the Hadoop Ecosystem
- Key technology foundations required for Big Data
- Limitations and Solutions of existing Data Analytics Architecture
- Comparison of traditional data management systems with Big Data management systems
- Evaluate key framework requirements for Big Data analytics
- Hadoop Ecosystem & Hadoop 2.x core components
- Explain the relevance of real-time data
- Explain how to use Big Data and real-time data as a Business planning tool
Hadoop Cluster-Architecture-Configuration File
- Hadoop Master-Slave Architecture
- The Hadoop Distributed File System – Concept of data storage
- Explain different types of cluster setups(Fully distributed/Pseudo etc)
- Hadoop cluster set up – Installation
- Hadoop 2.x Cluster Architecture
- A Typical enterprise cluster – Hadoop Cluster Modes
- Understanding cluster management tools like Cloudera manager/Apache ambari
Hadoop and Mapreduce (Yarn)
- HDFS Overview & Data storage in HDFS
- Get the data into Hadoop from local machine(Data Loading Techniques) – vice versa
- Map Reduce Overview (Traditional way Vs. MapReduce way)
- Concept of Mapper & Reducer
- Understanding MapReduce program Framework
- Develop MapReduce Program using Java (Basic)
- Develop MapReduce program with streaming API) (Basic)
Data Integration Using SQOOP & FLUME
- Integrating Hadoop into an Existing Enterprise
- Loading Data from an RDBMS into HDFS by Using Sqoop
- Managing Real-Time Data Using Flume
- Accessing HDFS from Legacy Systems
Data Analysis Using PIG
- Introduction to Data Analysis Tools
- Apache PIG – MapReduce Vs Pig, Pig Use Cases
- PIG’s Data Model
- PIG Streaming
- Pig Latin Program & Execution
- Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
- Writing JAVA UDF’s
- Embedded PIG in JAVA
- PIG Macros
- Parameter Substitution
- Use Pig to automate the design and implementation of MapReduce applications
- Use Pig to apply structure to unstructured Big Data
Data Analysis Using HIVE
- Apache Hive – Hive Vs. PIG – Hive Use Cases
- Discuss the Hive data storage principle
- Explain the File formats and Records formats supported by the Hive environment
- Perform operations with data in Hive
- Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
- Hive Script, Hive UDF
- Hive Persistence formats
- Loading data in Hive – Methods
- Serialization & Deserialization
- Handling Text data using Hive
- Integrating external BI tools with Hadoop Hive
Data Analysis Using IMPALA
- Impala & Architecture
- How Impala executes Queries and its importance
- Hive vs. PIG vs. Impala
- Extending Impala with User Defined functions
Introduction to other Ecosystem Tools
- NoSQL database – Hbase
- Introduction Oozie
SPARK: SPARK in Practice
- Invoking Spark Shell
- Creating the Spark Context
- Loading a File in Shell
- Performing Some Basic Operations on Files in Spark Shell
- Caching Overview
- Distributed Persistence
- Spark Streaming Overview(Example: Streaming Word Count)
SPARK: SPARK Meets HIVE
- Analyze Hive and Spark SQL Architecture
- Analyze Spark SQL
- Context in Spark SQL
- Implement a sample example for Spark SQL
- Integrating hive and Spark SQL
- Support for JSON and Parquet File Formats Implement Data Visualization in Spark
- Loading of Data
- Hive Queries through Spark
- Performance Tuning Tips in Spark
- Shared Variables: Broadcast Variables & Accumulators
SPARK Streaming
- Extract and analyze the data from twitter using Spark streaming
- Comparison of Spark and Storm – Overview
SPARK GraphX
- Overview of GraphX module in spark
- Creating graphs with GraphX
Introduction to Machine Learning Using Spark
- Understand Machine learning framework
- Implement some of the ML algorithms using Spark MLLib
Project
- Consolidate all the learnings
- Working on Big Data Project by integrating various key components
Get Ahead with Databyte’s Certificate
Earn your Certificate
Our Certified Big Data Science program is exhaustive and this certificate is proof that you have taken a big leap in mastering the domain.
Differentiate yourself with a Certificated Big Data Science
The knowledge and skills you’ve gained working on projects, simulations, case studies will set you ahead of competition.
Share your achievement
Talk about it on Linkedin, Twitter, Facebook, boost your resume or frame it – tell your friends and colleagues about it.