Databyte Academy

About the Program

About The Certified Big Data Scientist Program

Big Data Science course will cover the Advanced Analytical and Machine learning techniques using most popular tools in the analytics industry like Hadoop, Python and Spark. This course focuses on case study approach for learning and is a great blend of analytics and technology, making it apt for aspirants who want to develop advanced data science skills.

Course Objective

This is our advanced Big Data training, where students will gain practical skill set not only on Hadoop in detail, but also learn advanced analytics concepts through Python, Hadoop and Spark. For extensive hands-on practice, students will get several assignments and projects. At end of the program candidates are awarded Certified Big Data Science Certification on successful completion of projects that are provided as part of the training.

Who should do this course?

Students coming from IT, Software, Datawarehouse background and wanting to get into the Big Data Analytics domain

Who are the trainers?

Our trainers are highly qualified industry experts and certified instructors with more than 10 years of global analytical experience.


Students must have knowledge on Hadoop, Python and spark and also possess exposure to statistics in data analysis.

Project - Case Studies

Data storage using HDFS
This case study aims to give practical experience on Storing & managing different types of data(Structured/Semi/Unstructured) – both compressed and un-compressed.

Processing data using map reduce
This case study aims to give practical experience on understanding & developing Map reduce programs in JAVA & R and running streaming job in terminal & Ecclipse

Data integration using sqoop & flume
This case study aims to give practical experience on Extracting data from Oracle and load into HDFS and vice versa also Extracting data from twitter and store in HDFS

Data Analysis using Pig
This case study aims to give practical experience on complete data analysis using pig and create and usage of user defined function (UDF)

Data Analysis using Hive
This case study aims to give practical experience on complete data analysis using Hive and create and usage of user defined function (UDF)

Hbase-NoSql data base creation
This case study aims to give practical experience on Data table/cluster creation using Hbase

Final Project : Integration of Hadoop components
The final project aims to give practical experience on how different modules(Pig-Hive-Hbase) can be used for solving big data problems

I would like to know more

Nameyour full name
Contact Number
Messagemore details
0 /

Exam & Certification

The certification is provided by Databyte Academy

Upon successful completion of the program, students will be conferred with dual certification:

  1. Certificate of Completion

In order to be “Certified” as part of the course, students need to complete the assignments and examination. Once all your assignments are submitted and evaluated, the certificate shall be awarded.

New Intake – To be commenced soon

Certified Big Data Scientist

Course ID – CBDS
Duration – 120 Hours
Classes – 15 Days
Learning Mode – Instructor Led-Classroom Training
Next Batch – To be commenced soon

Course Outcome

Ability to understand big data and use Big Data Ecosystem tools store and process the big data. Also get hands on exposure on how to use big data technology to improve performance across functions by storing, managing and processing big data in efficient manner

Course Content

The field of data analysis, as the name implies, analyses data to discover trends. It has tremendous uses not only in the economics and financial sector but fields like law, healthcare, public administration, politics, telecom, social media, manufacturing, banking & financial institutions etc. who rely on quality data analysis to arrive at strategic business decisions. Working professionals can definitely improve their resume and their job prospects by achieving a certificate in data analytics.

Data Science with Python

  1. What is Data Science?
  2. Data Science Vs. Analytics vs. Data warehousing, OLAP, MIS Reporting
  3. Relevance in Industry and need of the hour
  4. Type of problems and objectives in various industries
  5. How leading companies are harnessing the power of Data Science?
  6. Different phases of a typical Analytics / Data Science projects

PYTHON: Introduction and Essentials

  1. Overview of Python- Starting Python
  2. Introduction to Python Editors & IDE’s (Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
  3. Custom Environment Settings
  4. Concept of Packages/Libraries – Important packages (NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
  5. Installing & loading Packages & Name Spaces
  6. Data Types & Data objects/structures (Tuples, Lists, Dictionaries)
  7. List and Dictionary Comprehensions
  8. Variable & Value Labels – Date & Time Values
  9. Basic Operations – Mathematical – string – date
  10. Reading and writing data
  11. Simple plotting
  12. Control flow
  13. Debugging
  14. Code profiling

PYTHON: Accessing/Importing and Exporting Data

  1. Importing Data from various sources (CSV, Txt, Excel, Access etc…)
  2. Database Input (Connecting to database)
  3. Viewing Data objects – subsetting,
  4. Exporting Data to various formats

PYTHON- Data Manipulation-Cleansing

  1. Cleansing Data with Python
  2. Data Manipulation steps (Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversations, renaming, formatting etc)
  3. Data manipulation tools (Operators, Functions, Packages, control structures, Loops, arrays etc)
  4. Python Built-in Functions (Text, numeric, date, utility functions)
  5. Python User Defined Functions
  6. Stripping out extraneous information
  7. Normalizing data
  8. Formatting data
  9. Important Python Packages for data manipulation (Pandas, Numpy etc)

PYTHON: Data Analysis And Visualization

  1. Introduction exploratory data analysis
  2. Descriptive statistics, Frequency Tables and summarization
  3. Univariate Analysis (Distribution of data & Graphical Analysis)
  4. Bivariate Analysis (Cross Tabs, Distributions & Relationships, Graphical Analysis)
  5. Creating Graphs- Bar/pie/line chart/histogram/boxplot/scatter/density etc)
  6. Important Packages for Exploratory Analysis (NumPy Arrays, Matplotlib, Pandas and scipy.stats etc)


  1. Basic Statistics – Measures of Central Tendencies and Variance
  2. Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
  3. Inferential Statistics -Sampling – Concept of Hypothesis Testing
  4. Statistical Methods – Z/t-tests (One sample, independent, paired), Anova, Correlations and Chi-square

PYHTON: Machine Learning-Predictive Modeling Basic

  1. Introduction to Machine Learning & Predictive Modeling
  2. Types of Business problems – Mapping of Techniques
  3. Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
  4. Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
  5. Overfitting (Bias-Variance Trade off) & Performance Metrics
  6. Types of validation(Bootstrapping, K-Fold validation etc)

PYTHON: Machine Learning in Practice

  1. Linear Regression
  2. Logistic Regression
  3. Segmentation – Cluster Analysis (K-Means)
  4. Decision Trees (CHAID/CART/CD 5.0)
  5. Artificial Neural Networks (ANN)
  6. Support Vector Machines (SVM)
  7. Ensemble Learning (Random Forest, Bagging & boosting)
  8. Other Techniques (KNN, Naïve Bayes)
  9. Important Packages for Machine Learning (Sci Kit Learn, scipy.stats etc)

Hadoop & Hadoop- Ecosystem Introduction to Big Data

  1. Introduction and Relevance
  2. Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.
  3. Problems with Traditional Large-Scale Systems

Hadoop (Big Data) Ecosystem

  1. Motivation for Hadoop
  2. Different types of projects by Apache
  3. Role of projects in the Hadoop Ecosystem
  4. Key technology foundations required for Big Data
  5. Limitations and Solutions of existing Data Analytics Architecture
  6. Comparison of traditional data management systems with Big Data management systems
  7. Evaluate key framework requirements for Big Data analytics
  8. Hadoop Ecosystem & Hadoop 2.x core components
  9. Explain the relevance of real-time data
  10. Explain how to use Big Data and real-time data as a Business planning tool

Hadoop Cluster-Architecture-Configuration File

  1. Hadoop Master-Slave Architecture
  2. The Hadoop Distributed File System – Concept of data storage
  3. Explain different types of cluster setups(Fully distributed/Pseudo etc)
  4. Hadoop cluster set up – Installation
  5. Hadoop 2.x Cluster Architecture
  6. A Typical enterprise cluster – Hadoop Cluster Modes
  7. Understanding cluster management tools like Cloudera manager/Apache ambari

Hadoop and Mapreduce (Yarn)

  1. HDFS Overview & Data storage in HDFS
  2. Get the data into Hadoop from local machine(Data Loading Techniques) – vice versa
  3. Map Reduce Overview (Traditional way Vs. MapReduce way)
  4. Concept of Mapper & Reducer
  5. Understanding MapReduce program Framework
  6. Develop MapReduce Program using Java (Basic)
  7. Develop MapReduce program with streaming API) (Basic)

Data Integration Using SQOOP & FLUME

  1. Integrating Hadoop into an Existing Enterprise
  2. Loading Data from an RDBMS into HDFS by Using Sqoop
  3. Managing Real-Time Data Using Flume
  4. Accessing HDFS from Legacy Systems

Data Analysis Using PIG

  1. Introduction to Data Analysis Tools
  2. Apache PIG – MapReduce Vs Pig, Pig Use Cases
  3. PIG’s Data Model
  4. PIG Streaming
  5. Pig Latin Program & Execution
  6. Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
  7. Writing JAVA UDF’s
  8. Embedded PIG in JAVA
  9. PIG Macros
  10. Parameter Substitution
  11. Use Pig to automate the design and implementation of MapReduce applications
  12. Use Pig to apply structure to unstructured Big Data

Data Analysis Using HIVE

  1. Apache Hive – Hive Vs. PIG – Hive Use Cases
  2. Discuss the Hive data storage principle
  3. Explain the File formats and Records formats supported by the Hive environment
  4. Perform operations with data in Hive
  5. Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
  6. Hive Script, Hive UDF
  7. Hive Persistence formats
  8. Loading data in Hive – Methods
  9. Serialization & Deserialization
  10. Handling Text data using Hive
  11. Integrating external BI tools with Hadoop Hive

Data Analysis Using IMPALA

  1. Impala & Architecture
  2. How Impala executes Queries and its importance
  3. Hive vs. PIG vs. Impala
  4. Extending Impala with User Defined functions

Introduction to other Ecosystem Tools

  1. NoSQL database – Hbase
  2. Introduction Oozie

SPARK: SPARK in Practice

  1. Invoking Spark Shell
  2. Creating the Spark Context
  3. Loading a File in Shell
  4. Performing Some Basic Operations on Files in Spark Shell
  5. Caching Overview
  6. Distributed Persistence
  7. Spark Streaming Overview(Example: Streaming Word Count)


  1. Analyze Hive and Spark SQL Architecture
  2. Analyze Spark SQL
  3. Context in Spark SQL
  4. Implement a sample example for Spark SQL
  5. Integrating hive and Spark SQL
  6. Support for JSON and Parquet File Formats Implement Data Visualization in Spark
  7. Loading of Data
  8. Hive Queries through Spark
  9. Performance Tuning Tips in Spark
  10. Shared Variables: Broadcast Variables & Accumulators

SPARK Streaming

  1. Extract and analyze the data from twitter using Spark streaming
  2. Comparison of Spark and Storm – Overview


  1. Overview of GraphX module in spark
  2. Creating graphs with GraphX

Introduction to Machine Learning Using Spark

  1. Understand Machine learning framework
  2. Implement some of the ML algorithms using Spark MLLib


  1. Consolidate all the learnings
  2. Working on Big Data Project by integrating various key components

I would like to know more


Get Ahead with Databyte’s Certificate

Earn your Certificate

Our Certified Big Data Science program is exhaustive and this certificate is proof that you have taken a big leap in mastering the domain.

Differentiate yourself with a Certificated Big Data Science

The knowledge and skills you’ve gained working on projects, simulations, case studies will set you ahead of competition.

Share your achievement

Talk about it on Linkedin, Twitter, Facebook, boost your resume or frame it – tell your friends and colleagues about it.

Login Form

Register Form