Exploring Data Science

Tuesday, September 10, 2019

Framework for Applied Machine Learning

What is Machine learning (ML) ?

Machine learning is about learning patterns within datasets and to model them in order to make useful predictions and to provide answers for difficult problems.

Academic ML vs applied ML

Academic ML obsesses over those algorithms and the math behind each one, while applied ML is focused on practical results.

To succeed in data science, it's more important to understand the end-to-end framework—plus the practical tools used in each step—than to obsess over the math and theory behind each algorithm.

High-level framework for Machine learning

To create real-world business value with ML, the most important thing is to have a comprehensive framework.

At a very high level, it consists of 5 core steps.

1) Exploratory Analysis

Exploratory Analysis is the process of "getting to know" the dataset before you begin your modeling or other analyses. It consists of plotting key charts, displaying key statistics, and digging into the dataset—often into individual observations—to make sure you have everything you need to complete your project.

2) Data Cleaning

In real-world problems, better data beats fancier algorithms every single time. Garbage in gets you garbage out. On the flipside, if you have a clean dataset, even simple algorithms can learn useful insights from it. While it's not the "sexiest" part of machine learning, proper data cleaning will make or break your project.

3) Feature Engineering

Feature engineering is the process of creating new input features using your dataset. This is one of the best ways data scientists add value to the ML process and improve model results, as you're able to incorporate domain knowledge with feature engineering.

4) Algorithm Selection

For business use cases of DS and ML, it's important to choose modern algorithms that are relevant and applicable to the problem. Generally speaking, the best place for beginners to start are tree ensembles (e.g. random forests), as they are very effective general-purpose algorithms. Don't jump into neural nets and deep learning right away, as those tend to have more niche use-cases.

5) Model Training

Once you have the previous steps down, training a professional-level model is actually pretty straightforward and formulaic. There are a few best practices, such as cross-validation and train/test splitting, that you'll want to incorporate to avoid overfitting your models.

Now you know the basic framework for Machine Learning, now go out there to kick some dataset's ass!

Friday, January 25, 2019

Descriptive Statistics

Whenever we do an analysis of any survey results, we should first consider the basis of its validity. If not the analysis would be a classic case of garbage in garbage out. Here are some important factors to consider if the survey results have credibility before we start analyzing them for any insights.

Survey Results Evaluation

How many people were surveyed (Sample size)
Who were surveyed (Representativeness of sample)
How the survey was conducted (Sampling methodology)

Defining Constructs within the survey

Constructs are elements that are required to be measured within the survey but it lacks clear measurement standards. How would you measure something, say - memory?

It could be:

the number of faces you remember
knowing the order in which the faces were shown
the ethnicity of the faces
specific facial features on certain faces

Here an Operational Definition of the construct helps in measuring memory - e.g the number of faces one could remember.

The Control and Extraneous Variables
Continuing with the memory example, let's say you did a study and realized that individuals getting six hours of sleep remembers less than 70% of the faces shown to them while individuals with more than six hours of sleep remember more than 70% of the faces. Would it be feasible to conclude that individuals with more sleep have a better memory?

However, there would be other factors that would affect the validity of the conclusion we are about to make. For example,

Age
time of day the test was taken
stress levels
attention intensity during the test

Hence it is important to set up a control, which refers to a group of individuals which have most if not all the factors that affect memory to be controlled to a baseline. Following which, another group of individuals could have one the variables tweaked and thus have that variable studied with the output results.

Independent and Dependent Variable

For this dataset, the memory score is plotted against the independent variable (Hours slept). From this scatterplot, we can see that the memory score has a positive correlation with the numbers of hours slept. However, it is important to note that correlation does not equal causation!

From the below example we can see that at a temporal memory score of 70, there are 2 data points with different hours slept ( 6hrs vs 8hrs). This shows the possibility that there are other factors at play other than the predictor/independent variable being shown here. Hence we cannot safely conclude that the number of hours slept causes the memory score to increase in an individual. This would require a controlled experiment where other extraneous factors are controlled to prove that this particular variable indeed causes an increase in memory score.

Observational study VS Controlled Experiment

Showing relationship -> Observational Study/ Survey

In an observational study, we measure or survey members of a sample without trying to affect them.

Showing causation -> Controlled experiment

In a controlled experiment, we assign people or things to groups and apply some treatment to one of the groups while the other group does not receive the treatment.

Exploring Data Science

Tuesday, September 10, 2019

Framework for Applied Machine Learning

Friday, January 25, 2019

Descriptive Statistics

Framework for Applied Machine Learning

Report Abuse

Labels