Exploring Data Science: January 2019

Whenever we do an analysis of any survey results, we should first consider the basis of its validity. If not the analysis would be a classic case of garbage in garbage out. Here are some important factors to consider if the survey results have credibility before we start analyzing them for any insights.

Survey Results Evaluation

How many people were surveyed (Sample size)
Who were surveyed (Representativeness of sample)
How the survey was conducted (Sampling methodology)

Defining Constructs within the survey

Constructs are elements that are required to be measured within the survey but it lacks clear measurement standards. How would you measure something, say - memory?

It could be:

the number of faces you remember
knowing the order in which the faces were shown
the ethnicity of the faces
specific facial features on certain faces

Here an Operational Definition of the construct helps in measuring memory - e.g the number of faces one could remember.

The Control and Extraneous Variables
Continuing with the memory example, let's say you did a study and realized that individuals getting six hours of sleep remembers less than 70% of the faces shown to them while individuals with more than six hours of sleep remember more than 70% of the faces. Would it be feasible to conclude that individuals with more sleep have a better memory?

However, there would be other factors that would affect the validity of the conclusion we are about to make. For example,

Age
time of day the test was taken
stress levels
attention intensity during the test

Hence it is important to set up a control, which refers to a group of individuals which have most if not all the factors that affect memory to be controlled to a baseline. Following which, another group of individuals could have one the variables tweaked and thus have that variable studied with the output results.

Independent and Dependent Variable

For this dataset, the memory score is plotted against the independent variable (Hours slept). From this scatterplot, we can see that the memory score has a positive correlation with the numbers of hours slept. However, it is important to note that correlation does not equal causation!

From the below example we can see that at a temporal memory score of 70, there are 2 data points with different hours slept ( 6hrs vs 8hrs). This shows the possibility that there are other factors at play other than the predictor/independent variable being shown here. Hence we cannot safely conclude that the number of hours slept causes the memory score to increase in an individual. This would require a controlled experiment where other extraneous factors are controlled to prove that this particular variable indeed causes an increase in memory score.

Observational study VS Controlled Experiment

Showing relationship -> Observational Study/ Survey

In an observational study, we measure or survey members of a sample without trying to affect them.

Showing causation -> Controlled experiment

In a controlled experiment, we assign people or things to groups and apply some treatment to one of the groups while the other group does not receive the treatment.

Exploring Data Science

Friday, January 25, 2019

Descriptive Statistics

Framework for Applied Machine Learning

Report Abuse

Labels