sbi/01_chapter1.Rmd at main · collinn/sbi · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229

---
output: html_document
---

# Introduction {#ch1}

> "So never lose an opportunity of urging a practical beginning, however small, for it is wonderful how often in such matters the mustard-seed germinates and roots itself."
>
> --- Florence Nightingale


<div class="objective-container">
<div class="objectives"> Learning objectives </div>
1. Describe the field of statistics and its importance in scientific inquiry
2. Learn the scientific method, its purpose, and its connection to statistics
3. Understand the concepts in the general statistical framework
</div>

## Statistics and Evidence-based Research {#ch1_s1}

There are many definitions of statistics:

* the science of learning from experiences
* the study of collecting, analyzing, and interpreting data
* the mathematics of uncertainty
* a required college course designed to make everyone miserable

All of these definitions hold a degree truth (some more than others -- we'll let you
guess which `r emo::ji("wink")`). Statistics is a highly versatile field, with methods
applicable to and used by a wide variety of disciplines. The importance of the
field of statistics lies in it's ability to quantify our uncertainty about a
potential outcome -- such as the most likely cause and treatment for a headache.
Almost everything in life is associated with some uncertainty -- the weather,
your career, the connection between your genetics and your response to certain
medications; however, just because something isn't known with 100% certainty
doesn't mean we can't understand it better. Statistics helps us understand the
world better by allowing us to quantify our uncertainty, describe associations
between phenomena, and make informed decisions.

And how exactly do researchers use statistics to quantify their uncertainty?
Well, this is achieved by collecting data, or evidence, about whatever process
is of interest. For example, if you want to know how likely it is to rain
tomorrow, you might look at the weather in your location the past few days.
Since rain can travel, you might also look at the recent weather in the surrounding
areas. You could also evaluate historical weather records to determine if
this season is known for being particularly rainy or dry in the past. These are
all various examples of data you can collect to inform your weather prediction.
In scientific research, data is obtained and investigated through statistical
inquiry to help answer important questions:

* Which drug should a doctor prescribe to treat an illness?
* Can we predict the longevity of the national population to inform government decisions regarding Social Security?
* What factors increase the risk of an individual developing coronary heart disease?

These questions are too important to be left to opinion, superstition, or
conjecture. Consequently, there has been a tremendous push for objective,
**evidence-based** decision making in medicine, public health, and policy making.
Statistics and biostatistics are the sciences that allows us to make these difficult
decisions. When studying humans, we can't control every aspect of their life,
such as what they eat or where they work. There are ethical considerations - if
we wanted to research the effects of smoking on lung cancer, we would not be
able to force people to smoke (or not). Humans are also incredibly diverse and
variable, not to mention expensive to perform research upon. Despite these
issues, there is a moral imperative to make decisions on potentially life-saving
therapies as fast as possible. To make evidence-based conclusions, we need to
collect, process, analyze, and interpret data in order to draw conclusions in an
objective manner.

---

<div class="definition-container">
<div class="definition"> &nbsp; </div>
<div class="text">
**Evidence-based practice:** <em> Using sound research findings based on observed or collected data to make decisions </em>
</div>
</div>


## Scientific Method {#ch1_s2}

In principle, people collect and process information every day of their lives.
Since it's something we do frequently, you might think we would be really good
at it...but we're not. Unfortunately, humans are not natural statisticians. We
are not good at picking out patterns from a sea of noisy data. And, on the flip
side, we are *too good* at picking out non-existent patterns from small numbers
of observations. We also find it difficult to sort out the effects of multiple
factors occurring simultaneously and we are subject to all sorts of biases
depending on our personalities, emotions, and past experiences.

In order to mitigate any of our own personal biases when answering important
questions about the way the world works (i.e., to do good science), we must be
careful to be rigorous in the way we proceed. The **scientific method** is the
process used to answer scientific questions.

1) Ask a question
2) Construct a hypothesis
3) Test your hypothesis with a study or an experiment
4) Analyze data and draw conclusions
5) Communicate results

As an example of the scientific method, consider that you would like to start a
garden, but you are not sure which type of fertilizer is best. If there are three
types of fertilizer you cannot decided between, you might proceed through the
scientific method as follows:

1) Ask a question

Of fertilizers A, B, and C, which will yield the fastest plant growth in my garden?

2) Construct a hypothesis

You may have read some reviews for each type of fertilizer online. Fertilizer A
has 4.5 stars, fertilizer B as 4.2 stars, and fertilizer C has 4.6 stars. Thus,
you may hypothesize that fertilizer C leads to the fastest plant growth, followed
by fertilizer A, and that fertilizer B has the slowest plant growth.

3) Test your hypothesis with a study or an experiment

To test your hypothesis, you can section off three areas of your garden. In each
area, you use one of the three fertilizers and plant the same number and type of
plants. In order for your experiment to be fair, you will want to make sure each
of the three areas is equal-size, gets the same amount of sunlight, is watered
the same amount, etc.

4) Analyze data and draw conclusions

After 5 weeks, you measure the height of each plant which use each type of
fertilizer. Taking the average plant height from each of the three sections of
your garden, you find that when fertilizer A was used, the average plant height
was the tallest and thus you conclude that fertilizer A is the best for your
garden.

5) Communicate results

Other gardeners in your area may be interested in your results. You may share
with them your recommendation for fertilizer A, which is backed by a
well-controlled experiment.

Even in this relatively simple example, you can see how much care is required to
have a valid experiment. There are many other factors that may also influence
your decision on the best fertilizer, such as the price and availability. It's
also possible that fertilizer A ended up with the fastest average plant growth
just due to random chance, and if you repeated the entire experiment again you
might find fertilizer B or C to result in taller plants.

---

<div class="definition-container">
<div class="definition"> &nbsp; </div>
<div class="text">
**Scientific method:** <em> Five steps that can be used to acquire knowledge
  without incorporating personal opinion </em>
</div>
</div>


## Statistical framework {#ch1_s3}

As we saw in the fertilizer example, collecting and analyzing data can be complicated.
Statistics helps us design studies, test hypotheses, and use data to make scientifically
valid conclusions about the world. In general, scientists use the scientific method
to make generalizations about classes of people on the basis of their studies.
The class of people that they are trying to make generalizations about is called
the **population**. Most of the time, it is impractical and expensive to study
all individuals in a population - although many governments do  the best they
can every 10 years. Instead of sampling everyone in the population, or taking a
**census**, typically we study only a portion of the population called the
**sample**. In order to determine how best to obtain a sample to answer the
research questions, we must be cautious about the **study design**. Then,
researchers make generalizations, or **inference**, about the entire population
based on studying the sample. We can visualize the statistical framework using
this diagram:

```{tikz statFramework, fig.cap = "Statistical Framework", fig.ext = 'png', echo = F, fig.align='center'}
\usetikzlibrary{decorations.pathreplacing,positioning, arrows, shapes, calc,shapes.multipart}
\tikzstyle{block1} = [rectangle, draw, fill=yellow!20,
    text width=10em, text centered, rounded corners, minimum height=6em]
\tikzstyle{block2} = [rectangle, draw, fill=yellow!20,
    text width=5em, text centered, rounded corners, minimum height=3em]
\tikzset{
    %Define standard arrow tip
    >=stealth,
    % Define arrow style
    pil/.style={
           ->,
           thick,
           shorten <=2pt,
           shorten >=2pt,}
}
\tikzstyle{line} = [draw, -latex]
\begin{tikzpicture}[node distance = 3cm, auto]
            % Place nodes
            \node [block1] (pop) {Population \\ (Parameters)};
            \node [block2, below of=pop] (samp) {Sample \\ (Statistics)};

            % Draw edges
            \draw[->, >=latex, shorten >=2pt, shorten <=2pt, bend right=45, thick]  (pop.west) to node[auto, swap] {Study Design}(samp.west);
            \draw[->, >=latex, shorten >=2pt, shorten <=2pt, bend right=45, thick] (samp.east) to node[auto, swap] {Inference}(pop.east);

        \end{tikzpicture}
```

Throughout this book, we will dive deeper in to each aspect of the statistical
framework. We consider two broad categories of statistical
analysis: **descriptive statistics**, which deals with methods for summarizing
and/or presenting data and **inferential statistics**.

---

<div class="definition-container">
<div class="definition"> &nbsp; </div>
<div class="text">
**Population: ** <em> The group towards which scientists attempt to generalize </em>

**Census: ** <em> A study which includes all members of the population </em>

**Sample: ** <em> A portion of the population which is part of the study </em>

**Study design: ** <em> The process of obtaining the best sample to answer the research questions </em>

**Inference: ** <em> The process of making generalizations about the population </em>

**Descriptive statistics: ** <em> Methods for summarizing and/or presenting data </em>

**Inferential statistics: ** <em> Methods for making generalizations about a population
  using information contained in the sample </em>
</div>
</div>