Correlation Between GMO products and Diseases.

Recently, the issue of GMO products has been a controversial topic which is broadly discussed, especially in China.

GMOs, or “genetically modified organisms,” are plants or animals created through the gene splicing techniques of biotechnology (also called genetic engineering, or GE).

As many people know, Cui Yongyuan, a famous former host of China Central Television(CCTV), has traveled to US for the purpose of investigating the situation of GMO products. He has been visiting authority of the Genetic Engineering in the universities, former professors, doctors, and even owners of farms, citizens, trying to find out if GMO products are harmful to human-beings scientifically, and what sort of opinions the US citizens are possessing about GMO products.

In the 68 minutes video, a former navy professor, named Dr.Nancy Swanson, who took a few important chronic diseases showing either the incidence or deaths over time. Her research had given us the evidence of the relation between either glyphosate application or the amount of GE products, and the usual chronic diseases such as stroke, obesity, diabetes, etc, in US.

In Dr.Nancy Swanson’s research, she mentioned “Correlation” to describe it. I would like to talk a little bit about “Correlation”.

Generally, correlation tells us how well two sets of data are linked together. If the change in one variable is accompanied by a change in the other, then the variables are said to be correlated. In this article, it is not so difficult to find that the “two sets of data” would be the amount of GE products, and the incidence of diseases.

The theoretical covariance is the limiting value of the observed covariance of a frequency distribution defined as:

$\displaystyle covariance = \frac{\displaystyle \sum\limits_{i=1}^n \left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)}{\displaystyle n}$

In here, let $\displaystyle x_i$ be the variable for the amount of GE products, and $\displaystyle y_i$ be the incidence of diseases.

But depending on the measurement unit, the numbers of the variable could be fluctuating(bigger or smaller), and that makes covariance a futility to understand the relationship between two variables. In statistic, correlation is measured by what is called coefficient of correlation, which is defined as the covariance divided by the product of the standard deviations of the two variables:

$\displaystyle r = \frac{\displaystyle \sum\limits_{i=1}^n \left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)/{n}}{\displaystyle \sqrt{\displaystyle \frac{\sum\limits_{i=1}^n \left(x_i-\overline{x}\right)^{2}}{n} \frac{\sum\limits_{i=1}^n \left(y_i-\overline{y}\right)^{2}}{n}}}\\\\= \frac{\displaystyle \sum\limits_{i=1}^n \left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)}{\displaystyle \sqrt{\sum\limits_{i=1}^n \left(x_i-\overline{x}\right)^{2} \sum\limits_{i=1}^n \left(y_i-\overline{y}\right)^{2}}}$

Here is just a sample from Dr.Nancy Swanson, showing annual data of “GE soy & corn crops” and “Annual Incidence of Diabetes(per 1000)” of the latest decade.

So you can calculate the $\displaystyle r$ value for a try and find out if it is approaching to 1.

As the number of observations increases the observed correlation coefficient, $\displaystyle r$, tends towards its theoretical counterpart, $\displaystyle \rho$, defined as:

$\displaystyle \rho = \displaystyle \frac{\displaystyle Cov \left(X, Y \right)}{\displaystyle \sqrt{V \left(X \right) V \left(Y \right)}}$

And Dr.Nancy Swanson mentioned that, all the numbers of these correlations(the relations between the glyphosate application or the amount of GE products, and the cancer incidence, these years) are greater than 0.9. And a correlation of 1 is perfect.

Actually, correlation coefficient is bounded between -1 and 1. Here is the proof of it:

Let $\displaystyle \mu = E \left(X \right)$, $\displaystyle \nu = E \left(Y \right)$,

$\displaystyle Cov \left(X, Y \right) = E \left[\left(X- \mu \right) \left(Y- \nu \right) \right]$,

$\displaystyle \Bigl| Cov \left(X, Y \right) \Bigr|^{2} = \Bigl| E \left[\left(X- \mu \right) \left(Y- \nu \right) \right] \Bigr|^{2}$

By the Cauchy–Schwarz inequality,

$\displaystyle \Bigl| E \left(XY \right) \Bigr|^{2} \leq E \left(X^{2} \right)E \left(Y^{2} \right)$,

So we have

$\displaystyle \Bigl| Cov \left(X, Y \right) \Bigr|^{2} \\\\ = \Bigl| E \left[\left(X- \mu \right) \left(Y- \nu \right) \right] \Bigr|^{2} \\\\ \leq E \left[\left(X-\mu \right)^{2} \right] E \left[\left(Y-\nu \right)^{2} \right]\\\\ = V \left(X \right) V \left(Y \right)$

Therefore,

$\displaystyle -1 \leq \rho \leq 1$

Closer the coefficients are to +1.0 and -1.0, greater is the strength of the relationship between the variables.

Believe it or not. According to the data, it is said that the correlations between GE products and terrible diseases are greater than 0.9, which is nearly perfect.

Do you want to buy GE foods any more?

The opportunity to meet the pretty girl in the smoking room.

I am a moderate smoker. I find a pretty girl working in the same company, smokes sporadically in the same smoking room.

A casual thought came into my mind. Say, just like a mathematical puzzle. If both the girl and I have completely the same habit in smoking, for example, let it to be more mathematically: In eight hours of a working day, the girl totally smokes 6 cigarettes, each for 10 minutes, and only one cigarette for each time. Suppose she enters the smoking room at random timing absolutely. And all above is the same to me. Then what is the probability that I can meet the pretty girl at one time when I enter the smoking room?

In probability theory, if a kind of event happens sporadically during a period of time or in some particular space stochastically, we identify the happening of that kind of event as “Poisson distribution”, which is discrete.

On the other hand, the probability of the next Poissonian event happens in the following time $\displaystyle t$ is denoted by “Exponential distribution”.

$\displaystyle f(x)=\lambda e^{-\lambda x} \hspace{6mm} (x>0)$

Accordingly, in this story, the girl and I both have the same randomness, with no mutual interference. The probability of meeting the girl is exactly congruent to the probability of the next poissonian event happens in the following time $\displaystyle t$.

$\displaystyle P(0 \leqslant X \leqslant t) = \int_{0}^{t}\lambda e^{-\lambda x}dx$

$\displaystyle t=10 minutes$,
$\displaystyle \lambda=\frac{6}{8*60}=\frac{1}{80}$

Integrate it:

$\displaystyle \int_{0}^{10}\frac{1}{80}e^{-\frac{1}{80} x}dx \\\\ = 1 - e^{-\frac{1}{8}} \\\\ \approx 0.1175$

Proof of Planck Law(in energy density formula) with statistical mechanics

Max Planck, the founder of quantum theory, who discovered it in 1900, and who coined the term “Quantum”.

Max Planck

Just before the dawn of Quantum Mechanics, an idealized physical conception called “black body” had been used to research the relationship between temperature and radiation spectrum.

In thermal equilibrium (at a constant temperature) in the wall of a black body, the classical expression of radiation energy which is also dubbed “Rayleigh-Jeans Law” is:

$\displaystyle U(\nu) = \frac{8\pi \nu^2}{c^3}kT$

$\displaystyle U(\nu)$: spectral intensity.
$\displaystyle \frac{8\pi \nu^2}{c^3}$: radiation modes per unit frequency per unit volume.
$\displaystyle c$: speed of light.
$\displaystyle kT$: average energy per mode.

But in quantum expression, it is described by Planck as follows:

$\displaystyle U(\nu) = \frac{8\pi}{c^3} \frac{h\nu^3}{\exp \left(\frac{h\nu}{kT} \right) - 1}$

The difference from “Rayleigh-Jeans Law” equation is the factor of $\displaystyle \frac{h\nu}{\exp \left(\frac{h\nu}{kT} \right) - 1}$

Rayleigh-Jeans Law and Planck Law curve

Planck Law can be proved considering both quantum mechanically and statistical mechanically.

In statistical mechanics, given a constant temperature $\displaystyle T$, in thermal equilibrium environment, the probability of energy($\displaystyle E_i$) of a given status($\displaystyle i$) is:

$\displaystyle P(E_i) = \frac{\exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \hspace{6mm} (n=0,1,2,...) \hspace{6mm} \textcircled{1}$

$\displaystyle \beta=\frac{1}{kT}$, called Boltzmann factor.

Note : The name quantum mechanics derives from the observation that some physical quantities can change only in discrete amounts, but not in a continuous way.

According to the energy $\displaystyle E_n = nh\nu$, the denominator of $\displaystyle \textcircled{1}$ is:

$\displaystyle Z = \sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right) \\\\= 1 + e^{-\beta h\nu} + e^{-2\beta h\nu} + ... \\\\=\frac{1}{1-\exp \left(-\beta h\nu \right)} \hspace{6mm} \textcircled{2}$

Then let us think of the expectation value(mean value) of $\displaystyle E_i$, just in an usual consideration of probability:

$\displaystyle \langle E \rangle = \sum \limits_{i=0}^{\infty} E_i P(E_i) \\\\= \frac{\sum \limits_{i=0}^{\infty} E_i \exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \\\\=\frac{-\frac{\partial}{\partial \beta} \sum \limits_{i=0}^{\infty} \exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \\\\=-\frac{\partial}{\partial \beta}\ln \left(\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right) \right) \\\\=-\frac{\partial}{\partial \beta} \ln Z \hspace{6mm} \textcircled{3}$

Put $\displaystyle Z$ which we got in $\displaystyle \textcircled{2}$ into $\displaystyle \textcircled{3}$, do $\displaystyle \beta$ partial derivative again:

$\displaystyle \langle E \rangle = \frac{h\nu}{\exp \left(\frac{h\nu}{kT} \right)-1}$

Then times radiation modes per unit frequency per unit volume, which is $\displaystyle \frac{8\pi \nu^2}{c^3}$, we get Planck Law thus:

$\displaystyle U(\nu) = \frac{8\pi}{c^3} \frac{h\nu^3}{\exp \left(\frac{h\nu}{kT} \right) - 1}$

Proof of Chebyshev’s Inequality

Chebyshev’s Inequality is an important tool in probability theory. And it is a theoretical basis to prove the weak law of large numbers.

The theorem is named after Pafnuty Chebyshev, who is one of the greatest mathematician of Russia.

Пафну́тий Льво́вич Чебышёв

It is described as follows:

For a random variable $\displaystyle X$, has mathematical expectation $\displaystyle \mu=E(X)$, and variance $\displaystyle \sigma^2=V(X)$,

$\displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}$

The fabulous thing is that, Chebyshev’s Inequality works only by knowing the mathematical expectation and variance, whatever the distribution is(no matter the distribution is discrete or continuous).

Here is the proof of Chebyshev’s Inequality.

1. In the case of a discrete random variable $\displaystyle X$, the probability density function is $\displaystyle f(x)$,

$\displaystyle \sigma^2 = \sum\limits_{i=1} (x_i-\mu)^2f(x_i) \hspace{8mm} \textcircled 1$

For those $\displaystyle x_j$ in the domain of $\displaystyle \mid X-\mu \mid \geqslant k\sigma$, $\displaystyle (x_j-\mu)^2 \geqslant k^2\sigma^2$. So,

$\displaystyle \textcircled 1 \geqslant \sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} (x_j-\mu)^2f(x_j) \\\\ \geqslant \sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} k^2\sigma^2f(x_j) \\\\ = k^2\sigma^2\sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} f(x_j) \\\\ = k^2\sigma^2 P \left(\mid X-\mu \mid \geqslant k\sigma \right)$

Then we can get the inequality $\displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}.$

2. In the case of a continuous random variable $\displaystyle X$,

$\displaystyle \sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2f(x)dx \\\\ = \int_{\mid X-\mu \mid \geqslant k\sigma} (x-\mu)^2f(x)dx + \int_{\mid X-\mu \mid < k\sigma} (x-\mu)^2f(x)dx \\\\ \geqslant \int_{\mid X-\mu \mid \geqslant k\sigma} (x-\mu)^2f(x)dx \hspace{8mm} \textcircled 2$

Just like discrete distribution discussed, for those $\displaystyle x$ in the domain of $\displaystyle \mid X-\mu \mid \geqslant k\sigma$, $\displaystyle (x-\mu)^2 \geqslant k^2\sigma^2$. So,

$\displaystyle \textcircled 2 \geqslant \int_{\mid X-\mu \mid \geqslant k\sigma} k^2\sigma^2f(x)dx \\\\ = k^2\sigma^2 \int_{\mid X-\mu \mid \geqslant k\sigma} f(x)dx \\\\ = k^2\sigma^2 P \left(\mid X-\mu \mid \geqslant k\sigma \right)$

And we can get the same inequality $\displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}.$

Is it possible to pass a test all by chance?

Suppose you are taking a test of 100 questions, each of which has 4 options and values 1 point, and there is only 1 option is the correct answer. The pass line is set to 60 points. Now, you are holding a lottery of A, B, C, D to guess the correct answers all by chance.

Is it possible to pass the test all by chance?
Let’s think of it by serious mathematics.

First, because you try to choose a correct answer for each question, the possibility you succeed is $\displaystyle p=\frac{1}{4}$. And there is no doubt that the choosing action is a Bernoulli Experiment. So it is adapted to Binomial Distribution. Denote the random variable for a successful choice with $\displaystyle X$.

$\displaystyle X \sim B(100,\frac{1}{4})$

The probability we are chasing is:

$\displaystyle P(X \geqslant 60)=?$

In general, $\displaystyle k$ successes in $\displaystyle n$ choices,

$\displaystyle P(X=k)\\\\=\binom{n}{k}p^k(1-p)^{n-k} \hspace{6mm} (k=0,1,2,..., n)$

And, the cumulative distribution is:

$\displaystyle \sum\limits_{k=0}^n \binom{n}{k}p^k(1-p)^{n-k}=1$

So in our problem here,

$\displaystyle P(X \geqslant 60)=\sum\limits_{k=60}^{100} \binom{100}{k}(\frac{1}{4})^k(\frac{3}{4})^{100-k}$

Obviously, it is pretty hard to calculate this in that form.
But when $\displaystyle n$ is big enough, we can transform the Binomial Distribution to Normal Distribution, by the famous “de Moivre–Laplace theorem“. So approximately,

$\displaystyle X \sim N(np, np(1-p)).$

By the way, a Normal Distribution density function is:

$\displaystyle f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right).$

$\displaystyle F(X \geqslant x)=\int^{\infty}_x \frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right)dx.$

And approximately, what we should do in our problem is, get the result of:

$\displaystyle F(X \geqslant 60)\\\\=\int^{\infty}_{60} \frac{1}{\sqrt{2\pi}\sqrt{18.75}}\exp\left({-\frac{(x-25)^2}{2*18.75}}\right)dx$

In order to look up in a Normal Distribution table, transform it to the Standard Normal Distribution format.

$\displaystyle Z=\frac{X-\mu}{\sigma}=\frac{X-25}{\sqrt{18.75}}$

Finally, use Standard Normal Distribution Table to find the result of this:

$\displaystyle P(X \geqslant 60)\\\\=1-\Phi(Z \leqslant \frac{60-25}{\sqrt{18.75}})\\\\ \approx 1-\Phi(Z \leqslant 8.0829)\\\\ \approx 0$

.
Ok, you see it, it is wise to study hard and do not play a lottery in your test. It does not help anyway.

Decision Tree and Entropy algorithm.

This is a study note of “Data Mining-Concepts and Techniques”, Jiawei Han.

In data mining and machine learning, there is a traditional methodology called Decision Tree.

Here is an extremely simple decision tree sample.
it shows how to predict whether a customer will buy a given kind of computer:

Decision Tree is a flowchart-like tree structure. The parts and their names are like this:
1.Node
* Root node(Top of a decision tree)
* Non-leaf node(an internal node denotes a test on an attribute)
* Leaf node(the terminal node which holds a class label)
2.Branch(an outcome of the test)

What is a decision tree for, in simple words, is to predict.

For ID3 solution,
Step 1, procure the class-labeled training tuples from database.

Step 2, get the expected information(also known as entropy) needed to classify a tuple in D.

$\displaystyle Entropy(D) = Info(D) = -\sum\limits_{i=1}^n p_i\log_2 p_i$

(The word entropy, which originally presents a measure of molecular randomness or disorder, is from the second law of thermodynamics, which means that any spontaneous process increases the disorder of the universe.)

Step 3, calculate the information(entropy) on each given attribute. For example on attribute A:

$\displaystyle Entropy(A) = Info_A (D) = \sum\limits_{j=1}^m \frac{|D_j|}{|D|} *Info(D_j)$

Step 4, get the “Information gain” on each attribute, which is used as attribute selection measure. For example the information gain on attribute A:

$\displaystyle Gain(A) = Info(D) - Info_A (D)$

Step 5, find those attributes with higher information gain, and make them the splitting nodes of the decision tree. Another way of saying that is, the highest information gain reduces unpredictability the most.

But when we think about some intemperate cases, for example, a unique attribute such as customer ID, each ID should be corresponding to one single leaf node so,

$\displaystyle Info_{CustomerID}(D) = 0$

The information gain can be maximal but that makes no sense to choose customer ID as a splitting node.

Thus, a C4.5 solution, which is an successor of ID3 is imported. C4.5 uses an extension to information gain known as gain ratio, which attempts to overcome this bias. It normalizes information gain with a “split information” value:

$\displaystyle SplitInfo_A(D) = -\sum\limits_{j=1}^m \frac{|D_j|}{|D|} *\log_2 \left(\frac{|D_j|}{|D|}\right)$

For each outcome, it regards to the number of tuples having that outcome with respect to the total number of tuples in D, which is a consideration of “ratio”. And the gain ratio is defined as

$\displaystyle GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}$

The attribute with the maximum gain ratio is selected as the splitting attribute.
Note, when $\displaystyle SplitInfo_A(D) \rightarrow 0$, the gain ratio turns to be unstable. A constraint is added to avoid this, whereby the information gain of the test selected must be large(at least as great as the average gain over all tests examined).

“Memorylessness” property of some distributions.

Suppose that you are gambling, for example, a simple poker game. You have failed for $\displaystyle s$ times, what is the probability to fail more than $\displaystyle s+t$ times till your first win? You may feel that after so many failures, luck has been accumulated the success is more and more closer. But in fact it is not what it likes.

There are several kinds of probability distribution in the nature, with the property we call “memorylessness”.

In discrete distributions, for a geometric random variable where we count the number of failures till the first success on a sequence of independent Bernoulli test,

$\displaystyle P(X=k)=p(1-p)^k \hspace{6mm} (k=0,1,2,...)$

so the cumulative function is

$\displaystyle P(X>s)\\\\=1-P(X \leqslant s)\\\\=1-\sum\limits_{k=0}^s p(1-p)^s\\\\=(1-p)^{s+1}$

Then we can find,

$\displaystyle P(X>s+t \mid X \geqslant s)\\\\=\frac{P(X>s+t \cap X \geqslant s)}{P(X \geqslant s)}\\\\=(1-p)^{t+1}\\\\=P(X>t)$

That means, knowing that we’ve already seen $\displaystyle s$ failures doesn’t change the likelihood that it will take at least $\displaystyle t$ more trials till the first success. Another way of saying, the trial number till success from now on does not pertain to the trials that has already been failed.

And also in continuous distributions, for an exponential random variable $\displaystyle X \sim Exp(\lambda)$, we have

$\displaystyle f(x)=\lambda e^{-\lambda x} \hspace{6mm} (x>0)$

and for $\displaystyle s>0, t>0$, we can also find the “memorylessness” property:

$\displaystyle P(X>s+t\mid X>s) \\\\= \frac{P(X>s+t\cap X>s)}{P(X>s)} \\\\= \frac{P(X>s+t)}{P(X>s)} \\\\ = \frac{\displaystyle \int_{s+t}^{\infty}\lambda e^{-\lambda x}dx}{\displaystyle \int_{s}^{\infty}\lambda e^{-\lambda x}dx} \\\\= e^{-\lambda t} \\\\= P(X>t)$