Recently, the issue of GMO products has been a controversial topic which is broadly discussed, especially in China.
GMOs, or “genetically modified organisms,” are plants or animals created through the gene splicing techniques of biotechnology (also called genetic engineering, or GE).
As many people know, Cui Yongyuan, a famous former host of China Central Television(CCTV), has traveled to US for the purpose of investigating the situation of GMO products. He has been visiting authority of the Genetic Engineering in the universities, former professors, doctors, and even owners of farms, citizens, trying to find out if GMO products are harmful to human-beings scientifically, and what sort of opinions the US citizens are possessing about GMO products.
In the 68 minutes video, a former navy professor, named Dr.Nancy Swanson, who took a few important chronic diseases showing either the incidence or deaths over time. Her research had given us the evidence of the relation between either glyphosate application or the amount of GE products, and the usual chronic diseases such as stroke, obesity, diabetes, etc, in US.
In Dr.Nancy Swanson’s research, she mentioned “Correlation” to describe it. I would like to talk a little bit about “Correlation”.
Generally, correlation tells us how well two sets of data are linked together. If the change in one variable is accompanied by a change in the other, then the variables are said to be correlated. In this article, it is not so difficult to find that the “two sets of data” would be the amount of GE products, and the incidence of diseases.
The theoretical covariance is the limiting value of the observed covariance of a frequency distribution defined as:
In here, let be the variable for the amount of GE products, and be the incidence of diseases.
But depending on the measurement unit, the numbers of the variable could be fluctuating(bigger or smaller), and that makes covariance a futility to understand the relationship between two variables. In statistic, correlation is measured by what is called coefficient of correlation, which is defined as the covariance divided by the product of the standard deviations of the two variables:
Here is just a sample from Dr.Nancy Swanson, showing annual data of “GE soy & corn crops” and “Annual Incidence of Diabetes(per 1000)” of the latest decade.
So you can calculate the value for a try and find out if it is approaching to 1.
As the number of observations increases the observed correlation coefficient, , tends towards its theoretical counterpart, , defined as:
And Dr.Nancy Swanson mentioned that, all the numbers of these correlations(the relations between the glyphosate application or the amount of GE products, and the cancer incidence, these years) are greater than 0.9. And a correlation of 1 is perfect.
Actually, correlation coefficient is bounded between -1 and 1. Here is the proof of it:
Let , ,
By the Cauchy–Schwarz inequality,
So we have
Believe it or not. According to the data, it is said that the correlations between GE products and terrible diseases are greater than 0.9, which is nearly perfect.
Do you want to buy GE foods any more?
I am a moderate smoker. I find a pretty girl working in the same company, smokes sporadically in the same smoking room.
A casual thought came into my mind. Say, just like a mathematical puzzle. If both the girl and I have completely the same habit in smoking, for example, let it to be more mathematically: In eight hours of a working day, the girl totally smokes 6 cigarettes, each for 10 minutes, and only one cigarette for each time. Suppose she enters the smoking room at random timing absolutely. And all above is the same to me. Then what is the probability that I can meet the pretty girl at one time when I enter the smoking room?
In probability theory, if a kind of event happens sporadically during a period of time or in some particular space stochastically, we identify the happening of that kind of event as “Poisson distribution”, which is discrete.
On the other hand, the probability of the next Poissonian event happens in the following time is denoted by “Exponential distribution”.
Accordingly, in this story, the girl and I both have the same randomness, with no mutual interference. The probability of meeting the girl is exactly congruent to the probability of the next poissonian event happens in the following time .
Max Planck, the founder of quantum theory, who discovered it in 1900, and who coined the term “Quantum”.
Just before the dawn of Quantum Mechanics, an idealized physical conception called “black body” had been used to research the relationship between temperature and radiation spectrum.
In thermal equilibrium (at a constant temperature) in the wall of a black body, the classical expression of radiation energy which is also dubbed “Rayleigh-Jeans Law” is:
: spectral intensity.
: radiation modes per unit frequency per unit volume.
: speed of light.
: average energy per mode.
But in quantum expression, it is described by Planck as follows:
The difference from “Rayleigh-Jeans Law” equation is the factor of
Planck Law can be proved considering both quantum mechanically and statistical mechanically.
In statistical mechanics, given a constant temperature , in thermal equilibrium environment, the probability of energy() of a given status() is:
, called Boltzmann factor.
According to the energy , the denominator of is:
Then let us think of the expectation value(mean value) of , just in an usual consideration of probability:
Put which we got in into , do partial derivative again:
Then times radiation modes per unit frequency per unit volume, which is , we get Planck Law thus:
Chebyshev’s Inequality is an important tool in probability theory. And it is a theoretical basis to prove the weak law of large numbers.
The theorem is named after Pafnuty Chebyshev, who is one of the greatest mathematician of Russia.
It is described as follows:
For a random variable , has mathematical expectation , and variance ,
The fabulous thing is that, Chebyshev’s Inequality works only by knowing the mathematical expectation and variance, whatever the distribution is(no matter the distribution is discrete or continuous).
Here is the proof of Chebyshev’s Inequality.
1. In the case of a discrete random variable , the probability density function is ,
For those in the domain of , . So,
Then we can get the inequality
2. In the case of a continuous random variable ,
Just like discrete distribution discussed, for those in the domain of , . So,
And we can get the same inequality
Suppose you are taking a test of 100 questions, each of which has 4 options and values 1 point, and there is only 1 option is the correct answer. The pass line is set to 60 points. Now, you are holding a lottery of A, B, C, D to guess the correct answers all by chance.
Is it possible to pass the test all by chance?
Let’s think of it by serious mathematics.
First, because you try to choose a correct answer for each question, the possibility you succeed is . And there is no doubt that the choosing action is a Bernoulli Experiment. So it is adapted to Binomial Distribution. Denote the random variable for a successful choice with .
The probability we are chasing is:
In general, successes in choices,
And, the cumulative distribution is:
So in our problem here,
Obviously, it is pretty hard to calculate this in that form.
But when is big enough, we can transform the Binomial Distribution to Normal Distribution, by the famous “de Moivre–Laplace theorem“. So approximately,
By the way, a Normal Distribution density function is:
And approximately, what we should do in our problem is, get the result of:
In order to look up in a Normal Distribution table, transform it to the Standard Normal Distribution format.
Finally, use Standard Normal Distribution Table to find the result of this:
Ok, you see it, it is wise to study hard and do not play a lottery in your test. It does not help anyway.
This is a study note of “Data Mining-Concepts and Techniques”, Jiawei Han.
In data mining and machine learning, there is a traditional methodology called Decision Tree.
Here is an extremely simple decision tree sample.
it shows how to predict whether a customer will buy a given kind of computer:
Decision Tree is a flowchart-like tree structure. The parts and their names are like this:
* Root node(Top of a decision tree)
* Non-leaf node(an internal node denotes a test on an attribute)
* Leaf node(the terminal node which holds a class label)
2.Branch(an outcome of the test)
What is a decision tree for, in simple words, is to predict.
For ID3 solution,
Step 1, procure the class-labeled training tuples from database.
Step 2, get the expected information(also known as entropy) needed to classify a tuple in D.
(The word entropy, which originally presents a measure of molecular randomness or disorder, is from the second law of thermodynamics, which means that any spontaneous process increases the disorder of the universe.)
Step 3, calculate the information(entropy) on each given attribute. For example on attribute A:
Step 4, get the “Information gain” on each attribute, which is used as attribute selection measure. For example the information gain on attribute A:
Step 5, find those attributes with higher information gain, and make them the splitting nodes of the decision tree. Another way of saying that is, the highest information gain reduces unpredictability the most.
But when we think about some intemperate cases, for example, a unique attribute such as customer ID, each ID should be corresponding to one single leaf node so,
The information gain can be maximal but that makes no sense to choose customer ID as a splitting node.
Thus, a C4.5 solution, which is an successor of ID3 is imported. C4.5 uses an extension to information gain known as gain ratio, which attempts to overcome this bias. It normalizes information gain with a “split information” value:
For each outcome, it regards to the number of tuples having that outcome with respect to the total number of tuples in D, which is a consideration of “ratio”. And the gain ratio is defined as
The attribute with the maximum gain ratio is selected as the splitting attribute.
Note, when , the gain ratio turns to be unstable. A constraint is added to avoid this, whereby the information gain of the test selected must be large(at least as great as the average gain over all tests examined).
Suppose that you are gambling, for example, a simple poker game. You have failed for times, what is the probability to fail more than times till your first win? You may feel that after so many failures, luck has been accumulated the success is more and more closer. But in fact it is not what it likes.
There are several kinds of probability distribution in the nature, with the property we call “memorylessness”.
In discrete distributions, for a geometric random variable where we count the number of failures till the first success on a sequence of independent Bernoulli test,
so the cumulative function is
Then we can find,
That means, knowing that we’ve already seen failures doesn’t change the likelihood that it will take at least more trials till the first success. Another way of saying, the trial number till success from now on does not pertain to the trials that has already been failed.
And also in continuous distributions, for an exponential random variable , we have
and for , we can also find the “memorylessness” property:
There came an exacting interrogation on me from a supervisor of infra team, asking how much hard drive space is accurately need on our app server, which receives random 24-hour file upload request, probably from all the branches all over the world, and unfortunately does not have plenty hard disk space. The good news is that, after some processes since the files are uploaded, they will be deleted by some app automatically in one minute. It is said the approximate upload requests count is 800 one day, and the size of each file is about 2M.
So I got to work. If one file will survive no more than one minute, the key is to calculate how many files could be uploaded concurrently in one minute, and the probability of that batch with the number appears in one minute should be close enough to 1. I will be satisfied assuming it as 99.99%.
Considering the requests all come stochastically, intuitively the event of the requests should satisfy Poisson distribution.
Let’s take a look on how many requests in one minute, and it should be the expectation value of my Poisson distribution showing here.
Ok, Poisson distribution will work fine with this tiny .
In Poisson distribution,
So my problem becomes, find the below to satisfy:
So this is what I am looking for. A cluster of drive space for 5 files concurrently should be the minimum requirement. That is 10M.
I found an interesting mathematical puzzle several days ago on the internet. The question was, pick up 3 real numbers from 0 to 1, then find the mathematical expectation of the minimum one.
Ok, let’s solve this with the power of probability techniques!
My solution is like this:
In the domain of
Denote the 3 independent random variables are:
The cumulative distribution function of is:
so we have
and the density function is:
Now the expectation value is:
Use function will be quick: