The opportunity to meet the pretty girl in the smoking room.

I am a moderate smoker. I find a pretty girl working in the same company, smokes sporadically in the same smoking room.

A casual thought came into my mind. Say, just like a mathematical puzzle. If both the girl and I have completely the same habit in smoking, for example, let it to be more mathematically: In eight hours of a working day, the girl totally smokes 6 cigarettes, each for 10 minutes, and only one cigarette for each time. Suppose she enters the smoking room at random timing absolutely. And all above is the same to me. Then what is the probability that I can meet the pretty girl at one time when I enter the smoking room?

In probability theory, if a kind of event happens sporadically during a period of time or in some particular space stochastically, we identify the happening of that kind of event as “Poisson distribution”, which is discrete.

On the other hand, the probability of the next Poissonian event happens in the following time \displaystyle t is denoted by “Exponential distribution”.

\displaystyle f(x)=\lambda e^{-\lambda x} \hspace{6mm} (x>0)

Accordingly, in this story, the girl and I both have the same randomness, with no mutual interference. The probability of meeting the girl is exactly congruent to the probability of the next poissonian event happens in the following time \displaystyle t.

\displaystyle P(0 \leqslant X \leqslant t) = \int_{0}^{t}\lambda e^{-\lambda x}dx

\displaystyle t=10 minutes,
\displaystyle \lambda=\frac{6}{8*60}=\frac{1}{80}

Integrate it:

\displaystyle \int_{0}^{10}\frac{1}{80}e^{-\frac{1}{80} x}dx \\\\ = 1 - e^{-\frac{1}{8}} \\\\ \approx 0.1175

Proof of Planck Law(in energy density formula) with statistical mechanics

Max Planck, the founder of quantum theory, who discovered it in 1900, and who coined the term “Quantum”.

Max Planck

Max Planck

Just before the dawn of Quantum Mechanics, an idealized physical conception called “black body” had been used to research the relationship between temperature and radiation spectrum.

In thermal equilibrium (at a constant temperature) in the wall of a black body, the classical expression of radiation energy which is also dubbed “Rayleigh-Jeans Law” is:

\displaystyle U(\nu) = \frac{8\pi \nu^2}{c^3}kT

\displaystyle U(\nu): spectral intensity.
\displaystyle \frac{8\pi \nu^2}{c^3}: radiation modes per unit frequency per unit volume.
\displaystyle c: speed of light.
\displaystyle kT: average energy per mode.

But in quantum expression, it is described by Planck as follows:

\displaystyle U(\nu) = \frac{8\pi}{c^3} \frac{h\nu^3}{\exp \left(\frac{h\nu}{kT} \right) - 1}

The difference from “Rayleigh-Jeans Law” equation is the factor of \displaystyle \frac{h\nu}{\exp \left(\frac{h\nu}{kT} \right) - 1}

Rayleigh-Jeans Law and Planck Law curve

Rayleigh-Jeans Law and Planck Law curve

Planck Law can be proved considering both quantum mechanically and statistical mechanically.

In statistical mechanics, given a constant temperature \displaystyle T, in thermal equilibrium environment, the probability of energy(\displaystyle E_i) of a given status(\displaystyle i) is:

\displaystyle P(E_i) = \frac{\exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \hspace{6mm} (n=0,1,2,...) \hspace{6mm} \textcircled{1}

\displaystyle \beta=\frac{1}{kT}, called Boltzmann factor.

Note : The name quantum mechanics derives from the observation that some physical quantities can change only in discrete amounts, but not in a continuous way.

Note : The name quantum mechanics derives from the observation that some physical quantities can change only in discrete amounts, but not in a continuous way.

According to the energy \displaystyle E_n = nh\nu, the denominator of \displaystyle \textcircled{1} is:

\displaystyle Z = \sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right) \\\\= 1 + e^{-\beta h\nu} + e^{-2\beta h\nu} + ... \\\\=\frac{1}{1-\exp \left(-\beta h\nu \right)} \hspace{6mm} \textcircled{2}

Then let us think of the expectation value(mean value) of \displaystyle E_i, just in an usual consideration of probability:

\displaystyle \langle E \rangle = \sum \limits_{i=0}^{\infty} E_i P(E_i) \\\\= \frac{\sum \limits_{i=0}^{\infty} E_i \exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \\\\=\frac{-\frac{\partial}{\partial \beta} \sum \limits_{i=0}^{\infty} \exp \left(-\beta E_i \right)}{\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right)} \\\\=-\frac{\partial}{\partial \beta}\ln \left(\sum \limits_{n=0}^{\infty} \exp \left(-\beta E_n \right) \right) \\\\=-\frac{\partial}{\partial \beta} \ln Z \hspace{6mm} \textcircled{3}

Put \displaystyle Z which we got in \displaystyle \textcircled{2} into \displaystyle \textcircled{3}, do \displaystyle \beta partial derivative again:

\displaystyle \langle E \rangle = \frac{h\nu}{\exp \left(\frac{h\nu}{kT} \right)-1}

Then times radiation modes per unit frequency per unit volume, which is \displaystyle \frac{8\pi \nu^2}{c^3}, we get Planck Law thus:

\displaystyle U(\nu) = \frac{8\pi}{c^3} \frac{h\nu^3}{\exp \left(\frac{h\nu}{kT} \right) - 1}

Proof of Chebyshev’s Inequality

Chebyshev’s Inequality is an important tool in probability theory. And it is a theoretical basis to prove the weak law of large numbers.

The theorem is named after Pafnuty Chebyshev, who is one of the greatest mathematician of Russia.

Пафну́тий Льво́вич Чебышёв

Пафну́тий Льво́вич Чебышёв

It is described as follows:

For a random variable \displaystyle X, has mathematical expectation \displaystyle \mu=E(X), and variance \displaystyle \sigma^2=V(X),

\displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}

The fabulous thing is that, Chebyshev’s Inequality works only by knowing the mathematical expectation and variance, whatever the distribution is(no matter the distribution is discrete or continuous).

Here is the proof of Chebyshev’s Inequality.

1. In the case of a discrete random variable \displaystyle X, the probability density function is \displaystyle f(x),

\displaystyle \sigma^2 = \sum\limits_{i=1} (x_i-\mu)^2f(x_i) \hspace{8mm} \textcircled 1

For those \displaystyle x_j in the domain of \displaystyle \mid X-\mu \mid \geqslant k\sigma, \displaystyle (x_j-\mu)^2 \geqslant k^2\sigma^2. So,

\displaystyle \textcircled 1 \geqslant \sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} (x_j-\mu)^2f(x_j) \\\\ \geqslant \sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} k^2\sigma^2f(x_j) \\\\ = k^2\sigma^2\sum\limits_{j, \mid X-\mu \mid \geqslant k\sigma} f(x_j) \\\\ = k^2\sigma^2 P \left(\mid X-\mu \mid \geqslant k\sigma \right)

Then we can get the inequality \displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}.

2. In the case of a continuous random variable \displaystyle X,

\displaystyle \sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2f(x)dx  \\\\ = \int_{\mid X-\mu \mid \geqslant k\sigma} (x-\mu)^2f(x)dx + \int_{\mid X-\mu \mid < k\sigma} (x-\mu)^2f(x)dx \\\\ \geqslant \int_{\mid X-\mu \mid \geqslant k\sigma} (x-\mu)^2f(x)dx \hspace{8mm} \textcircled 2

Just like discrete distribution discussed, for those \displaystyle x in the domain of \displaystyle \mid X-\mu \mid \geqslant k\sigma, \displaystyle (x-\mu)^2 \geqslant k^2\sigma^2. So,

\displaystyle \textcircled 2 \geqslant \int_{\mid X-\mu \mid \geqslant k\sigma} k^2\sigma^2f(x)dx \\\\ = k^2\sigma^2 \int_{\mid X-\mu \mid \geqslant k\sigma} f(x)dx \\\\ = k^2\sigma^2 P \left(\mid X-\mu \mid \geqslant k\sigma \right)

And we can get the same inequality \displaystyle P\left(\mid X-\mu \mid \geqslant k\sigma \right) \leqslant \frac{1}{k^2}.

Is it possible to pass a test all by chance?

Suppose you are taking a test of 100 questions, each of which has 4 options and values 1 point, and there is only 1 option is the correct answer. The pass line is set to 60 points. Now, you are holding a lottery of A, B, C, D to guess the correct answers all by chance.

Is it possible to pass the test all by chance?
Let’s think of it by serious mathematics.

First, because you try to choose a correct answer for each question, the possibility you succeed is \displaystyle p=\frac{1}{4}. And there is no doubt that the choosing action is a Bernoulli Experiment. So it is adapted to Binomial Distribution. Denote the random variable for a successful choice with \displaystyle X.

\displaystyle X \sim B(100,\frac{1}{4})

The probability we are chasing is:

\displaystyle P(X \geqslant 60)=?

In general, \displaystyle k successes in \displaystyle n choices,

\displaystyle P(X=k)\\\\=\binom{n}{k}p^k(1-p)^{n-k} \hspace{6mm} (k=0,1,2,..., n)

And, the cumulative distribution is:

\displaystyle \sum\limits_{k=0}^n \binom{n}{k}p^k(1-p)^{n-k}=1

So in our problem here,

\displaystyle P(X \geqslant 60)=\sum\limits_{k=60}^{100} \binom{100}{k}(\frac{1}{4})^k(\frac{3}{4})^{100-k}

Obviously, it is pretty hard to calculate this in that form.
But when \displaystyle n is big enough, we can transform the Binomial Distribution to Normal Distribution, by the famous “de Moivre–Laplace theorem“. So approximately,

\displaystyle X \sim N(np, np(1-p)).

By the way, a Normal Distribution density function is:

\displaystyle f(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right).

\displaystyle F(X \geqslant x)=\int^{\infty}_x \frac{1}{\sqrt{2\pi}\sigma}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right)dx.

And approximately, what we should do in our problem is, get the result of:

\displaystyle F(X \geqslant 60)\\\\=\int^{\infty}_{60} \frac{1}{\sqrt{2\pi}\sqrt{18.75}}\exp\left({-\frac{(x-25)^2}{2*18.75}}\right)dx

In order to look up in a Normal Distribution table, transform it to the Standard Normal Distribution format.

\displaystyle Z=\frac{X-\mu}{\sigma}=\frac{X-25}{\sqrt{18.75}}

Finally, use Standard Normal Distribution Table to find the result of this:

\displaystyle P(X \geqslant 60)\\\\=1-\Phi(Z \leqslant \frac{60-25}{\sqrt{18.75}})\\\\ \approx 1-\Phi(Z \leqslant 8.0829)\\\\ \approx 0

.
Ok, you see it, it is wise to study hard and do not play a lottery in your test. It does not help anyway.

Decision Tree and Entropy algorithm.

This is a study note of “Data Mining-Concepts and Techniques”, Jiawei Han.

In data mining and machine learning, there is a traditional methodology called Decision Tree.

Here is an extremely simple decision tree sample.
it shows how to predict whether a customer will buy a given kind of computer:

decision tree

Decision Tree is a flowchart-like tree structure. The parts and their names are like this:
1.Node
* Root node(Top of a decision tree)
* Non-leaf node(an internal node denotes a test on an attribute)
* Leaf node(the terminal node which holds a class label)
2.Branch(an outcome of the test)

What is a decision tree for, in simple words, is to predict.

For ID3 solution,
Step 1, procure the class-labeled training tuples from database.

Step 2, get the expected information(also known as entropy) needed to classify a tuple in D.

\displaystyle Entropy(D) = Info(D) = -\sum\limits_{i=1}^n p_i\log_2 p_i

(The word entropy, which originally presents a measure of molecular randomness or disorder, is from the second law of thermodynamics, which means that any spontaneous process increases the disorder of the universe.)

Step 3, calculate the information(entropy) on each given attribute. For example on attribute A:

\displaystyle Entropy(A) = Info_A (D) = \sum\limits_{j=1}^m \frac{|D_j|}{|D|} *Info(D_j)

Step 4, get the “Information gain” on each attribute, which is used as attribute selection measure. For example the information gain on attribute A:

\displaystyle Gain(A) = Info(D) - Info_A (D)

Step 5, find those attributes with higher information gain, and make them the splitting nodes of the decision tree. Another way of saying that is, the highest information gain reduces unpredictability the most.

But when we think about some intemperate cases, for example, a unique attribute such as customer ID, each ID should be corresponding to one single leaf node so,

\displaystyle Info_{CustomerID}(D) = 0

The information gain can be maximal but that makes no sense to choose customer ID as a splitting node.

Thus, a C4.5 solution, which is an successor of ID3 is imported. C4.5 uses an extension to information gain known as gain ratio, which attempts to overcome this bias. It normalizes information gain with a “split information” value:

\displaystyle SplitInfo_A(D) = -\sum\limits_{j=1}^m \frac{|D_j|}{|D|} *\log_2 \left(\frac{|D_j|}{|D|}\right)

For each outcome, it regards to the number of tuples having that outcome with respect to the total number of tuples in D, which is a consideration of “ratio”. And the gain ratio is defined as

\displaystyle GainRatio(A) = \frac{Gain(A)}{SplitInfo_A(D)}

The attribute with the maximum gain ratio is selected as the splitting attribute.
Note, when \displaystyle SplitInfo_A(D) \rightarrow 0, the gain ratio turns to be unstable. A constraint is added to avoid this, whereby the information gain of the test selected must be large(at least as great as the average gain over all tests examined).

“Memorylessness” property of some distributions.

Suppose that you are gambling, for example, a simple poker game. You have failed for \displaystyle s times, what is the probability to fail more than \displaystyle s+t times till your first win? You may feel that after so many failures, luck has been accumulated the success is more and more closer. But in fact it is not what it likes.

There are several kinds of probability distribution in the nature, with the property we call “memorylessness”.

In discrete distributions, for a geometric random variable where we count the number of failures till the first success on a sequence of independent Bernoulli test,

\displaystyle P(X=k)=p(1-p)^k \hspace{6mm} (k=0,1,2,...)

so the cumulative function is

\displaystyle P(X>s)\\\\=1-P(X \leqslant s)\\\\=1-\sum\limits_{k=0}^s p(1-p)^s\\\\=(1-p)^{s+1}

Then we can find,

\displaystyle P(X>s+t \mid X \geqslant s)\\\\=\frac{P(X>s+t \cap X \geqslant s)}{P(X \geqslant s)}\\\\=(1-p)^{t+1}\\\\=P(X>t)

That means, knowing that we’ve already seen \displaystyle s failures doesn’t change the likelihood that it will take at least \displaystyle t more trials till the first success. Another way of saying, the trial number till success from now on does not pertain to the trials that has already been failed.

And also in continuous distributions, for an exponential random variable \displaystyle X \sim Exp(\lambda), we have

\displaystyle f(x)=\lambda e^{-\lambda x} \hspace{6mm} (x>0)

and for \displaystyle s>0, t>0, we can also find the “memorylessness” property:

\displaystyle P(X>s+t\mid X>s) \\\\= \frac{P(X>s+t\cap X>s)}{P(X>s)} \\\\= \frac{P(X>s+t)}{P(X>s)} \\\\ = \frac{\displaystyle \int_{s+t}^{\infty}\lambda e^{-\lambda x}dx}{\displaystyle \int_{s}^{\infty}\lambda e^{-\lambda x}dx} \\\\= e^{-\lambda t} \\\\= P(X>t)

Determine the mandatory disk space on a server.

There came an exacting interrogation on me from a supervisor of infra team, asking how much hard drive space is accurately need on our app server, which receives random 24-hour file upload request, probably from all the branches all over the world, and unfortunately does not have plenty hard disk space. The good news is that, after some processes since the files are uploaded, they will be deleted by some app automatically in one minute. It is said the approximate upload requests count is 800 one day, and the size of each file is about 2M.

So I got to work. If one file will survive no more than one minute, the key is to calculate how many files could be uploaded concurrently in one minute, and the probability of that batch with the number appears in one minute should be close enough to 1. I will be satisfied assuming it as 99.99%.

Considering the requests all come stochastically, intuitively the event of the requests should satisfy Poisson distribution.

\displaystyle X \sim Po(\lambda)

Let’s take a look on how many requests in one minute, and it should be the expectation value \displaystyle \lambda of my Poisson distribution showing here.

\displaystyle \lambda=\frac{800}{24*60}\approx 0.6

Ok, Poisson distribution will work fine with this tiny \displaystyle \lambda.

In Poisson distribution,

\displaystyle P(X=k)=\frac{\lambda^ke^{-\lambda}}{k!}\hspace{6mm}(k=0,1,2,...)

So my problem becomes, find the \displaystyle n below to satisfy:

\displaystyle \sum\limits_{k=0}^n P(X=k)\geqslant 0.9999

When \displaystyle n=5,

\displaystyle \sum\limits_{k=0}^5 P(X=k)\approx 0.99996

So this is what I am looking for. A cluster of drive space for 5 files concurrently should be the minimum requirement. That is 10M.