The Easy Way To Extract Useful Text From Arbitrary Html

Let y denote text items and z non-text items then the optimal threshold given that each item is distributed according to a normal distribution and text items are picked with probability p:

(1)
\theta^* = \frac{\mu_y t_y - \mu_z t_z \pm \sqrt{ t_y t_z } ( \mu_y - \mu_z)}{ t_y - t_z}

Where \mu_y is the average length of all y-values:

(2)
\mu_y = \bar y = \frac{1}{|y|} \sum_{i=1}^{|y|} y_i
(3)
\mu_z = \bar z = \frac{1}{|z|} \sum_{i=1}^{|z|} z_i

, t_y = \sigma_y^2 \text{ln}(p), t_z = \sigma_z^2 \text{ln}(1-p), p is the frequency of text items:

(4)
p = \frac{|y|}{|y| + |z|}

, and \sigma_y^2 the variance, which is the mean square error from the mean:

(5)
\sigma_y^2 = \frac{1}{|y|} \sum_{i=1}^{|y|} (y_i - \bar y)^2
(6)
\sigma_z^2 = \frac{1}{|z|} \sum_{i=1}^{|z|} (z_i - \bar z)^2

This does not take into account that text items are more frequent about the middle of the page or that text segments might be more likely next to each other.

Example (3 text items with lengths 0.2, 0.6 and 0.7, 2 markup-items with lengths 0.3 and 0.36):

(7)
y_1 = 0.2, y_2 = 0.6, y_3 = 0.7
(8)
z_1 = 0.3, z_2 = 0.36
(9)
\bar y = (0.2 + 0.6 + 0.7) / 3 = 0.5
(10)
\bar z = (0.3 + 0.36) / 2 = 0.33
(11)
p = 3 / (3 + 2) = 0.6
(12)
\sigma_y = ((0.2 - 0.5)^2 + (0.6 - 0.5)^2 + (0.7 - 0.5)^2 ) / 3 = 0.0466...
(13)
\sigma_z = ((0.3 - 0.33)^2 + (0.36 - 0.33)^2 ) / 2 = 0.0009
(14)
t_y = 0.0466... \text{ln} 0.6 = -0.238...
(15)
t_z = 0.0009 \text{ln} 0.4 = -0.000824...
(16)
\theta^* = \frac{\mu_y t_y - \mu_z t_z \pm \sqrt{ t_y t_z } ( \mu_y - \mu_z)}{ t_y - t_z} = \frac{0.5 * -0.233 - 0.33 * -0.000824 \pm \sqrt{ -0.233 * -0.000824 } ( 0.5 - 0.33)}{ -0.233 - 0.000824} = 0.489 \ \text{or} \ 0.509

Compare to the average over all samples:

(17)
(0.2 + 0.6 + 0.7 + 0.3 + 0.36) / 5 = 0.43

The difference should be particularly large when there is a larger number of one class than the other or if one class is more spread out than the other.

page_revision: 29, last_edited: 1175872712|%e %b %Y, %H:%M %Z (%O ago)
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.