Let
denote text items and
non-text items then the optimal threshold given that each item is distributed according to a normal distribution and text items are picked with probability
:

Where
is the average length of all y-values:


,
,
,
is the frequency of text items:

, and
the variance, which is the mean square error from the mean:


This does not take into account that text items are more frequent about the middle of the page or that text segments might be more likely next to each other.
Example (3 text items with lengths 0.2, 0.6 and 0.7, 2 markup-items with lengths 0.3 and 0.36):
(7)









Compare to the average over all samples:
(17)
The difference should be particularly large when there is a larger number of one class than the other or if one class is more spread out than the other.





