The usual way of doing a markov chain bot (first order) is that for all pairs of tokens p and q, the likelihood of p following q is:
P(q|p)
P(p)
This gives us a probability graph that looks more or less like a line. The most common pairs are up top, and the least common are at the bottom. This is actually directly equivalent to the inverted graph of information entropy, where the least common would be up top and the most common at the bottom.
The problem is that in language, the most common sequences tend to be meaningless (or exist only for redundancy). Search engines filter these sequences out because they do nothing but make more work for systems that operate based on finding the sequences closest to unique. The outliers on the other side tend also to be meaningless, for a different reason: they tend to be errors. So, the optimal signal is actually of middling entropy.
How do we make bots that will (without special-case coding) automatically avoid succumbing to the usual exploits (such as the twelve year old troll who spams it in PM with the token “mantits” repeated eleven thousand times)? How do we reliably and elegantly improve the signal to noise ratio?
If you try to graph how likely something is to be signal-heavy in such a system, you’ll probably get a parabola that peaks about where the graph of probability and the graph of entropy cross. The goal is to make the graph of this weighted markov probability (which might be called cooked-model probability) quickly approach that of the signal. The easiest way to do this is, rather than incrementing both p and q when p follows q, doing the following:
Pn(q|p) <- Pn-1(q|p) + ( (Pn-1(q|p) )2 - (Pn-1(p) )2 )1/2 Pn-1(p) Pn-1(q|p)
Pn(p) <- Pn-1(p) + 1
As should be clear, the graph will rapidly approach resemblance to the signal graph, and it will slow its mutation as it gets closer to the signal graph, for a known sequence of tokens. This means that such a bot should be capable of operating at a similar signal to noise ratio as some standard with a much smaller training input set.
No comments:
Post a Comment