Preference and restriction bias in decision trees

In machine learning, we begin with the space of all possible hypotheses about the true function which best predicts the data. These could be linear functions, polynomials, series of conjunctions and disjunctions, and so on. The model that we use to select from this hypothesis space necessarily imposes inductive biases, about how to generalize to the function from the training data. This is obviously a necessary step – it is infeasible to manually evaluate all possible function approximations – as much as it is a risky one.

Mitchell (1997) distinguishes between two types of inductive bias: preference and restriction. Restriction bias narrows the hypothesis space by eliminating all but certain kinds of functions. A single-layer perceptron, for example, famously can only produce linearly separable classifications, constraining it from modeling even simple functions like XOR. Preference bias, in contrast, refers to how a single hypothesis might be selected from the space of all possible hypotheses. Linear regression will prefer those that minimize a loss function. Decision trees will typically prefer shorter (‘simpler’) trees over longer ones. KNN prefers… nearer neighbors over farther ones, per some definition of ‘near’. And so on.

In other words, it looks like restriction bias limits what kinds of functions we can get as hypotheses, and preference bias tells us how we pick a favorite. We might even be tempted to think about it like the restriction tells us the general equation (like y = mx + b), and the preference tells us the specific values for the equation (m = 3, b = 2.5). But isn’t a preference metric (eg a particular loss function) a limit - maybe even a restriction - on ‘bad’ hypotheses?

My thought here was that the least-leaky way to think about this is that restrictions describe the space of possible hypothesis functions before we ever see any data, and preferences describe narrowing that space once we’ve seen data. So, for example, decision trees (without any other tricks) prefer shorter trees, but for any tree of depth k, you could envision a dataset that would produce a tree of depth k + 1 with the same algorithm. It’s true that you could never, with a DT, yield a tree where deeper nodes carry more information gain than shallower nodes, but that only becomes true after we’ve seen data.

Pre-pruning, where you pre-register a max tree depth k, does impose a hard restriction, because you could never discover/produce a dataset that yields from that model a tree of depth n + 1. Post-pruning, where you (for example) select some subtree where a penalty term a counterweights against excessive numbers of leafs, doesn’t render any particular hypothesis unreachable – even if a given training set ends up chopping off a few leafs, you could probably change nothing but the targets to make it so those chops aren’t worth the cost (let alone envision a different training set where the pruned result looks like the tree from the first example before any post-pruning is applied).

However, Mitchell says the ‘candidate elimination’ algorithm imposes a restriction bias by only considering functions that are consistent with the data. This, of course, suggests that our ‘data awareness’ heuristic is incorrect. We might be tempted to say that the candidate elimination algorithm assumes data consistency, so it cannot produce a hypothesis when data contradicts – for example, if we have training data ((2, 3), (2, 1)), it would simply produce no solution, but a decision tree (depending on the flavor) might settle for a deciding factor of m = 2. However, that’s stil introducing awareness of the data – before that, there’s nothing that prevents the candidate elimination algorithm from yielding that same hypothesis.

Isbell ultimately rules that “[m]aking it impossible to have a tree greater than depth k is a restriction bias. Making it less likely is a preference bias.” I think I’m pretty capable of applying this rule, but I’m not sure if I’ve successfully absorbed its logic, though.