A defintion of artificial intelligence can fall under 4 categories:
Thinking humanly | Thinking rationally |
Thinking rationally  | Acting rationally |
Acting Humanly
The Turing Test is a well known test for how well an AI performs. It suggests that if it is acting intelligently, it is thinking intelligently.
It also suggests that an AI requires knowledge, reasoning, learning (and language).
However the general consensus is that the Turing Test is not great, as many counterexamples and exploits can be found for it.
One such "counterexample" is Searle's Chinese Room.
Thinking Humanly
There is a field of study called cognitive modelling, which is modelling human thinking in programs. (We will however ignore this)
Thinking rationally
Given a premise/some premises, an agent (the AI model) must derive or deduce the solution via logic. Rational deduction however has problems dealing with uncertainty.
Acting Rationally
Acting rationally can be summed up as "doing the right thing", which is to say, the action that maximises some achievement metric. However, thought or inference in these actions is not necessarily involved.
In actuality, the boundaries between these four categories are quite blurred.
Specialised AI vs AGI
Rational Agents
Specifically, we look at Rational Agents.
These agents can be modelled as a function \(f:P \longrightarrow A\), where P denotes percept histories and A denotes actions.
Agents are typically required to exhibit autonomy, and seek the best performance possible. (Perfect rationality may not always be the best as there is a trade off with computing resources)
Agent Input Model
Above shows a diagram of what inputs an agent takes, and what it does.
Rational Agents
The Goals of a rational agent can be specified by a performance measure, which is a numerical value.
A rational action is an action which aims to maximise the performance measure, given the percept history to date.
Note that an agent is not onmiscient, clairvoyant (forseeing changes), nor always successful. The expected value, not the actual value, is considered - an action to maximise expected value which fails is still rational.
Dimensions of Complexity
The design space / considerations of an AI can be defined by a set of "dimensions of complexity".
Dimension | Values |
---|---|
Modularity | Flat, modular, heirarchical |
Planning Horizon | Static/none, finite (stage), indefinite, infinite |
Representation | Flat state, features, relations |
Computational Limits | Perfect or bounded rationality |
Learning | Knowledge given or learned |
Sensing Uncertainty | Fully or partially obervable |
Effect Uncertainty | Deterministic, stochastic |
Preference | Goals, complex preferences |
Number of Agents | Single, multiple |
Interaction | Offline, online |
Modularity
Modularity concerns the structure and abstraction.
Planning Horizon
Planning Horizon concerns how the world is, and how far forward does the agent plan to.
Representation
Finding compact Representations and exploiting them for computational gains. An AI can reason in
Computational Limits
Computational Limits:
Learning from Experience
Learning from Experience:
Uncertainty
Uncertainty has two dimensions: sensing and effect. In each, an agent can have:
Sensing Uncertainty
Sensing Uncertainty is about what the agent can sense about state, the state can be
Effect Uncertainty
Effect Uncertainty concerns whether an agent could predict an outcome from a state and an action.
Preferences
Preferences concerns what an agent tries to achieve.
Number of Agents
Crucially, Number of reasoning agents.
Interaction
Interaction talks about when the agent reasons relating to its action
Examples
Some examples of the dimensions of different agents
Dimension | Values |
---|---|
Modularity | Flat |
Planning Horizon | indefinite |
Representation | Flat state |
Computational Limits | Perfect rationality |
Learning | Knowledge given |
Sensing Uncertainty | Fully obervable |
Effect Uncertainty | Deterministic |
Preference | Goals |
Number of Agents | Single |
Interaction | Offline |
Dimension | Values |
---|---|
Modularity | Flat |
Planning Horizon | indefinite, infinite |
Representation | features |
Computational Limits | Perfect rationality |
Learning | Knowledge learned |
Sensing Uncertainty | Fully obervable |
Effect Uncertainty | Stochastic |
Preference | complex preferences |
Number of Agents | Single |
Interaction | online |
Dimension | Values |
---|---|
Modularity | heirarchical |
Planning Horizon | indefinite, infinite |
Representation | relations |
Computational Limits | bounded rationality |
Learning | Knowledge learned |
Sensing Uncertainty | partially obervable |
Effect Uncertainty | stochastic |
Preference | complex preferences |
Number of Agents | multiple |
Interaction | online |
Complex interaction
Dimensions intract and combine in complex ways. Partial observability makes multi-agent and indefinite stage more complex, and with modularity you can have fully or partially observcable modules. Heirarchical reasoning, Individuals and relations, and bounded rationality make reasoning simpler.
Representations
A task can be represented in a formal way, from which we can make formal deductions and interpret them to get our solution.
We want a representation complex enough to solve the problem, to be as close to the problem as possible, but also compact, natural, and maintainable.
We want it to be amenable to computation, thus
(1) Can express features which can be exploited for computational gain
(1) Can tradeoff between accuracy and time/space complexity
And able to acquire data from people, datasets, and past experiences.
Defining a Solution
Given informal description of problem, what is the solution? Much is usually unspecified, but they cannot be arbitrarily filled.
Much work in AI is for common-sense reasoning, the agent needs to make common sense deductions about unstated assumptions.
Quality of Solutions
Solutions can be split into classes of quality
Descisions and Outcomes
Good descisions can have bad outcomes (and vice versa). Information can lead to better descisions, known as value of information.
We can often tradeoff computation time and complexity: an anytime algorithm can provide a solution at any time, but more time can lead to better answer.
An agent is also concerned with acquiring appropriate information, timely computation, besides finding the right answer.
We can often use a quality over time graph with a time discount to represent this.
Choosing a Representation
Problem \(\longrightarrow\) specificiation of problem \(\longrightarrow\) appropriate computation.
We need to choose a representation, this can be done as a high level formal/natural specificaiton, down to programming languages, assembly, etc, etc.
Physical Symbol Hypothesis
We define a Symbol as a meaninigful physical pattern that can be manipulated (like logic symbols)
A Symbol system creates, copies, modifies, and deletes symbols.
The Physical symbol hypothesis states that a physical symb. system is necessary and sufficient for general intelligent action
Knowledge and Symbol Levels
Two levels of abstraction common amongst systems:
Mapping problem to representation
We have a few questions to ask
Choosing Level of Abstraction
A high level description is easier for human understanding. A lower level abstraction however can be more accurate, predictive, however are harder to reason, specify, understand. One may not know all details for a low level description. Many cases lead to multipe levels of abstraction.
Reasoning and Acting
Reasoning is computation required to determine what action to do.
Problem Solving
Often we are only given the specification for a solution, and have to search for it. A typical situation is where we have an agent in one state, and have a set of deterministic actions with which it will go towards the goal state.
Often this can be abstrated into finding the goal state in a directed graph of linked states.
Example
Suppose you're in Coventry and want to get to London. How do you achieve this?
Using what transport do you get there? What routes do you go via? More importantly, which routes are reasonable?
Finding the actions that would get you from Coventry to London is problem solving.
A problem-solving agent is a goal-based agent that will determine sequences of actions leading to desirable states. It must in this process determine which states are reasonable to go through.
Problem solving steps
There are 4 steps:
For our example from earlier:
Pseudocode agent
We can make a mockup of an offline reasoning agent in pseudocode.
Naturally all the functions mentioned are left as an excersise to the reader (i.e. assumed implemented).
formulateGoal
internally will use its current perceptions.
Problem types
Problems of course can come in many types.
Vacuum Cleaner World
Let's use an example of a simple environment, a vacuum robot V
and a world with two locations, [ ][ ]
. Each location may also contain dirt ([ d]
or [Vd]
for vacuum and dirt). The vacuum can move left or right, and suck dirt.
Single state. Suppose we know we are in the state [V][d]
. We then know our sequence of actions is just [right, suck].
Multiple State. (sensorless / conformant) Suppose we don't know which state we start in. We could have the states:
1. [Vd][ d] |
2. [ d][Vd] |
3. [Vd][ ] |
4. [ d][V ] |
5. [V ][ d] |
6. [ ][Vd] |
7. [V ][ ] |
8. [ ][V ] |
Contingency. Suppose we start in state [V ][ d]
, but our vacuum has received a downgrade: if we suck a clean square, there is a chance that it may make that square less clean. So what do we do?
What we can is sense whilst acting, thus: [right, if dirt then suck].
State space problems
A State space problem consists of a set of states; subsets of start states, goal states; a set of actions; an action function (f(state, action) -> new state); and a criterion that specifies the quality of an acceptable solution (depends on problem).
Since the real world is complex, state space may require abstraction for problem solving. States, operators, and solutions approximate the real world equivalent, though one must take care that their operations are still valid for the real world counterpart.
8-puzzle
A 3\(\times\)3 grid of tiles numbered 1 to 8, with a blank space for sliding. We want to get from some initial state to some goal state.
283 123
164 8_4
7_5 765
On the left is an initial state, and on the right is the goal state.
State Space Graphs
A general formultion for a problem is a state space graph. It is:
Tree searching algorithms
A tree search is a basic approach to problem solving. It is an offline, simulated exploration of the state space. Starting with a start state, we expand an explored state by generating / exploring its successor branches, to build a search tree.
We have a single start state, and repeatedly expand until we find a solution, or run out of valid candidates.
Implementation details
A state represents a physical configuration of a puzzle, and a node will be the data structure for one state of the tree.
A node stores the state, parent, children, depth, path cost - the last of these is usually denoted \(g\).
Nodes (explored) but not yet expanded are called the frontier, and is generally represented as a queue.
Search strategies
The strategy of searching is defined by which order we expand nodes in. These strategies are evaluated over several criteria:
Simple, uninformed strategies
Uninformed search only uses information given in problem definiton - we have no measure of which nodes would be better to expand to, nor do we really check the nodes once we've explored them (i.e. not checking cycles).
Two of these algorithms are Breadth-first search BFS and Depth-first search DFS.
Breadth first
BFS expands the shallowest node: put recently expanded successors at the end of queue: FIFO.
The main problem for BFS is the space usage, which can balloon very very quickly (with \(b = 10\) and 1 KB per node, at depth 8 one needs around 1 TB of memory).
Depth first
DFS expands the deepest node: put most recently expanded successors at the start of the "queue" (stack): LIFO.
Notes
This section is uninformed search, and graph searches are similarly uninformed, so much so that the algorithms presented do not even check for cycles, despite common intuition.
The main difference is that in graph search we look at paths, rather than nodes. Expanding into a node would be adding it to the stored path.
Graphs
With trees, if we have looped states these will just branch out to infinity. Graphs are a good way to take loops into account and are a practical representation of such spaces.
Similar to a tree, we incrementally explore, maintaining a frontier of explored paths. The frontier should expand into unexplored nodes until a goal node is encountered.
The way nodes are selected is the search strategy, the neighbours of (all) nodes define a graph, and the goal defines the solution.
We can continue searching from the return call even if a solution has been found, if we want more or all possible solutions.
BFS on Graphs
All paths on the frontier \([p_1 .. p_r]\) are expanded one by one, and the neighbours of an explored path are added to the end of the frontier.
The time complexity is instead written \(O(b^n)\), where \(n\) is the path length (instead of tree depth).
DFS on Graphs
DFS treats the frontier as a stack instead. The time complexity is \(O(b^k)\) where \(k\) is the maximum number of arcs a path can have (i.e. max depth). Space complexity is linear: \(O(k + (b-1))\).
Now, DFS is said to fail on both infinitely long graphs and graphs with cycles, although I'm not sure why you're adding already explored nodes to the frontier (cycle case).
Lowest cost first
Sometimes, there are costs, or weights, associated with edges. The cost of a path is thus the sum of the cost of edges, and an optimal solution has the path with the smallest total cost.
Lowest cost first is a fairly naive algorithm that simply selects the path on the frontier with the lowest cost. I.e., frontier paths are stored in a priority queue according to path length.
Lowest cost is complete.
Supposedly none of these algorithms will halt on a finite graph with cycles, but one has to be a particular type of person to program a graph search without cycle checking / not adding explored nodes to the unexplored frontier.
Overview
Uninformed search is inefficient, especially when we have extra information we can exploit. By using problem-specific knowledge, we can improve our searching - this is informed search.
Still offline.
Overview
If you know where the goal is, when selecting paths do not ignore it.
Employ the use of an extra heuristic measurement \(h(n)\) - an estimate of the remaining cost to the goal.
This heuristic should be efficient to compute, and always be an underestimate (no path to goal strictly less than \(h(n)\)). If a heuristic meets these criteria, then it is admissible.
Examples
If one is on a euclidean plane finding the shortest path, \(h(n)\) can be the straight-line distance to the goal. If the cost is time, \(h(n)\) can be straight-line distance divided by max speed.
We can think of this as reducing the constraints in the problem to an easier representation that we can quickly calculate an answer for. The straight-line distance does not count obstacles, for example.
Best First Search
When we choose nodes to expand in a frontier, the heuristic can determine the order / rank of the frontier structure.
Heuristic DFS is where we select a neighbour such that the best neighbour is selected first.
If by best we mean closest (by some metric) distance to the goal, then we have Greedy Best First Search (BFS).
The frontier is a priority queue ordered by \(h(n)\).
Problems
Some graphs are very much not good for BFS. Especially if we do not track cycles.
Analysis
BFS has an exponential space complexity, of \(O(b^n\) where \(b\) is the branching factor and \(n\) is the number of nodes. TIme complexity is also \(O(b^n)\).
It is not guaranteed to find the solution, even if one exists, and may not necessarily find the shortest path (but will go in its general direction).
A* Search
A* considers both path cost and heuristic values. We denote the cost of a path \(p\) by \(g(p)\) or \(cost(p)\), and the estimate of remaining path to goal is \(h(p)\).
Let \(f(p) = cost(p) + h(p)\), at some node on the path \(p\). \(f(p)\) is thus the overall estimate cost of \(p\). A* searches using this estimate value.
Thus A* can be considered as a mix of lowest-cost and best-first, and its frontier is a priority queue ordered by the metric \(f\).
Admissibility
A* is admissible if
Justification. A path \(p\) is selected by A*. Suppose another path \(p'\) ends on the frontier. Since \(p\) is chosen before \(p'\), by definition \[cost(p) \leq cost(p') + h(p')\] And since \(h\) is an underestimate \(cost(p') + h(p') \leq cost(p'_C)\), where \(p'_C\) is a complete path extending \(p'\) and reaching the goal. By definition of the algorithm, we must have for any other unpicked path \(p'_C\) \[cost(p) \leq cost(p'_C).\]
A* can always find a solution if there is one.
The frontier always contains part of a path to the goal, and A* will halt as the costs of the paths increase beyond any finite number.
However, this doesn't mean A* won't pick unoptimal nodes, just that its final result is always the optiumal path (even if we have cycles in our graph).
Heuristics are not equal
Some heuristics are better than others. The closer a heuristic is to the true value (without overestimating), the more efficient A* is.
For some optimal cost \(c\), A* will explore all paths \(p:cost(p) + h(p) < c\). A* will explore some paths \(q:cost(q) + h(q) = c\) (until it finds the solution), and will not explore overestimates.
Thus the closer the underestimate, the less paths there are where \(cost(p) + h(p) < c\). Additionally, if the number of paths where \(cost(p) + h(p) = c\) is large, then there may be much variability in space and time of A*.
Complexity
A* is exponential time in \(h\)-error \(\times\) path length to solution. A* is also exponential space, since it stores all nodes in memory.
Cycle Checking
Before, we have not dealt with the cases where a search algorithm hits an already visited node (i.e. go in a cycle). If this is so, we can "prune" a path that leads into a cycle, removing all unnecessary nodes from consideration.
In depth-first methods, cycle checking can be done in \(O(1)\) with things like hash functions. In most other algoirthms though this is often \(O(n)\).
Multiple Path Pruning
Similarly, this pruning can apply when we have two paths to the same node. We can prune the path to said node and keep just the existing path.
THis can be implemented via a set of explored nodes at the end of explored paths, the closed list.
When a path \(\langle n_0 \cdots n_k \rangle\) is selected, if \(n_k \in\) closed list, then discard path, otherwise add \(n_k\) to list and proceed on.
On optimal solutions
How do we make sure we do not prune a path whcih is better than the already found one?
The most straightforward answer is to make sure that you always select the best path first, though this may be hard. We could remove all paths which use the longer route, but those paths may extend further (past the node which the better route reaches), and that's a lot of work wasted.
Otherwise we can simply replace the longer path: given a path \(p = \langle s \cdots n \cdots m \rangle\) if \(p' = \langle s \cdots n \rangle\) is found which is shorter than the corresponding part in \(p\), replace that specific segment.
Path Pruning and A*
A* selects a path \(p'\) to node \(n'\). Suppose there's a shorter path to \(n'\), via a path \(p\), which currently ends at the fronter node \(n\) (so \(len(p') > len(p) + len(\langle n \cdots n' \rangle )\)).
Since \(p'\) was selected before \(p\), \(f(p') \leq f(p)\), thus \(cost(p') + h(p') \leq cost(p) + h(p)\). Let \(cost(n, n')\) be the real length of a path \(\langle n \cdots n' \rangle\). Since \(p\) is lower cost than \(p'\), \(cost(p) + cost(n, n') < cost(p')\). \begin{align} \implies cost(p') - cost(p) &\leq h(p) - h(p') \\ \textrm{and } cost(n, n') & < cost(p') - cost(p) \\ h(p) - h(p') &= h(n) - h(n') \\ \implies cost(n, n') & < h(n) - h(n') \end{align} If we make sure \(|h(n) - h(n')| \leq cost(n, n')\) then this situation will never happen (always select the best path first option).
Monotone Restriction
A heuristic \(h\) satisfies monotone restriction if \(|h(m) - h(n)| \leq cost(m, n) \; \forall \textrm{arcs } \langle m, n \rangle\). It is then termed consistent.
A* with a consistent heuristic and path pruning always finds the shortest path.
Direction of Search
In graph/tree searches, the direction we search doesn't really matter. We can go from goal to start nodes.
Whilst the forward branching factor is the number of arcs out of a node, the backwards branching factor is the number of arcs going into a node.
If the forward factor is larger than backwards factor, our \(O(b^n)\) algorithm would do much better searching backwards.
Of course, a backwards graph isn't always available.
Bidirectional Search
We have a \(O(b^k)\) (\(k\) depth of goal) algorithm. We can get it to be less horrible by searching from both sides simultaneously. \(2b^{\frac{k}{2}} < b^k\) after all.
The main problem with this is making sure both sides meet.
Oftentimes, BFS is used from the goal for a full set of locations that leads to goal, whilst another method is used from the start to link up the path.
Island Hopping
An island driven search is like an extension of the bidirectional search - we have a series of "islands": useful nodes that can link a start and a finish, \[s \longrightarrow i_1 \longrightarrow i_2 \longrightarrow \cdots \longrightarrow i_m \longrightarrow t\] Now, we have \(m\) subproblems to solve, which can improve efficiency to \(mb^{\frac{k}{m}} \).
The difficult is then in identifying the correct islands, and it's hard to guarantee optimality with such an approach.
Dynamic Programming
Dynamic programming is a fancy way of saying have an array of values you update and retrieve from as you go along. This is explained more in depth in CS260.
We build a table \(D: D_n = \) the distance of shortest path from \(n\) to the goal. Then, apply the following strategy: \[D_n = \begin{cases} &0 &\textrm{if isGoal}(n) \\ &\min_{\forall \langle m, n \rangle} (|\langle m, n \rangle| + D_m) &\textrm{otherwise}\end{cases}\]
Bounded DFS
A bounded DFS does normal DFS, but has a cost/depth bound at which it will stop exploring. Only part of the search graph is explored.
We do this because DFS uses linear space and if space is tight it's appropriate.
Iterative Deepening
If we bounded-DFS over an ever increasing bound, we have iterative deepening. Starting with a bound \(b = 0\), do DFS, and either return the solution or increase the bound.
This finds the first solution BFS does, but uses linear space. There is an overhead of \(\frac{b}{b-1} \times\) the cost of expanding a node at depth \(k\) with BFS; i.e. bounded DFS will do \(\frac{b}{b-1} \times\) more work than DFS at the same \(b\) depth. As \(b\) increases, overhead decreases.
Depth-First Branch & Bound
Depth first branch and bound Combines DFS with heuristic info, and so will find an optimal solution. More useful if multiple solutions, and still has the space complexity of DFS.
We want to find a single optimal solution. Suppose \(b\) is the cost of the shortest path to goal so far.
What if we then encounter a path \(p : cost(p) + h(p) \geq b\)? \(p\) can be pruned. If we instead communicate a path \(q : q\) is better, then \(q\)'s cost can be \(b\) and the path \(q\) remembered.
This is linear space (since DFS) and we can guarantee an optimal solution when it completes.
Initalising Bound
We can initialise the bound \(b\) to \(\infty\). We can also set it to an estimate of the minimum path cost.
After DFS terminates, either (1) a solution found, (2) no solution found, no path pruned, (3) no solution found, path pruned.
Can be combined with iterative deepening to increase the bound until a solution is found.
Cycle pruning works well, but multiple path pruning does not, since storing extra paths defeats the point of saving space with DFS.
The 8-puzzle example
The branching factor is roughly 3, and a typical solution is 20 steps. Thus, an exhaustive search is \(3^20\) states, and there are \(9!\) non-repeating states. It would be nice to have a heuristic to search with.
We could:
Characterising Heuristics
Definition. If A* search expands \(N\) nodes and solution is depth \(d\), then the effective branching factor \(b* \) is the factor a uniform tree of depth \(d\) must have to have \(N\) nodes. \[N = 1 + b* + (b*)^2 + \dots + (b*)^d\] Ex. depth 5 and 52 nodes: \(b* = 1.91\).
As \(b* \longrightarrow 1\), the larger the problem that can be solved (efficiently).
Whilst solving for \(b* \) may be difficult, we can estimate it experimentally, over small instances.
On the prior example, an experiment determined that \(h_2\) is better than \(h_1\) (lower EBF).
Definition. If some \(h_2(n) \geq h_1(n) \; \forall n\) then \(h_2\) dominates \(h_1\).
Note that whilst an (admissible) heuristic which is closer to the true cost is better, we must make sure the heuristic still remains cheap to calculate.
Deriving Heuristics - Relaxed problems
Admissible heuristics can be the exact solutions of problems with certain conditions relaxed.
If the problem is defined in formal language this is easier.
We may generate several different relaxed problems, and thus may have several admissible heuristics. If one dominates, use that. Else, if you have \(h_1, h_2, \dots, h_n\) admissible heuristics which do not dominate each other, just use \(h = \max(h_1, h_2, \dots, h_n)\).
Because all options are admissible, the aggregate \(h\) is admissible - and dominates each constituent individually.
Example
If we describe the 8-puzzle as "A tile can move from square A to B if A is adjacent to B, and B is blank", then we can generate some relaxed problems:
Subproblems
We can derive a heuristic from a subproblem.
For the 8 puzzle, we can only solve the problem for 4 squares (ignoring 5-8), and use that as a heuristic. Of course, the subproblem should be pretty easy to solve.
Pattern Databases
We can then store the possible solution costs for each instance of a subproblem in a data structure for easy lookup. Then the heuristic is the looked-up cost of the corresponding subproblem.
Different pattern databases can also be combined to get a maximum of them all.
Disjoint Pattern Databases
If we have multiple independent subproblems, we can store them in different databases and add them together.
These databases form disjoint pattern databases.
Other approaches
We can run searches over smaller training problems, and use the statistical results as a heuristic.
Say we have a base heuristic \(\eta\). If in 95% our training sets, \(\eta = 14\) when the real cost is 18, we can just use \(h = 18\) as our statistical heuristic.
Now this is not admissible, but can often be quite efficient (and good enough).
Varibles
Constraint Satisfaction Problems (CSPs) are described with regards to variables and configurations of variables (worlds).
Algebraic Variables are symbols used to describe features. Often denoted with capital letters, like \(X\), which have a corresponding domain of possible values, \(D_X\) or \(dom(X)\).
Symbols internally are just labels - they only have meaning to the user. The meaning associated to a variable should follow the clarity principle, that an omniscient agent can determine values knowing meaning of the variables.
Discrete Variables has a finite or countably infinite domain - a special type of discrete variable with domain size 2 is a binary variable, most often seen as boolean variables.
Continuous Variables have uncountably infinite domain, e.g. \(\mathbb{R}\)
Assignments and Worlds
Given a set of variables, an assignment is a function \(: \textrm{vars} \rightarrow \textrm{dom(vars)}\). Variables \(\{X_1, X_2, \dots, X_n\} \) would have an assignment \(\{X_1 = v_1, X_2 = v_2, \dots, X_n = v_n\} : v_i \in D_{X_i} \).
A total assignment assigns all variables, since assignments can be partial.
A possible world is a total assignment. If \(X_i\) has value \(v_i\) in the assignment of a world \(w\), then we say "variable \(X_i\) has value \(v_i\) in world \(w\).
The number of worlds is product of all domains: \(|W| = \prod^n_{i=1} |D_{X_i}|\)
The advantage of using variables over pure discrete states is that a few variables can describe many states.
Constraints
A (hard) constraint specificies legal combinations of some variables.
A scope is a set of (relevant) variables, a constraint \(C\) is just a relation on that scopt \(C : S \longrightarrow \{T, F\}\), which involves each variable in scope.
An assignment \(A\) on the full set of variables \(S'\) (i.e. the superset of the scope / \(S \subseteq S'\)) satisfies a constraint on \(S\) if everything maps to true. Else, the constraint is violated.
Constraints can be defined by intension and extension. Intension is "define by rule", where formulae are used. Extension is "define by enumeration", where the valid combinations are listed out.
A unary constraint is on 1 variable, a binary on two, etc.
CSPs
A CSP consists of a set of variables, a domain per variable, and a set of constraints. The goal is to make true (satisfy) all constraints.
A finite CSP has a finite set of variables, with finite domains. We will usually be dealing with these. Note that some algorithms which work for one type of CSP may not work for others.
Example. A delivery robot must carry out activities \(a,b,c,d,e\). Those activities happen at any time 1, 2, 3, 4. Let \(A\) be variable of activity \(a\). The domains of all variables are thus \(\{1,2,3,4\}\). The following must be satisfied: \[\{ B \neq 3, C \neq 2, A \neq B, B \neq C, C < D, A = D, E < A, E < B, E < C, E < D, B \neq D\} \]
When we have a CSP, a number of tasks are helpful for it:
Generate and Test
Obviously the simplest way to solve a CSP is via a generate-and-test (G&T) algorithm - to solve via exhaustion.
Let \(D\) be the "assignment space" (set of total assignments), G&T will generate a random assignment, and save it.
Complexity
G&T is an \(O(\textrm{horrible})\) algorithm.
For \(n\) variables each with domain size \(d\), and \(e\) constraints, the total number of constraints tested is \(O(ed^n)\). It's fine for small problems but can quickly spiral out of control.
G&T assigns values to all variables regardless, but some constraints are unary, so some values can be ruled out even before generating.
Backtracking Search
Backtracking is about tree-searching a tree constructed from incremental assignments of variables in a CSP.
We define the search space as thus:
Example. Let us have a CSP with variables \(A, B, C\) each with domain \(\{1,2,3,4\} \), with constraints \((A < B), (B < C)\). A search tree can be drawn: The node here represents it and all assignments towards it from the root. Anything with a cross means that assignment violates the constraint, and that branch is pruned. Anything with a tick means that is a valid total assingment.
Since all goals are at depth \(n\) (\(n\) vars), we use DFS to save space. The branching factor \(b = (n-l)d\) for a depth \(l\) and domain size \(d\).
Implementation
We can implement backtracking as the following:
Selection heuristic
How do we select a variable? How do we assign a value? We use heuristics (which can be combined) to determine which to select.
Introduction
Though DFS is an improvement on G&T, it still isn't that good.
In example 2, the DFS searched through all possible alignments in a domain, even if they were clearly contradictory with one or more other variables.
We could interleave an algorithm that removes all impossible values given the current assigned domains and the constraints.
Consistency Algorithms
Consistency algorithms operate on a network of constraints:
Each node is a variable, which is annotated with its domain. Each arc between nodes is a constraint over those variables, and annotated with the rule of said constraint.
We say that for a variable \(X\), \(D_X\) will now be the set of possible values (which initially is its domain). For every constraint \(c,\; \forall X\) in scope of \( c ,\; \exists \textrm{ arc } \langle X, c \rangle\).
Another way is to represent variables as round nodes and constraints as rectangular ones, with an arc between a varaible and a constraint meaning that variable is in scope of that constraint.
Arc Consistency
In the simplest case, the constraint is unary. The variable is then domain consistent if all values satisfy the constraint.
Then, let constraint \(c\) have the scope \(\{X_1, Y_1, \dots, Y_k\} \). Arc \(\langle X, c \rangle\) is arc consistent if \(\forall x \in D_x\), \(\exists y_1 \dots y_k : y_i \in D_{Y_i} : c(X=x, Y_1=y_1, \dots, Y_k = y_k)\) is satisfied. I.e. every value for X has some compatible assignments for all other variables.
A network is arc consistent if all arcs are consistent.
If an arc is not consistent, then \(\exists x \in D_X\) which is incompatible with some other variables in the constraint. We can delete values to make an arc consistent.
Though deleting a value from one \(D_X\), it may make other constraint arcs with \(X\) inconsistent again, so these need to be redone. We compile all of this into an algorithm like so:
Algorithm Description
Regardless of arc selection order, the algorithm will terminate with the same arc-consistent graph and set of domains. Three cases will happen:
Note Well: Arc consistent algorithms can still have no solutions. Example: Take a CSP with variables \(A, B, C\) all with domain \(\{1,2,3\} \), and constraints \(A = B, B = C, C \neq A\). The graph is arc consistent - but as we can clearly see there are no possible solutions.
Complexity
Assume we have only binary constraints.
Suppose there are \(c\) many constraints, and the domain of each variable is size \(d\). There are thus \(2c\) arcs. Checking the arc \(\langle X, c = r(X, Y) \rangle\)* is at the worst case iterating through each domain value in \(Y\) for each in \(X\), thus making \(O(d^2)\).
*As mentioned previously \(c\) is a relation, \(c = r(X,Y)\) is just denoting that it's over the variables X and Y.
The arc may need to be checked once for each element in \(dom(Y)\), which is \(O(cd)\).
Thus time complexity of GAC on binary constraints is order \(O(cd^3)\), which is linear on \(c\).
The space complexity of GAC is \(O(nd)\) for \(n\) variables.
Extensions
The domain need not be finite - and many be specified through intensions.
If constraints defined extensionally, when we prune a value from X, we can prune that value from all other constraints including X.
Higher order techniques like path consistency can consider \(k\)-tuples of vars at once - this may deal with the "arc consistent but no solution" problem but is often less efficient.
Splitting domains
Another method of simplifying a network is to split a problem into disjoint cases, then unioning together the solutions.
We can split a problem into several disjoint subproblems through splitting the domain of a variable, having multiple "worlds" in a sense where the variable is assigned different things.
Suppose \(X : D_X = \{T, F\} \). We can find solutions for where \(X = T\) and \(X = F\) separately.
The idea is that this reduces the problem into smaller subproblems which are easier and faster to solve. Also, if we just want any solution, we can look in one subproblem, and only if no solution we move onto the other.
We have multiple strategies of splitting if the domain is larger than 2. For \(X:D_X = \{1,2,3,4\} \), we could either split into 4 "worlds" of \(X=1,2,3,4\) respectively, or 2 "worlds" where \(X=\{1,2\} \) and \(X = \{3,4\} \), for example. The former makes more progress per split, whilst the latter can do more pruning with fewer steps.
Note that recursively splitting domains into units (and recognising impossible configurations) is literally how the backtrack search works.
Interleaving
It is more efficient to interleave arc consistency with searching/splitting. We can alternate between arc consistency, and domain splitting.
Implementation
Extensions
It is possible to find all solutions, by altering such that \(\varnothing\) is returned when no solution found, and a set union of all solutions otherwise.
If assigning a variable disconnects the graph (that variable connects two separate components), then the soluion is just the cross product of the two components - this shortcut saves computation when counting components.
Problem Structure
We can break a CSP of \(n\) variables up into separate subproblems of \(c\) variables each.
Each solving step takes \(O(d^c)\) (domain size \(d\)), so in total we have \(O(\frac{n}{c}d^c)\), which, through a bit of trickery, is actually linear in \(n\).
But this is genuinely much better: take \(n=80, d=2, c=20\). The whole problem takes \(O(d^n) = 2^{80}\) which is 4 billion years at 10 million processes a second. Splitting it using \(c=20\) we get \(4 \times 2^{20}\) which is 0.4 seconds at same speed.
Tree structures and conditioning
A CSP graph without loops can be solved in \(O(nd^2)\), with the following algorithm:
However, most of the time we don't have a tree, but removing one or two variables would make it a tree.
Conditioning is the process of instantiating (assigning) a variable and proving compatibility with neighbours such that the rest of the graph resolves down into a tree.
Cutset Conditioning is conditioning but done to a set of variables (a cutset).
Variable Elimination
More generally we can look at simplifying a CSP by "removing" variables from consideration.
When we remove a variable X, we need to ensure that we preserve the effect of X in all remaining constraints, so we can get back from reduced form to the original problem.
The idea is to collect all constraints involving X, (natural) join all the constraints to make \(c_x(X, \bar{Y})\) where \(\bar{Y} \) is all other variables neighbouring X. Then, project onto (select columns of) \(\bar{Y} \) only, and this one big relation would then replace all relations including X.
Natural join is denoted \(\bowtie\), can see CS258 / SQL natural join
for how it works.
Say we have a table \(r\) with \(A, B, C\), the projection onto \(B, C\) is denoted \(\pi_{\{B, C\}}r\), with \(\pi\) standing for project - it's akin to select distinct from B, C
.
Example. Let us have a CSP with variables \(A, B, C\) each with domains \(\{1,2,3,4\}\). We want to remove B. The constraints involving B are \(A < B\) and \(B < C\).
\[ \begin{bmatrix} A & B \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 2 & 3 \\ 2 & 4 \\ 3 & 4 \end{bmatrix} \bowtie \begin{bmatrix} B & C \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 2 & 3 \\ 2 & 4 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} A & B & C \\ 1 & 2 & 3 \\ 1 & 2 & 4 \\ 1 & 3 & 4 \\ 2 & 3 & 4 \end{bmatrix} \]
To get the "induced" relation on A and C (removing B), we project the relation table (call it \(r\)) onto A and B: \[ \pi_{\{A, C\}}r = \begin{bmatrix} A & C \\ 1 & 3 \\ 1 & 4 \\ 2 & 4 \end{bmatrix} \]
The original constraints are replaced by this one on A and C. The original constraint however is remembered to reconstruct the full problem afterwards.
Implementation
We can implement this recursively:
Efficiency
Efficiency depends on how \(X\) is selected. The intermediate structure depends only on the graph of the constraint network.
Variable Elimination is only effective if the network is sparse.
The Treewidth is the number of variables in the largest relation. The graph treewidth is the minimum treewidth of any ordering.
VE is exponential in treewidth, and linear in variables, whereas searching is exponential in variables.
Heuristic
We have some heurisics to detrmine what variable to eliminate.
Local Search
In some problems, only the goal state is important, whilst the path to it is not.
Local search is an approach which only cares about the current state, and surrounding states it can move to.
Whilst local search is not systematic, it is efficient for memory and can often find reasonable solutions, and is useful for many optimisation problems, where there is some objective cost/"goodness" function.
We define the state space as all possible configurations (states) of a world, and the goal as the state we want to end up in. Usually this is optimal by some heuristic metric.
Iterative Improvement Algorithms
Iterative improvements describe algorithms that only consider the current state, and all states it can move to. The idea is to move to a state which gives the greatest increase in "quality" (metric based)
Since we only track current state, these algorithms are usually constant space (though time is eugheheheh?).
Common Problems
One common problem in computer science is called the N-Queens Problem, where, given an \(n \times n\) chessboard, fit \(n\) queens on it so that none can take another in one move.
Another, more general problem, is an abstraction of many problems. Imagine you have two values for a state, and a cost metric. The highest optimal costs form "peaks" in a landscape, and the goal is to travel to the highest possible peak.
Hill Climbing
Hill Climbing, or Greedy Local Search, is the most basic of iterative improvement algorithms, and always tries to improve state (by reducing or increasing a cost function depending on the situation).
There is no search tree, and if there are several equally good values, they are chosen at random.
Implementation
Problems
Local maxima: naive hill climbing is very supceptible to finding local maxima: states that are surrounded by worse states, but not being the best state overall in the state space. Thus, the algorithm will halt sub-optimally. We will talk about ways to solve this later.
Ridges: whilst not as big of a problem, if we have a ridge with very steep sides, the search algorithm may oscillate around the ridge, thus making very slow uphill progress.
Plateaus: A large flat expanse will cause a random walk. This may be a local maximum plateau, or a "shoulder plateau" (where one edge leads to higher ground, whilst the others lower). We need to allow sidways motion to try get off a plateau if it's a shoulder, but this must be limited, else we would walk infinitely.
Greedy descent for CSPs
Greedy Descent is a hill climbing algorithm for CSPs.
The idea is to maintain an assignment of variables to values (initialised randomly). Select a variable to change, and select a new value for said variable, repeat until a satisfying arrangement is found.
When we assign values randomly to variables, we can break constraints. Each broken constraint is one conflict, and our goal is to get an assignment with zero conflicts. So we can use the number of conflicts as our heuristic measurement.
Selecting variables
We want to find the variable-value pair that minimises the number of conflicts. There are multiple strategies to select:
More Complex Domains
When domains are small / unordered, the neighbours of one state can be equivalent to choosing another value for a variable. When domains are large / ordered, the neighbours can be considered as adjacent values.
If the domains are continuous (imagine a function) we use a process called gradient descent, which changes a variable proportional to its heuristic (cost) function. i.e. We find the gradient at that point (via differentiation), and then go in the opposite direction by a certain step size \(\eta\), to "descend" down to minimise cost. \[X_i : v_i \longrightarrow v_i - \eta \frac{\partial h}{\partial X_i}\] Which is saying, variable \(X_i\) goes from value \(v_i\) to a value one gradient step away (the \(\del\) thing is the derivative).
Randomised Greedy Descent
Stochasticity is a fancy word for randomness. The idea in general is to introduce some randomness into our choices, so we don't beeline it for the local maxima, and gives the algorithm a chance to find the actual maximum.
There are two main methods of randomising greedy descent. Random Steps, where we sometimes move to a random neighbour instead of the best one, and Random restart, where we start again somewhere else, and take the best out of all runs.
If the space has jagged, small local minima, then random steps can take us closer to the true minima, but random restart alone would get start. Conversely, if the space is wide with tall peaks, random restart can be quite quick, but random steps wouldn't get out of a minima well at all.
Stochastic Local Search
Stochastic local search is greedy descent with a mix of both random steps and random restart.
Randomly Walking
When doing a search step, have a fixed percentage where you choose another variable (or value).
Perhaps choose the variable that is in most conficts, perhaps choose any in-conflict variable, or just any variable altogether.
Sometimes choose the best value, but sometimes choose a random value instead.
Comparing Stochastic Algorithms
Stochastic algorithms are inherently random, and so how do we measure how good they are? It depends on how much time you have, and how reliable the algorithm is at halting.
You run tests on three algorithms, A, B, and C. A is very quick (say, 0.4s) at solving 30% of a problem set, but doesn't halt for the other 70%. B is 60% at reasonable speed (1m), but doesn't halt for anything else, and C is 100% of problems, but takes ages (like 20m) for each one.
By plotting a graph like shown, we can then determine which algorithm is best for our needs. If our problem is solution critical, then we would go for C, since it always finds a solution. If we need time critical, then A, ran multiple times (with a 30% chance of success each time) might be more preferable.
Simulated Annealing
Annealing is a process in metallurgy that heat-treats a metal to change its properties. We take the vague concepts from this example and apply it to a CSP by "heat treating" the variables.
We have a temperature parameter, \(T\), which slowly decreases over time.
Now, we pick a variable, and a value at random. If it is an improvement, we always adopt it. If it's not better, with probability based on \(T\), we adopt it anyway.
Note: the actual probability is based on the heuristic \(h()\) of the two values. If we have a current value of \(n\) and a new value of \(n'\), the probability we adopt it is \( e^{\frac{h(n')-h(n)}{T}} \).
And as we do more iterations, this probablity \(T\) goes down, representing us "honing in" on a target.
"Tabu" Lists
Sometimes our search space has cycles, and we do not want that. A Tabu List (not "taboo"?) is a list of the last \(k\) explored states, that we do not allow moving into.
This is like keeping track of all states, but much less memory and work intensive (depending on the size of \(k\)).
Parallel Search
Similar to random restart, a parallel search does the same thing ... but in parallel. One total assignment is called an individual, and we have \(k\) individuals, all run at once, all updated every step, and get the solutions of all those \(k\) searches.
The first solution to return (i.e. the quickest), or the best solution out of all solved, is taken as the final solution. You could probably multithread this.
Beam Search
A beam search is like parallel search, where we have a number of individuals. Every update, however, instead of each individual picking one best neighbour, they pick \(k\) best neighbours, which then pick \(k\) best neighbours, and so on.
When \(k = 1\), we have greedy descent, and when \(k = \infty\), we have breadth-first search. Being able to pick \(k\) allows us to limit the amount of space and parallelism we use.
Stochastic Beam Search
Expanding on from beam search, a stochastic beam search picks not the best \(k\) individuals, but rather \(k\) individuals based on probability weighted by their heuristic scores (best heuristics are more likely to be picked).
This maintains diversity and allows some beams to find different local maxima / the global maxima. We can think of this as akin to asexual reproduction.
Genetic Algorithms
A Generic Algoritm (GA) is a self-improvement algorithm based off Darwin's theory of evolution (and is kinda like AI Eugenics). It is often used for training neural networks, but can also be used for CSPs.
We have a population of \(k\) randomly generated individuals (states). Each individual is represented as a string describing the state, to facilitate computation.
Successor states are generated based on two "parent" states, like sexual reproduction.
Individuals can be evaluated by a fitness function, which should return higher values for better states. The fitness of an individual is related to the probability that individual is allowed to "reproduce". All individuals not chosen / below the threshold are culled (discarded)
"Reproduction"
For each chosen pair from a GA, a random "crossover point" is chosen. Two offspring are generated by swapping the parent strings after the crossover point.
Parents "123|41234"
and "567|85678"
, and a crossover point of 3
(shown in parent), generate the children "12385678"
and "56741234"
.
Finally, each children is subject to a random mutation with a small probability, changing maybe one letter.
Please note an individual can be chosen more than once to reproduce (though obviously not with itself)
Running style
If parents are quite different, their children will be quite different from both parents.
Thus a GA takes very large steps at the beginning, and slow down as the population begins to converge around some optimal states. This is useful for optimisation problems.
The choice of represenation is fundamental - a representation needs to be a meaningful description for a GA to work properly.
Adversarial Search
Consider a multi-agent environment, where agents are in direct conflict with each other. These type of games give rise to adversarial search.
Opponents to an agent creates uncertainty, which the agent must deal with by considering contingencies.
These games are often complex, and time is usually important and limited, thus the agent may have to "best guess" based on experience and what it can calculate to be the best in the time allowed.
Uncertainty may not only arise from opponent moves, but also from other factors, like randomness of dice, or time being limited as to not allow perfect decision making.
This uncertainty makes games interesting, and a better representation of the real world.
Let's first consider games which are small enough that we can make a perfect decision.
Games, formally
For a given game, we have:
Noughts and Crosses
Let's consider the game Noughts and Crosses. It's sufficiently small and simple that we can build a tree for it.
In the game we can win, draw or lose. These terminal states can be given the utility 1, 0, -1 respectively.
Let's call our players Min and Max. Max goes first, and Min goes second. When points are awarded, if Max wins the points are positive, and if Max loses, the points are negative.
Max wants to maximise the score, Min wants to minimise it, and so every turn after Max plays, Min will try to prevent Max from winning and getting positive score. Max thus needs a strategy to prevent Min from stopping him, or worse, winning.
Obviously as Max we don't know what Min will do, so we have to plan for every possible move Min can do.
Perfect decisions
If we have such a formal, deterministic game (such as noughts and crosses, draughts, chess, etc), we can consider perfect decisions: making the best decision considering every eventuality.
Course for some games (like chess) there are too many options to realistically consider, perfect decision making is not practical, but a computer could perfectly play noughts and crosses.
Two ply game trees
Let's step back and consider a generic two player game. Each move of the game is when all players have played, and since there are two players, the game tree is called two-ply.
We denote Max's moves with an up arrow \(\Delta\), and Min's moves with a down arrow \(\nabla\), as their names would suggest.
Max has three of nine possible moves to make against Min, depending on what move Min makes. When we represent a game as a tree like this, we can use this to make an algorithm for an optimal move for Max.
Minimax
Assuming both players are perfectly rational players, minimax is an algorithm that gives the optimal strategy for Max.
Given a state of the game, pick the move that has the largest minimax value — describing the best possible outcome for Max against the best possible opponent: \[ \textrm{minimaxValue}(n) = \begin{cases} \textrm{utility}(n) & n \textrm{ is a terminal state} \\ \max_{s \in \textrm{successor}(n)}(\textrm{minimaxValue}(s)) & n \textrm{ is a node for Max} \\ \min_{s \in \textrm{successor}(n)}(\textrm{minimaxValue}(s)) & n \textrm{ is a node for Min} \end{cases} \]
Process
Completeness
Minimax is complete ... should the tree be finite. It is also optimal ... given your opponent is too.
Minimax takes \(O(bd)\) (b branching, d depth) since it is DFS and requires \(O(b^d)\), which is really \(O(\textrm{horrible})\) for complex games.
This is because minimax requires a complete search tree, which, if you have too many combinations, is a no-go.
Multiplayer Minimax!?
Instead of storing utility as a single value, we can store a vector for every player playing. Thus, we can minimax for three or more-ply games.
Then, with some math magic relevant to the game structure, we can max everyone for their own utility against everyone else.
Overview
\(\alpha \beta\)-pruning is based on minimax, and gets its name from the utilities on the min-max tree.
The pruning part comes from when we search through the search tree. In short, if we discover a node which makes a path objectively worse than any other already found, we don't bother to search more on that path, and prune it out entirely - terminate the recursive call.
Example. Take the game state shown earlier.
We run minimax from left to right, so the A path is searched for Max. We find that the value we get is 3.
The B path is searched, and we immediately come across a terminal node with value 2. Since Min will always try to minimise score, it will never pick anything above 2 for that branch, thus A will always be better. We stop searching B and prune it.
Finally, we search C. The 2 value is only found in the last child, and so everything is searched anyway.
Finally, Max plays A.
Effectiveness
The effectiveness of ab-pruning depends on the order we search in. For example above, the second branch was pruned very quicky but the third was fully searched.
To fix this we could examine the best successors first, reducing minimax from \(O(b^d)\) to \(O(b^{\frac{d}{2 }})\) - a significant improvement.
Looking at successors randomly is roughly \(O(b^{\frac{3}{4}d})\), which is still better than nothing.
In practice, always looking at the best successor is very difficult, but even a simple ordering function gives a significant speedup. A ✨heuristic✨ of sorts.
Chance nodes
So far we have only looked at perfectly deterministic games - those where the only "uncertainty" comes from opponent moves.
However, a lot of games contain elements of chance, moves being dependent on the roll of a dice.
We thus have to include chance nodes, labelled with the value, and the chance of obtaining that value. We can then calculate an expected value talking into account the probabilities.
Expected minimiax
We can expand the minimax value function further by taking into account expected values: \[ \textrm{expMinMax}(n) = \begin{cases} \textrm{utility}(n) & n \textrm{ is terminal node} \\ \max_{s \in \textrm{successor}(n)}(\textrm{expMinMax}(s)) & n \textrm{ is node for Max} \\ \min_{s \in \textrm{successor}(n)}(\textrm{expMinMax}(s)) & n \textrm{ is node for Min} \\ \sum_{s \in \textrm{successor}(n)}(P(s) \cdot \textrm{expMinMax}(s)) & n \textrm{ is chance node} \end{cases} \] Since this considers all outcomes, it has a big O of horrible at \(O(b^m n^m)\) where \(n\) is the number of unique chance outcomes.
We can however forward-prune chance nodes if we put bounds on our utility function. Although a chance node is an average, we would know what bounds it lies in.
Minimax needs to search the complete game tree, which ab-pruning improves somewhat by not searching everything. However, ab-pruning still needs quite a few terminal nodes to begin the pruning, which is impractical when the game is complex and time is limited.
Thus we reach the problem of imperfect decisions.
Limiting the search tree
Before, we used utility values. These are assigned, definite values for a terminal state. Here, we use an evaluation function, which give heuristic values that are estimates of the eventual value of a state.
We also have a cutoff test, which determines when we stop exploring a tree further.
Thus the general overview is to cut off a search tree after a set depth, then rate the nodes using their estimate of expected utility, gotten from the evalation function.
We cannot avoid uncertainty, and may not always make the best decision, but we can make a good enough one in a reasonable time.
Evaluation functions
Many evaluation functions are based off features of a state. In chess, this may consist of the number and type of pieces remaining/taken for each player, the number threatened, etc. Note that each feature is represented numerically.
These features can be used to determine the probabiliy of the game resulting in a win, draw or lose.
If these features are independent, then we can use a weighted linear function to scale these features according to how important they are: \[\textrm{eval}(n) = w_1 f_1 + w_2 f_2 + \cdots + w_n f_n\] Each w is a different weight, applied to a different feature.
Cutoff methods
The simplest cutoff method is to set a depth \(d\) and stop searching when \(d\) is reached.
A better approach however is to use iterative deepening, where we continue as much as we can until we run out of time, and return the best move found so far.
None of these are perfectly reliable however, due to us approximating value, which may be wildly off from actual utility.
Quiescentness
We can only apply evaluation to quiescent states — those that are unlikely to change in the near future, by some sort of metric.
Non-quiescent states are expanded until quiescent ones are found.
This quiescent search is an extra search on top of the fixed depth one, and should only be restricted to certain moves, like a capture in chess, which can quickly resolve uncertainty, otherwise we many never stop.
Horizon Problem
This does however lead to the horizon problem, where if we are faced with an unavoidable damaging move, a fixed-depth quiescent search will view stalling for time as actually avoiding said move, when sometimes it's not the case.
Mitigating said problem
We can perform a singular extension search, to avoid the horizon problem.
The aim is to search for a move that is clearly better for the opponent (can they move a pawn to 8th row and get a queen?), and prevent that, taking a hit somewhere else.
We can also do forward pruning, immediately pruning some nodes without further consideration, and not trying to play them out to their entirety, as real life humans do when they don't consider some moves and only the ones they think are important.
Note that this is really only safe in certain cases: if two nodes are symmetric or equivalent, it is sufficient to consider one. If we are really deep in the tree, the probability that the opponent plays just the right moves for that path decreases, and we can be safe to not be as detailed in consideration.
Monte Carlo tree search is an improvement on \(\alpha \beta\)-pruning which addesses some of the latter's limitations, namely
Estimate the value of a state from the average utility of many simulated games (playouts) from the start state.
By playing games against itself Monte Carlo can learn which states often lead to which outcomes, rather than the programmer having to come up with some function.
The way moves are made during playouts is called playout policy, and can vary depending on games. For some games the algorithm can play against itself, and learn via neural network in unguided play, whereas for others game-specific heurisitics can be used.
Pure MCTS
Perform N simulations from the current state, track which moves from current position has highest win %.
As N increases, this converges to optimal play.
Selection policy
A pure MTCS is usually infeasible, given the large amount of processing power needed to run. Thus, a selection policy is applied to focus simulation on important parts of the game tree.
To balance exploration (simulating from states with few playouts) and exploitation (simulating from known good states to increase accuracy of estimate).s
Steps
MTCS maintains a search tree, and grows it at each iteration, via
Repeat for either a fixed no. of iterations, or until we are out of time, and return the move with the highest number of playouts.
The last point is based on the motivation that 65/100 is much better than 2/3 because the former is more certain.
Upper confidence selection policy
Upper confidence is a selection policy based off the upper confidence bound formula UCB1: \[ UCB1(n) = \frac{U(n)}{N(n)} + C \sqrt{\frac{\log (N \cdot \textrm{parent}(n))}{N(n}} \] \(U(n)\) is the total utility of all playouts with \(n\) as the starting node (from n), \(N(n)\) as the number of playouts from n, parent is self explanatory, and \(C\) is a set constant, which is used to balance exploration with exploitation.
The computation time of a playout is linear in depth of tree.
Intelligent behaviour is conditioned on knowledge, our actions are based on knowledge.
Thinking is the process of retrieving relevant knowledge, and reasoning with said knowledge.
We can formally represent knowledge through the use of symbol-based structures and reason computationally with those structures.
Knowledge
Knowledge is usually described using statement sentences, propositions, which can be true or false (recall CS130).
We call the store of all knowledge (the set of all propositions) for a system the knowledge base (KB).
Reasoning
Reasoning is extrapolating from given knowledge, through logcical processes.
We say that a new proposition \(\alpha\) is logically entailed by our knowledge base, \(KB \models \alpha\), if \(\alpha\) is true when every proposition in KB is true: KB implies \(\alpha\).
Representation Languages
Langauges used to represent knowledge must have "entailment" well defined, and the overall expressiveness of the language directly impacts computational complexity.
A language is logically complete if we can compute (reason out) all propositions the KB entails.
A language is logically sound if we can guarantee that anything we reason to be true, is indeed true.
Knowledge Representation Hypothesis
Any intelligent, mechanical process will contain a collection of propositions of the world, that it believes to be true, and is able to formally reason with those propositions during operation.
Propositional Logic
One language we can use to represent knowledge is propositional logic (recall CS130).
Because it is boolean and has nice rules, we can reason with it easily.
Semantics
When we build a knowledge base, we must choose the variables to build our propositions. These should have meaning to the designer - to the machine they are just symbols to manipulate with fixed rules.
The definite clause is the form of proposition we will store knowledge in, given as \(\textrm{head} \impliedby \textrm{body}\).
The head is a single variable, and implied by the body, which is a whole formula
User view of semantics
The user must decide the intended interpretation (domain) of the system, and there should be a variable for each atomic proposition the user wants to represent.
The user needs to tell the system what is true - axiomatising the domain. If the \(KB \models \alpha\), then \(\alpha\) must also be true in the intended interpretation (i.e. you can't have true things implying something is true when it's not).
The system is only responsible for calculation, the user must interpret what it spits out using their set interpretation.
System view of semantics
The system can see a series of symbols and formulas that it can churn through - this is the knowledge base.
The system can also determine if a formula / variable is a logical consequence of the KB (entailment), but the system won't have any idea what they mean - that's up to the user.
Expert systems are not used much any more, but were all the rage in AI back in, like, the 80s and earlier.
Overview
An expert system is a program that represents and reasons with knowledge from some speciality, in order to provide advice.
There is a real life expert, providing knowledge to its knowledge base. That can then be run through an inference engine (which reasons with the KB), and respond to user requests.
An expert has deep knowledge of a domain, can solve problems, but crucially can explain answers.
MYCIN
MYCIN (I can't find what it stands for) was one of the first expert systems created at Stanford in the 1970s. It was designed to be used in medicine to generate diagnositc hypotheses and weight them based on evidence provided.
I'm going to put it all in a big collapsible because you just need to get an idea of these systems and move on, not actually know them well.
The crucial thing is that these systems are able to explain why they got a result - essentially showing a trace of all the clauses it reasoned through.
The problem we have is we want to figure out how to treat blood infections. We have a variety of drugs that treat a variety of bacteria (but not all), and want a system to reason what drugs to use, based on test data we feed in. This is the problem domain.
The process is to identify the organism type (via test results), and choose the most appropriate drug(s).
The way the system works is that an expert in infectious diseases uses a program to input true statements into the system's knowledge base, with support from patient data.
The user, the physician in charge of the patient, inputs which test results and data through the user-facing interface, and the system will run its inference engine to come up with a result, which is displayed to the user.
Crucially, the user can ask the system "why", and it will justify its reasonings based on its knowledge base.
Its knowledge base has a set of rules, and its thought process is: "If condition \(c_1, c_2, \dots\) hold, then draw with tally \(t\) conclusions \(n_1 \dots n_n\)" where the tally is like the probability that those conclusions are true, given the conditions (this is set by the expert).
And the final certainty is combined from the certainties of the separate conditions and the rule tally probability itself.
Also information on different morphologies, types, etc of bacteria are stored there for good measure.
Patient data is stored in a tree structure, with branches leading to tests, procedures, and operations they found out, and children nodes being the findings, such as what species of bacteria is causing the illness.
MYCIN does its search through a "goal tree", with its top level goals being broken down into sub goals, children, which are broken all the way down into leaves, which are effectively true or false atomic predicates.
It's called an and/or tree, where an arc is drawn between branches leading to children to show that all children must be true for the parent to be true, otherwise at least one of the children must be true for the parent to be.
Even though it was good, MYCIN was incomplete, and more importantly hospitals could not afford the computing power for the system (in the 1970s) to actually run it in the first place.
Intelligent Behaviour
People tend to see regular behaviour as intelligent behaviour. Regular behaviour is often described by a set of (written or unwritten) rules.
Formally, for AI systems they can be called production rules or condition-action rules
Canonical systems
Canonical systems are a formal language system based on an alphabet \(A\) of symbols, some "axiomatic strings", and a series of production rules.
These rules act like regular expression replacements. Note that $
will be used as a wildcard any input type character.
Example language
Let \(A = \{a, b, c\} \) and our axioms be \(a, b, c, aa, bb, cc\). Our production rules are
Language to represent knowledge
Let's define a language to represent formally object attributes.
We have an "alphabet" of three sets: O, the objects (in our domain), A, the attributes of objects, and V, the