Part 1
The idea of the whole assignment is that a location can be identified as having a particular function by looking at the amino acids near that location. But what do we mean by "looking at the amino acids near a location", anyway? To formalize this, we characterize a given location with a table that lists the presence or absence of amino acids in each "shell". So this is a 20x5 table, presuming that we chose five shells.
In part 1, we then look at a lot of these tables for a lot of calcium binding sites, and for a lot of non-binding sites. Then given the total counts in each table, we can figure out how likely it is that we will or won't see a particular amino acid in a particular shell, given that we know that we're looking at a calcium binding site or a non-binding-site.
Formally, once we've done the smoothing and calculate the probabilities, the result of part 1 gives us the following information:
P(feature-1 = true | Location-is-a-site), P(feature-1 = true | Location-is-not-a-site)
P(feature-2 = true | Location-is-a-site), P(feature-2 = true | Location-is-not-a-site)
...
P(feature-100 = true | Location-is-a-site), P(feature-100 = true | Location-is-not-a-site)
P(feature-2 = true | Location-is-a-site), P(feature-2 = true | Location-is-not-a-site)
...
P(feature-100 = true | Location-is-a-site), P(feature-100 = true | Location-is-not-a-site)
Where the notation P(X | Y) means "the probability of X given Y" and "feature-n = true" I mean "it is true that amino acid m is present in shell p."
Note that we also have the corresponding P(feature-n = false | ...) probabilities, because we know that P(feature-n = true | ...) + P(feature-n = false | ...) = 1. Of course, by "feature-n = false" I mean "it is false that amino acid m is present in shell p."
For convenience, from here on, take "P(feature-n | ...)" to mean "P(feature-n = true | ...)" IF our value for feature-n is "true" OR "P(feature-n = false | ...)" IF our value for feature-n is "false". This is a pretty standard notation.
Basically what we have now is two probability distributions - one which tells you the likelihood of a set of features given that we're looking at a site, and one that tells you the likelihood of a set of features given that we're looking at a non-site. (A "set of features" would be something like "feature 1 is true, and feature 2 is false, and ... and feature 100 is true." Basically, a listing of whether each amino acid is present or absent in each shell; that is, a 20x5 table just like above.)
That is, we have:
P(features | site) and P(features | non-site)
Note that technically, this is false. What we have is just the marginal distributions, and not the full joint distributions. We don't have access to the the true value of P(Feature-1 AND feature-2 AND... | site), for example. We only have access to the individual probabilities like P(feature-1|site) and P(feature-2|site).
There is no guarantee in real life that the features are independent: that is, that P(Feature-1 AND feature-2 AND ... | site) = P(feature-1|site) * P(feature-2|site) * ...
In fact, they probably aren't independent. Imagine that a site requires a single negative charge in the third shell, but it doesn't matter which charged residue provides that negative charge. If half the sites have glutamate, and half have aspartate, then P(glu-in-shell-3 = true | site) = 0.5 and P(asp-in-shell-3 = true | site) = 0.5, but P(asp-in-shell-3 = true AND glu-in-shell-3 = true | site) = 0, because there's only one negative charge per site in that shell.
Nevertheless, if we make the assumption, incorrect as it may be, that the probabilities are all independent, then we really can recreate P(features | site) and P(features | non-site) from what we've calculated above. This is the independence assumption that makes "Naive Bayes" so naive.
Part 2
In this part, we want to be able to ask how likely a given (x, y, z) location is to be a site or a non-site. That is, we want to know if P(site | x-y-z-location) > P(non-site | x-y-z-location), or equivalently, if
P(site | x-y-z-location) / P(non-site | x-y-z-location) > 1
However, we don't know how to evaluate this directly. We do know how to turn an (x, y, z) location into a list of features, though - that was a step in our calculations above.
So, for a given (x, y, z), build a table that reports the presence or absence of each amino acid type in each shell out from (x, y, z). That is, build a 20x5 table that has "TRUE" in every location where you have detected one or more of that amino acid in that particular shell, and "FALSE" otherwise. This gets us a list of the 100 features of that location -- the answers to the "is amino acid m present in shell n" questions.
Now we want to ask, "given these features, is it more likely that I'm looking at a site or a non-site?" or equivalently,
P(site | features) / P(non-site | features) > 1
This is starting to look like something we can calculate based on what we got in part 1. We're not there yet, though. All that we can calculate from part 1 is:
P(features | site) / P(features | non-site)
Fortunately, from Bayes rule, we can show that:
P(site | features) / P(non-site | features) = P(features | site) / P(features | non-site),
*IF* we assume that P(site) = P(non-site). (Not a good assumption, but at least it moves us forward.)
Now we need to calculate P(features | site) / P(features | non-site), for the given list of 100 features that we have for this particular (x, y, z) location.
We make the naive Bayes assumption (detailed above) that:
P(features | site) = P(feature-1 | site)*P(feature-2|site)*...*P(feature-100|site)
and the same for non-sites.
So, we could go ahead and, for a given location, approximate the ratio:
P(site | x-y-z-location) / P(non-site | x-y-z-location)
as
P(feature-1 | site)*P(feature-2|site)*...*P(feature-100|site) / P(feature-1 | non-site)*P(feature-2|non-site)*...*P(feature-100|non-site)
However, this isn't a good calculation to do. We're multiplying a lot of small numbers together into really small numbers, and then dividing a really small number by another really small number. This is very unstable numerically.
Since we're only interested in whether the ratio is larger than 1, we can just take the log of the whole thing and see if it's larger than 0, which is equivalent.
Now, if we take the log, expand, and then substitute in the full definition of P(feature-n | ...), we get the score calculation procedure presented on the main assignment page.
<< Back to Project 3 main page