For an AI or LM, S, e.g., a predictor of future event, classifier, etc.
A requirements specification, RS, for S will determine if S:
1. matches what actually happened in the real world or the correct classification, etc.
2. matches what a human would predict or classify, or
3. is better than a human in the prediction or classification.
1 and 3 require knowing the real world
Sometimes knowing what is true requires knowing what a group of humans agree is true in the real world.
2 and 3 require knowing what humans can do
Will need to have a gold standard to define truth in the real world.
Will need to know how each human contributing to the gold standard did on his or her own.
Experimentation will be needed.
In each case, there are tolerances of the result, e.g., in matching what human can do, and statistical significance in an experiment to evaluate.
The evaluation of 1, 2, or 3 will involve such measures as
a. recall
b. precision
c. F-measure (harmonic mean of recall and precision)
d. weighted F-measure
e. accuracy
What S is evaluted against, which measures to use, and the relative weights of recall and precision are determined by the requirements for the S and are parts of the RS.