A live integrity tool hands you two things: a score and a timeline of flagged moments. Used well, they make you a sharper interviewer. Used badly, they turn you into a prosecutor working from one data point. The difference is entirely in how you read them.

One flag is not proof

Start with the math, because it is unforgiving. When real cheating is rare in your pool, false positives outnumber true positives even with a very accurate detector — a consequence of the base-rate fallacy first described by Tversky and Kahneman. A vivid example: Vanderbilt disabled Turnitin's AI detector after noting that even a 1% false-positive rate, across its 75,000 papers, meant about 750 students wrongly flagged. The detectors themselves are shaky: OpenAI shut down its own classifier for low accuracy, and GPT detectors misflag 61% of essays from non-native English speakers as AI. Treat any single flag as a question, never an answer.

Read the signals together, not one "gotcha"

The reliable read comes from convergence. Selection science is explicit that you should weigh converging lines of evidence rather than a single indicator. The test-security field says the same thing about its own anomaly data: it "does not necessarily confirm cheating" and "must be supplemented with other information". A paste spike on its own means little. A paste spike that lines up with a focus switch, an off-screen gaze, and an answer that suddenly outpaces everything before it is a pattern worth a closer look.

Read the timeline in context

A flag is a timestamp plus what was happening around it. Your job is to separate normal behavior from a real pattern.

  • A glance away is not a tell. Breaking eye contact is how people think — averting your gaze actually improves recall and accuracy on hard questions. A pause to think looks nothing like sustained reading off a fixed line.
  • A flag can be an artifact. Automated proctoring famously flags some groups far more than others with no underlying difference in cheating. If a signal can fire on appearance or accent, weight it accordingly.
  • Read it like a security analyst reads telemetry. Events that look benign alone can matter together — judge each flag against a baseline and across time, not as a single snapshot.

Keep due process

Once you act on a flag, the law expects a human to stay in charge. The EU AI Act requires human oversight that guards against "automation bias" — over-relying on the system's output, and the GDPR gives candidates the right to contest a solely automated decision and obtain human intervention. Practically: do not auto-reject on a score. Give the candidate a chance to explain — a flag can reflect a disability or a tool you did not anticipate, which is why the EEOC and DOJ both stress accommodations and human review. And remember the employer, not the vendor, owns the decision and the liability.

Turn evidence into a fair decision

NIST frames it cleanly: an AI system can defer to a human or serve as "an additional opinion," and the human-AI loop can amplify bias if you let the score do the deciding. So treat the integrity score as one input alongside the strongest predictor you have — the structured interview, validity ~.42 — and watch for confirmation bias once a flag appears. The score tells you where to look. The conversation tells you what it means. You make the call.

That is the whole point of a timeline: not to accuse, but to give you evidence you can actually reason about. Trueyy is built to surface those signals in context, with the timestamp, so the decision stays yours.


Sources