ScanCode train gibberish model

ScanCode uses a 2-character Markov chain to perform gibberish detection on text. At a high level, it detects gibberish strings by seeing if a sequence of letters is part or a whole word, two letters at a time. It does this by checking how likely it is to go from one letter to another. The probabilities of going from one letter to another are determined by a model that has been trained on a large set of valid text, where it counts each transition between letters and computes a probability based off of that. These probabilities and thresholds are stored in a model that is saved to a Python pickle.

The training corpus for the gibberish detector can be found in src/textcode/data/gibberish/.

big.txt contains the main source of valid words that the gibberish detector model is trained on.

good.txt and bad.txt are used to determine the average threshold, where any letter transition whose average transition probability falls below this threshold is classified as gibberish.

Usage: scancode-train-gibberish-model [OPTIONS]

Quick Reference

--big FILE

Text file containing main training corpus for the gibberish detector

--good FILE

Text file containing text considered to be not gibberish (good)

--bad FILE

Text file containing text considered to be gibberish (bad)

-h, --help

Show this message and exit.


--big option

The --big option allows the user to use a different text file to train the gibberish detector model.

--good option

The --good option allows the user to use a different text file containing strings considered to be valid copyrights. This option is used to adjust the average transition probability threshold that determines whether or not a string is gibberish.

--bad option

The --bad option allows the user to use a different text file containing strings considered to be invalid copyrights. This option is used to adjust the average transition probability threshold that determines whether or not a string is gibberish.