ChroniclItaly 2.0 is the tagged version of ChroniclItaly (Viola 2018), a corpus of Italian language newspapers published in the USA between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia. This collection, which totals 4,810 issues and contains 16,624,571 words, gathers the front pages of each issue and it was collected from Chronicling America (https://chroniclingamerica.loc.gov/newspapers/), an Internet-based, searchable database of U.S. newspapers published in the United States from 1789 to 1963 made available by the Library of Congress. ChronclItaly 2.0 has been tagged for entities using a sequence tagging tool (Riedl and Padó 2018) that implements Tensorflow. The Italian language model of the sequence tagging tool was trained on I-CAB (Italian Content Annotation Bank), an open access corpus annotated for entities.
The tags are:
LOC -> Location
GPE -> Geo-political entity
PER -> Person
ORG -> Organization
The output is in columns as it follows: the first column is the input word, the second column specifies the pre-processed, lower-cased word, the third column contains a flag, i.e., whether the word has been known during training (KNOWN) or not (UNKNOWN). If labels are assigned to the input file, these will appear in the third column. The last column contains the predicted tags. The no-entity tag is O. Because some entities (like Stati Uniti) have multiple words, the tagging scheme distinguishes between the beginning (tag B-...) or the inside of an entity (tag I-...).
Please see the GitHub repository https://github.com/lorellav/GeoNewsMiner#chroniclitaly for further information. Finally, the files are arranged in two types: by newspaper's title to allow for comparative analysis across titles, and all together. Within each title, files are arranged chronologically.