Skip to main navigation menu Skip to main content Skip to site footer

No. 1/26 (2026)

Articles

Domain linguistics resources for discovering criminal activities in Polish texts

DOI: https://doi.org/10.25312/j.10126  [Google Scholar]
Published: 2026-03-25

Abstract

This article considers the process of obtaining text data and the methodology of creating text corpora as well as the selection and the definition of individual lexical units in order to create a lexicon of crime vocabulary in Polish. The language material was developed and used in order to create an IT system supporting Polish uniformed services in searching for crimes committed or planned on the Internet. The crime categories considered were the following: smuggling and trafficking of drugs, cigarettes, alcohol, vehicles and machinery, weapons and explosives, trafficking in human goods and organs, trafficking and falsification of documents, sexual crimes and paedophilia. As a result of the work, a collection of over three thousand words and phrases was created. Additionally, a linguistic dataset of 3337 full texts from online sources was collected. The lexicon has been adapted to the requirements of computer processing for the needs of three system modules: Definition, Context, and Translator. The linguistic material was collected from various types of anonymous forums, advertising sites online, where there is no content control, moderation and administration. The linguistic material has been tested and implemented in the AISearcher Border Guard System.

References

  1. Demenko G. (2015), Korpusowe badania języka mówionego, Poznań: Akademicka Oficyna Wydawnicza EXIT. [Google Scholar]
  2. Demenko G., Skórzewski P., Kuczmarski T., Pieniowski M. (2022), Linguistic Information Extraction from Text-based Web to Discover Criminal Activity, s.l.: unpublished manuscript. [Google Scholar]
  3. Eynde Van F., Gibbon D. (2000), Processing, Lexicon Development for Speech and Language, Berlin: Springer. [Google Scholar]
  4. Gibbon D., Moore R., Winski R. (1997), Handbook of standards and resources for spoken language systems, Berlin: Walter de Gruyter. [Google Scholar]
  5. Krauz A. (2017), Mroczna strona Internetu – tor niebezpieczna forma cybertechnologii, „Dydaktyka informatyki”, nr 12, pp. 63–74. [Google Scholar]
  6. Maziarz M., Pisasecki M., Rudnicka E., Szpakowicz S., Kędzia P. (2016), plWordNet 3.0 – a Comprehensive Lexical-Semantic Resource, [in:] Y. Matsumoto, R. Prasad (eds.), 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, Osaka: The COLING 2016 Organizing Committee, pp. 2259–2268. [Google Scholar]
  7. Mider D. (2019), Czarny i czerwony rynek w sieci The Onion Router – analiza funkcjonowania darkmarketów, “Przegląd Bezpieczeństwa Wewnętrznego”, nr 29, pp. 154–190. [Google Scholar]
  8. Pęzik P. (2012), Wyszukiwarka PELCRA dla danych NKJP, [in:] A. Przepiórkowski, M. Bańko, R.L. Górski, B. Lewandowska-Tomaszczyk, Narodowy Korpus Języka Polskiego, Warszawa: PWN, pp. 253–273. [Google Scholar]
  9. Apache2 Ubuntu Default Page (n.d.), http://www.nlp.pwr.wroc.pl/ [accessed: 23.02.2026]. [Google Scholar]
  10. Beautiful Soup Documentation (n.d.), https://www.crummy.com/software/BeautifulSoup/bs4/doc/ [accessed: 03.03.2026]. [Google Scholar]
  11. CEN (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen [accessed: 3.09.2025]. [Google Scholar]
  12. CLARIN-PL (n.d.), https://clarin-pl.eu/ [accessed: 03.03.2026]. [Google Scholar]
  13. Corpus of manually lemmatised Polish noun and adjective phrases (n.d.), (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/kpwr-lemma [accessed: 3.09.2025]. [Google Scholar]
  14. Dopalacze-sklep (n.d.), https://dopalacze-sklep.org/ [accessed: 3.09.2025]. [Google Scholar]
  15. gpwEcono (n.d.), https://zil.ipipan.waw.pl/gpwEcono [accessed: 23.02.2026]. [Google Scholar]
  16. ITcontent (n.d.), https://itcontent.eu/ [accessed: 23.02.2026]. [Google Scholar]
  17. KPWr (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/kpwr [accessed: 3.09.2025]. [Google Scholar]
  18. KPWr (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/kpwr [accessed: 3.09.2025]. [Google Scholar]
  19. Lista dystrybucyjnego podobieństwa semantycznego (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/lista-podobienstwa [accessed: 3.09.2025]. [Google Scholar]
  20. Lista frekwencyjna (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/lista-frekwencyjna [accessed: 3.09.2025]. [Google Scholar]
  21. Narodowy Korpus Języka Polskiego (n.d.), http://nkjp.pl/ [accessed: 23.02.2026]. [Google Scholar]
  22. Natural Language Toolkit (n.d.), https://www.nltk.org/ [accessed: 23.02.2026]. [Google Scholar]
  23. NELexicon (n.d.), Grupa Technologii Językowych G4.19 Politechniki Wrocławskiej, http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/nelexicon [accessed: 3.09.2025]. [Google Scholar]
  24. Ogłaszamy24h.pl (n.d.), https://oglaszamy24h.pl/ [accessed: 23.02.2026]. [Google Scholar]
  25. plWikiEcono (n.d.), http://zil.ipipan.waw.pl/plWikiEcono [accessed: 23.02.2026]. [Google Scholar]
  26. Polish Coreference Corpus / Korpus zależności referencyjnych (n.d.), http://zil.ipipan.waw.pl/PolishCoreferenceCorpus [accessed: 23.02.2026]. [Google Scholar]
  27. Polish Wikipedia Corpus (n.d.), http://clip.ipipan.waw.pl/PolishWikipediaCorpus [accessed: 23.02.2026]. [Google Scholar]
  28. Polski Korpus Listów Pożegnalnych (n.d.), http://www.pcsn.uni.wroc.pl/ [accessed: 23.02.2026]. [Google Scholar]
  29. Polski Korpus Metafor Synestezyjnych SYNAMET (n.d.), http://synamet.polon.uw.edu.pl/ [accessed: 23.02.2026]. [Google Scholar]
  30. python (n.d.), https://www.python.org/ [accessed: 23.02.2026]. [Google Scholar]
  31. re – Regular expression operations (n.d.), https://docs.python.org/3/library/re.html [accessed: 23.02.2026]. [Google Scholar]
  32. Requests: HTTP for Humans™ (n.d.), https://docs.python-requests.org/en/latest/ [accessed: 23.02.2026]. [Google Scholar]
  33. Results of the IMPACT project (n.d.), Digital Libraries and Knowledge Platforms Department, http://dl.psnc.pl/activities/projekty/impact/results/ [accessed: 3.09.2025]. [Google Scholar]
  34. Słowosieć (n.d.), http://plwordnet.pwr.wroc.pl/wordnet/ [accessed: 23.02.2026]. [Google Scholar]
  35. TOPOgłoszenia (n.d.), https://top-ogloszenia.net/ [accessed: 23.02.2026]. [Google Scholar]
  36. Tor Browser (n.d.), https://www.torproject.org/ [accessed: 23.02.2026]. [Google Scholar]

Downloads

Download data is not yet available.