Hey there! I'm really interested in understanding how plagiarism checkers operate. There are numerous tools out there like Grammarly, Quetext, Scribbr, EssayPro, and Turnitin that claim to be reliable and accurate, but I'm curious about their inner workings. How do these tools actually identify similarities between two pieces of text or code? Do they utilize techniques like hashing, fingerprinting, or maybe even machine learning to analyze meanings? Also, if I wanted to create my own plagiarism checker in Python, what would be a good approach? Have any of you developed a plagiarism detection system for coding files specifically, not just essays? I'd love to hear your thoughts and advice! Thanks!
4 Answers
If I were building a simple plagiarism checker, I’d write some code to compare two files, keeping track of identical text segments over a certain length. This would identify direct copying, but less effective for rephrased text. It could catch those who just copy and paste, however. This is a pretty straightforward project that anyone with a basic CS background could tackle! Just my quick brainstorming on the matter.
For coding, I’d generate an abstract syntax tree (AST) for the programs, rename all variables to standard names, and then compare their structure for similarity. I might also apply algorithms like Levenshtein distance for individual lines to measure how closely they match. Check out Google Scholar; there's a wealth of research on this topic that might inspire your approach!
I think Harvard's CS50 GitHub page has a plagiarism checker that they use, plus there are AI tools designed for code review. Those might be worth checking out if you're interested in coding plagiarism detection.
To create a solid plagiarism checker, I’d start by building a database of existing works—think libraries, Wikipedia, and various online resources. Then I'd cross-reference student submissions line by line against that database looking for similarities. Here’s a rough breakdown of the approach: 1. Compare similar words to flag potential issues, 2. Identify phrases or sentences that are too close, 3. Teach the program to differentiate between plagiarized text and proper citations, and 4. Continuously refine the process for better accuracy. A bit more challenging for code since many problems have a single correct solution, but for unique projects, you can definitely spot copied work.

Sounds interesting! I'll definitely look into that, thanks!