I'm configuring DLP, labeling, and trainable classifiers at my workplace in a Microsoft GCCHIGH environment. The issue is I'm struggling to train the "CUI" classifiers effectively since we don't have enough actual CUI documents; it seems I need at least 50 positive and 50 negative samples. I've tried generating fake data, but that hasn't worked. Are there any sysadmins or Information Protection Engineers who have tackled this? What steps did you take to set up trainable classifiers without having actual CUI documents?
3 Answers
I don’t have a solid answer, but I'm definitely interested in hearing how this works for you! Just out of curiosity, why are you trying to set up these classifiers without the actual CUI? Is it because your company hasn't provided them, or just doesn't have enough? A few insights would really help!
Yeah, it's tough! For compliance, you might be in a bind without enough documents.
You might have some luck asking a contracting officer for sample documents if you're working on any active CUI contracts. They usually have samples that can be used for training. Just a heads-up, though, we've experienced mixed results with the classifiers for CUI info in the past.
Good point! I’ve read that the classifiers can be hit or miss. I think I’ll try reaching out to the CO for samples.
Training classifiers for CUI is tricky due to the broad definitions, as NARA lists around 125 categories. Start by limiting SharePoint access to those who need it for CUI, set a default label, and prevent users from changing it. I recommend running auto-label policies in simulation mode first to gain insights on your data. A helpful resource is Summit7’s steps on Microsoft 365 for CMMC. I’m dealing with a similar issue for a client, and it’s definitely a challenge!
I put the DLP in test mode to see where things stand. I have a restricted SharePoint setup for CUI, but to be honest, I’m concerned about my job since we're bringing in an MSP for a lot of this Microsoft work.

We’re aiming for CMMC level 2 certification. I’m using DLP to identify where our CUI documents are located. We do have some actual CUIs, but not the required 50 for training.