I'm building a citation generator within a JavaScript application and I'm on the lookout for reliable methods to fetch citation metadata for various types of URLs. I'm targeting different sources like scholarly articles, news websites, blogs, government sites, and direct PDF links. Ideally, I'd like to receive responses in either CSL-JSON or BibTeX format, and possibly some styled citations as well. One of my main concerns is ensuring I don't end up with missing or incorrect authors and dates. What's the most dependable solution you've found—do you prefer a paid API, an open-source library, or perhaps a combination of web scraping, DOI lookups, and PDF parsing? Also, are there any JavaScript libraries you would recommend for this?
3 Answers
Zotero is a great tool for this. It's what Wikipedia uses with their Citoid service. You can find the translation server here: https://github.com/zotero/translation-server.
For formatting citations, you might want to check out citeproc.js. However, for actually retrieving the citation data, you’ll probably have to resort to some web scraping.
Thanks for the formatting library recommendation! That actually helps a lot.
The best approach would be a combined pipeline rather than relying solely on a single JS library. Here’s a quick workflow:
1. Use Zotero Translators via the Zotero Translation Server for general web pages.
2. If you have a DOI, PMID, or ISBN, you can normalize and enrich it using registry tools like DOI content negotiation to get CSL-JSON or BibTeX from services like Crossref or DataCite.
3. For direct PDFs, consider using GROBID to extract metadata, authors, and DOIs, and then export in BibTeX or TEI.
4. If you want a unified URL citation, using Wikimedia Citoid (either hosted or self-hosted) is a solid choice as it leverages Zotero translators too.
That's incredibly helpful, thank you!

Thanks for the suggestion! I tried Zotero before but had some trouble. Any other tools you think might work?