As many know, Microsoft Academic is shutting down. Some people may have not heard of them or, if they have, did not understand the service they provided. They were by far the largest source of free/open scholarly metadata. This data powered, directly or indirectly, many of the sites academics use today. There have been a number of blog posts covering “what’s next” (1, 2). None of them get into the details of where tools builders should turn to after MAG. So this is a post attempting to do that with the two major players currently out there. The Lens and Semantic Scholar. The bulk of the test is done by randomly selecting 100,000 papers from the MAG corpus, querying each of the services, and comparing the results.
Deciding Who to Test
I know there are others out there like CrossRef, Dimensions, CORE, etc but for my purposes I want:
- The most data possible, this means I am looking at a data aggregator who will blend in data from sources like CrossRef, arXiv, etc.
- It needs to be free. Inciteful is free and I can only keep it that way if my data is free. (Dimensions is paid)
- I need either bulk data dumps or insanely high API limits (so I can download each paper individually). (CORE has low API limits)
- It needs to be kept up to date. Data dumps don’t help if they are a year old. (CORE’s data dumps are old)
Given the above criteria, as of now, The Lens and Semantic Scholar are the only ones which fit the bill. If you know if any others that might work, let me know and I’ll try to add them here.
There have been rumblings about OpenAlex but unfortunately as of this writing it is not yet live.
Evaluation Criteria
For the two that made it into the ring. I’m going to be evaluating them on a few different criteria:
- Overall Coverage
- Data Structure (how are the authors, affiliations, etc structured?)
- Data Enrichment (citation contexts, external ids, etc)
- Updated Data Accessibility
- API features (search endpoints, etc)
These criteria are just what’s important for Inciteful, YMMV.
Data Sources
There are a ton of different metadata providers out there. CrossRef, Dimensions, CORE, OpenCitations, the publishers themselves, etc. Both The Lens and Semantic Scholar pull data from many of these sources. According to The Len’s website they get data from the following sources:
It should be noted that The Lens started off as a patent search product that expanded into academic literature, hence places like the USPTO and WIPO. As far as I’m aware, they are also the source of all patent data coming into Microsoft Academic.
According to Semantic Scholar they get their data from the following sources:
It seems like Semantic Scholar integrates more with primary (read, publisher) sources of data than The Lens, who is more of an “aggregator of aggregators”.
1. Overall Coverage
Let’s start by discussing what “coverage” actually means. Coverage in this context can mean a few different things:
- The most items in the index
- The most “academic” papers in the index
- The most citations in the index
- et al
For my purposes, I’m most interested in a combination of 2 and 3. I don’t care if the index has a ton of stuff that is not academic in nature and given the service Inciteful provides, I want to have the papers which are in the index to have the best citation coverage possible. Also, while patent data from MAG/The Lens is currently in my index, and it is nice to have, it doesn’t seem terribly important to my users.
The Test
The main part of the test is going to be evaluating the coverage portion. Being data aggregators each of them get their data from a different set of sources and integrate them into a coherent database in their own way. I am testing their “coverage” by randomly selecting 100,000 MAG ids out of the data dump from 2021-08-30, then pulling the data from Mag, The Lens, and Semantic Scholar using their respective APIs. From there I am comparing the results from each service. Focusing specifically on if the paper was found, and if so, the citation and reference counts. The data was pulled from the APIs on 2021-09-29. Since Inciteful is most concerned about making connections between papers, I will be focusing on citations and making conclusions about the services based on that coverage. You may read the results differently.
There is also a big hole in this test, the academic literature that never made it into MAG. A future test could include randomly pulling data from other sources such as CrossRef but for now, since MAG is the largest database outside of Google Scholar, this was the best I could do without adding a bunch of extra work.
Summary Data
I’m going to drop a bunch of numbers so you can draw your own conclusions. I’ve also posted the source sqlite database and related queries so you can do your own analysis.
The first table is simple summary data cut a few different ways. The first column of data is all of the items we found in Inciteful. It doesn’t equal 100k exactly because I did the random number generation from the ID database I maintain. This database contains all of the historical IDs as well as the current ones. Over time MAG drops items from it’s DB for various reasons. So in this instance ~3.7% of the IDs in my database were dropped by MAG. The second and third columns filter results to only those papers found by The Lens and Semantic Scholar respectively. The fourth column filters down to papers which have any sort of citation data from any source. The final is non-patent papers with citation data, as those are the papers I’m most interested in.
If you are reading this on a mobile device, be aware that the tables may be able to scroll horizontally.
Incite Found | Lens Found | SS Found | Has Cit Data | Has Cit & Non-Pat. |
|
---|---|---|---|---|---|
Total Papers | 96,331 | 72,102 | 68,451 | 50,929 | 38,674 |
lens_found | 72,097 | 72,102 | 64,986 | 37,478 | 37,478 |
ss_found | 66,733 | 64,986 | 68,451 | 37,348 | 37,260 |
incite_found | 96,331 | 72,097 | 66,733 | 50,046 | 37,791 |
lens_cits | 622,631 | 622,632 | 611,576 | 622,632 | 622,632 |
ss_cits | 702,773 | 697,185 | 730,927 | 730,927 | 730,347 |
incite_cits | 659,948 | 587,171 | 580,993 | 659,948 | 591,214 |
lens_refs | 647,975 | 647,977 | 619,992 | 647,977 | 647,977 |
ss_refs | 784,266 | 775,688 | 802,969 | 802,969 | 802,676 |
incite_refs | 683,193 | 602,020 | 582,488 | 683,193 | 610,618 |
lens_not_ss | 7,114 | 7,116 | 0 | 1,313 | 1,313 |
ss_not_lens | 1,750 | 0 | 3,465 | 1,183 | 1,095 |
incite_more_cits_than_lens | 1,821 | 1,821 | 1,782 | 1,821 | 1,821 |
lens_more_cits_than_incite | 9,897 | 9,898 | 9,745 | 9,898 | 9,898 |
incite_more_refs_than_lens | 1,483 | 1,483 | 1,421 | 1,483 | 1,483 |
lens_more_refs_than_incite | 10,892 | 10,893 | 10,528 | 10,893 | 10,893 |
incite_more_cits_than_ss | 4,130 | 4,055 | 4,130 | 4,130 | 4,122 |
ss_more_cits_than_incite | 17,115 | 16,994 | 17,868 | 17,868 | 17,856 |
incite_more_refs_than_ss | 3,658 | 3,609 | 3,658 | 3,658 | 3,642 |
ss_more_refs_than_incite | 14,088 | 13,957 | 14,674 | 14,674 | 14,672 |
lens_more_cits_than_ss | 6,971 | 6,972 | 6,972 | 6,972 | 6,972 |
ss_more_cits_than_lens | 14,795 | 14,795 | 14,795 | 14,795 | 14,795 |
lens_more_refs_than_ss | 7,640 | 7,640 | 7,640 | 7,640 | 7,640 |
ss_more_refs_than_lens | 12,872 | 12,873 | 12,873 | 12,873 | 12,873 |
Paper Coverage
On a surface level it’s clear that there is a big gap in the number of papers which Inciteful found vs the others. The data is below but, making a long story short, it is because MAG includes patents whereas The Lens and SS do not (The Lens does but from a different API).
All | Missing From Lens |
Missing From SS |
|||||||
---|---|---|---|---|---|---|---|---|---|
docType | # | Cits | Refs | # | Cits | Refs | # | Cits | Refs |
Blank | 31,202 | 28,119 | 88,227 | 180 | 105 | 384 | 5,137 | 2,906 | 4,130 |
Book | 1,687 | 22,386 | 3,282 | 2 | 0 | 0 | 134 | 286 | 0 |
BookChapter | 1,436 | 1,728 | 6,871 | 6 | 0 | 35 | 83 | 35 | 67 |
Conference | 1,891 | 21,872 | 20,933 | 7 | 69 | 83 | 101 | 61 | 910 |
Dataset | 39 | 5 | 0 | 0 | 0 | 0 | 18 | 1 | 0 |
Journal | 32,453 | 510,423 | 460,603 | 224 | 2,699 | 4,825 | 1,257 | 5,483 | 19,033 |
Patent | 23,679 | 68,734 | 72,575 | 23,679 | 68,734 | 72,575 | 22,181 | 68,535 | 72,237 |
Repository | 1,776 | 5,747 | 18,708 | 122 | 1,170 | 3,271 | 429 | 1,630 | 3,792 |
Thesis | 2,168 | 934 | 11,994 | 14 | 0 | 0 | 258 | 18 | 536 |
Total | 96,331 | 659,948 | 683,193 | 24,234 | 72,777 | 81,173 | 29,598 | 78,955 | 100,705 |
Speculating, it seems as though the Lens tries to keep their data as close as possible to the MAG and maybe does not get involved with disambiguation, etc. Looking at the second column of the summary table seems to support this, any paper which was found by The Lens was also found by Inciteful (with a few exceptions). Analyzing when the non-patent papers missing from the Lens were created:
YEAR | # |
---|---|
NULL | 2 |
2016 | 146 |
2017 | 18 |
2018 | 12 |
2019 | 26 |
2020 | 97 |
2021 | 254 |
A large portion of these are recent, so we can possibly chalk those up to a timing issue where The Lens has not yet updated their database. The rest are a rounding error for our purposes.
Semantic Scholar on the other hand has a lot more missing (~7,000 papers) after removing patents. When I inquired about this, the response I got was that, upon ingestion, they have a filter which tries to look for “non-scientific” or gray literature and which stops the paper from entering the index. So basically they have stricter criteria than MAG for what constitutes academic literature. Which makes sense and is actually a good thing because once MAG goes away, they are going to already have an opinion as to what to index when they encounter something new.
Semantic Scholar is also doing other things like paper disambiguation which made tracking everything down a bit more complicated. For example, with this paper from arXiv, MAG indexes both the actual article as well as the conference proceeding. So it’s possible they are also doing other disambiguation which I missed. In line with them maintaining their own index rather than mirroring it (like Inciteful), it looks as though they have “found” a ~1,700 papers that Inciteful did not as a result of the papers being dropped from MAG.
Citation Coverage
I’m most interested in the last column of the summary table (replicated in part below). I want to look at papers which have some sort of citation data associated with them. Papers which don’t have any citations or references are pretty useless to Inciteful as I cannot build a graph without them. Ideally the more citations the better.
Has Cit & Non-Pat. | |
---|---|
Total Papers | 38,674 |
lens_found | 37,478 |
ss_found | 37,260 |
incite_found | 37,791 |
lens_cits | 622,632 |
ss_cits | 730,347 |
incite_cits | 591,214 |
lens_refs | 647,977 |
ss_refs | 802,676 |
incite_refs | 610,618 |
The first thing that jumps out is the fact that Semantic Scholar has 17% more citations and 23% more references than The Lens, which in turn has more than Inciteful. This second point is particularly interesting because that means that The Lens is not just mirroring the citation data that it gets from MAG like it seems to be doing for paper data. It’s actually enriching it from other sources. Semantic Scholar is enriching the data as well.
Digging into the actual comparison between The Lens and Semantic Scholar:
Has Cit & Non-Pat. | |
---|---|
lens_more_cits_than_ss | 6,972 |
ss_more_cits_than_lens | 14,795 |
lens_more_refs_than_ss | 7,640 |
ss_more_refs_than_lens | 12,873 |
There are some instances where The Lens outperforms Semantic Scholar, but almost twice as often, it’s Semantic Scholar that has more citation/reference data.
2. Data Structure
For the data structure piece I am looking into how they are presenting the standard data that is associated with a paper. That means authors, affiliations, URLs, etc. I’ll try to dive into each here. In the next section I’ll cover data enrichments that are not found in MAG.
Paper Data
To start off I’ll focus on the paper specific data and present it in table form for easier consumption.
The Lens | Semantic Scholar | |
---|---|---|
Open Access | Yes, including license and color | Only Yes/No |
URLs | Yes, multiple URLs per paper | Only to Semantic Scholar |
Abstract | Yes | Yes |
Source | Yes, including page numbers, etc | Only name of journal/conference |
Publication Type | Yes | No |
External Ids (PubMed, arXiv, etc) | Yes | Yes |
Other Info | Page numbers, volumes, and issues | No |
Date | Date published and date inserted | Year Published |
Clearly The Lens offers more basic data in their response than Semantic Scholar does. It’s enough to construct a citation if that’s the type of service you are looking to build. Semantic Scholar, in comparison, is relatively spartan.
A big call out I want to make here is that through Semantic Scholar’s API, the only URL you get to a paper is the one to Semantic Scholar’s site. While I understand why they are doing it, to drive more eyeballs to their site from sites that use their data, it does add friction between either Semantic Scholar and the site/tool builder or between the site and the user. There are easy ways around this for papers with external IDS like a DOI, PubMedID or arXiv ID.
Here are the links to the data structures for those that are interested in digging deeper.
Author Data
For Author data on the other hand, Semantic Scholar seems to offer a bit more.
The Lens | Semantic Scholar | |
---|---|---|
Name | Separate first/last name | Full name and aliases |
Affiliation Data | Yes, including grid/ror | Yes, including grid/ror |
Structured IDs | Yes, the author’s MAG ID | Yes, using own disambiguation |
URL | No | Semantic Scholar URL and author’s homepage |
Other Data
In terms of “other data”, The Lens offers a bunch of other information like publisher information, ISSNs, and MeSH terms that Semantic Scholar does not. None of which is useful for my purposes but could be for yours.
3. Data Enrichment
To me a “data enrichment” is something that the service constructs and serves up as part of their API, above and beyond what MAG provides.
The Lens enriches their data by pulling in structured data from other sources. Where possible they identify and present:
- Funding sources
- Clinical trials
- Chemicals
- Patent citations
- Author’s ORCID IDs
- Abstracts
Semantic Scholar’s data enrichment focuses more on text mining and NLP. I think Semantic Scholar has a leg up here given the partnerships they have with large publishers which in turn gives them access to the full text of a lot of articles that are not Open Access.
- Identify “influential citations”
- Author disambiguation
- Citation contexts (the text surrounding the citation in a paper)
- Citation intents
- Specter Embeddings - A vector representation of the document
- Author’s ORCID IDs
- Abstracts
The types of enrichments available vary pretty widely between the two so, depending on your use case, one may work better for you.
4. Updated Data Accessibility
This one is a pretty specific use case but it’s something that most people ingesting the data on a regular basis will need to think about. How do I ensure I have the most up to date data? For Inciteful I need to pre-process the data and ingest it into my custom data store, this means I can’t really use the API for the bulk of my requirements.
Data Dumps
To start, both do regular data dumps. Semantic Scholar has monthly downloads available to everyone. I was told that The Lens will do data dumps but you have to request special access.
I don’t have access to a Lens data dump so I can’t comment on it. But the downside to using the Semantic Scholar data dump is that (as of this writing) none of the above enrichments, outside of the abstract, are actually in the data dump. You need to hit the API to get them. Which is a problem, since I could really beef up the functionality of the site that data but I need have access to it locally. I’m not sure how they would feel about me hitting the API a couple of hundred of million times to get the data :)
Update Data Through the API
Once you have downloaded a data dump, it would be nice to just be able to hit the API for the most recently changed data rather than download the entire dump once again. This is how the Crossref API works by default. The Lens allows you to do that in a round about way through their search API (more on that below) but Semantic Scholar does not. You can only search papers by keyword or by ID. That makes it hard to find out about the new papers you’ve never seen before.
5. API Features
Each service has taken a different approach to building their API. You can see each of their API home pages here:
Both The Lens and Semantic Scholar have API endpoints where you can query a specific paper:
- The Lens: https://api.lens.org/scholarly/{lens_id}
- Semantic Scholar: https://api.semanticscholar.org/graph/v1/paper/{paper_id}
From here this is where they really diverge.
The Lens
To start you have to apply for an API key, I ended up getting one with a monthly API limit of 10,000 requests. That’s fine for Inciteful because the full text search is a side feature that very few people actually use on the site. With your API key you have access to their search endpoint.
Their search endpoint is basically an exposed Elastic Search Cluster that contains their entire corpus. You can see the documentation here. But in the end there are a few outcomes from this:
- You can query in pretty much any way you want
- It’s not as fast as a purpose-built endpoint
- It’s complicated to get what you want out of it
According to their docs, there are 61 searchable fields and 13 pre-defined filters. Giving you a lot of flexibility. But that doesn’t come without it’s own problems. For example, you can do a full text search across the entire corpus relatively quickly, but the results for multi-word queries are pretty bad because, by default, elastic search does an OR
boolean search on the multiple keywords. It also does not give extra weight to those items which contain both words let alone items that contain both words next to each other. In addition to the above problems, it will also do a full text search across every field. This includes field of study, journal title, author names, etc. So be sure to specify which fields you want to search and how important each of the fields are.
The query I ended up with when using the full text search on Inciteful was:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
{
"size": "",
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"title"
],
"query": "",
"default_operator": "OR"
}
}
],
"should": [
{
"query_string": {
"fields": [
"title^10",
"abstract"
],
"query": "",
"default_operator": "AND"
}
},
{
"query_string": {
"fields": [
"title^100",
"abstract^5"
],
"query": "",
"type": "phrase",
"phrase_slop": 100
}
}
]
}
},
"include": [
"lens_id",
"title",
"abstract",
"external_ids"
]
}
I’m not an Elastic Search guru, I know I can improve the query, I just don’t know how. So just remember:
“With great power comes great complexity” - (I’m sure someone said that sometime)
Semantic Scholar
You can play around with the Semantic Scholar endpoint without a key, you’ll just be subject to a low rate limit. When I applied for my key my rate limit was something like 100 requests/second. Woo hoo!
Semantic Scholar offers a few additional options outside of just the paper endpoint:
- A keyword search
- An author endpoint (more information about an author)
- An author’s papers endpoint (all of the papers written by an author)
- A paper’s citations and references endpoints (get information about all the papers citing or cited by the paper in question)
There is a bit of complexity in understanding what data you can get from which endpoint. For example, you can only get a subset of the information about a paper from the keyword search endpoint that you can get from the actual single paper endpoint. But you can only get a subset of the information about the citations or references from the single paper endpoint, for more info like intents
and contexts
, you need to hit the paper’s citation and reference endpoints individually.
I’m not going to go through the details of each endpoint, but I will highlight a few things.
Starting with the keyword search endpoint, it’s dead simple:
https://api.semanticscholar.org/graph/v1/paper/search?query=covid+vaccination
In addition to that, the results are what you get on their own site, so you know they are good.
The author endpoint allows you to get information about a specific author and see all the papers from a specific author. But just like with the keyword search and paper endpoints, you can only see a subset of the information unless you use the “Author’s papers” endpoint. Don’t shoot the messenger.
Conclusion
Congrats if you’ve made it this far. If you read everything, I’m sure you’ve drawn your own conclusions about what is best for your use case. There are clearly pros and cons with each and there is no perfect answer.
But if I had to break out my crystal ball to say who is most likely to pick up the torch and carry it after MAG, I think it will most likely be Semantic Scholar. While there are a number of downsides to Semantic Scholar (fewer available fields, less powerful API, etc), it seems to me like they have done the leg work to be able to keep running when MAG shuts down. At each point they have demonstrated that they have their own independent infrastructure which exists separate from MAG. They have built their own paper disambiguation, author disambiguation, citation context extraction, and citation intent analyzer. They have established partnerships directly with publishers for better access to non-OA papers. They also don’t seem to guard their data as closely as The Lens (higher API limits, public data dumps, etc).
But all that being said, the ecosystem is lucky to have both and I am very grateful for what each adds to it. None of our tools would be possible without all the hard work they have done along with the others like Crossref, OpenCitations, the I4OC, Unpaywall, ROR, ORCID, etc.
Finally, while it’s incredibly sad that MAG is shutting down, they did the world a huge services by proving the value of enriched/open metadata delivered in a consistent manner with high level of service and professionalism. I have high hopes that the community will pick up where they left off. It will take some time for the community to reach parity with MAG but the will is there. I look forward to seeing where the next few years takes us.