Introduction
This is documentation for the OACensus tool. To jump right in and use the tool, check out the Quickstart section. More detailed documentation is in the User Guide. An overview of all the standard scrapers is in the Scrapers section, and the standard reports are documented in the Reports section. The Writing Scrapers and Reports section covers how to write your own new scrapers and reports. The Use Cases section describes some example scenarios for using this tool.
The oacensus Tool
The oacensus tool consists of several configurable data scrapers and reporting tools. It is written in Python.
Quickstart
Here is an example configuration file named example-orcid.yaml
:
- orcid:
orcid: 0000-0002-0068-716X
- oag
The order in which the scrapers are specified is important.
This uses the orcid
scraper:
$ oacensus help -scraper orcid
orcid Scraper
Generate lists of articles for authors based on ORCID.
Settings:
cache-expires: Number of units after which to expire cache files. (default value: None)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
encoding: Which encoding to use. Can be 'chardet'. (default value: None)
end-period: Period (month) in yyyy-mm format. (default value: None)
no-hash-settings: Settings to exclude from hash calculations. (default value: ['start-period', 'end-period'])
orcid: ORCID of author to process, or a list of ORCIDS. (default value: None)
orcid-data-file: File to save data under. (default value: orcid.pickle)
period: Custom setting for article 'period'. (default value: manual)
start-period: Period (month) in yyyy-mm format. (default value: None)
Followed by the oag
scraper:
$ oacensus help -scraper oag
oag Scraper
Adds information from Open Article Gauge (OAG) to articles.
Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).
Settings:
base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
cache-expires: Number of units after which to expire cache files. (default value: None)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
encoding: Which encoding to use. Can be 'chardet'. (default value: None)
max-items: Maximum number of items in a single API request. (default value: 1000)
no-hash-settings: Settings to exclude from hash calculations. (default value: [])
repository-name: Name of OAG repository. (default value: Open Article Gauge)
This is run via:
$ oacensus run --config example-orcid.yaml --reports "personal-openness"
running orcid scraper
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
function(**options)
File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
scraper.run()
File "/home/oacensus/oacensus/oacensus/scraper.py", line 105, in run
return self.process()
File "/home/oacensus/oacensus/oacensus/scrapers/orcids.py", line 45, in process
for pub in response.publications:
File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 107, in publications
self._load_works()
File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 101, in _load_works
+ '/orcid-works', headers = BASE_HEADERS)
TypeError: cannot concatenate 'str' and 'NoneType' objects
And a personal openness report is generated:
User Guide
The oacensus
tool runs scrapers to download and process data, and then runs
reports to present the data. You specify the scrapers in a configuration file,
and reports are specified on the command line.
Config File Format
Config files are written in YAML and should consist of a list of the scraper aliases to be run, in order. Each scraper alias may be followed by an optional dictionary of custom settings to provide to the scraper.
Here is an example:
- orcid:
orcid: 0000-0002-0068-716X
- oag
In this case we are running the orcid
scraper followed by the oag
scraper,
and the orcid
scraper has a custom setting specified, confusingly having the
setting name orcid
.
You can look at the documentation for each scraper to see its available settings.
Command Line Interface
The command line interface is documented via command line help.
The output from the help
command is:
$ oacensus help
Available commands:
help - Prints this help message or help for individual commands, scrapers or reports.
list - List all available scrapers and reports.
run - Runs the oacensus tool.
reports - Runs additional reports using data from the last run.
Run `oacensus help -on cmd` for detailed help on any of these commands.
This lists each of the available commands.
Here is detailed help on the run
command:
$ oacensus help -on run
=======================
Help for 'oacensus run'
=======================
Runs the oacensus scrapers specified in the configuration file.
This is the main command for using the oacensus tool. It reads the
configuration specified in YAML and runs the requested scrapers in order
(using data from the cache if available). Data will be stored in a sqlite3
database. After data has been processed and stored in the database, reports
may be run which will present the data.
Arguments:
cachedir - Directory to store cached scraped data.
[optional, defaults to '.oacensus/cache/']
e.g. 'oacensus run --cachedir .oacensus/cache/'
config - YAML file to read configuration from.
[optional, defaults to 'oacensus.yaml']
e.g. 'oacensus run --config oacensus.yaml'
dbfile - Name of sqlite db file.
[optional, defaults to 'oacensus.sqlite3']
e.g. 'oacensus run --dbfile oacensus.sqlite3'
profile - Whether to run in profiler (dev only).
[optional, defaults to 'False'] e.g. 'oacensus run --profile False'
progress - Whether to show progress indicators.
[optional, defaults to 'False'] e.g. 'oacensus run --progress False'
reports - Reports to run.
[optional, defaults to ''] e.g. 'oacensus run --reports '
workdir - Directory to store temp working directories.
[optional, defaults to '.oacensus/work/']
e.g. 'oacensus run --workdir .oacensus/work/'
You can run reports as part of run
, but you can also run reports separately
after you have executed the run
command:
$ oacensus help -on reports
===========================
Help for 'oacensus reports'
===========================
Arguments:
dbfile - db file to use for reports
[optional, defaults to 'oacensus.sqlite3']
e.g. 'oacensus reports --dbfile oacensus.sqlite3'
reports - reports to run
[optional, defaults to ''] e.g. 'oacensus reports --reports '
To get a list of available scrapers or reports, use the list
command:
$ oacensus help -on list
========================
Help for 'oacensus list'
========================
List the available scrapers and reports.
Here are the built-in scrapers and reports:
$ oacensus list
Scrapers:
articlescraper
biomed
crossref
crossrefjournals
csvarticles
demo
doaj
doilist
elsevier
gtr
licenses
oag
oai
orcid
pubmed
scimago
wiley
Reports:
excel
institution
personal-openness
textdump
You can get help on individual scrapers:
$ oacensus help -scraper oag
oag Scraper
Adds information from Open Article Gauge (OAG) to articles.
Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).
Settings:
base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
cache-expires: Number of units after which to expire cache files. (default value: None)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
encoding: Which encoding to use. Can be 'chardet'. (default value: None)
max-items: Maximum number of items in a single API request. (default value: 1000)
no-hash-settings: Settings to exclude from hash calculations. (default value: [])
repository-name: Name of OAG repository. (default value: Open Article Gauge)
Or individual reports:
$ oacensus help -report excel
excel Report
Dump all database models to a single excel workbook.
Settings:
date-format-string: Excel style date format string. (default value: D-MMM-YYYY)
filename: Name of file to write excel dump to. (default value: dump.xls)
install-dir: Location where the plugin was defined. (default value: None)
Performance and Caching
Some scrapers have to fetch a lot of data and will be slow to run. The data will be cached after the first run and re-used if the parameters are the same.
You can use the --progress
option to have progress notices printed out while
scrapers run.
Scrapers
It is up to users to specify scrapers in a sensible order. Some scrapers use the database state from previously-run scrapers in the current batch to do their work. The database is reset at the start of each batch run, and entries will be re-populated from cached or just-downloaded data sources.
Open Access Data Sources
We can obtain Open Access information at the article or the journal level.
Journal-level open access data is obtained by querying a publisher’s site or an aggregation service like Directory of Open Access Journals.
Article-level open access data is obtained by querying the Open Article Gauge (OAG).
To some extent, obtaining journal level data and then applying that to individual articles within the oacensus tool is a duplication of the work done by the OAG, but querying the OAG requires a DOI for each article, and this is not always available.
Article-level open access data is stored in the open_access
field in the
Article data model. The is_open_access()
method of an Article object will use
both the open_access
field on the Article object and the open_access
field
on the associated Journal object (if any) to determine the openness of the
article.
Journal Scrapers
Journal scrapers should typically be run first in most workflows.
The JournalScraper
class implements a create_or_modify_journal
method which
should be used as the standard method to add new journal entries. This method
looks for an existing journal having the ISSN and, if it finds it, it modifies
the existing entry with only those fields specified in update_journal_fields
.
If there is not an existing journal corresponding to the ISSN, all provided
data fields are used to create a new journal entry. If a JournalList is
provided, the created or modified journal is added to that list (journals can
be linked to multiple journal lists).
If the desired behavior is to only modify existing journals, then the
add-new-journals
setting can be set to False.
Unless this paradigm does not fit, the preferred method is to use
create_or_modify_journal
rather than using the Journal.create
method
directly.
Don’t forget to specify update_journal_fields
for each JournalScraper so that
oacensus knows how to handle journals which already exist. The default is an
empty list meaning that no data will be updated.
BiomedCentral journals
The biomed
scraper creates a Journal entry for each journal published by BioMed Central.
Wiley
Creates Journal entry for each journal published by Wiley.
Elsevier
Creates Journal entry for each journal published by Elsevier.
DOAJ
Gets information about open access journals from DOAJ, using new website’s CSV download option.
Scimago
Not complete. Returns Scimago journal ranking information.
Articles & Article Lists
Here are scrapers which create article entries, sometimes organized into lists.
Where possible, articles are assigned to journals by linking on the journal ISSN.
Pubmed scraper
The Pubmed scraper obtains articles returned from a search of Pubmed.
DOI List scraper
Creates articles from an external list of DOIs.
Article Info
Scrapers which add information to existing article entries.
Open Article Gauge
This scraper updates open access-related attributes for an article using data retrieved from the OAG API.
Crossref
Obtain information from Crossref for all articles having a DOI.
Currently, this scraper does not modify any data.
Reports
Built-in reports.
Writing Scrapers and Reports
Scrapers and reports are implemented using the cashew plugin system.
If you implement a custom scraper or report, make sure to add an import statement in the load plugins module so that Cashew will register the plugin.
ORM
Oacensus uses the peewee ORM.
Article
Fields:
-
date_published Date on which article was published.
-
doi Digital object identifier for article.
-
id
-
journal Journal object for journal in which article was published.
-
period Name of date-based period in which this article was scraped.
-
source Which scraper populated this information?
-
title Title of article.
-
url Web page for article information.
ArticleList
Fields:
-
id
-
name
-
orcid
-
source Which scraper populated this information?
ArticleListMembership
Fields:
-
article
-
article_list
-
id
-
source Which scraper populated this information?
Instance
Fields:
-
article
-
download_url URL for directly downloading the article. May be tested for status and file size.
-
end_date This rating’s properties are in effect up to this date.
-
expected_file_hash Expected file checksum if article is downloadable.
-
expected_file_size Expected file size if article is downloadable.
-
free_to_read Are journal contents available online for free?
-
id
-
identifier Identifier within the repository.
-
info_url For convenience, URL of an info page for the article.
-
license
-
repository Repository in which this instance is deposited or described.
-
source Which scraper populated this information?
-
start_date This rating’s properties are in effect commencing from this date.
Journal
Fields:
-
country Country of publication for this journal.
-
doi DOI for journal.
-
eissn Electronic ISSN (EISSN) of journal.
-
id
-
iso_abbreviation
-
issn ISSN of journal. [unique]
-
issn_linking
-
language Language(s) in which journal is published.
-
medline_ta
-
nlm_unique_id
-
publisher Publisher object corresponding to journal publisher.
-
source Which scraper populated this information?
-
subject Subject area which this journal deals with.
-
title Name of journal.
-
url Website of journal.
JournalList
Fields:
-
id
-
name
-
source Which scraper populated this information?
JournalListMembership
Fields:
-
id
-
journal
-
journal_list
-
source Which scraper populated this information?
License
Fields:
-
alias
-
id
-
source Which scraper populated this information?
-
title
-
url
LicenseAlias
Fields:
-
alias [unique]
-
id
-
license
-
source Which scraper populated this information?
OpenMetaCommon
Fields:
-
end_date This rating’s properties are in effect up to this date.
-
free_to_read Are journal contents available online for free?
-
id
-
license
-
source Which scraper populated this information?
-
start_date This rating’s properties are in effect commencing from this date.
Publisher
Fields:
-
id
-
name
-
source Which scraper populated this information?
Rating
Fields:
-
end_date This rating’s properties are in effect up to this date.
-
free_to_read Are journal contents available online for free?
-
id
-
journal
-
license
-
source Which scraper populated this information?
-
start_date This rating’s properties are in effect commencing from this date.
Repository
Fields:
-
id
-
info_url For convenience, URL of info page for the repository.
-
name Descriptive name for the repository.
-
source Which scraper populated this information?
Scraper Design
Scrapers work in two phases. The first phase is scrape
and the second phase
is process
. Results of the scrape
phase are cached and, if no parameters have
changed, re-used in subsequent calls. The scrape
phase should do as much
pre-processing as possible (for efficiency) but they should not do anything
that depends on database state or on the ordering of scrapers. Anything which
depends on state should occur in the process
phase which is not cached.
Report Design
Reports take the harvested data and present it. Reports can be of any format.
Use Cases
Institutional Open Access Census
User Story
A librarian at Oxford University wishes to understand the amount of Open Access content, as defined in different ways, in the research they publish. They first need to create a list of research articles published from Oxford University. They use PubMed and CrossRef as sources of articles that provide affiliation information to generate the list of article DOIs. For each article they then wish to ask: a) Is this in an Open Access Journal (using DOAJ) b) Does the article have an open license (OAG) and c) Is the article in one of the following repositories (PMC/EuropePMC, OpenAIRE, the Oxford Institutional repository[1]). They aim to provide a report on this once a month.
[1] Most IRs can be searched via a standard protocol OAI-PMH. It would be reasonable to ask the user to supply the appropriate URL for the API endpoint
PubMed Articles
We’ll retrieve a list of articles where the affiliation is Oxford University.
To determine how to configure the pubmed query, we first review the docs for
the pubmed
scraper:
$ oacensus help -scraper pubmed
pubmed Scraper
Creates a single ArticleList and individual Article objects for all
articles returned from pubmed matching the [required] search query.
Settings:
base-url: Base url of API. (default value: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)
cache-expires: Number of units after which to expire cache files. (default value: None)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
datetype: Type of date for period filtering (default value: pdat)
delay: Time in seconds to delay between API requests. (default value: 1)
encoding: Which encoding to use. Can be 'chardet'. (default value: None)
end-period: Period (month) in yyyy-mm format. (default value: None)
filepattern: Names of files which hold data in cache. (default value: data_%04d.xml)
initial-ret-max: Maximum number of entries to return in the initial query. (default value: 5)
ncbi-db: Name of NCBI database to query. (default value: pubmed)
no-hash-settings: Settings to exclude from hash calculations. (default value: ['start-period', 'end-period'])
ret-max: Maximum number of entries to return in any single query. (default value: 10000)
search: Search query to include. (default value: None)
start-period: Period (month) in yyyy-mm format. (default value: None)
We only need to specify the search
parameter:
- pubmed:
search: '"Oxford University"[affiliation] AND 2012[pdat]'
CrossRef Articles
TBD.
OAG Licensing Information
The OAG scraper retrieves OAG metadata for any article in the database which has a DOI:
$ oacensus help -scraper oag
oag Scraper
Adds information from Open Article Gauge (OAG) to articles.
Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).
Settings:
base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
cache-expires: Number of units after which to expire cache files. (default value: None)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
encoding: Which encoding to use. Can be 'chardet'. (default value: None)
max-items: Maximum number of items in a single API request. (default value: 1000)
no-hash-settings: Settings to exclude from hash calculations. (default value: [])
repository-name: Name of OAG repository. (default value: Open Article Gauge)
We don’t need to set any parameters:
- oag
DOAJ Metadata
The doaj
scraper fetches the full listing of open access journals from DOAJ.
Then, any journals in the database matching DOAJ ISSNs are updated with DOAJ information about openness and license information.
$ oacensus help -scraper doaj
doaj Scraper
Generates ratings for journals with openness information from DOAJ.
Settings:
add-new-journals: Whether to create new Journal instances if one doesn't already exist. (default value: True)
cache-expires: Number of units after which to expire cache files. (default value: 90)
cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
csv-url: Base url for accessing DOAJ. (default value: http://www.doaj.org/csv)
data-file: File to save data under. (default value: doaj.csv)
encoding: Which encoding to use. Can be 'chardet'. (default value: utf-8)
limit: Limit of journals to process (for testing/dev) (default value: None)
no-hash-settings: Settings to exclude from hash calculations. (default value: [])
update-journal-fields: Whitelist of fields which should be applied when updating an existing journal. (default value: [])
We don’t need to set any parameters:
- doaj
Running the Example
$ oacensus run --config example-oxford-2012.yaml --reports excel
running pubmed scraper
Traceback (most recent call last):
File "/usr/local/bin/oacensus", line 9, in <module>
load_entry_point('oacensus==0.1.0d', 'console_scripts', 'oacensus')()
File "/home/oacensus/oacensus/oacensus/commands.py", line 27, in run
args.parse_and_run_command(sys.argv[1:], mod, default_command=default_cmd)
File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 381, in parse_and_run_command
command_module(mod, command, options, cli_options=cli_options)
File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
function(**options)
File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
scraper.run()
File "/home/oacensus/oacensus/oacensus/scraper.py", line 97, in run
self.scrape()
File "/home/oacensus/oacensus/oacensus/scraper.py", line 294, in scrape
for start_date, end_date in self.periods():
File "/home/oacensus/oacensus/oacensus/scraper.py", line 245, in periods
for i, start_date in enumerate(self.start_dates()):
File "/home/oacensus/oacensus/oacensus/scraper.py", line 228, in start_dates
start_month = self.parse_month("start-period")
File "/home/oacensus/oacensus/oacensus/scraper.py", line 215, in parse_month
raise Exception("%s must be provided in YYYY-MM format" % param_name)
Exception: start-period must be provided in YYYY-MM format
The excel
report dumps each database table onto an excel worksheet for inspection.
Individual Openness Report
User Story
A researcher wishes to provide a report demonstrating that they are a good citizen in generating open content. They use their ORCID profile as a source of article information. For each article they wish to show that it is either available at the publisher website freely to read[2] or is in either PMC or their institutional repository.
[2] "free-to-read" is a metadata element that Crossref will be shortly rolling out. It doesn’t yet exist and will take some time to reach critical mass.
Implementation
For now this report is implementing just using OAG openness data.
Here is the full project configuration:
- orcid:
orcid: 0000-0002-0068-716X
- oag
Here is the run output:
$ oacensus run --config example-orcid.yaml --reports "personal-openness"
running orcid scraper
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
function(**options)
File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
scraper.run()
File "/home/oacensus/oacensus/oacensus/scraper.py", line 105, in run
return self.process()
File "/home/oacensus/oacensus/oacensus/scrapers/orcids.py", line 45, in process
for pub in response.publications:
File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 107, in publications
self._load_works()
File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 101, in _load_works
+ '/orcid-works', headers = BASE_HEADERS)
TypeError: cannot concatenate 'str' and 'NoneType' objects
And here is the resulting report:
Topic Openness Report
User Story
A patient advocate wants to understand how much content related to their disease is available. They search PubMed to identify a set of articles and a comparison set for a different disease. They then wish to know what proportion of articles are free to read via the publisher[2], available in PubMedCentral, and available openly licensed.
[2] "free-to-read" is a metadata element that Crossref will be shortly rolling out. It doesn’t yet exist and will take some time to reach critical mass.
RCUK Policy Compliance Report
User Story
A UK funder wishes to report on RCUK policy compliance. They use Gateway to Research to generate a list of publications relating to their funding. Compliance is provided via two routes. If the article is OA through the publisher website it must have a CC BY license (OAG) or it must be made available through a repository. The funder elects to search PMC, OpenAIRE, and a UK federated institutional repository search tool[3] to identify copies in repositories.