OACensus Documentation

Introduction

This is documentation for the OACensus tool. To jump right in and use the tool, check out the Quickstart section. More detailed documentation is in the User Guide. An overview of all the standard scrapers is in the Scrapers section, and the standard reports are documented in the Reports section. The Writing Scrapers and Reports section covers how to write your own new scrapers and reports. The Use Cases section describes some example scenarios for using this tool.

The oacensus Tool

The oacensus tool consists of several configurable data scrapers and reporting tools. It is written in Python.

Quickstart

Here is an example configuration file named example-orcid.yaml:

- orcid:
    orcid: 0000-0002-0068-716X

- oag

The order in which the scrapers are specified is important.

This uses the orcid scraper:

$ oacensus help -scraper orcid

orcid Scraper

Generate lists of articles for authors based on ORCID.

Settings:

     cache-expires: Number of units after which to expire cache files. (default value: None)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     encoding: Which encoding to use. Can be 'chardet'. (default value: None)
     end-period: Period (month) in yyyy-mm format. (default value: None)
     no-hash-settings: Settings to exclude from hash calculations. (default value: ['start-period', 'end-period'])
     orcid: ORCID of author to process, or a list of ORCIDS. (default value: None)
     orcid-data-file: File to save data under. (default value: orcid.pickle)
     period: Custom setting for article 'period'. (default value: manual)
     start-period: Period (month) in yyyy-mm format. (default value: None)

Followed by the oag scraper:

$ oacensus help -scraper oag

oag Scraper

Adds information from Open Article Gauge (OAG) to articles.

Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).

Settings:

     base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
     cache-expires: Number of units after which to expire cache files. (default value: None)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     encoding: Which encoding to use. Can be 'chardet'. (default value: None)
     max-items: Maximum number of items in a single API request. (default value: 1000)
     no-hash-settings: Settings to exclude from hash calculations. (default value: [])
     repository-name: Name of OAG repository. (default value: Open Article Gauge)

This is run via:

$ oacensus run --config example-orcid.yaml --reports "personal-openness"
running orcid scraper
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
    function(**options)
  File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
    scraper.run()
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 105, in run
    return self.process()
  File "/home/oacensus/oacensus/oacensus/scrapers/orcids.py", line 45, in process
    for pub in response.publications:
  File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 107, in publications
    self._load_works()
  File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 101, in _load_works
    + '/orcid-works', headers = BASE_HEADERS)
TypeError: cannot concatenate 'str' and 'NoneType' objects

And a personal openness report is generated:

User Guide

The oacensus tool runs scrapers to download and process data, and then runs reports to present the data. You specify the scrapers in a configuration file, and reports are specified on the command line.

Config File Format

Config files are written in YAML and should consist of a list of the scraper aliases to be run, in order. Each scraper alias may be followed by an optional dictionary of custom settings to provide to the scraper.

Here is an example:

- orcid:
    orcid: 0000-0002-0068-716X

- oag

In this case we are running the orcid scraper followed by the oag scraper, and the orcid scraper has a custom setting specified, confusingly having the setting name orcid.

You can look at the documentation for each scraper to see its available settings.

Command Line Interface

The command line interface is documented via command line help.

The output from the help command is:

$ oacensus help


Available commands:

  help - Prints this help message or help for individual commands, scrapers or reports.
  list - List all available scrapers and reports.
  run  - Runs the oacensus tool.
  reports - Runs additional reports using data from the last run.

Run `oacensus help -on cmd` for detailed help on any of these commands.

This lists each of the available commands.

Here is detailed help on the run command:

$ oacensus help -on run

=======================
Help for 'oacensus run'
=======================
   Runs the oacensus scrapers specified in the configuration file.

   This is the main command for using the oacensus tool. It reads the
   configuration specified in YAML and runs the requested scrapers in order
   (using data from the cache if available). Data will be stored in a sqlite3
   database. After data has been processed and stored in the database, reports
   may be run which will present the data.

   Arguments:
      cachedir - Directory to store cached scraped data.
            [optional, defaults to '.oacensus/cache/']
             e.g. 'oacensus run --cachedir .oacensus/cache/'


      config - YAML file to read configuration from.
            [optional, defaults to 'oacensus.yaml']
             e.g. 'oacensus run --config oacensus.yaml'


      dbfile - Name of sqlite db file.
            [optional, defaults to 'oacensus.sqlite3']
             e.g. 'oacensus run --dbfile oacensus.sqlite3'


      profile - Whether to run in profiler (dev only).
            [optional, defaults to 'False'] e.g. 'oacensus run --profile False'


      progress - Whether to show progress indicators.
            [optional, defaults to 'False'] e.g. 'oacensus run --progress False'


      reports - Reports to run.
            [optional, defaults to ''] e.g. 'oacensus run --reports '


      workdir - Directory to store temp working directories.
            [optional, defaults to '.oacensus/work/']
             e.g. 'oacensus run --workdir .oacensus/work/'

You can run reports as part of run, but you can also run reports separately after you have executed the run command:

$ oacensus help -on reports

===========================
Help for 'oacensus reports'
===========================
   Arguments:
      dbfile - db file to use for reports
            [optional, defaults to 'oacensus.sqlite3']
             e.g. 'oacensus reports --dbfile oacensus.sqlite3'


      reports - reports to run
            [optional, defaults to ''] e.g. 'oacensus reports --reports '

To get a list of available scrapers or reports, use the list command:

$ oacensus help -on list

========================
Help for 'oacensus list'
========================
   List the available scrapers and reports.

Here are the built-in scrapers and reports:

$ oacensus list
Scrapers:

   articlescraper
   biomed
   crossref
   crossrefjournals
   csvarticles
   demo
   doaj
   doilist
   elsevier
   gtr
   licenses
   oag
   oai
   orcid
   pubmed
   scimago
   wiley

Reports:

   excel
   institution
   personal-openness
   textdump

You can get help on individual scrapers:

$ oacensus help -scraper oag

oag Scraper

Adds information from Open Article Gauge (OAG) to articles.

Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).

Settings:

     base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
     cache-expires: Number of units after which to expire cache files. (default value: None)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     encoding: Which encoding to use. Can be 'chardet'. (default value: None)
     max-items: Maximum number of items in a single API request. (default value: 1000)
     no-hash-settings: Settings to exclude from hash calculations. (default value: [])
     repository-name: Name of OAG repository. (default value: Open Article Gauge)

Or individual reports:

$ oacensus help -report excel

excel Report

Dump all database models to a single excel workbook.

Settings:

     date-format-string: Excel style date format string. (default value: D-MMM-YYYY)
     filename: Name of file to write excel dump to. (default value: dump.xls)
     install-dir: Location where the plugin was defined. (default value: None)

Performance and Caching

Some scrapers have to fetch a lot of data and will be slow to run. The data will be cached after the first run and re-used if the parameters are the same.

You can use the --progress option to have progress notices printed out while scrapers run.

Scrapers

It is up to users to specify scrapers in a sensible order. Some scrapers use the database state from previously-run scrapers in the current batch to do their work. The database is reset at the start of each batch run, and entries will be re-populated from cached or just-downloaded data sources.

Open Access Data Sources

We can obtain Open Access information at the article or the journal level.

Journal-level open access data is obtained by querying a publisher’s site or an aggregation service like Directory of Open Access Journals.

Article-level open access data is obtained by querying the Open Article Gauge (OAG).

To some extent, obtaining journal level data and then applying that to individual articles within the oacensus tool is a duplication of the work done by the OAG, but querying the OAG requires a DOI for each article, and this is not always available.

Article-level open access data is stored in the open_access field in the Article data model. The is_open_access() method of an Article object will use both the open_access field on the Article object and the open_access field on the associated Journal object (if any) to determine the openness of the article.

Journal Scrapers

Journal scrapers should typically be run first in most workflows.

The JournalScraper class implements a create_or_modify_journal method which should be used as the standard method to add new journal entries. This method looks for an existing journal having the ISSN and, if it finds it, it modifies the existing entry with only those fields specified in update_journal_fields. If there is not an existing journal corresponding to the ISSN, all provided data fields are used to create a new journal entry. If a JournalList is provided, the created or modified journal is added to that list (journals can be linked to multiple journal lists).

If the desired behavior is to only modify existing journals, then the add-new-journals setting can be set to False.

Unless this paradigm does not fit, the preferred method is to use create_or_modify_journal rather than using the Journal.create method directly.

Don’t forget to specify update_journal_fields for each JournalScraper so that oacensus knows how to handle journals which already exist. The default is an empty list meaning that no data will be updated.

BiomedCentral journals

The biomed scraper creates a Journal entry for each journal published by BioMed Central.

Wiley

Creates Journal entry for each journal published by Wiley.

Elsevier

Creates Journal entry for each journal published by Elsevier.

DOAJ

Gets information about open access journals from DOAJ, using new website’s CSV download option.

Scimago

Not complete. Returns Scimago journal ranking information.

Articles & Article Lists

Here are scrapers which create article entries, sometimes organized into lists.

Where possible, articles are assigned to journals by linking on the journal ISSN.

Pubmed scraper

The Pubmed scraper obtains articles returned from a search of Pubmed.

DOI List scraper

Creates articles from an external list of DOIs.

Article Info

Scrapers which add information to existing article entries.

Open Article Gauge

This scraper updates open access-related attributes for an article using data retrieved from the OAG API.

Crossref

Obtain information from Crossref for all articles having a DOI.

Currently, this scraper does not modify any data.

Reports

Built-in reports.

Writing Scrapers and Reports

Scrapers and reports are implemented using the cashew plugin system.

If you implement a custom scraper or report, make sure to add an import statement in the load plugins module so that Cashew will register the plugin.

ORM

Oacensus uses the peewee ORM.

Article

Fields:

date_published Date on which article was published.
doi Digital object identifier for article.
id
journal Journal object for journal in which article was published.
period Name of date-based period in which this article was scraped.
source Which scraper populated this information?
title Title of article.
url Web page for article information.

ArticleList

Fields:

id
name
orcid
source Which scraper populated this information?

ArticleListMembership

Fields:

article
article_list
id
source Which scraper populated this information?

Instance

Fields:

article
download_url URL for directly downloading the article. May be tested for status and file size.
end_date This rating’s properties are in effect up to this date.
expected_file_hash Expected file checksum if article is downloadable.
expected_file_size Expected file size if article is downloadable.
free_to_read Are journal contents available online for free?
id
identifier Identifier within the repository.
info_url For convenience, URL of an info page for the article.
license
repository Repository in which this instance is deposited or described.
source Which scraper populated this information?
start_date This rating’s properties are in effect commencing from this date.

Journal

Fields:

country Country of publication for this journal.
doi DOI for journal.
eissn Electronic ISSN (EISSN) of journal.
id
iso_abbreviation
issn ISSN of journal. [unique]
issn_linking
language Language(s) in which journal is published.
medline_ta
nlm_unique_id
publisher Publisher object corresponding to journal publisher.
source Which scraper populated this information?
subject Subject area which this journal deals with.
title Name of journal.
url Website of journal.

JournalList

Fields:

id
name
source Which scraper populated this information?

JournalListMembership

Fields:

id
journal
journal_list
source Which scraper populated this information?

License

Fields:

alias
id
source Which scraper populated this information?
title
url

LicenseAlias

Fields:

alias [unique]
id
license
source Which scraper populated this information?

OpenMetaCommon

Fields:

end_date This rating’s properties are in effect up to this date.
free_to_read Are journal contents available online for free?
id
license
source Which scraper populated this information?
start_date This rating’s properties are in effect commencing from this date.

Publisher

Fields:

id
name
source Which scraper populated this information?

Rating

Fields:

end_date This rating’s properties are in effect up to this date.
free_to_read Are journal contents available online for free?
id
journal
license
source Which scraper populated this information?
start_date This rating’s properties are in effect commencing from this date.

Repository

Fields:

id
info_url For convenience, URL of info page for the repository.
name Descriptive name for the repository.
source Which scraper populated this information?

Scraper Design

Scrapers work in two phases. The first phase is scrape and the second phase is process. Results of the scrape phase are cached and, if no parameters have changed, re-used in subsequent calls. The scrape phase should do as much pre-processing as possible (for efficiency) but they should not do anything that depends on database state or on the ordering of scrapers. Anything which depends on state should occur in the process phase which is not cached.

Report Design

Reports take the harvested data and present it. Reports can be of any format.

Use Cases

Institutional Open Access Census

User Story

A librarian at Oxford University wishes to understand the amount of Open Access content, as defined in different ways, in the research they publish. They first need to create a list of research articles published from Oxford University. They use PubMed and CrossRef as sources of articles that provide affiliation information to generate the list of article DOIs. For each article they then wish to ask: a) Is this in an Open Access Journal (using DOAJ) b) Does the article have an open license (OAG) and c) Is the article in one of the following repositories (PMC/EuropePMC, OpenAIRE, the Oxford Institutional repository[1]). They aim to provide a report on this once a month.

[1] Most IRs can be searched via a standard protocol OAI-PMH. It would be reasonable to ask the user to supply the appropriate URL for the API endpoint

PubMed Articles

We’ll retrieve a list of articles where the affiliation is Oxford University.

To determine how to configure the pubmed query, we first review the docs for the pubmed scraper:

$ oacensus help -scraper pubmed

pubmed Scraper

Creates a single ArticleList and individual Article objects for all
articles returned from pubmed matching the [required] search query.

Settings:

     base-url: Base url of API. (default value: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)
     cache-expires: Number of units after which to expire cache files. (default value: None)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     datetype: Type of date for period filtering (default value: pdat)
     delay: Time in seconds to delay between API requests. (default value: 1)
     encoding: Which encoding to use. Can be 'chardet'. (default value: None)
     end-period: Period (month) in yyyy-mm format. (default value: None)
     filepattern: Names of files which hold data in cache. (default value: data_%04d.xml)
     initial-ret-max: Maximum number of entries to return in the initial query. (default value: 5)
     ncbi-db: Name of NCBI database to query. (default value: pubmed)
     no-hash-settings: Settings to exclude from hash calculations. (default value: ['start-period', 'end-period'])
     ret-max: Maximum number of entries to return in any single query. (default value: 10000)
     search: Search query to include. (default value: None)
     start-period: Period (month) in yyyy-mm format. (default value: None)

We only need to specify the search parameter:

- pubmed:
    search: '"Oxford University"[affiliation] AND 2012[pdat]'

CrossRef Articles

TBD.

OAG Licensing Information

The OAG scraper retrieves OAG metadata for any article in the database which has a DOI:

$ oacensus help -scraper oag

oag Scraper

Adds information from Open Article Gauge (OAG) to articles.

Requests information from the OAG API for each article in the oacensus
database which has a DOI (articles must already be populated from some
other source).

Settings:

     base-url: Base url of OAG API (default value: http://oag.cottagelabs.com/lookup/)
     cache-expires: Number of units after which to expire cache files. (default value: None)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     encoding: Which encoding to use. Can be 'chardet'. (default value: None)
     max-items: Maximum number of items in a single API request. (default value: 1000)
     no-hash-settings: Settings to exclude from hash calculations. (default value: [])
     repository-name: Name of OAG repository. (default value: Open Article Gauge)

We don’t need to set any parameters:

- oag

DOAJ Metadata

The doaj scraper fetches the full listing of open access journals from DOAJ.

Then, any journals in the database matching DOAJ ISSNs are updated with DOAJ information about openness and license information.

$ oacensus help -scraper doaj

doaj Scraper

Generates ratings for journals with openness information from DOAJ.

Settings:

     add-new-journals: Whether to create new Journal instances if one doesn't already exist. (default value: True)
     cache-expires: Number of units after which to expire cache files. (default value: 90)
     cache-expires-units: Unit of time for cache-expires. Options are: years, months, weeks, days, hours, minutes, seconds, microseconds (default value: days)
     csv-url: Base url for accessing DOAJ. (default value: http://www.doaj.org/csv)
     data-file: File to save data under. (default value: doaj.csv)
     encoding: Which encoding to use. Can be 'chardet'. (default value: utf-8)
     limit: Limit of journals to process (for testing/dev) (default value: None)
     no-hash-settings: Settings to exclude from hash calculations. (default value: [])
     update-journal-fields: Whitelist of fields which should be applied when updating an existing journal. (default value: [])

We don’t need to set any parameters:

- doaj

Running the Example

$ oacensus run --config example-oxford-2012.yaml --reports excel
running pubmed scraper
Traceback (most recent call last):
  File "/usr/local/bin/oacensus", line 9, in <module>
    load_entry_point('oacensus==0.1.0d', 'console_scripts', 'oacensus')()
  File "/home/oacensus/oacensus/oacensus/commands.py", line 27, in run
    args.parse_and_run_command(sys.argv[1:], mod, default_command=default_cmd)
  File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 381, in parse_and_run_command
    command_module(mod, command, options, cli_options=cli_options)
  File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
    function(**options)
  File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
    scraper.run()
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 97, in run
    self.scrape()
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 294, in scrape
    for start_date, end_date in self.periods():
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 245, in periods
    for i, start_date in enumerate(self.start_dates()):
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 228, in start_dates
    start_month = self.parse_month("start-period")
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 215, in parse_month
    raise Exception("%s must be provided in YYYY-MM format" % param_name)
Exception: start-period must be provided in YYYY-MM format

The excel report dumps each database table onto an excel worksheet for inspection.

Excel Report

Individual Openness Report

User Story

A researcher wishes to provide a report demonstrating that they are a good citizen in generating open content. They use their ORCID profile as a source of article information. For each article they wish to show that it is either available at the publisher website freely to read[2] or is in either PMC or their institutional repository.

[2] "free-to-read" is a metadata element that Crossref will be shortly rolling out. It doesn’t yet exist and will take some time to reach critical mass.

Implementation

For now this report is implementing just using OAG openness data.

Here is the full project configuration:

- orcid:
    orcid: 0000-0002-0068-716X

- oag

Here is the run output:

$ oacensus run --config example-orcid.yaml --reports "personal-openness"
running orcid scraper
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/modargs/args.py", line 286, in command_module
    function(**options)
  File "/home/oacensus/oacensus/oacensus/commands.py", line 191, in run_command
    scraper.run()
  File "/home/oacensus/oacensus/oacensus/scraper.py", line 105, in run
    return self.process()
  File "/home/oacensus/oacensus/oacensus/scrapers/orcids.py", line 45, in process
    for pub in response.publications:
  File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 107, in publications
    self._load_works()
  File "/usr/local/lib/python2.7/dist-packages/orcid/rest.py", line 101, in _load_works
    + '/orcid-works', headers = BASE_HEADERS)
TypeError: cannot concatenate 'str' and 'NoneType' objects

And here is the resulting report:

Topic Openness Report

User Story

A patient advocate wants to understand how much content related to their disease is available. They search PubMed to identify a set of articles and a comparison set for a different disease. They then wish to know what proportion of articles are free to read via the publisher[2], available in PubMedCentral, and available openly licensed.

[2] "free-to-read" is a metadata element that Crossref will be shortly rolling out. It doesn’t yet exist and will take some time to reach critical mass.

RCUK Policy Compliance Report

User Story

A UK funder wishes to report on RCUK policy compliance. They use Gateway to Research to generate a list of publications relating to their funding. Compliance is provided via two routes. If the article is OA through the publisher website it must have a CC BY license (OAG) or it must be made available through a repository. The funder elects to search PMC, OpenAIRE, and a UK federated institutional repository search tool[3] to identify copies in repositories.