Type: | Package |
Title: | Search and Retrieve Scientific Publication Records from PubMed |
Version: | 3.1.6 |
Date: | 2025-08-25 |
Maintainer: | Damiano Fantini <damiano.fantini@gmail.com> |
Description: | Query NCBI Entrez and retrieve PubMed records in XML or text format. Process PubMed records by extracting and aggregating data from selected fields. A large number of records can be easily downloaded via this simple-to-use interface to the NCBI PubMed API. |
URL: | https://www.data-pulse.com/dev_site/easypubmed/ |
Depends: | R(≥ 3.5) |
Imports: | methods, utils, rlang |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
LazyData: | true |
Encoding: | UTF-8 |
License: | GPL-3 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-08-25 14:58:55 UTC; dami |
Author: | Damiano Fantini [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2025-08-25 18:40:02 UTC |
Retrieve and Process Scientific Publication Records from Pubmed
Description
Query NCBI Entrez and retrieve PubMed records in XML or TXT format. PubMed records can be downloaded and saved as XML or text files. Data integrity is enforced during data download, allowing to retrieve and save very large number of records effortlessly. PubMed records can be processed to extract publication- and author-specific information.
Details
This software is based on the information included in the Entrez Programming Utilities Help manual authored by Eric Sayers, PhD and available on the NCBI Bookshelf (NBK25500). This R library is NOT endorsed, supported, maintained NOR affiliated with NCBI.
Author(s)
Damiano Fantini damiano.fantini@gmail.com
References
Tutorials and Help Webpage: https://www.data-pulse.com/dev_site/easypubmed/
NCBI PubMed Help Manual: https://pubmed.ncbi.nlm.nih.gov/help/
Entrez Programming Utilities Help (NBK25500): https://www.ncbi.nlm.nih.gov/books/NBK25500/
See Also
Useful links:
Examples
## Example 01: retrieve data in XML format, extract info, show
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
my_query_string <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
epm <- epm_query(my_query_string)
epm <- epm_fetch(epm)
epm <- epm_parse(epm, max_authors = 5, max_references = 10)
processed_data <- get_epm_data(epm)
utils::head(processed_data)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
## Not run:
## Example 02: retrieve data in medline format
my_query_string <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
epm <- epm_query(my_query_string)
epm <- epm_fetch(epm, format = 'medline')
medline_data <- get_epm_raw(epm)
first_record <- medline_data[[1]]
cat(first_record, sep = '\n')
## Additional Examples: show easyPubMed Vignette
library(easyPubMed)
vignette("easyPubMed_demo")
## End(Not run)
Parse and Format Author Names and Affiliations.
Description
Extract Author Information form a slice of a raw XML PubMed record. Last Name, First Name, Address and emails are returned. Only the first address of each author is returned. A collapsed version of the author list is also returned.
Usage
EPM_auth_parse(x, max_authors = 15, autofill = TRUE)
Arguments
x |
String (character vector of length 1) including an XML Author List section from a PubMed record. |
max_authors |
Numeric, maximum number of authors to include. See details for additional information. |
autofill |
Logical, shall non-missing address information be propagated to fill missing address information for other authors in the same publication. |
Details
The value of the 'max_authors' argument should be tuned to control which author information to extract from the input. If 'max_authors' is set to '0', no author information are extracted. If 'max_authors' is set to '-1' (or any negative number), only information corresponding to the last author are extracted. If 'max_authors' is set to '+1', only the first author information are extracted. If 'max_authors' is set to any other positive integer, only information for the indicated number of authors is extracted. In this case, information for both the first and the last author will be included.
Value
list including 2 elements: 'authors' is a data.frame including one row for each author and n=4 columns: lastname, forename, address and email; 'collapsed' is a list including 2 elements (each element is a string): authors and address.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
aff <- paste0('<Author><LastName>Doe</LastName><ForeName>John</ForeName>',
'<Affiliation>Univ A</Affiliation></Author>',
'<Author><LastName>Doe</LastName><ForeName>Jane</ForeName>',
'<Affiliation>jane_doe@univ_a.edu</Affiliation></Author>',
'<Author><LastName>Foo</LastName><ForeName>Bar</ForeName>',
'<Affiliation>Univ B</Affiliation></Author>')
easyPubMed:::EPM_auth_parse(aff)
Check Metadata from Imported XML Files.
Description
Analyze the Metadata from different XML files that were imported using easyPubMed and identify which records / files can be merged together and which ones to exclude. Only files with the same unique ID can be merged together a this step. The goal is to re-build a consistent easyPubMed object.
Usage
EPM_check_guide(x)
Arguments
x |
Data.frame including information from the imported XML files. The following columnnames are expected: 'index', 'file', 'JobUniqueId', 'JobQuery', 'JobBatch'. |
Value
Data.frame identical to 'x' with an additional *numeric) column ('pass' column).
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
gx <- data.frame(
index = c(1, 2, 3, 4, 5),
JobUniqueId = rep('xyz0x', 5),
JobQuery = rep('test_query', 1),
JobBatch = c(1, 2, 3, 4, 3),
JobBatchNum = rep(4, 5),
stringsAsFactors = FALSE)
easyPubMed:::EPM_check_guide(gx)
Custom XML Tag Matching.
Description
Extract text form a string containing XML or HTML tags. Text included between tags of interest will be returned. If multiple tagged substrings are found, they will be returned as different elements of a list or character vector.
Usage
EPM_custom_grep(xml_data, tag, xclass = NULL, format = "list")
Arguments
xml_data |
String (character vector of length 1), this is a string including PubMed records or string including XML/HTML tags. |
tag |
String (character vector of length 1), the tag of interest (e.g., "Title") (should NOT include < > chars). |
xclass |
String (character vector of length 1), a tag decorator of interest (e.g., "EIdType=\"doi\""). Can be NULL. |
format |
String. Must be a value in c("list", "char"). Indicates the type of output. Defaults to "list". |
Value
List or vector where each element corresponds to an in-tag substring.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
x <- "This string includes <Ti>an XML Tag</Ti>."
easyPubMed:::EPM_custom_grep(x, tag = "Ti")
Parse and Format a Pubmed Date Field.
Description
Extract Date Information form a slice of a raw XML PubMed record. Day, month and year are returned. Months are recoded as numeric if needed (e.g., 'Oct' and 'October' are converted to 10). If month and/or day information are missing, these are imputed to 1. If the year is missing, NA is returned.
Usage
EPM_date_parse(x)
Arguments
x |
String (character vector of length 1) including an XML date field from a PubMed record. |
Value
list including n=3 numeric elements: day, month and year.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
dt0 <- '<Year>2021</Year><Month>03</Month><Day>12</Day>'
easyPubMed:::EPM_date_parse(dt0)
Decode an XML String into the Corresponding Metadata.
Description
Decode an XML String including a list of meta information associated to an easyPubMed object whose contents were written to a text file on a local disk. These meta-information are used to keep track of easyPubMed query jobs and/or to re-build objects starting from XML files saved on a local disk.
Usage
EPM_decode_xml_meta(x)
Arguments
x |
String corresponding to the XML-decorated text including metadata from an easyPubMed object/query job. |
Value
String, chunck of XML-decorated text including meta information.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
xml <- paste0('<EPMxJobData><EPMxJobUniqueId>EPMJ_20231017151112_mi7xvol743',
'rvz5ry5z3n8qm0ww</EPMxJobUniqueId><EPMxJobBatchNum>4</EPMxJo',
'bBatchNum><EPMxJobBatch>1</EPMxJobBatch><EPMxQuery>Test_Quer',
'y</EPMxQuery><EPMxQBatchInitDate>1937/01/22</EPMxQBatchInitD',
'ate><EPMxQBatchEndDate>1980/08/01</EPMxQBatchEndDate><EPMxQB',
'atchDiffDays>15897</EPMxQBatchDiffDays><EPMxQBatchExpCount>2',
'13</EPMxQBatchExpCount><EPMxMaxRecordsPerBatch>1000</EPMxMax',
'RecordsPerBatch><EPMxExpCount>2083</EPMxExpCount><EPMxExpNum',
'OfBatches>4</EPMxExpNumOfBatches><EPMxAllRecordsCovered>TRUE',
'</EPMxAllRecordsCovered><EPMxExpMissedRecords>0</EPMxExpMiss',
'edRecords><EPMxQueryDate>2023-10-17 15:11:12</EPMxQueryDate>',
'<EPMxRawFormat>xml</EPMxRawFormat><EPMxRawEncoding>UTF-8</EP',
'MxRawEncoding><EPMxRawDate>2023-10-17 15:14:12</EPMxRawDate>',
'<EPMxLibVersion>3.01</EPMxLibVersion></EPMxJobData>')
easyPubMed:::EPM_decode_xml_meta(xml)
Detect PubMed Record Identifiers.
Description
Parse a list of pubmed records in XML or Medline format, extract and return the corresponding PubMed record identifiers (PMID).
Usage
EPM_detect_pmid(x, format = "xml", as.list = TRUE)
Arguments
x |
list including PubMed record data (either in 'xml' or 'abstract' format). |
format |
string (character of length 1) indicating the format of each element in x (either 'xml' or 'medline'). |
as.list |
logical (of length 1). Shall results be returned as a list. |
Value
list of PubMed record identifiers.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
x <- list(A='First record: <PMID>Rec_1A</PMID> Lorem ipsum dolor sit amet',
B='Another record: <Ti>Title</Ti><PMID>Rec_2</PMID> Lorem ipsum ')
easyPubMed:::EPM_detect_pmid(x, format = 'xml')
Submit a Query to the NCBI EFetch Server.
Description
Submit a Query to the NCBI EFetch Server and capture the response.
Usage
EPM_efetch_basic_q(params)
Arguments
params |
List including the information for querying the NCBI EFetch Server. |
Details
The input list must include the elements listed below.
'web_env'. String, unique value returned by the NCBI ESearch server.
'format'. String corresponding to the desired response data format (e.g., "xml").
'query_key'. Integer, key value returned by the NCBI ESearch server.
'retstart'. Integer, numeric index of the first record to be request.
'retmax'. Integer, maximum number of records to be retrieved from the server.
'encoding'. String, encoding of the data (e.g., "UTF-8").
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- easyPubMed:::EPM_esearch_basic_q(params = list(q = "easyPubMed"))
x <- easyPubMed:::EPM_esearch_parse(x)
my_params <- list(web_env = x$web_env,
query_key = x$query_key,
format = "uilist")
easyPubMed:::EPM_efetch_basic_q(params = my_params)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Encode Metadata to an XML String.
Description
Encode a list of meta information from an easyPubMed object into an XML string. These meta-information are used to keep track of easyPubMed query jobs and/or to re-build objects starting from XML files saved on a local disk.
Usage
EPM_encode_meta_to_xml(meta, job_list, i, encoding)
Arguments
meta |
List including metadata associated with an easyPubMed query job. It corresponds to the contents of the 'meta' slot of an easyPubMed object. |
job_list |
Data.frame that defines the list of sub-queries of an easyPubMed query job. It corresponds to the 'job_list' data.frame included in the 'misc' slot of an easyPubMed object. |
i |
Integer, index of the batch (query sub-job) being written to file. |
encoding |
String, this is the Encoding of the contents/text being retrieved from the Entrez server (typically, 'UTF-8'). |
Value
String, chunck of XML-decorated text including meta information.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
tmp_meta <- list(max_records_per_batch = 1000,
exp_count = 10,
exp_num_of_batches = 1,
all_records_covered = TRUE,
exp_missed_records = 0,
query_date = "2023-10-16 23:13:29",
UID = 'EPMJ_20231017141741_c4das',
EPM_version = "3.01")
tmp_jobs <- data.frame(query_string = 'my test query',
init_date = '1990/01/01',
end_date = '2023/01/01',
diff_days = 12053,
exp_count = 10,
stringsAsFactors = FALSE)
easyPubMed:::EPM_encode_meta_to_xml(meta = tmp_meta, job_list = tmp_jobs,
i = 1, encoding = 'UTF-8' )
Submit a Query to the NCBI ESearch Server.
Description
Submit a Query to the NCBI ESearch Server and capture the response.
Usage
EPM_esearch_basic_q(params)
Arguments
params |
List including the information for querying the NCBI ESearch Server. |
Details
The params
list must include the
elements listed below.
'q'. String corresponding to the Query to be submitted to the server.
'api_key'. (Optional) String corresponding to the NCBI API key.
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
my_q <- 'easyPubMed'
my_params <- list(q = my_q)
easyPubMed:::EPM_esearch_basic_q(params = my_params)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Retrieve Results via an Esearch and Efetch sequence.
Description
Submit a Query to the NCBI ESearch Server, capture the response and retrieve the corresponding PubMed records from the NCBI EFetch Server. Up to the first n=10,000 records returned by the query will be retrieved (as per the NCBI policy). This does not include a timeout limit to complete the operation.
Usage
EPM_esearch_efetch_seq(
query_string,
api_key = NULL,
batch_size = 500,
encoding = "UTF-8",
format = "xml",
max_restart_attempts = 10
)
Arguments
query_string |
String (character vector of length 1), corresponding to the query URL to the remote server. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be NULL. |
batch_size |
Integer, max number of records to be retrieved as a batch. This corresponds to the "retmax" NCBI parameter. |
encoding |
String (character vector of length 1), encoding of the resulting records (e.g., "UTF-8"). |
format |
String (character vector of length 1), desired format of the Pubmed records. This must be one of the values in c("xml", "medline", "uilist"). |
max_restart_attempts |
Integer, max number of attempts in case of a failed iteration. |
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
easyPubMed:::EPM_esearch_efetch_seq(query_string = qry, format = "uilist")
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Parse Responses from the NCBI ESearch Server.
Description
Parse Responses from the NCBI ESearch Server and return a list of information that can be used for retrieving PubMed records from the NCBI EFetch Server.
Usage
EPM_esearch_parse(x)
Arguments
x |
String (character vector of length 1), this is the xml string returned by the NCBI ESearch Server. |
Details
The output list includes the following items.
'web_env'. String, unique identifier for fetching PubMed records corresponding to the current query.
'query_key'. Integer, unique numeric key for fetching PubMed records corresponding to the current query.
'count'. Integer, expected number of records returned by the current query.
'query_translation'. String, translation of the Query string provided by the user.
Value
List including information extracted from the NCBI ESearch Server response.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
my_q <- 'easyPubMed'
my_params <- list(q = my_q)
x <- easyPubMed:::EPM_esearch_basic_q(params = my_params)
easyPubMed:::EPM_esearch_parse(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Generate a Unique Query Key.
Description
Generate a pseudo-random key that uniquely identifies easyPubMed objects. The key is a 46-char string that includes the current date + time and a list of randomly selected characters, numbers and special characters. The unique key is typically saved in the 'meta' slot of an easyPubMed object, and is also written to local files when records are donwloaded and saved in XML format. This function takes NO arguments.
Usage
EPM_init_unique_key()
Value
string, a 46-char unique key.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
easyPubMed:::EPM_init_unique_key()
Split A PubMed Retrieval Job into Manageable Batches.
Description
Assess the number of PubMed records expected from a user-provided query and split the job in multiple sub-queries if the number is bigger than "max_records_per_batch" (typically, n=10,000). Sub-queries are split according to the "Create Date" of PubMed records. This does not support splitting jobs returning more than "max_records_per_batch" (typically, n=10,000) records that have the same "Create Date" (i.e., "[CRDT]").
Usage
EPM_job_split(
query_string,
api_key = NULL,
max_records_per_batch = 9999,
verbose = FALSE
)
Arguments
query_string |
String (character vector of length 1), corresponding to the query string. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be NULL. |
max_records_per_batch |
Integer, maximum number of records that should be expected be sub-query. This number should be in the range 1,000 to 10,000 (typicall, max_records_per_batch=10,000). |
verbose |
logical, shall progress information be printed to console. |
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
easyPubMed:::EPM_job_split(query_string = qry, verbose = TRUE)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Parse and Format Pubmed MeSH terms.
Description
Extract MeSH Information form a slice of a raw XML PubMed record. Both MeSH codes and MeSH terms are returned.
Usage
EPM_mesh_parse(x)
Arguments
x |
String (character vector of length 1) including an XML Mesh term field/section from a PubMed record. |
Value
list including n=2 elements (character vectors): mesh_codes and mesh_terms.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
msh <- paste0('<MeshHeading><DescriptorName UI=\"D000465\" >',
'Algorithms</DescriptorName></MeshHeading>')
easyPubMed:::EPM_mesh_parse(msh)
Map Job Batches to Filenames.
Description
Build Filenames Matching job sub-tasks. Each filename corresponds to a series of records returned by a specific job batch. The associated filename indicates where the corresponding records will be written on the local disc (if requested by the user).
Usage
EPM_prep_outfile(job_list, path, prefix)
Arguments
job_list |
data.frame. This is the 'job_list' data.frame included in the 'misc' slot of an 'easyPubMed' object. |
path |
folder on the local computer where files will be saved. It must be an existing directory. |
prefix |
string used as common prefix for all files written as part of the same PubMed record download job. |
Value
character vector pointing to the target files where Pubmed records will be written.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
test_df <- data.frame(query_string = c('ANY', 'ANY'),
init_date = c('2020/01/01', '2020/01/10'),
end_date = c('2020/01/11', '2020/01/20'),
diff_days = c(10, 10),
exp_count = 100, 100)
easyPubMed:::EPM_prep_outfile(test_df, path = '.', prefix = 'my_test_job')
Import PubMed Records Saved Locally in XML Format.
Description
Read the contents of an XML file and import Metadata and PubMed records for use by easyPubMed. The XML file must be generated by easyPubMed (ver >= 3) via the 'epm_fetch()' function or via the 'fetchEPMData()' method. XML files downloaded from the Web or using other software are currently unsupported. This function can only process one file.
Usage
EPM_read_xml(x)
Arguments
x |
Path to an XML file on the local machine. |
Value
List including four elements: 'guide' (data.frame), 'meta' (list), 'job_info' (data.frame) and 'contents' (named list).
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
## Not run:
x <- epm_query(query_string = 'easyPubMed', verbose = TRUE)
x <- epm_fetch(x = x, write_to_file = TRUE, store_contents = FALSE,
outfile_prefix = 'qpm_qry_', verbose = TRUE)
y <- EPM_read_xml(x = 'qpm_qry__batch_01.txt')
try(unlink('qpm_qry__batch_01.txt'), silent = TRUE)
y
## End(Not run)
Parse and Format References.
Description
Extract Reference Information form a raw XML string, typically extracted from a PubMed record. Users can select the type of identifier to extract and return, as well as the maximum number of references to be returned.
Usage
EPM_reference_parse(x, max_references = 100, id_type = "pmid")
Arguments
x |
String (character vector of length 1) including a List of references obtained from a PubMed record. |
max_references |
Numeric (of length 1). Maximum number of references to extract/include. This should be an integer '>=0'. |
id_type |
String (character vector of length 1). Type of identifier to be used for references. One of the following values is expected: ‘c(’pmid', 'doi', 'pmc')'. |
Value
data.frame including one row for each author and n=4 columns: lastname, forename, address and email.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
ref <- paste0('<xml><Reference><Citation>',
'<ArticleId IdType=\"pubmed\">25822800</ArticleId>',
'<ArticleId IdType=\"pmc\">PMC4739640</ArticleId>',
'</Citation></Reference></xml>')
easyPubMed:::EPM_reference_parse(ref)
easyPubMed:::EPM_reference_parse(ref, id_type = 'pmc')
Submit a Query and Retrieve Results from PubMed.
Description
Submit a Query to the NCBI ESearch Server, capture the response and retrieve the corresponding PubMed records from the NCBI EFetch Server. Up to the first n=10,000 records returned by the query will be retrieved (as per the NCBI policy). The operation must be completed within a user-defined timeout window otherwise it will be killed.
Usage
EPM_retrieve_data(
query_string,
api_key = NULL,
format = "xml",
encoding = "UTF-8",
timeout = 600,
batch_size = 500,
max_restart_attempts = 10
)
Arguments
query_string |
String (character vector of length 1), corresponding to the query string. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be NULL. |
format |
String (character vector of length 1), desired format of the Pubmed records. This must be one of the values in c("xml", "medline", "uilist"). |
encoding |
String (character vector of length 1), encoding of the resulting records (e.g., "UTF-8"). |
timeout |
Integer, time allowed for completing the operation (in seconds). |
batch_size |
Integer, max number of records to be retrieved as a batch. This corresponds to the "retmax" NCBI parameter. |
max_restart_attempts |
Integer, max number of attempts in case of a failed iteration. |
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
easyPubMed:::EPM_retrieve_data(qry, format = "uilist")
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Submit a Query and Read the Response from the Server.
Description
Submit a request to a server (typically, the Entrez Eutils server) and capture the response.
Usage
EPM_submit_q(qurl)
Arguments
qurl |
String (character vector of length 1), corresponding to the query URL to the remote server. |
Value
Character vector including the response from the server.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/",
"esearch.fcgi?db=pubmed&term=easyPubMed")
easyPubMed:::EPM_submit_q(qry)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Validate Parameters of a PubMed Retrieval Job.
Description
Check and correct (if needed) the parameters of an easyPubMed retrieval job.
Usage
EPM_validate_fetch_params(params)
Arguments
params |
list of user-provided parameters. |
Details
The following elements are expected and/or parsed from the 'params' list:
'encoding'. String, e.g. "UTF-8".
'format'. String, must be one of the following values: ‘c(’uilist', 'medline', 'xml')'.
'store_contents'. Logical, shall retrieved contents be stored in the object. If 'FALSE', the 'write_to_file' argument must be 'TRUE'.
'write_to_file' Logical, shall retrieved contents be written to a file (or list of files). If 'FALSE', the 'store_contents' argument must be 'TRUE'.
'outfile_path'. String, path to the folder where files will be written. This argument is evaluated only if 'write_to_file' is 'TRUE'.
'outfile_prefix'. String, prefix of the files that will be written locally. This argument is evaluated only if 'write_to_file' is 'TRUE'.
'api_key'. String, NCBI API key. Can be NULL.
'max_records_per_batch'. Integer scalar (numeric vector of length 1), this is the maximum number of records retrieved per batch. It deafualts to 10,000.
'verbose'. Logical, shall details about the progress of the operation be printed to console.
Value
list including the vetted parameters.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
prms <- list(
encoding = 'UTF-8',
format = 'xml',
api_key = NULL,
store_contents = TRUE,
write_to_file = FALSE,
verbose = TRUE)
easyPubMed:::EPM_validate_fetch_params(prms)
Validate Parameters of a PubMed Record Parsing Job.
Description
Check and correct (if needed) the parameters of an easyPubMed Record Parsing job.
Usage
EPM_validate_parse_params(params)
Arguments
params |
list of user-provided parameters. |
Details
The following elements are expected and/or parsed from the 'params' list:
'max_authors'. Numeric, maximum number of authors to retrieve. If this is set to -1, only the last author is extracted. If this is set to 1, only the first author is returned. If this is set to 2, the first and the last authors are extracted. If this is set to any other positive number (i), up to the leading (i-1) authors are retrieved together with the last author. If this is set to a number larger than the number of authors in a record, all authors are returned. Note that at least 1 author has to be retrieved, therefore a value of 0 is not accepted (coerced to -1).
'autofill_address'. Logical, shall author affiliations be propagated within each record to fill missing values.
'compact_output'. Logical, shall record data be returned in a compact format where each row is a single record and author names are collapsed together. If 'FALSE', each row corresponds to a single author of the publication and the record-specific data are recycled for all included authors.
'include_abstract'. Logical, shall abstract text be included in the output 'data.frame'.
'max_references'. Numeric, maximum number of references to return (from each PubMed record).
'ref_id_type'. String, must be one of the following values: ‘c(’pmid', 'doi')'.
'verbose'. Logical, shall details about the progress of the operation be printed to console.
Value
list including the vetted parameters.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
prms <- list(
max_authors = 12,
autofill_address = TRUE,
compact_output = FALSE,
include_abstract = TRUE,
max_references = 100,
ref_id_type = 'doi',
verbose = TRUE)
easyPubMed:::EPM_validate_parse_params(prms)
Write PubMed Records to Local Files.
Description
Write a list of PubMed records to a local file. If already existing, the destination file will be over-written. Original formatting of the PubMed records should be declared and will be preserved in the output file. Format conversion is NOT supported.
Usage
EPM_write_to_file(x, to, format, addon = NULL, verbose = FALSE)
Arguments
x |
List including raw PubMed records. |
to |
Path to the destination file on the local disc. |
format |
String, format of the raw PubMed records that will be saved to the destination file (e.g., 'xml'). |
addon |
String, optional chunk of text in XML format to be written to the destination file (header). This argument is only used when ‘format' is set to ’xml'. It can be NULL. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Value
integer in the range c(0, 1). A result of 0 indicates that an error occurred while writing the file. A result of 1 indicates that the operation was completed successfully.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
test <- list('Record #1', 'Record #2')
outfile = './test_file.txt'
file.exists(outfile)
easyPubMed:::EPM_write_to_file(x = test, to = './test_file.txt', format = 'xml')
file.exists(outfile)
readLines(outfile)
unlink(outfile)
Harmonize the Elements of a Vector by Adding Leading Zeros.
Description
Coerce a vector to character and then harmonize the number of characters (nchar) of each element by adding a suitable number of leading zeroes (or other user-character).
Usage
EPM_zerofill(x, fillchar = "0")
Arguments
x |
vector (numeric or character). |
fillchar |
string corresponding to a single character. This character is going to be added (one or more times) in front of each element of the input vector. |
Value
character vector whose elements have all the same size (number of characters).
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Example 1
easyPubMed:::EPM_zerofill(c(1, 100, 1000))
# Example 2
easyPubMed:::EPM_zerofill(c('Hey,', 'hello', 'there!'), '_')
Retrieve Text Between XML Tags
Description
Extract text form a string containing XML or HTML tags. Text included between tags of interest will be returned. If multiple tagged substrings are found, they will be returned as different elements of a list or character vector.
Usage
custom_grep(xml_data, tag, format = "list")
Arguments
xml_data |
String (of class character and length 1): corresponds to the PubMed record or any string including XML/HTML tags. |
tag |
String (of class character and length 1): the tag of interest (does NOT include < > chars). |
format |
c("list", "char"): specifies the format for the output. |
Details
The 'custom_grep()' function is now obsolete. This is a helper function that will be replaced by 'easyPubMed:::EPM_custom_grep()', an internal function that won't be exported. The 'custom_grep()' function will be retired in 2026.
Value
List or vector where each element corresponds to an in-tag substring.
Author(s)
Damiano Fantini damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
try({
## extract substrings based on regular expressions
string_01 <- paste0(
"The itsy bitsy <strong>spider</strong> ",
"Went up the water spout. Down came the rain ",
"And <strong>washed the spider out</strong>.")
print(string_01)
custom_grep(xml_data = string_01, tag = "strong", format = "char")
custom_grep(xml_data = string_01, tag = "strong", format = "list")
}, silent = TRUE)
Class easyPubMed.
Description
Class easyPubMed defines objects that represent PubMed Query jobs and the corresponding results. Briefly, these objects are initialized using information that will guide the communication with the NCBI Entrez server. Also, easyPubMed objects are used to store raw and processed data retrieved from Pubmed.
Usage
## S4 method for signature 'easyPubMed'
initialize(.Object, query_string, job_info)
Arguments
.Object |
The easyPubMed object being built. |
query_string |
String (character vector of length 1) corresponding to the user-provided text of the query to be submitted to PubMed. |
job_info |
List, this should be the output of 'EPM_job_split()'. |
Slots
query
String (character vector of length 1) corresponding to the PubMed request submitted by the user.
meta
List including meta information about the PubMed Query job.
uilist
List including all unique identifiers corresponding to the Pubmed records returned by the query. Can be empty.
raw
List including the raw data (in 'xml' or 'medline' format) retrieved from the NCBI eFetch server. Can be empty.
data
Data.frame including processed data based on the xml raw data retrieved from PubMed.
misc
List including additional information.
Author(s)
Damiano Fantini damiano.fantini@gmail.com
Fetch Raw Records from Pubmed.
Description
Fetch raw PubMed records from PubMed. Records can be downloaded in text or xml format and stored into a local object or written to local files.
Usage
epm_fetch(
x,
format = "xml",
api_key = NULL,
write_to_file = FALSE,
outfile_path = NULL,
outfile_prefix = NULL,
store_contents = TRUE,
encoding = "UTF-8",
verbose = TRUE
)
Arguments
x |
An 'easyPubMed' object. |
format |
String, the desired format for the raw records. This argument must take one of the following values: 'c("uilist", "medline", "xml")' and defaults to '"xml"'. |
api_key |
String, corresponding to the NCBI API token (if available). NCBI token strings can be requested from NCBI. Record download will be faster if a valid NCBI token is used. This argument can be 'NULL'. |
write_to_file |
Logical of length 1. Shall raw records be written to a file on the local machine. It defaults to 'FALSE'. |
outfile_path |
Path to the folder on the local machine where files will be saved (if 'write_to_file' is 'TRUE'). It must point to an already existing directory. If 'NULL', the working directory will be used. |
outfile_prefix |
String, prefix that will be added to the name of each file written to the local machine. This argument is parsed only when 'write_to_file' is 'TRUE'. If 'NULL', an arbitrary prefix will be added (easypubmed_job_YYYYMMDDHHMM). |
store_contents |
Logical of length 1. Shall raw records be stored in the 'easyPubMed' object. It defaults to 'TRUE'. It may convenient to switch this to 'FALSE' when downloading large number of records. If 'store_contents' is 'FALSE', 'write_to_file' must be 'TRUE'. |
encoding |
String, the encoding of the records retrieved from PubMed. Typically, this is 'UTF-8'. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Value
an easyPubMed object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x = x, format = 'uilist')
x
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Import PubMed Records from Local Files.
Description
Read one or more text files including XML-decorated raw PubMed records and rebuild an 'easyPubMed' object. The function expects all files to be generated from the same query using 'easyPubMed>3.0' and the 'epm_fetch()' function setting 'write_to_file' to 'TRUE'. This function can import a fraction or all of the files resulting from a single query. Files resulting from non-compatible fetch jobs will be dropped.
Usage
epm_import_xml(x)
Arguments
x |
Character vector, the paths to text files including XML-decorated raw PubMed records saved using 'easyPubMed>3.0'. |
Value
an 'easyPubMed' object including raw XML PubMed records.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x = x, format = 'xml', write_to_file = TRUE,
outfile_prefix = 'test', store_contents = FALSE)
y <- epm_import_xml('test_batch_01.txt')
tryCatch({unlink('test_batch_01.txt')}, error = function(e) { NULL })
print(paste0(' Raw Record Num (fetched): ',
getEPMMeta(x)$raw_record_num))
print(paste0('Raw Record Num (read & rebuilt): ',
getEPMMeta(y)$raw_record_num))
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Extract Information from a Raw PubMed Record.
Description
Read a raw PubMed record, identify XML tags, extract information and cast it into a structured data.frame. The expected input is an XML-tag-decorated string corresponding to a single PubMed record. Information about article title, authors, affiliations, journal name and abbreviation, publication date, references, and keywords are returned.
Usage
epm_parse(
x,
max_authors = 10,
autofill_address = TRUE,
compact_output = TRUE,
include_abstract = TRUE,
max_references = 150,
ref_id_type = "doi",
verbose = TRUE
)
Arguments
x |
An 'easyPubMed' object. The object must include raw records (n>0) downloaded in the 'xml' format. |
max_authors |
Numeric, maximum number of authors to retrieve. If this is set to -1, only the last author is extracted. If this is set to 1, only the first author is returned. If this is set to 2, the first and the last authors are extracted. If this is set to any other positive number (i), up to the leading (n-1) authors are retrieved together with the last author. If this is set to a number larger than the number of authors in a record, all authors are returned. Note that at least 1 author has to be retrieved, therefore a value of 0 is not accepted (coerced to -1). |
autofill_address |
Logical, shall author affiliations be propagated within each record to fill missing values. |
compact_output |
Logical, shall record data be returned in a compact format where each row is a single record and author names are collapsed together. If 'FALSE', each row corresponds to a single author of the publication and the record-specific data are recycled for all included authors (legacy approach). |
include_abstract |
Logical, shall abstract text be included in the output data.frame. If 'FALSE', the abstract text column is populated with a missing value. |
max_references |
Numeric, maximum number of references to return (for each PubMed record). |
ref_id_type |
String, must be one of the following values: ‘c(’pmid', 'doi')'. Type of identifier used to describe citation references. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Value
an easyPubMed object including a data.frame ('data' slot) that stores information extracted from its raw XML PubMed records.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.7)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x = x, format = 'xml')
x <- epm_parse(x, include_abstract = FALSE, max_authors = 1)
get_epm_data(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Extract Information from a Raw PubMed Record.
Description
Read a raw PubMed record, identify XML tags, extract information and cast it into a structured 'data.frame'. The expected input is an XML-tag-decorated string corresponding to a single PubMed record. Information about article title, authors, affiliations, journal name and abbreviation, publication date, references, and keywords are returned.
Usage
epm_parse_record(
pubmedArticle,
max_authors = 15,
autofill_address = TRUE,
compact_output = TRUE,
include_abstract = TRUE,
max_references = 1000,
ref_id_type = "pmid"
)
Arguments
pubmedArticle |
String, this is an XML-tag-decorated raw PubMed record. |
max_authors |
Numeric, maximum number of authors to retrieve. If this is set to -1, only the last author is extracted. If this is set to 1, only the first author is returned. If this is set to 2, the first and the last authors are extracted. If this is set to any other positive number (i), up to the leading (n-1) authors are retrieved together with the last author. If this is set to a number larger than the number of authors in a record, all authors are returned. Note that at least 1 author has to be retrieved, therefore a value of 0 is not accepted (coerced to -1). |
autofill_address |
Logical, shall author affiliations be propagated within each record to fill missing values. |
compact_output |
Logical, shall record data be returned in a compact format where each row is a single record and author names are collapsed together. If 'FALSE', each row corresponds to a single author of the publication and the record-specific data are recycled for all included authors. |
include_abstract |
Logical, shall abstract text be included in the output data.frame. If 'FALSE', the abstract text column is populated with a missing value. |
max_references |
Numeric, maximum number of references to return (for each PubMed record). |
ref_id_type |
String, must be one of the following values: ‘c(’pmid', 'doi')'. Type of identifier used to describe citation references. |
Value
a data.frame including information extracted from a raw XML PubMed record.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
data(epm_samples)
x <- epm_samples$bladder_cancer_2018$demo_data_03$raw[[1]]
epm_parse_record(x)
Search for PubMed Records.
Description
Query PubMed (Entrez) via the PubMed API eSearch utility.
Calling this function results in submitting a query to the NCBI EUtils
server and then capturing and parsing the response.
The number of records expected to be returned by the query is
determined. If this number is bigger than n=10,000, the record retrieval job
is automatically split in a list of smaller manageable sub-queries.
This function returns an "easyPubMed" object, which includes all
information required to retrieve PubMed records using the epm_fetch()
function.
Usage
epm_query(query_string, api_key = NULL, verbose = TRUE)
Arguments
query_string |
String (character vector of length 1), corresponding to the query string. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be 'NULL'. |
verbose |
logical, shall progress information be printed to console. Defaults to 'TRUE'. |
Details
This function will use "query_string" for querying PubMed. The Query Term can include one or multiple words, as well as the standard PubMed operators (AND, OR, NOT) and tags (i.e., [AU], [PDAT], [Affiliation], and so on).
Value
An easyPubMed object which includes no PubMed records.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
epm_query(query_string = qry, verbose = FALSE)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Query PubMed by Full-length Title.
Description
Execute a PubMed query using a full-length publication title as query string. Tokenization and stopword removal is automatically performed. The goal is to mimic a Pubmed citation matching search. Because of this approach, it is possible that a query by full-length title may return more than one record.
Usage
epm_query_by_fulltitle(
fulltitle,
field = "[Title]",
api_key = NULL,
verbose = TRUE
)
Arguments
fulltitle |
String (character vector of length 1) that corresponds to the full-length publication title used for querying PubMed (titles should be used as is, without adding trailing filter tags). |
field |
String (character vector of length 1). This indicates the PubMed record field where the full-length string (fulltitle) should be searched in. By default, this points to the 'Title' field. However, the field can be changed (always use fields supported by PubMed) as required by the user (for example, to attempt an exact-match query using a specific sentence included in the abstract of a record). |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be 'NULL'. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Value
an easyPubMed object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
q <- 'Analysis of Mutational Signatures Using the mutSignatures R Library.'
epm_query_by_fulltitle(q)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Query PubMed by PMIDs.
Description
Query PubMed using a list of PubMed record identifiers (PMIDs) as input. The list of identifiers is automatically split into a series of manageable-sized chunks (max n=50 PMIDs per chunk).
Usage
epm_query_by_pmid(pmids, api_key = NULL, verbose = TRUE)
Arguments
pmids |
Vector (character or numeric), list of Pubmed record identifiers (PMIDs). Values will be coerced to character. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be 'NULL'. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Value
an easyPubMed object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
my_pmids <- c(34097668, 34097669, 34097670)
epm_query_by_pmid(my_pmids)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Preprocessed PubMed Records and Data
Description
This dataset includes a collection of sample data obtained from PubMed records and saved in different formats. This dataset is used to demonstrate specific functionalities of the 'easyPubMed' R library. Each element in the 'epm_samples' list corresponds to a different input or intermediate object.
Usage
data("epm_samples")
Format
The dataset is formatted as a list including 4 elements:
* 'bladder_cancer_2018': List of 4
* 'bladder_cancer_40y': List of 1
* 'fx': List of 5
Examples
## Display some contents
data("epm_samples")
# Display Query String used for collecting the data
print(epm_samples$bladder_cancer_2018$demo_data_01)
PubMed Query Stopwords
Description
Collection of 133 Stopwords that can be removed from query strings to improve the accuracy of exact-match PubMed queries.
Usage
data("epm_stopwords")
Format
A character vector including all PubMed stopwords tat are typically filtered out from queries.
Details
Number of stopwords included, n=133.
Examples
## Display some contents
data("epm_stopwords")
head(epm_stopwords)
Method fetchEPMData.
Description
Retrieve PubMed records for an 'easyPubMed' object.
Usage
fetchEPMData(x, params)
## S4 method for signature 'easyPubMed,list'
fetchEPMData(x, params)
Arguments
x |
an easyPubMed-class object. |
params |
list including parameters to tune the record retrieval job. For more info, see '?easyPunMed:::EPM_validate_fetch_params'. |
Retrieve PubMed Data in XML or TXT Format
Description
Retrieve PubMed records from Entrez following a search performed via the get_pubmed_ids() function. Data are downloaded in the XML or TXT format and are retrieved in batches of up to 5000 records.
Usage
fetch_pubmed_data(
pubmed_id_list,
retstart = 0,
retmax = 500,
format = "xml",
encoding = "UTF8",
api_key = NULL,
verbose = TRUE
)
Arguments
pubmed_id_list |
An easyPubMed object. |
retstart |
Integer (>=0): this argument is ignored. |
retmax |
Integer (>=1): this argument is ignored. |
format |
String: element specifying the output format. The following values are allowed: c("xml", "medline", "uilist"). |
encoding |
String, the encoding of the records retrieved from Pubmed. This argument is ignored and set to 'UTF-8'. |
api_key |
String, corresponding to the NCBI API token (if available). NCBI token strings can be requested from NCBI. Record download will be faster if a valid NCBI token is used. This argument can be NULL. |
verbose |
Logical, shall details about the progress of the operation be printed to console. |
Details
The 'fetch_pubmed_data()' function is now obsolete. You should use the 'epm_fetch()' function instead. Please, have a look at the manual or the vignette. The 'fetch_pubmed_data()' function will be retired in 2026.
Value
Character vector of length >= 1. If format is set to "xml" (default), a single String including all PubMed records (decorated with XML tags) is returned. If a different format is selected, a vector of strings is returned, where each element corresponds to a line of the output document.
Author(s)
Damiano Fantini damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/ https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
Examples
## Example 01: retrieve PubMed record Unique Identifiers (uilist)
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
q <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
x <- get_pubmed_ids(pubmed_query_string = q)
y <- fetch_pubmed_data(x, format = "uilist")
y
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
## Not run:
## Example 02: retrieve data in XML format
q <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
x <- epm_query(query_string = q)
y <- fetch_pubmed_data(x, format = "xml")
y
## End(Not run)
Method getEPMData.
Description
Retrieve processed data from an 'easyPubMed' object.
Usage
getEPMData(x)
## S4 method for signature 'easyPubMed'
getEPMData(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMJobList.
Description
Retrieve the list of record retrieval sub-jobs from an 'easyPubMed' object. Record retrieval sub-jobs are stored in a 'data.frame' and each row corresponds to an independent non-overlapping PubMed query. This 'data.frame' guides the record retrieval process. The 'data.frame' is obtained from the 'misc' slot of an 'easyPubMed' object.
Usage
getEPMJobList(x)
## S4 method for signature 'easyPubMed'
getEPMJobList(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMMeta.
Description
Retrieve meta data from an 'easyPubMed' object.
Usage
getEPMMeta(x)
## S4 method for signature 'easyPubMed'
getEPMMeta(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMMisc.
Description
Retrieve miscellaneous information stored in an 'easyPubMed' object.
Usage
getEPMMisc(x)
## S4 method for signature 'easyPubMed'
getEPMMisc(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMQuery.
Description
Retrieve the user-provided query string from an 'easyPubMed' object.
Usage
getEPMQuery(x)
## S4 method for signature 'easyPubMed'
getEPMQuery(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMRaw.
Description
Retrieve the raw PubMed record data stored in an 'easyPubMed' object.
Usage
getEPMRaw(x)
## S4 method for signature 'easyPubMed'
getEPMRaw(x)
Arguments
x |
an object of class 'easyPubMed'. |
Method getEPMUilist.
Description
Retrieve the list of unique record identifiers (PMIDs) from an 'easyPubMed' object.
Usage
getEPMUilist(x)
## S4 method for signature 'easyPubMed'
getEPMUilist(x)
Arguments
x |
an object of class 'easyPubMed'. |
Get Processed Data from an easyPubMed Object.
Description
Obtain Processed Data that were extracted from a list of PubMed records. This is a wrapper function that calls the 'getEPMData()' method. This function returns contents from the 'data' slot.
Usage
get_epm_data(x)
Arguments
x |
An 'easyPubMed' object. |
Value
a 'data.frame' including processed data from an 'easyPubMed' object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x)
x <- epm_parse(x, max_references = 5, max_authors = 5)
get_epm_data(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Get Meta Data from an easyPubMed Object.
Description
Request Meta Data from an 'easyPubMed' object. This is a wrapper function that calls the 'getEPMMeta()' method. This function returns contents from the 'meta' slot.
Usage
get_epm_meta(x)
Arguments
x |
An 'easyPubMed' object. |
Value
a list including meta data from an 'easyPubMed' object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
get_epm_meta(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Get Raw Data from an easyPubMed Object.
Description
Request Raw Data from an 'easyPubMed' object. This is a wrapper function that calls the 'getEPMRaw()' method. This function returns contents from the 'raw' slot.
Usage
get_epm_raw(x)
Arguments
x |
An 'easyPubMed' object. |
Value
a list including raw data from an 'easyPubMed' object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x)
get_epm_raw(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Get PubMed Record Identifiers from an easyPubMed Object.
Description
Request the list of unique PubMed Record Identifiers that are contained in an 'easyPubMed' object. This function is a wrapper function calling the 'getEPMUilist()' method. This function returns contents from the 'uilist' slot.
Usage
get_epm_uilist(x)
Arguments
x |
An 'easyPubMed' object. |
Value
a character vector including a list of unique record identifiers from an 'easyPubMed' object.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
x <- epm_query(query_string = 'Damiano Fantini[AU] AND "2018"[PDAT]')
x <- epm_fetch(x)
get_epm_uilist(x)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Simple PubMed Record Search
Description
Query PubMed (Entrez) in a simple way via the PubMed API eSearch function.
Calling this function results in posting the query results on the PubMed
History Server. This allows later access to the resulting data via the
fetch_pubmed_data() function, or other easyPubMed functions.
NOTE: this function has become obsolete. You should use the epm_query()
function instead. Please, have a look at the manual or the vignette.
The get_pubmed_ids()
function will be retired in 2026.
Usage
get_pubmed_ids(pubmed_query_string, api_key = NULL)
Arguments
pubmed_query_string |
String (character vector of length 1), corresponding to the query string used for querying PubMed. |
api_key |
String (character vector of length 1), corresponding to the NCBI API key. Can be NULL. |
Details
This function will use the String provided as argument for querying PubMed via the eSearch function of the PubMed API. The Query Term can include one or multiple words, as well as the standard PubMed operators (AND, OR, NOT) and tags (i.e., [AU], [PDAT], [Affiliation], and so on). ESearch will post the UIDs resulting from the search operation onto the History server so that they can be used directly in a subsequent fetchPubmedData() call.
Value
An easyPubMed object which includes no PubMed records.
Author(s)
Damiano Fantini, damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
qry <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
get_pubmed_ids(pubmed_query_string = qry)
}, silent = TRUE)
setTimeLimit(elapsed = Inf)
Method parseEPMData.
Description
Extract, parse and format information from raw PubMed records stored in an 'easyPubMed' object.
Usage
parseEPMData(x, params)
## S4 method for signature 'easyPubMed,list'
parseEPMData(x, params)
Arguments
x |
an easyPubMed-class object |
params |
list including parameters to tune the record data parsing job. For more info, see '?easyPunMed:::EPM_validate_parse_params'. |
Print method of the easyPubMed Class.
Description
Print method of the easyPubMed Class.
Usage
## S4 method for signature 'easyPubMed'
print(x)
Arguments
x |
the 'easyPubMed' object being shown. |
Method setEPMData.
Description
Attach (or replace) processed data to an 'easyPubMed' object.
Usage
setEPMData(x, y)
## S4 method for signature 'easyPubMed,data.frame'
setEPMData(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
'data.frame' including processed data. |
Method setEPMJobList.
Description
Attach (or replace) the list of record retrieval sub-jobs to an 'easyPubMed' object. Record retrieval sub-jobs are stored in a data.frame and each row corresponds to an independent non-overlapping PubMed query. This 'data.frame' guides the record retrieval process. The 'data.frame' is written into the 'misc' slot of an 'easyPubMed' object.
Usage
setEPMJobList(x, y)
## S4 method for signature 'easyPubMed,data.frame'
setEPMJobList(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
'data.frame' including the list of PubMed record retrieaval sub-jobs. |
Method setEPMMeta.
Description
Attach (or replace) meta data to an 'easyPubMed' object.
Usage
setEPMMeta(x, y)
## S4 method for signature 'easyPubMed,list'
setEPMMeta(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
list including meta data information. |
Method setEPMMisc.
Description
Attach (or replace) miscellaneous information to an 'easyPubMed' object.
Usage
setEPMMisc(x, y)
## S4 method for signature 'easyPubMed,list'
setEPMMisc(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
list including miscellaneous data and information. |
Method setEPMQuery.
Description
Attach (or replace) a user-provided query string to an 'easyPubMed' object.
Usage
setEPMQuery(x, y)
## S4 method for signature 'easyPubMed,character'
setEPMQuery(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
string (character vector of length 1) corresponding to a PubMed query string. |
Method setEPMRaw.
Description
Attach (or replace) raw PubMed record data to an 'easyPubMed' object.
Usage
setEPMRaw(x, y)
## S4 method for signature 'easyPubMed,list'
setEPMRaw(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
list of PubMed records (raw data). |
Method setEPMUilist.
Description
Attach (or replace) the list of unique record identifiers (PMIDs) to an 'easyPubMed' object.
Usage
setEPMUilist(x, y)
## S4 method for signature 'easyPubMed,list'
setEPMUilist(x, y)
Arguments
x |
an object of class 'easyPubMed'. |
y |
list of unique PubMed record identifiers (PMIDs). |
Show method of the easyPubMed Class.
Description
Show method of the easyPubMed Class.
Usage
## S4 method for signature 'easyPubMed'
show(object)
Arguments
object |
the 'easyPubMed' object being shown. |
Extract Publication and Affiliation Data from PubMed Records
Description
Extract Publication Info from PubMed records and cast data into a data.frame where each row corresponds to a different author. It is possible to limit data extraction to first authors or last authors only, or get information about all authors of each PubMed record.
Usage
table_articles_byAuth(
pubmed_data,
included_authors = "all",
max_chars = 500,
autofill = TRUE,
dest_file = NULL,
getKeywords = TRUE,
encoding = "UTF8"
)
Arguments
pubmed_data |
PubMed Data in XML format: typically, an XML file resulting from a batch_pubmed_download() call or an XML object, result of a fetch_pubmed_data() call. |
included_authors |
Character: c("first", "last", "all"). Only includes information from the first, the last or all authors of a PubMed record. |
max_chars |
This argument is ignored. In this version of the function, the whole Abstract Text is returned. |
autofill |
Logical. If TRUE, missing affiliations are imputed according to the available values (from the same article). |
dest_file |
String (character of length 1). Name of the file that will be written for storing the output. If NULL, no file will be saved. |
getKeywords |
This argument is ignored. In this version of the function MeSH terms and codes (i.e., keywords) are parsed by default. |
encoding |
The encoding of an input/output connection can be specified by name (for example, "ASCII", or "UTF-8", in the same way as it would be given to the function base::iconv(). See iconv() help page for how to find out more about encodings that can be used on your platform. Here, we recommend using "UTF-8". |
Details
The 'table_articles_byAuth()' function is now obsolete. You should use the 'epm_parse()' function instead. Please, have a look at the manual or the vignette. The 'table_articles_byAuth()' function will be retired in 2026.
Value
Data frame including the following fields: 'c("pmid", "doi", "title", "abstract", "year", "month", "day", "jabbrv", "journal", "keywords", "mesh", "lastname", "firstname", "address", "email")'.
Author(s)
Damiano Fantini damiano.fantini@gmail.com
References
https://www.data-pulse.com/dev_site/easypubmed/
Examples
# Note: a time limit can be set in order to kill the operation when/if
# the NCBI/Entrez server becomes unresponsive.
setTimeLimit(elapsed = 4.9)
try({
q0 <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
q1 <- easyPubMed::get_pubmed_ids(pubmed_query_string = q0)
q2 <- fetch_pubmed_data(pubmed_id_list = q1)
df <- table_articles_byAuth(q2, included_authors = 'first')
df[, c('pmid', 'lastname', 'jabbrv', 'year', 'month', 'day')]
}, silent = TRUE)
setTimeLimit(elapsed = Inf)