Skip to Content

reading sdmx-xml files into a dataframe in R

I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb).
I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:

fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))

The few lines above basically follow the example here http://www.omegahat.org/RSXML/gettingStarted.html
but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.

The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this
http://opensdmxdevelopers.wikispaces.com/RSDMX
I can’t find this package for download on its homepage here
https://r-forge.r-project.org/projects/rsdmx/
(I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file
http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml
But this is a 10mb file so I was starting smaller.

Thanks for your help

Aidan

edit3
attempting sgibb's answer on large file using changes in Mischa's comment
library("XML")

url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"
 
    sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA
  ## temp value to store current REF_AREA
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA
 
  ## handler subroutine for Obs tag
  Obs <- function(name, attr) {
    ## found an Obs tag and now fill data.frame
    data[i, "refArea"] <<- refArea
    data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
    data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
    data[i, "bsItem"] <<- bsItem
    data[i, "bsCountSector"] <<- bsCountSector
    i <<- i + 1
  }
 
  ## handler subroutine for Series tag
  Series <- function(name, attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- as.character(attr["BS_ITEM"])
    bsCountSector <<- as.numeric(attr["BS_ITEM"])
  }
  return(list(getData=function() {return(data)},
              Obs=Obs, Series=Series))
}
 
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)

edit2:
the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on

url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml

(the 10mb file, original link corrected), with the only modification being the addition of two lines:

data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
 
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])

(these are additional id variables which are needed to identify a row in this larger dataset).
It ran for a few minutes then finished with this error:

Specification mandate value for attribute TIME_PE

attributes construct error

Couldn't find end of Start Tag Obs line 20743

Premature end of data in tag Series line 20689

Premature end of data in tag DataSet line 91

Premature end of data in tag CompactData line 2

Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)

The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:

    <Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
        <Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>

edit1:

desired data format:
Ref_area time_period obs_value

At 2006 118

At 2007 119

Be 2006 101

Here’s the first bit of the data.

DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1">