Mass export data from STATATLAS
Posted on Sun 26 May 2024 in opendata
These days, I find it difficult to compare different municipalities with eachother. The Federal Statistical Office (BFS) does publish many data, but it is hard to collect various data from
- a) each municipality in Swizterland (location)
- b) over multiple years (time)
- c) over multiple indicators (scope)
The STATATLAS (statistical atlas) gives a great overview over various indicators, over time, for municipalities, regions or cantons. This is exactly the data I searched for, because it allows a broad overview over the municipalities in Switzerland. According to the BFS, there is currently no possibility to download all the STATATLAS data at once. To download multiple yars for one indicator, I had to click on the download button for each year for this one indicator.
In this post I will explain how I downloaded all the data from STATATLAS in two approaches:
- Approach 1: Use DAM-API to search for datasets related to STATATLAS
- Approach 2: Brute force download URL
Approach 1: Use DAM-API to search for datasets related to STATATLAS
Loaded rows
2'134'923 (only municipalities)
Links
After asking the BFS, I got a hint: They have a API called DAM-API, where the metadata of their data gets published. And indeed, the data published on STATATLAS is also listed on this api:
curl --location --request GET 'https://dam-api.bfs.admin.ch/hub/api/dam/assets?articleModel=900052&spatialdivision=900004'
Whereas
articleModel=900052
stands for the STATATLAS product- and
spatialdivision=900004
stands for data on municipality level
What was still missing is the data itself. By looking at the URL from the CSV download button, I got something like https://www.atlas.bfs.admin.ch/core/projects/13/xshared/csv/26712_131.csv
, where 26712
is some kind of ID and 131
must stand for the selected language (when selecting French as language, I get 132
). When I can find this ID also in the DAM-API, I can mass download all indicators for all years on municipality level. In the field shop.orderNr
i finally found then the reference between CSV name and the DAM-API.
That the data always comes in the same structure for all indicators on STATATLAS and years comes very handy, because I can now concat all dataframes over all indicators together with pandas.concat
.
Columns of the exported CSV:
ipdb> df.columns
Index(['GEO_ID', 'GEO_NAME', 'VARIABLE', 'VALUE', 'UNIT', 'STATUS',
'STATUS_DESC', 'DESC_VAL', 'PERIOD_REF', 'SOURCE', 'LAST_UPDATE',
'GEOM_CODE', 'GEOM', 'GEOM_PERIOD', 'MAP_ID', 'MAP_URL'],
dtype='object')
So, iterating over all indicators on municipality level and over all years on STATATLAS gives me the following dataset with 2.1 million rows:
ipdb> df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2134923 entries, 0 to 4325
Data columns (total 16 columns):
# Column Dtype
--- ------ -----
0 GEO_ID object
1 GEO_NAME object
2 VARIABLE object
3 VALUE float64
4 UNIT object
5 STATUS object
6 STATUS_DESC object
7 DESC_VAL object
8 PERIOD_REF object
9 SOURCE object
10 LAST_UPDATE object
11 GEOM_CODE object
12 GEOM object
13 GEOM_PERIOD object
14 MAP_ID int64
15 MAP_URL object
dtypes: float64(1), int64(1), object(14)
memory usage: 276.9+ MB
This data can then loaded e.g. into a database to perform transformations and analysis downstream.
Approach 2: Brute force download URL
Loaded rows
5'040'141 (all)
Links
Unfortunately, I could not find all the maps displayed on STATATLAS also on DAM API with my first approach. As an example, the map with the number of inhabitants per municipality for year 2000, I could not find on the DAM API. But because the map IDs are kind ascending, I tried out to iterate over numbers between 0 and 30'000 to download the CSVs and concat them into one single dataframe:
...
dataframes = []
for i in range(30_000):
try:
dataframes.append(pd.read_csv(f'https://www.atlas.bfs.admin.ch/core/projects/13/xshared/csv/{i}_131.csv', sep=';'))
except urllib.error.HTTPError:
continue
...
From approach 1, the highest map ID was 27642.
This is surely not an optimal solution, because...
- it sends a lot of requests to the backend from the BFS to download the data
- the duration of the task is significantly longer (40 vs. 7 minutes) than using the DAM API
But it is currently the only solution to download all the data as far as I can see.
If you find a cleaner way to download the STATATLAS data let me know.