

Picture by Editor
# Introduction
Information is on the core of any knowledge skilled’s work. With out helpful and legitimate knowledge sources, we can’t carry out our duties. Moreover, poor-quality or irrelevant knowledge will solely trigger our work to go to waste. That’s why getting access to dependable datasets is a crucial start line for knowledge professionals.
Information Commons is an open-source initiative by Google to arrange the world’s accessible knowledge and make it accessible for everybody to make use of. It’s free for anybody to question publicly accessible knowledge. What units Information Commons aside from different public dataset initiatives is that it already performs the schematic work, making knowledge prepared to make use of way more shortly.
Given the utility of Information Commons for our work, accessing it’s turning into essential for a lot of knowledge duties. Happily, Information Commons supplies a brand new Python API consumer to entry these datasets.
# Accessing Information Commons with Python
Information Commons works by organizing knowledge right into a queryable data graph that unifies data from various sources. At its core, it makes use of the schema-based mannequin from schema.org to standardize knowledge representations.
Utilizing this schema, Information Commons can join knowledge from numerous sources right into a single graph the place nodes signify entities (resembling cities, areas, and folks), occasions, and statistical variables. Edges depict the relationships between these nodes. Every node is exclusive and identifiable by a DCID (Information Commons ID), and plenty of nodes embrace observations — measurements linked to the variable, entity, and interval.
With the Python API, we are able to simply entry the data graph to accumulate the required knowledge. Let’s check out how we are able to try this.
First, we have to purchase a free API key to entry Information Commons. Create a free account and replica the API key to a safe location. You too can use the trial API key, however entry is extra restricted.
Subsequent, set up the Information Commons Python library. We are going to use the V2 API consumer, as it’s the newest model. To try this, run the next command to put in the Information Commons consumer with non-obligatory help for Pandas DataFrames as effectively.
pip set up "datacommons-client[Pandas]"
With the library put in, we’re able to fetch knowledge utilizing the Information Commons Python consumer.
To create the consumer that can entry the info from the cloud, run the next code.
from datacommons_client.consumer import DataCommonsClient
consumer = DataCommonsClient(api_key="YOUR-API-KEY")
Probably the most essential ideas in Information Commons is the entity, which refers to a persistent and bodily factor in the actual world, resembling a metropolis or a rustic. It turns into an essential a part of fetching knowledge, as most datasets require specifying the entity. You’ll be able to go to the Information Commons Place web page to study all accessible entities.
For many customers, the info that we wish to purchase is extra particular: the statistical variables saved in Information Commons. To pick the info we wish to retrieve, we have to know the DCID of the statistical variables, which yow will discover by way of the Statistical Variable Explorer.
You’ll be able to filter variables and choose a dataset from the choices above. For instance, select the World Financial institution dataset for “ATMs per 100,000 adults.” On this case, you may get hold of the DCID by analyzing the data offered within the explorer.
If you happen to click on on the DCID, you may see all the data associated to the node, together with the way it connects to different data.
For the statistical variable DCID, we additionally must specify the entity DCID for the geography. We are able to discover the Information Commons Place web page talked about above, or we are able to use the next code to see the accessible DCIDs for a sure place title.
# Lookup DCIDs by place title (returns a number of candidates)
resp = consumer.resolve.fetch_dcids_by_name(names="Indonesia").to_dict()
dcid_list = [c["dcid"] for c in resp["entities"][0]["candidates"]]
print(dcid_list)
With output just like the next:
['country/IDN', 'geoId/...' , '...']
Utilizing the code above, we fetch the DCID candidates accessible for a particular place title. For instance, among the many candidates for “Indonesia,” we are able to choose nation/IDN
because the nation DCID.
All the data we want is now prepared, and we solely must execute the next code:
variable = ["worldBank/GFDD_AI_25"]
entity = ["country/IDN"]
df = consumer.observations_dataframe(
variable_dcids=variable,
date="all",
entity_dcids=entity
)
The result’s proven within the dataset under.
The present code returns all accessible observations for the chosen variables and entities throughout your complete timeframe. Within the code above, additionally, you will discover that we’re utilizing lists as a substitute of single strings.
It’s because we are able to move a number of variables and entities concurrently to accumulate a mixed dataset. For instance, the code under fetches two distinct statistical variables and two entities without delay.
variable = ["worldBank/GFDD_AI_25", "worldBank/SP_DYN_LE60_FE_IN"]
entity = ["country/IDN", "country/USA"]
df = consumer.observations_dataframe(
variable_dcids=variable,
date="all",
entity_dcids=entity
)
With output like the next:
You’ll be able to see that the ensuing DataFrame combines the variables and entities you set beforehand. With this technique, you may purchase the info you want with out executing separate queries for every mixture.
That’s all it is advisable to find out about accessing Information Commons with the brand new Python API consumer. Use this library everytime you want dependable public knowledge on your work.
# Wrapping Up
Information Commons is an open-source undertaking by Google aimed toward democratizing knowledge entry. The undertaking is inherently completely different from many public knowledge initiatives, because the datasets are constructed on prime of a data graph schema, which makes the info simpler to unify.
On this article, we explored easy methods to entry datasets throughout the graph utilizing Python—leveraging statistical variables and entities to retrieve observations.
I hope this has helped!
Cornellius Yudha Wijaya is an information science assistant supervisor and knowledge author. Whereas working full-time at Allianz Indonesia, he likes to share Python and knowledge suggestions by way of social media and writing media. Cornellius writes on quite a lot of AI and machine studying subjects.