Mobile Monitoring Solutions

Close this search box.

How to use IoT datasets in #AI applications (full stack)

MMS Founder

Article originally posted on Data Science Central. Visit Data Science Central


Recently, google launched a Dataset search – which is a great resource to find Datasets.  In this post, I list some IoT datasets which can be used for Machine Learning or Deep Learning applications. But finding datasets is only part of the story.  A static dataset for IoT is not enough i.e. some of the interesting analysis is in streaming mode. To create an end to end streaming implementation from a given dataset, we need knowledge of full stack skills. These are more complex (and in high demand). In this post, I hence describe the datasets but also a full stack implementation. An end to end flow implementation is described in the book Agile Data Science, 2.0 by Russell Jurney. I use this book in my teaching at the Data Science for Internet of Things course at the University of Oxford. I demonstrate the implementation from this book below. The views here represent my own.

In understanding an end to end application, the first problem is .. how to capture data from a wide range of IoT devices. The protocol used for this is typically MQTT. MQTT is lightweight IoT connectivity protocol. MQTT is publish-subscribe-based messaging protocol used in IoT applications to manage a large number of IoT devices who often have limited connectivity, bandwidth and power. MQTT integrates with Apache Kafka. Kafka provides high scalability, longer storage and easy integration to legacy systems. Apache Kafka is a highly scalable distributed streaming platform. Kafka ingests, stores, processes and forwards high volumes of data from thousands of IoT devices. (source Kai Waehner)


Full stack – End to End

With this background, let us try to understand the end to end (full stack) implementation of an IoT dataset. This section is adapted from the Agile Data Science 2.0 book

Image source:  Agile Data Science, 2.0 by Russell Jurney

We have the following components

Events: represents an occurrence with a relevant timestamp. Events can represent various things (ex logs from the server). In our case, they represent time series data from sensors typically represented as JSON objects

Collectors are event aggregators which collect events from various sources and queue them for action by real-time workers. Typically, Kafka or Azure event hub may be used at this stage.

Bulk storage – represents a file system capable of high I/O – for example S3 or HDFS

Distributed document store – ex MongoDB

A web application server – ex flask, Node.js

The data processing is done via spark. Pyspark is used for the Machine learning (either scikit learn or Sparl MLlib libraries) and the results are stored in MongoDB. Apache Airflow can be used for scheduling



from github repository of Agile Data Science, 2.0

The EC2 scripts: *

The real-time notebook with Spark ML/Streaming :


Finally, below are some of the reference datasets you can use with IoT.

To conclude

To conclude, using the strategy and code described here – you could in principle, create an end to end streaming IoT application. 

IoT datasets


Gas Sensor Array Drift Dataset Data Set

Water Treatment Plant Data Set

Internet Usage Data Data Set

Commercial Building Energy Dataset

Individual household electric power consumption Data Set

AMPds2: The Almanac of Minutely Power dataset (Version 2)

Commercial Building Energy Dataset Energy, – Smart Building Energy related data set from a commercial building where data is sampled more than once a minute.

Individual household electric power consumption Energy, Smart home One-minute sampling rate over a period of almost 4 years 

Energy, Smart home AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring

UK Domestic Appliance-Level Electricity Energy, Smart Home Power demand from five houses

Gas sensors for home activity monitoring Smart home Recordings of 8 gas sensors 



Smart cities

Traffic Sign Recognition Testsets

Pollution Measurements for the City of Brasov in Romania

GNFUV Unmanned Surface Vehicles Sensor Data Data Set

CGIAR dataset Agriculture, Climate – High-resolution climate datasets for a variety of fields including agricultural

Uber trip data Transportation About 20 million Uber pickups in New York City during 12 months.

Traffic Sign Recognition Transportation

Malaga datasets Smart City A broad range of categories such as energy, ITS, weather, Industry, Sport, etc

CityPulse Dataset Collection Smart City Road Traffic Data, Pollution Data, Weather, Parking

Open Data Institute – node Trento Smart City Weather, Air quality, Electricity, Telecommunication

Taxi Service Trajectory Transportation Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal

T-Drive trajectory data Transportation Chicago Bus Traces data Transportation Bus traces from the Chicago Transport Authority for 18 days

Citypulse ataset Collection

Taxi service trajectories


Health and home activity

Educational Process Mining Education, Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator

PhysioBank databases Healthcare – Archive of over 80 physiological datasets

Saarbruecken Voice Database Healthcare – A collection of voice recordings from more than 2000 persons for pathological voice detection

CASAS datasets for activities of daily living – Smart home Several public datasets related to Activities of Daily Living (ADL) performance in a two- story home, an apartment, and an office settings

ARAS Human Activity Dataset – Smart home Human activity recognition datasets collected from two real houses with multiple residents during two months

MERLSense Data – Smart home, building Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records

SportVU Sport Video of basketball and soccer games captured from 6 cameras

RealDisp Sport Includes a wide range of physical activities (warm up, cool down and fitness exercises)

GeoLife GPS Trajectories Transportation A GPS trajectory by a sequence of time-stamped points

Various sensor driving datasets

IoT Network Dataset

Various MHEALTH / physical activity datasets



Source: for some of the datasets Deep Learning for IoT Big Data and Streaming Analytics: A Survey


Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.