Processing Data with Kotlin Dataframe Preview

MMS Founder
MMS Johan Janssen

Article originally posted on InfoQ. Visit InfoQ

Kotlin DataFrame, available as a first public preview, is a new library for processing data from any source, such as CSV, JSON, Excel and Apache Arrow files. The DataFrame library works together with Kotlin data classes and hierarchical data schemas by using a Domain Specific Language (DSL).

A data frame is a two dimensional table with labeled columns, comparable to spreadsheets, SQL tables, or CSV files. The Kotlin DataFrame GitHub repository contains more information about the set of operations on the data, available through a DSL, supporting data analysis. The library started as a wrapper around the Krangl library, but most of the functionality was rewritten over time.

The Kotlin DataFrame library was designed to read and display data from any source and allows nesting columns and cells. Any Kotlin object or collection can be stored and retrieved from a DataFrame.

Data may be supplied via a file such as students.csv:

firstName, lastName, country
Akmad, Desiree, The Netherlands
Serik, Chuy, India
Ioel, Jan, Belgium
Draco, Arti, Argentina
Myrna, Hyginos, Bolivia
Dalila, Dardanos, Belgium

The DataFrame is created based on the contents of the file:

val studentsDataFrame = DataFrame.read("students.csv")
print(studentsDataFrame.head())

By default, the head() method returns the first five rows:

    firstName lastName         country
 0 	    Akmad  Desiree The Netherlands
 1 	    Serik     Chuy           India
 2       Ioel      Jan         Belgium
 3 	    Draco 	  Arti       Argentina
 4 	    Myrna  Hyginos         Bolivia

Alternatively, the DataFrame may be created programmatically:

val firstName by columnOf("Akmad", "Serik", "Ioel", "Draco", "Myrna", "Dalila")
val lastName by columnOf("Desiree", "Chuy", "Jan", "Arti", "Hyginos", "Dardanos")
val country by columnOf("The Netherlands", "India", "Belgium", "Argentina", "Bolivia", "Belgium")

By supplying the head() method with an argument, the number of elements may be defined, in this case two:

val customDataFrame = dataFrameOf(firstName, lastName, country)
print(customDataFrame.head(2))
    firstName  lastName          country
 0 	    Akmad   Desiree  The Netherlands
 1 	    Serik      Chuy            India

The API offers various ways to retrieve specific types of data such as the contents of the first row:

println(studentsDataFrame.get(0).values())
[Akmad, Desiree, The Netherlands]

Alternatively, a specific column, such as the country column may be retrieved:

println(studentsDataFrame.getColumnOrNull(2)?.values())
[The Netherlands, India, Belgium, Argentina, Bolivia, Belgium]

More advanced API methods allow, for example, sorting and removing elements from a DataFrame:

println(studentsDataFrame.sortBy("firstName").remove("country"));
    firstName  lastName
 0 	    Akmad   Desiree
 1	   Dalila  Dardanos
 2 	    Draco 	   Arti
 3       Ioel       Jan
 4 	    Myrna   Hyginos
 5 	    Serik 	   Chuy

A DataSchema annotation may be specified to improve the data handling:

@DataSchema
interface Student {
   val firstName: String
   val lastName: String
   val country: String
}

Now the schema can be used together with the DataFrame API to filter on the country field of Student and sort on the firstName field of Student:

val studentsDataFrame = DataFrame.read("students.csv")

println(studentsDataFrame.filter{it[Student::country] != "Belgium"}.sortBy(Student::firstName))
    firstName  lastName         country
 0 	    Akmad   Desiree The Netherlands
 1 	    Draco 	   Arti       Argentina
 2 	    Myrna   Hyginos         Bolivia
 3 	    Serik      Chuy           India

More information can be found in the first video about Kotlin DataFrame which covers the basic operations and processing tables. Various examples are available and the #datascience channel on Slack may be used to ask questions after signing-up.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.