MMS • Johan Janssen
Article originally posted on InfoQ. Visit InfoQ
Kotlin DataFrame, available as a first public preview, is a new library for processing data from any source, such as CSV, JSON, Excel and Apache Arrow files. The DataFrame library works together with Kotlin data classes and hierarchical data schemas by using a Domain Specific Language (DSL).
A data frame is a two dimensional table with labeled columns, comparable to spreadsheets, SQL tables, or CSV files. The Kotlin DataFrame GitHub repository contains more information about the set of operations on the data, available through a DSL, supporting data analysis. The library started as a wrapper around the Krangl library, but most of the functionality was rewritten over time.
The Kotlin DataFrame library was designed to read and display data from any source and allows nesting columns and cells. Any Kotlin object or collection can be stored and retrieved from a DataFrame.
Data may be supplied via a file such as students.csv:
firstName, lastName, country
Akmad, Desiree, The Netherlands
Serik, Chuy, India
Ioel, Jan, Belgium
Draco, Arti, Argentina
Myrna, Hyginos, Bolivia
Dalila, Dardanos, Belgium
The DataFrame is created based on the contents of the file:
val studentsDataFrame = DataFrame.read("students.csv")
print(studentsDataFrame.head())
By default, the head()
method returns the first five rows:
firstName lastName country
0 Akmad Desiree The Netherlands
1 Serik Chuy India
2 Ioel Jan Belgium
3 Draco Arti Argentina
4 Myrna Hyginos Bolivia
Alternatively, the DataFrame may be created programmatically:
val firstName by columnOf("Akmad", "Serik", "Ioel", "Draco", "Myrna", "Dalila")
val lastName by columnOf("Desiree", "Chuy", "Jan", "Arti", "Hyginos", "Dardanos")
val country by columnOf("The Netherlands", "India", "Belgium", "Argentina", "Bolivia", "Belgium")
By supplying the head()
method with an argument, the number of elements may be defined, in this case two:
val customDataFrame = dataFrameOf(firstName, lastName, country)
print(customDataFrame.head(2))
firstName lastName country
0 Akmad Desiree The Netherlands
1 Serik Chuy India
The API offers various ways to retrieve specific types of data such as the contents of the first row:
println(studentsDataFrame.get(0).values())
[Akmad, Desiree, The Netherlands]
Alternatively, a specific column, such as the country column may be retrieved:
println(studentsDataFrame.getColumnOrNull(2)?.values())
[The Netherlands, India, Belgium, Argentina, Bolivia, Belgium]
More advanced API methods allow, for example, sorting and removing elements from a DataFrame:
println(studentsDataFrame.sortBy("firstName").remove("country"));
firstName lastName
0 Akmad Desiree
1 Dalila Dardanos
2 Draco Arti
3 Ioel Jan
4 Myrna Hyginos
5 Serik Chuy
A DataSchema
annotation may be specified to improve the data handling:
@DataSchema
interface Student {
val firstName: String
val lastName: String
val country: String
}
Now the schema can be used together with the DataFrame API to filter on the country
field of Student
and sort on the firstName
field of Student
:
val studentsDataFrame = DataFrame.read("students.csv")
println(studentsDataFrame.filter{it[Student::country] != "Belgium"}.sortBy(Student::firstName))
firstName lastName country
0 Akmad Desiree The Netherlands
1 Draco Arti Argentina
2 Myrna Hyginos Bolivia
3 Serik Chuy India
More information can be found in the first video about Kotlin DataFrame which covers the basic operations and processing tables. Various examples are available and the #datascience channel on Slack may be used to ask questions after signing-up.