MMS • Daniel Dominguez
Article originally posted on InfoQ. Visit InfoQ
AWS announced that users can now create labeled synthetic data with Amazon SageMaker Ground Truth. SageMaker Ground Truth is a data labeling service that makes it simple to label data and allows you the choice to use human annotators through third-party suppliers, Amazon Mechanical Turk, or your own private workforce. Without actively gathering or labeling real-world data, you can alternatively produce tagged synthetic data. On your behalf, SageMaker Ground Truth can produce millions of automatically labeled synthetic images.
The process of creating machine-learning models is iterative and begins with data preparation and gathering, then moves on to model training and model deployment. Collecting extensive, varied, and precisely labeled datasets for your model training is frequently difficult and time-consuming, especially the initial stage.
For the purpose of building more comprehensive training datasets for your machine-learning models, combining your real-world data with synthetic data is helpful.
Synthetic data itself is created by simple rules, statistical models, computer simulations, or other techniques. This makes it possible to generate vast amounts of synthetic data with extremely precise labels for annotations over tens of thousands of images. A very small granularity, such as a pixel or sub-object level, and across modalities, can be used to determine the label accuracy. Bounding box, polygon, depth, and segment modalities are some examples.
Synthetic data is a powerful solution to two different problems: data limitations and privacy risks. When there is a lack of labeled data, training data can be supplemented by synthetic data to reduce overfitting. In the instance of privacy protection, data curators can provide made-up data rather than actual data in a way that simultaneously safeguards users’ privacy and keeps the original data’s usefulness.
By adding data diversity that real-world data may lack, you can produce more full and balanced data sets by combining your real-world data with synthetic data.
With SageMaker Ground Truth, you are free to design any imaging scenario with synthetic data, including edge cases that could be challenging to identify and replicate in real-world data. Variations can be added to objects and surroundings to reflect changing lighting, colors, textures, poses, or backgrounds.
In other words, you may order the precise use case for which your machine-learning model is being trained. Amazon SageMaker Ground Truth synthetic data is available in US East (N. Virginia). Synthetic data is priced on a per-label basis.