Through wide research, there are multiple technologies which are upcoming continuously that have the most persistent auto labeling for creating the training data which is affordable and consumes time. The area of research which is getting the extreme attention is Snorkel. Other approaches will be having the capability of building your model on the datasets which are publically available which is always to be considered. There are many datasets where you could easily research whatever you want. If you are looking for machine learning and techniques which are statistical, you have multiple options.
Isn't it so nice that we just have to type and we get the subject of datasets which we are looking? For weather, you have NOAA and NASA which might be the favorite of many people. Now we have acquisition of Schema.org which is the metadata and it is recognized by the knowledge graph of Google. You can easily find the beta here:
The items which appear to be datasets, they are already indexed by the Google staff but there is always a way. There refinements which are already available. For the public datasets, you have subsidiary search site:
Some special subsets are there on this page:
Google Genomics Public Datasets
There is multiple stuff which might attract you.
A site called Microsoft Research Open Data, is launched by Microsoft.
Open data of Microsoft research do not go through the entire web, but it brings back the proprietary datasets of deep learning, both image and the text.
It offers 2000 datasets which are totally about 28 terabytes. It covers a wide range of topics that are distributed for sharing a large number of datasets. We can search through it, but not with the site of the Google. When it's for downloading, for others you might upload your dataset for others to this site.
It is a commercial platform which helps to prototype, maintain and retrain the models of machine learning. 101 datasets are offered from multiple sources which cover the text, videos, speech datasets, etc.
Having 565 Datasets
KAGGLE PUBLIC DATASETS
The current listing is 10992
It has 302944 datasets.
It mainly offers 8 datasets as it provides the data in loops. It provides accurate data, which helps them to enhance the accuracy of the data of the client. The convincing case is in the areas of NLP where chatbots are trained which bring the multiple viewers for every item drawn from the multiple demographics. If active learning is of your interest you have a good DSC webinar here which have a Figure Eights leading experts in this field.