This is the repository of the most frequently asked questions about the CrowdSignals.io data collection campaign.

Personal device data is one of the largest drivers of innovation in areas as diverse as computing, journalism, public health, social science, and urban planning among others. Yet data collection campaigns are very expensive, time consuming, technically, and legally challenging. In addition, researchers in many fields lack the funding and expertise to collect and analyze data from personal devices. Our goal is to enable students, developers, researchers, and scientists across a variety of fields with the data they need to solve important societal problems - and to collect this data using best practices that inform, engage, and compensate data collection volunteers while respecting privacy.

Since cost is a fundamental barrier to collection of large, high-quality datasets, CrowdSignals.io will use crowdfunding to finance the collection of a massive, shared dataset at a per sponsor cost that is orders of magnitude less than do-it-yourself data collection. We aim to have enough sponsors to make the data accessible even to students who need data for a thesis or class project (e.g., sponsors pay $1-$2 per data collection volunteer). The hope is that CrowdSignals.io could generate a massive dataset that jumpstarts work on hundreds of problems that cannot currently be solved due to a lack of data.

The crowdfunding campaign will take place in early 2016. The data collection will be conducted shortly thereafter and the collected data will be available a few weeks after the data is collected - by August 2016.

AlgoSnap is a C corporation founded and run full-time by Evan Welbourne. AlgoSnap is the entity to which all legal agreements with CrowdSignals.io Sponsors and Volunteers will refer. AlgoSnap also built and runs the system that securely collects, anonymizes, stores, and maintains the personal data during collection. AlgoSnap receives advising from a global panel of experts in different areas such as privacy, data transparency, and ethical data collection. AlgoSnap also works with a major Silicon Valley law firm with world-class specialists in data privacy law. For more on AlgoSnap visit our website.

We have more than a decade of expertise in sensors, signal processing, machine learning, and pattern classification for mobile and smart devices. AlgoSnap was founded to provide solutions that accelerate the development of intelligent algorithms for IoT devices. As such, AlgoSnap's business is fundamentally about support for algorithm development, not the collection and sale of data. Moreover, all funds contributed by sponsors will be used to pay for data collection - this includes compensation for AlgoSnap staff in addition to the data collection volunteers. At the same time, the lack of quality data is a huge, pervasive problem in the development of intelligent algorithms - and one for which there is no suitable solution available. As a first step toward solving the data scarcity problem we're proposing CrowdSignals.io to build at least one massive dataset that can begin to propel research, innovation, and problem solving in many areas.

Our long term plan with respect to CrowdSignals is to build the most powerful, robust, flexible, and easy to use platform for IoT data collection and make it available to the community for efficient, low-overhead data collection campaigns. We'll leverage our experience in building data collection and analytics infrastructure for numerous institutions including UW, Intel, Nokia, and Samsung among others. We're testing the CrowdSignals platform with limited partners as of early 2016.

Regarding legal structure, AlgoSnap is a C corporation. If we can validate the CrowdSignals.io concept by crowdfunding and executing several initial data collection campaigns then we will look into a legal structure that allows CrowdSignals.io to be spun out as a non-profit.

In late 2015 we collected feedback from academia and industry on priorities for: (1) data collection parameters (e.g., data recorded, volunteer demographics, device categories), (2) ground truth event labels to collect from volunteers (e.g., activities, events, and situations) and (3) anonymization techniques and data license terms that maximize the utility of the data collected while offering an adequate level of protection for volunteers.

We're also gathering data on the number of potential sponsors in order to define appropiate funding levels for the crowdfunding campaign when it launches in early 2016. As noted about, we aim to have enough sponsors to make the data accessible even to students (e.g., $1-2 per data collection volunteer) but this will depend on the community response.

Ensure we incorporate your feedback into consideration by completing our online surveys:

The data will be collected from volunteers that are 19 years of age and older from across the United States who give their informed consent to participate in this data collection.
We're constraining data collection to the United States because we best understand data privacy law in the US - this simplifies legal agreements and reduces the cost of legal services. If CrowdSignals.io is successful we will certainly consider data collection outside US borders.
The types of data that can be collected from Android smartphones and smartwatches can be found in the main page of the CrowdSignals.io site under the "The Data" section. There you can see each data types for more details and examples of the format in which data will be recorded. Also, take our 2 minute survey to vote on the type of data that you would like to see collected in CrowdSignals.io.

We are currently collecting feedback from academic and industry on what event labels are most important to collect from data collection volunteers. Take our 2 minute survey to vote on the labels that you would like to see collected in CrowdSignals.io. When we are closer to the data collection campaign, we will select the top labels suggested by the community and collect those during the campaign. You can also contact us if you have any more specific feedback on the labels you would like to see being collected.

Data will be collected from Android smartphones and smartwatches of different brands, models, and Android versions across the United States.
Data will be collected from volunteers 19 years of age and older. Efforts will be made to recruit volunteers from diverse demographics within these age limits. Early data collection trials will include fewer than 100 volunteers but we expect successive trials to include hundreds or thousands of volunteers. Take our 2 minute survey to vote on the demographics of the data collection participants to be recruited for CrowdSignals.io. Also, contact us if you have any more specific feedback on this topic.

The cost of the data per sponsor will be determined by the number of sponsors that join the CrowdSignals.io crowdfunding campaign. We're actively refining estimates on the number of potential sponsor in order to set costs associated with each funding level in our early 2016 crowdfunding campaign. As noted above, our goal is to drive cost so low that we can make the data accessible to as many sponsors as possible, including students working on thesis or class projects (e.g., each sponsor pays $1-$2 USD per volunteer). For example, a dataset containing rich data from 100 users would cost approximately $100 (US) instead of $50k-$100k+ for a do-it-yourself data collection. The price and terms of use will vary slightly for academic vs. commercial use. We'd love to hear your feedback on the cost of the dataset.

Sensitive data from volunteers such as contacts, call logs, SMS logs, and location information will be anonymized; datasets will also be watermarked for each individual sponsor receiving the data so that any unauthorized transfers of the data to third parties can be traced. Furthermore, all sponsors will be subject to a strict but fairly standard legal agreement that protects participant privacy (e.g., no reverse engineering identities, only ethical use).
To gain access to the data you need to become a sponsor of CrowdSignals.io during our crowdfunding campaign in early 2016. You can select one of the funding levels that will be made available to receive either an academic or commercial license to the data. To receive the data, you will have to fill out a simple application (e.g., name, institution, high-level reason for requesting the data) and agree to the data license.
This will depend on the academic institution on a case-by-case basis. If needed, researchers can use our "Institutional Review Board Kit" to minimize the pain of applying for approval. Our kit will provide the information most commonly requested by IRB and ethics committee applications. You will have to update few sections such as the one in which you describe what type of research are you performing with the data.
This will depend on the company on a case-by-case basis. If needed, researchers can use our "Legal Kit" when contacting their corresponding legal department for approval. Our kit will provide the information most commonly requested by legal teams. You will have to update a few sections such as the one in which you describe what type of work are you performing with the data.
To gain access to the data, you will have to sign a nonexclusive, non-transferable, non-assignable, non-sub-licensable data license with AlgoSnap Inc. The following is a high-level summary of the terms:

  • Ownership: the volunteers who share their data, the sponsors, and the personal data to be collected will all be protected by a legal framework consisting of contracts and license agreements. AlgoSnap will be referenced in the legal documents as the entity that collects, protects, and licenses the personal data. To create and enforce license agreements, the dataset collected shall remain the exclusive property of AlgoSnap. AlgoSnap will also be responsible for collecting descriptions of the research results produced by backers (including paper publications) and posting them on its website so the community has access to them. AlgoSnap will also be responsible for posting information about what institutions and groups within the organizations have access to the dataset on its website.
  • Watermarking: consent to unique fingerprinting of the dataset, specific to the sponsor receiving the data. This ensures trackability of the data as sponsors are constrained not to share the dataset with any third parties.
  • Permitted staff: sponsor or institution will designate certain of its staff to have access to the dataset. The name of the institution and group(s) having access to the data will be displayed in AlgoSnap website for the broader community.
  • Site manager: sponsors will designate a person as a site manager in charge of supervising access to the data, its confidentiality, and ensuring no reverse engineering of the dataset to reveal personally identifiable information.
  • Data access: sponsors agree to inform AlgoSnap about who the sponsor's institutional site manager is, as well as the groups having access to the data so this information can be made available to data collection volunteers via the AlgoSnap community.
  • No distribution: backer agrees to not transfer or resell the data to third parties under any circumstance.
  • Commercial and academic use: both commercial and academic use of the data are permitted. In general, sponsors will be required to inform AlgoSnap about how the data was used, in what research area, and for what problem, as well as to provide a summary of the results so that they can be made available to the wider community including the participants who originally collected the data. See the "Usage Purpose" section of the license for more details on this. Please contact us with any early feedback on licensing.
  • Illegal use: any unethical, illegal, or criminal use of the dataset is strictly prohibited.
  • Publications: publications derived from the data are permitted as long as (1) a link to the publication is provided to AlgoSnap for sharing with the broader community in its website once the publication has appeared publicly at a conference and (2) a reference to CrowdSignals.io campaign is made in the publication.
  • Usage purpose: sponsors must commit to inform AlgoSnap on how the data is being used and for what purposes (e.g., which research area and particular problem to solve) so that this information can be shared with the broader community, including data collection volunteers in AlgoSnap website. To protect companies' new products confidentiality and competitiveness in academia, the sharing of this information with AlgoSnap Inc. can be postponed until the product is released into the market or the academic research article has being published publicly.
  • Data license expiration: backer agrees to stop using and destroy their copy of the dataset after the data license period has expired.
  • Disclaimer: The above information about data licensing agreements is for reference purposes only and can change at any time as we collect feedback from the community on the best ways to balance the data's utility with the privacy and confidentiality needs of volunteers.

    Yes, the data collected in the CrowdSignals.io effort will be licensed to a group including yourself via legal agreement. However, a sponsor cannot share the data beyond the declared group. The dataset will also be watermarked with a fingerprint that is unique to each sponsor so that any unauthorized sharing of the data can be traced. Sponsors will be responsible for appointing a site manager responsible for supervising who has access to the data.

    Privacy and legal constraints are key reasons why we do not allow the dataset to be shared with any third party. Sponsors gaining access to the data will need to sign a legal agreement with AlgoSnap in which they commit to protect the personal data, for example by not reverse engineering the identities of the data collection volunteers or sharing the data with other entities who have not signed this legal agreement and might use the data for unethical or illegal purposes.

    There will be a limit, we are currently working out the details. If you have feedback, please do not hesitate to contact us.

    Yes, you can build a product with the collected data. We are still working out the details of the commercial license. The commercial license will be slightly more expensive and licensees would still need to inform AlgoSnap (at a high level) about the kinds of usages of the data they are planning. However, to protect the confidentiality of the product, any more specific information sharing on behalf of licensees could be postponed until their product is released.
    Yes, you can as long as you (1) send a link for the publication (could a link to a publisher website) to AlgoSnap so that the link can be made it publicly available on our website after it has been published, and (2) reference CrowdSignals.io campaign in your publication.
    Neither raw audio nor video will be recorded automatically in the data collection campaign due to privacy concerns. However, data collection participants may initiate raw audio and video recording if they wish to for a given ground truth labeling task (e.g., "tell/show what you're eating"). In such cases, volunteers will be fully informed and all the necessary consent will be in place.
    We will transfer the data to a sponsor's Google Drive or similar online storage account. At that time, extensive documentation on the format of the dataset will also be made available on the AlgoSnap website.

    Funds collected by the crowdfunding campaign will be applied to compensate volunteers as well as to pay for any equipment, cloud services, software development, AlgoSnap admin personnel, consulting, or legal services.

    We may define stretch goals if we surpass the initial funding goal set in a crowdfunding campaign.

    You can read our paper "CrowdSignals: A Call to Crowdfund the Community’s Largest Mobile Dataset" published at the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2014).

    Please note that not all technical details (e.g., specific architecture and software components) in the paper are consistent with our current approach nor is the notion of "open data" because the data collected in CrowdSignals.io will be initially only available to sponsors of the project. However, the main ideas and motivations are there as well as a good summary of costs associated with previous large-scale data collections.

    Note that there are two key reasons why we are not making the data collected in CrowdSignals.io immediately open to everyone: (1) sponsors gaining access to the data will need to sign a legal agreement with AlgoSnap in which they commit to protect the personal data by for example, not reverse engineering the identities of the volunteers or sharing the data with other entities that have not signed this legal agreement. This level of protection would not be possible with a completely open dataset, (2) it would be unfair for sponsots who contributed funds to the campaign to later find out that other (possibly competing) groups obtained free access to the data at the same time. However, we will release the data more widely and free of charge 12-18 months after it is collected.

