Close this search box.

We are creating some awesome events for you. Kindly bear with us.

EXCLUSIVE – Data for good – What does a multinational financial services company have to do with a non-profit fighting trafficking?

EXCLUSIVE - Data for good - What does a multinational financial services company have to do with a non-profit fighting trafficking?

OpenGov spoke to Mr. Steve Totman (above), Financial Services Industry Lead at Cloudera about the use of big data by NGOs and charitable organisations to solve complex real-world problems and how Cloudera is involved in such initiatives.  

A few years ago, MasterCard approached Cloudera to develop a PCI-Compliant Hadoop Environment2. In the credit card industry1, it is essential to ensure that cardholder data is properly secured and protected and that merchants and third-party solution providers meet minimum privacy levels for any application, database, or file system that plays a role in storing, processing, or transmitting account-related data. The Payment Card Industry Data Security Standard (PCI DSS) was formalised as an industry-wide standard in 2004, originating as separate data security standards established by the five major credit card companies: MasterCard, VISA, Discover, American Express, and the Japan Credit Bureau.

NGOs are also dealing with incredibly sensitive data. Mr. Totman said, “Sure you get upset if you lose your credit card. But imagine what it is like when you are dealing with victims of domestic abuse or with homosexual victims in a country where they might be persecuted or even killed.”

It turns out that the tools and frameworks used by multinational banks and credit card companies for collecting, processing, and protecting data as well as finding wrongdoing, translate remarkably well to meeting the needs of NGOs dealing with some of the most vulnerable populations and the most dangerous criminals.

Going further, consider a large financial regulatory body that tracks stock transactions, looking for insider trades. They are getting data from hundreds of brokerages, not just structured transaction data, but also audio files, e-mail communications, and text messages. All that data is placed into a massive store, so that they can look for individual stock transactions with specific characteristics indicating collusion. That kind of logic is exactly what the NGOs would want to use if they are looking for people doing bad things or searching for people who need help. For example, organisations looking at trafficking will look at things like craigslist or discussion boards in the dark web, pulling in unstructured data in the form of video, audio, text, images etc. 

Hadoop is well-equipped to handle structured data and it can blend structured and unstructured data together. In traditional relational databases, systems are structured around a data model, a schema. In Hadoop, you can store any form of data and flexibly apply the schema afterwards. This was one of the three points mentioned by Mr. Totman, explaining how Hadoop differs from legacy databases and how it is especially suited to deal with the requirements of large corporates such as banks and telecom companies as well as non-profits trying to use data to tackle challenging problems.

The other two differentiators are the significantly lower costs of dealing with large volumes of data (between 20 and 100 thousand dollars a year for 1 Tb of data on traditional databases vs a couple of thousand dollars on Hadoop) and flexibility of adding new data sources and analysing them within short time frames; a few hours compared to few months on earlier systems.

Use cases

Mr. Totman walked us through a few examples of the kind of big data applications he had been talking about.

An Israel-based data analytics company, Treato, aggregates patient experiences from the Internet, organising them into usable insights for patients, physicians, and other healthcare professionals. It crawls the entire web for medicines, symptoms, side effects, and other health-related user generated content.

The volume of data is not the only challenge. Treato also needs to process colloquial language, such as that used in social media posts, combine it with medical terminology, and translate it into actionable insights. By 2013, Treato had aggregated and analysed more than 1.1 billion online posts about over 11,000 medications and over 13,000 conditions from thousands of English language websites. The Treato website currently claims to provide information on 14,748 symptoms and conditions and 26,616 medications and treatments.

In collaboration with Cloudera, Patterns and Predictions, a predictive analytics firm developed an artificial intelligence (AI) solution that predicts mental health risk through opt-in analysis of social media and mobile text, with the goal of identifying indicators of suicidality, particularly among veterans, so that preventative action can be taken. The solution represented an extension of previous collaborations between the two organisations as part of The Durkheim Project, a DARPA-funded research program that ran from 2011 to 2015 and demonstrated the capability of big data technologies to effectively detect suicide risk at Internet scale.

Thorn is the organisation referred to in the title and referred to a few times previously in this article.

Thorn: Digital Defenders of Children is a non-profit dedicated to driving technology innovation to fight child sexual exploitation. Thorn partners with players from the technology industry, government, and non-governmental organisations, working to deter predatory behaviour, disrupt platforms that enable abuse, and accelerate victim identification.

Children are often bought and sold online, using online classified sites or escort pages (63% of child sex trafficking victims, according to the Thorn website). If technology was facilitating these heinous crimes, Thorn wanted to find the solution within technology to leverage the online information about these crimes to more rapidly find these children and connect them with victim services.

Thorn and Digital Reasoning (provides cognitive computing services to intelligence agencies and financial institutions) created Spotlight, a cloud-based collection and analysis tool used to provide intelligence and leads on suspected human trafficking networks and individuals to identify and assist victims. Cloudera’s CDH platform provides the infrastructure, which provides both distributed processing to run state of the art natural language processing and analytic algorithms on data that are harvested and organised in HDFS.

Spotlight has become the leading investigative tool for child sex trafficking investigations in the United States, with over 1,300 law enforcement users across 46 states.

Going the extra mile

Mr. Totman explained that it is not that difficult for the charities to get software or consulting services at little or no cost. But they also need skilled people who know how to use the resources and how to deal with data correctly.  

Through Cloudera Cares, employees are encouraged to donate time and resources for these initiatives. And the company’s customers have also expressed interest in getting involved. They are searching for mechanisms to get involved. Typically, they will throw money at the problem. But they can also provide data scientists. Cloudera is attempting to facilitate this borrowing of talent.

For instance, Cloudera recently collaborated with Intel and the National Centre for Missing and Exploited Children (NCMEC) on a month-long virtual hackathon to focus on innovative ways to locate missing children. They also organised a hackathon last year to explore new ways of using data to fight and prevent the Zika virus. These events provide opportunities for Cloudera and its partners to contribute to the use of “data for good.”


At the recent Strata Hadoop World San Jose event, Mr. Totman moderated a panel discussion on “Big Data as a force for good” to discuss using data for good and addressing the unique challenges humanitarian organisations and not-for-profits face in the big data world. The panelists included NetHope, a non-profit organisation working with over 20 international development organisations to identify key ICT-related needs related to the Syrian refugee crisis. Its efforts have included providing Wi-Fi hotspots and charging stations in camps and along the migration route. As Mr. Totman explained, the first thing the refugees need when they get off the boats is food and water. The next most important thing is connectivity. They need it to inform their families that they have made it till there. Sometimes, it becomes essential for their safety and survival; like during a period in late 2015 and early 2016, when applications for asylum in Greece could only be submitted through Skype.

To give someone a Wi-Fi connection, you end up storing the MAC (media access control) address of the phone, which entails some basic information about the person. The General Data Protection Regulations (GDPR) of EU include the right to be forgotten, essentially meaning the right to be deleted. For NetHope to have the ability to delete the information, they would also need to store additional information, so that later if they want their information to be deleted, they can prove that the information belongs to them.

This adds to the burden of protecting the information, guarding against antagonists engaged in the Syrian conflict from infiltrating and crippling the network, and exposing both refugees and humanitarian aid workers to outside risks. There are also the risks of a private entity or any government executing hacks that support their national interests. Strong cybersecurity and privacy protocols had to be integrated into the network.

Mr. Totman said that data wants to be shared. But he pointed out that there are concerns regarding storage, protection, and ownership. There are strict legal and ethical implications around that.

Cloud platforms offer a range of interesting options. But it also matters where the cloud platform has a local data centre. Anonymisation or tokenisation play an important role. Anonymisation turns data into a form where information about individuals cannot be recovered. Tokenisation is doing it in such a way that the data can be recovered under certain legal circumstances.

There are questions around what to anonymise and what to tokenise. With anonymisation, the frequency of the data (for instance you choose to anonymise an uncommon surname but once the user is in a country, where the first name is rare, that could be enough for identification) and the relativity of fields have to be taken into account. Cloudera went through these kinds of issues with MasterCard.

Cloudera strengthened its encryption capabilities with the acquisition of Gazzang in 2014 and later pushing the encryption itself into the chipset working directly with Intel. Today, hackers are very sophisticated and organised, sharing data, information on vulnerabilities, and hacking tools. But companies have not been coordinating in the same fashion. To bridge this gap, Intel and Cloudera initiated an Open Source project called Apache Spot.

Cloudera has developed a data governance solution called Navigator, which enables monitoring access to sensitive assets and seamlessly enforcing policies across the enterprise. The data lineage or provenance can be traced through Navigator.

Ultimately data governance is a combination of people, processes, and technology. There are frameworks like privacy-by-design which help. But there are no simple answers.

Data can be a force for good in the world, helping chip away at apparently intractable problems. But it’s not enough to have data to solve a problem. You must show how it was collected, how it was stored and used, and it has to protected all the way through. Security, lineage, and governance – they matter to banks and to charities. The transfer and sharing of tools, talent, and knowledge would help in unlocking that true potential of data, while dealing with the tricky concerns.

1Steve Totman is Cloudera's Industry Leader in Financial Services, Data Management Tooling and Ethical Data Governance, helping companies monetize their Big Data assets using Cloudera’s Enterprise Data Hub. Steve works with over 100 customers worldwide and helps several verticals in building architectures through data management tools and data models. Prior to Cloudera, Steve ran strategy for a Mainframe to Hadoop company and drove product strategy at IBM for DataStage and Information Server after the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance/metadata related designs. 

2Cloudera is the largest provider of Apache Hadoop based software, support and services. Apache Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.


Qlik’s vision is a data-literate world, where everyone can use data and analytics to improve decision-making and solve their most challenging problems. A private company, Qlik offers real-time data integration and analytics solutions, powered by Qlik Cloud, to close the gaps between data, insights and action. By transforming data into Active Intelligence, businesses can drive better decisions, improve revenue and profitability, and optimize customer relationships. Qlik serves more than 38,000 active customers in over 100 countries.


As a Titanium Black Partner of Dell Technologies, CTC Global Singapore boasts unparalleled access to resources.

Established in 1972, we bring 52 years of experience to the table, solidifying our position as a leading IT solutions provider in Singapore. With over 300 qualified IT professionals, we are dedicated to delivering integrated solutions that empower your organization in key areas such as Automation & AI, Cyber Security, App Modernization & Data Analytics, Enterprise Cloud Infrastructure, Workplace Modernization and Professional Services.

Renowned for our consulting expertise and delivering expert IT solutions, CTC Global Singapore has become the preferred IT outsourcing partner for businesses across Singapore.


Planview has one mission: to build the future of connected work. Our solutions enable organizations to connect the business from ideas to impact, empowering companies to accelerate the achievement of what matters most. Planview’s full spectrum of Portfolio Management and Work Management solutions creates an organizational focus on the strategic outcomes that matter and empowers teams to deliver their best work, no matter how they work. The comprehensive Planview platform and enterprise success model enables customers to deliver innovative, competitive products, services, and customer experiences. Headquartered in Austin, Texas, with locations around the world, Planview has more than 1,300 employees supporting 4,500 customers and 2.6 million users worldwide. For more information, visit


SIRIM is a premier industrial research and technology organisation in Malaysia, wholly-owned by the Minister​ of Finance Incorporated. With over forty years of experience and expertise, SIRIM is mandated as the machinery for research and technology development, and the national champion of quality. SIRIM has always played a major role in the development of the country’s private sector. By tapping into our expertise and knowledge base, we focus on developing new technologies and improvements in the manufacturing, technology and services sectors. We nurture Small Medium Enterprises (SME) growth with solutions for technology penetration and upgrading, making it an ideal technology partner for SMEs.


HashiCorp provides infrastructure automation software for multi-cloud environments, enabling enterprises to unlock a common cloud operating model to provision, secure, connect, and run any application on any infrastructure. HashiCorp tools allow organizations to deliver applications faster by helping enterprises transition from manual processes and ITIL practices to self-service automation and DevOps practices. 


IBM is a leading global hybrid cloud and AI, and consulting services provider, helping clients in more than 175 countries capitalize on insights from their data, streamline business processes, reduce costs and gain the competitive edge in their industries. Nearly 3,800 government and corporate entities in critical infrastructure areas such as financial services, telecommunications and healthcare rely on IBM’s hybrid cloud platform and Red Hat OpenShift to affect their digital transformations quickly, efficiently, and securely. IBM’s breakthrough innovations in AI, quantum computing, industry-specific cloud solutions and business services deliver open and flexible options to our clients. All of this is backed by IBM’s legendary commitment to trust, transparency, responsibility, inclusivity, and service. For more information, visit