Interview with Chao Han, Head of Data Science at Lucidworks

AIT MegaMindExplanation & ArgumentationInterviews

By Sudipto Ghosh On Sep 5, 2018

Today, even for professionals who are not data scientists, it would be helpful to know basic ML concepts, such as clustering, classification, and forecasting.

Know My Company

Tell us about your interaction with AI and other intelligent technologies in your daily life.

As the Head of Data Science at Lucidworks, it’s my job to build AI solutions into our search platform called Fusion and provide a better search experience to our customers.

As a data scientist, how do you see the raging trend of including ‘AI in everything’?

We do see that AI can bring big KPI growth and fast click-through rate lift based on our customers’ feedback just from turning on the AI features in Fusion. That’s the way to go to gain competitive advantage and save human resource costs.

What are the most challenging aspects of working with AI/Machine Learning? How do data scientists turn this into a goldmine of opportunities for the industry and customers?

There are two major challenges customers face before using Fusion.

The implementation time of ML models are usually very long. It’s not uncommon for it to take at least six months for engineers to implement an ML model, built by data scientists, into production at large scale.
The ML algorithms need to be robust and easy to use. Extra effort is usually needed to put into the algorithm to make the models robust on different dataset samples and easy to tune for business users.

Fusion has many OOTB solutions that shorten the implementation time drastically and we have built several algorithms in-house that are easy to tune and have higher quality than existing open source ML solutions.

As a mentor in the tech industry, how should young marketers and sales professionals train themselves to work better with AI and virtual assistants?

Today, even for professionals who are not data scientists, it would be helpful to know basic ML concepts such as: clustering, classification, and forecasting. The deeper understanding of ML methods, the better you can connect your business problem with the right AI tool.

Any company that wants to put AI as a top priority should make sure there is a direct communication channel between C-level executives and data scientist group leaders.

If interested, some basic data extraction and visualization skills can be fun and helpful to have, so that you can easily use the AI tool to perform your own analysis and reports.

But overall a great AI tool should make the consumption of ML models seamless for the user.

How much do data scientists interact with the company’s business leaders and the decision-makers?

As an AI-driven software company, Lucidworks data scientists have very close interaction with executive-level decision makers to make sure every party is clear on what is possible to enable with AI. It’s also beneficial for data scientists to be aware of the marketing focus or any direction changes in the company. I think any company that wants to put AI as a top priority should make sure there is a direct communication channel between C-level executives and data scientist group leaders.

How do you measure the success/failure of your work at Lucidworks?

The ultimate measurement is our customer satisfaction. The data science team is in direct contact with our users to get feedback about our AI features.

Can they solve their problem at hand with the minimal effort to setup?

What’s the quality of the results?

How about the job running speed?

We ask those questions and gather information, often with user review results, to better understand our performance and provide fast improvements if desired.

How do you consume information on AI/ML and related topics to build your opinion?

Visualizations are fast and powerful tools to help understand the data distribution, compare different options, and make conclusions. Keep in mind that rarely any ML model can provide 100% accurate results but the models can automate many different manual processes and increase business efficiency. It’s also good to pay attention to the assumptions of the model or tests to make sure the model results are aimed to answer your questions.

How can companies further improve their search functionality by using Natural Language Processing (NLP) and AI? What makes Lucidworks a leader in this space?

Lucidworks delivers AI-powered enterprise search to Fortune 2000 companies both on-premise and in the cloud. Examples of search problems we help our customers include: search on corporate intranets, product search from an ecommerce website, or general info search such as search on Reddit.

Our flagship platform is called Fusion, it allows companies to incorporate intelligent search features such as product recommendations, spell-check, auto-suggestions, and query rewriting into their applications without having to build their own systems from scratch and enable them to better compete with the likes of Amazon or Google.

Since we focused ML features in the last a few years and enabled many NLP capabilities, our applications are not limited to just search. E.g., we have a suite of e-discovery algorithms such as doc clustering, anomaly detection, topic trend analysis, synonym discovery, which can serve the needs beyond the search department of a company.

And, our ultimate goal is to solve the last mile problem in AI: to make complex data science available to end users directly, without requiring the user to have any background in data science.

What makes understanding AI so hard when it comes to actually deploy them?

Many AI tools’ designs mainly focus on data, not human. Thus, a great AI platform is usually solution based rather than tool based. For example, the entry point of the platform can start with a questionnaire format and let the user point to their problems, so the platform can automatically choose and combine tools or workflows to help solve the problem. Or if a certain problem always follows a certain pattern or process (e.g. risk management, cyber security), then the platform can be built specifically for such problems.

A key differentiator of Fusion to other vendors is that we are providing operational AI that non-data scientists can easily adapt and use it to solve their own problem.

We conduct extensive testing to make sure the default parameters of our OOTB models works on most of the use cases in different scenarios. And we always include a domain expert in the entire software design cycle to make sure it’s easy to use and aiming the right target.

How an end-to-end solution with data capturing of online behavior helps a company better compete with the likes of Amazon on Google search?

Data richness and quality can weigh more than modeling techniques in data science. Knowing that, in Fusion, we are providing end-to-end solutions from data capturing to result interpretation.

For example, Fusion can automatically capture user online behavior such as search, click on products, add to carts and purchase. That information is ingested and stored in Solr, then automatically transformed into a format that our ML model can consume. In the backend, we use Spark to run ML models such as recommendation, LTR, query intent classification, query analytics and spell checking.

At last, present results in our UI dashboard and directly connect to a pipeline to help improve search relevance at query time. The whole process is very streamlined and easy to setup, which makes AI more operational, rather than waiting for months or even years to see fruit from a data science project, you can see big impact in a short time frame using Fusion.

Which is harder – choosing AI or working with them?

There can be a big difference in implementation time and usability between a good AI platform and a bad one. Choosing the right one can make your life easier and you’ll see improvements in a short period of time. On the contrary, a bad AI software choice can waste a lot of time and require more human input than needed, it may even reduce business efficiency if the investment is too big. Always compare different vendors and make sure the vendor of choice has a great support and advisory program.

Would most businesses turn to AI eventually for better performance?

Not necessarily, again, depends on the balance between AI investment and reward. Choose a software that requires a long implementation period, because putting software on the shelf without adaptation can be a big waste.

Where do you see AI/Machine learning and other smart technologies heading beyond 2020?

As for trends in the ML industry, in recent years, with the healthy development of the open source community, more and more analysts shift from traditional analytics tools to open source languages such as Python, Scala and R. Large scale adoption of open source tools will keep happening beyond 2020.

On the methodology side, it’s worth mentioning the fast growing DL community. New DL architecture comes out every month. A graduate student can easily build a DL model that beats state-of-the-art models traditionally built by a big team. But I have to admit, because we are still lacking the knowledge of the underlying math of “why” and “how” DL works so well, we still need to use different tricks while building models from scratch to prevent overfitting and burning a lot of GPU power. Those are the factors that makes DL in production and large scale usage hard.

That’s why in our R&D practice at Lucidworks, we always design the research experiments to compare traditional methods vs DL methods. Because if we have to run a model on a GPU farm for a week to increase prediction accuracy by less than 2%, then the traditional way can be a more cost-effective solution. Combine that with each customer’s cost concerns, we are able to provide either a traditional solution or a DL solution.

Given the fast progress of DL research, a breakthrough may come around 2020 or even sooner to make DL models more tractable and easy to tune, thus making DL in production easier.

The Good, Bad and Ugly about AI that you have heard or predict

The Good: reduce manual work, lift KPI, provide preventative alerts etc. and can contribute to many industries such as health care, banking and ecommerce.

The Bad: high cost if you choose the wrong AI platform, long implementation, and adaptation time.

The Ugly: AI community need to pay attention to ethics problems caused by AI, such as cyber security, fake tweets, and using robots as weapons.

The Crystal Gaze:

What AI start-ups and labs are you keenly following?

OpenAI, fast.ai, H2O, PipelineAI

What technologies within AI and computing are you interested in?

All kinds. It ranges from unsupervised, semi-supervised to supervised ML methods.

Recently I’ve been especially interested in DL-backed NLP methods and different encoding techniques. I’m also using Spark with Scala daily to build ML jobs in Fusion and found it’s a very good tool to bridge the gaps between data scientists and engineers.

As a tech leader, what industries you think would be fastest to adopting AI/ML with smooth efficiency? What are the new emerging markets for AI technology markets?

Industries like marketing and ecommerce are more equipped to adopt AI/ML because they are less constrained by regulations to use open source tools.

Certain departments at banking and pharmaceutical companies are changing too. A lot can be done in the area of IoT, in addition to data collection and summarization. I can see the healthcare industry will have a revolutionary change lead by current DL-driven technologies.

What’s your smartest work related shortcut or productivity hack?

I followed several great data scientists on LinkedIn, if at least two of them shared the same blog or paper, I better read it to keep up to date with the most recent AI developments.

Tag the one person in the industry whose answers to these questions you would love to read:

Mike Tamir, head of data science at Uber.

Thank you, Chao! That was fun and hope to see you back on AiThority soon.

About Chao
About Lucidworks

About Chao

Chao is a data scientist with over 10 years of analytical experience in both academia and industry. She currently works at Lucidworks, an enterprise search engine company, to help build a new product called Fusion AI, which boasts functionalities such as recommendation, query analytics, automatic document clustering and QA system.

Chao received her phD in Statistics from Virginia Tech in 2012 (dissertation: Bayesian visual analytics for high dimensional data. with 8 publications). After graduation she worked at JPMorgan Chase R&D supporting projects in the areas of transaction text mining, social media sentiment analysis, fraud detection, default prediction and target marketing. Chao also initiated and led the “Robot Modeler” project to reduce predictive modeling time from months to days. She joined SAS in 2015 to help develop a new platform – an in-memory multi-threaded analytic engine that enables fast model implementation calculations on a gridded network.

About Lucidworks

Lucidworks Logo

Lucidworks builds AI-powered search and discovery applications for some of the world’s largest brands. Fusion, Lucidworks’ advanced development platform, provides the enterprise-grade capabilities needed to design, develop, and deploy intelligent search apps at any scale. Reddit, Red Hat, Moody’s, Commvault, and the US Census are just of few of the companies that rely on Lucidworks every day to power their consumer-facing and enterprise search apps. Lucidworks’ investors include Top Tier Capital Partners, Shasta Ventures, Granite Ventures, Silver Lake Waterman, and Walden International.