It’s time to address the reproducibility crisis in AI


GUEST: Recently I interviewed Clare Gollnick, CTO of Terbium Labs, on the reproducibility crisis in science and its implications for data scientists. The podcast seemed to really resonate with listeners (judging by the number of comments we’ve received via the show notes page and Twitter), for several reasons. To sum up the issue: Many resear…Read More

Getting Linked In to Data Science with Dr. Igor Perisic

Dr. Igor Perisic – Chief Data Officer

Episode 11, February 7, 2018

Getting Linked In to Data Science with Dr. Igor Perisic

Big data is a big deal, and if you follow the popular technical press, you’ll have heard all the metaphors: data is the new oil, the new bacon, the new currency, the new electricity. It’s even been called the new black. While data may not actually be any of these things, we can say this: in today’s networked world, data is increasingly valuable, and it is essential to research, both basic and applied. Continue reading “Getting Linked In to Data Science with Dr. Igor Perisic”

How IBM builds an effective data science team

Data science is a team sport. This sentiment rings true not only with our experiences within IBM, but with our enterprise customers, who often ask us for advice on how to structure data science teams within their own organizations.

Before that can be done, however, it’s important to remember that the various skills required to execute a data science project are both rare and distinct. That means we need to make sure that each team member can focus on what he or she does best.

Consider this breakdown of a data science project, along with the skills required for each role:

Continue reading “How IBM builds an effective data science team”

Yet Another Turning Point….

As some readers at this place already know , the boring fact is that I started work in the publishing and information industry in October 1967 , and am thus over fifty years as an observer of change in these parts . And , in what some regard as a fifty year dotage , , I am prone to remark that change is the new normal etc etc and pour scorn on the wealthy publisher who I approached for work in 1993 and who replied “ tell me when your digital revolution thing is over and then help me to cope with the next five hundred years of the post-printing world “ . And I quite see the point . Revolutions are not for everyone . And there were comfortable years in my twenties when it seemed possible to believe that Longman ad OUP, Nelson and Macmillan , could go on ruling the post colonial world of school textbook publishing  with nothing more exciting than a revised Latin syllabus to stir the waters of their creativity . Yet in truth the world of print , from the rise of Gutenberg to the fall of the house of Murdoch , has been full of change . It just happens faster and more completely now . Continue reading “Yet Another Turning Point….”

Publishing with Apache Kafka at The New York Times

At The New York Times we have a number of different systems that are used for producing content. We have several Content Management Systems, and we use third-party data and wire stories. Furthermore, given 161 years of journalism and 21 years of publishing content online, we have huge archives of content that still need to be available online, that need to be searchable, and that generally need to be available to different services and applications. Continue reading “Publishing with Apache Kafka at The New York Times”

Your WordPress plugins might be silently losing business data

If your WordPress site uses third-party plugins, you may be experiencing data loss and other problematic behavior without even knowing it.

Like many of you, I’ve become quite attached to WordPress over the past 15 years. It is by far the most popular content management system, powering 28 percent of the Internet, and still the fastest growing, with over 500 sites created on the platform each day. Considering myself well versed in the software, I was surprised to discover — while working on a digital design project for a client — what could be the Y2K of WordPress. Many WordPress plugins are suffering data loss, and it looks like this problem will soon explode if not properly addressed.

The issue is essentially due to the fact that WordPress discards entire datasets even when only one of the data elements within the set contains too many characters for the insertion field. Because WordPress doesn’t log the data loss or any errors related to it, few developers are aware of the issue. And because of one particular scenario involving storing a visitor’s data when they’re connecting with an IPv6 address, the situation is exponentially worse.

Example: Say a WordPress site owner has a plugin installed that lets users add comments. Plugins like that typically store the user’s IP address along with comments they submit, for analytics purposes. For years, plugin developers have assumed that IP addresses were always in the standard IPv4, 15-character format that looks like this: Thus, plugin developers typically set the maximum allowed characters for the IP address database field their plugin uses to about 15-20 characters. However, IPv6 has a much longer 39-character format that looks like this: 2001:0db8:85a3:0000:0000:8a2e:0370:7334.

Unbeknownst to many users, site owners, and developers alike, these longer IPv6 addresses are becoming increasingly widespread. Those new addresses won’t fit into the database fields developers have been using for years. Furthermore, for security purposes, WordPress specifically validates that each part of a data set about to be stored will fit. In the example above, if the IP address is too long, WordPress discards the entire data set (not just the oversized IP address string). Worse, WordPress doesn’t log an error when this happens. The data is simply lost to the ether, without leaving a trace. This two-year-old WordPress bug thread shows how long the WP core devs have known that the community didn’t like this, but they still haven’t addressed it.

Yes, this currently just affects data coming from IPv6 addresses (currently about 17 percent of users). But while IPv6 use may be in the minority right now, it won’t be for long, and as it becomes the majority, these unexplained issues with data loss will reach pandemic proportions if left untreated.

Just how widespread is this?

1.02 million active WordPress plugin installs are silently discarding real visitor logs, content submissions curated by users, and more, right now, all because IPv6 addresses are present in the data being stored. Here are some other interesting stats:

  • 50,336 plugins are available at today
  • 200 plugins (~1 in 250) create IP address fields that are too short
  • Those 200 plugins have over 1 million active installs — a total of 1,023,280.
  • Here’s a publicly-accessible Google Sheet my team created that lists all known offending plugins. For each plugin, that sheet includes one example where that plugin declares an IP address field that is too short.

The fix is easy peasy: You simply need to change the table schema for the column that stores IP addresses from 15 to 39 (or more).

This problem can affect applications other than WordPress; really, any application that utilizes IP addresses and stores them in MySQL/PostgreSQL tables (especially in STRICT mode, which would prevent row inserts) where the column max is expecting a 15-character IPv4 IP address.

Debuggin’ the plugin

I uncovered this situation while recently working on a site that needed a rating system that allowed authenticated users to vote on specific post types. So naturally, I did a search of existing plugins that could meet the requirements and found one fairly quickly, CBX Rating, and it was a breeze to configure and get working. Then came the intermittent reports of the form submissions not going through.

I spent hours deactivating other plugins, digging through code, and guiding users via screenshare. I was unable to narrow it down or find any smoking gun. No success message, no error message, no errors in the console log, nothing in the server logs. How could form submissions be failing without errors?

I remembered something I had seen in WordPress before: row inserts silently failing if the data strings were longer than the table column maximums. So I shifted my attention to the back end, and that’s where I found the problem and my boss, Erik Neff (the company’s CTO), helped identify exactly why it was happening.

MySQL databases, not in STRICT mode, will truncate values if they’re over the max character count for a particular column and will insert the new record with a warning. When in STRICT mode, MySQL will not accept the record and will return an error. WordPress, on the other hand, won’t execute a query if it determines the length is longer than the max, and will instead return false, with no error or warning.

When using the WordPress $wpdb->insert method, you get back a 1 upon success and a 0 upon failure. But a function is called before any mySQL statements are executed, and that’s where the problem lies. The function is called protected function process_field_lengths, and it checks to see if the data’s length is less than the max allowable length for that table column. If the length is longer than allowed, the entire insert is aborted and false is returned with no error message or explanation. This is a known issue with WordPress core, and makes debugging that much harder.

The CBX Rating plugin we were using didn’t account for this failure point. I checked the plugin’s table schema and started increasing varchar max lengths across the board. Touchdown! Soon after, I got wind from users of all types that all forms were now being submitted successfully.

My mind raced to how this could be an epidemic, so Erik and I set out to determine the scale. The result of a (rather lengthy) check of WordPress plugins yielded a list of every place an IP address field was declared with an incorrect length. You can find those results in the Google sheet that I’ve made public.

Brett Exnowski is senior developer at Primitive Spark and specializes in complex web applications.

LiveStories raises $10 million to help you access public health and census data


LiveStories, which provides software that simplifies access to civic data on poverty, health, economics, and more, today announced that it has raised $10 million in funding. Ignition Partners led the round, with participation from returning investors True Ventures and Founders Co-Op.

The Seattle-based startup sources data from federal, state, and local governments, including The Bureau of Labor Statistics, the U.S. Census, and the Centers for Disease Prevention and Control.

“The civic data workflow is fragmented across multiple tools and vendors,” wrote LiveStories founder and CEO Adnan Mahmud, in an email to VentureBeat. “For example, you might use Google to find the data, Excel to clean it up, Tableau to explore it, and Word to create a static report.”

According to Mahmud, LiveStories’ software allows customers to find and communicate civic data in a more interactive way — across charts, videos, and images. “Our platform automatically visualizes the data, down to city and county localities,” wrote Mahmud. The data can then be shared on social media networks like Facebook and Twitter.

LiveStories claims to have more than 120 customers, which include LA County, CDPH, San Diego County, UCLA, and the Gates Foundation.

Today’s funding will be used to further develop the product and increase sales and marketing. Founded in 2015, LiveStories has raised a total of $14 million and currently has 20 employees.

Sign up for Funding Daily: Get the latest news in your inbox every weekday.

7 takeaways from Mary Meeker’s 2017 Internet Trends report

Mary Meeker’s Internet Trends report has become an annual ritual for Silicon Valley. It’s as if the tech industry had an annual physical exam and received a health report in the form of a 355-page presentation.

As in years past, Meeker’s 2017 report contained a few notable trends in its firehose of data points, which are interesting in how they show the tech industry evolving. Here are some of the key takeaways.

Growth in Internet population is slowing, but growth in online ads is accelerating.

The number of global users on the Internet reached 3.4 billion in 2016, equal to 46 percent of the world’s population. That’s more than double the figure in 2009, but the growth rate has flatlined around 10 percent a year for the past five years.

Meanwhile, growth in online advertising is accelerating, at least in the U.S. Digital advertising rose 22 percent to $73 billion last year, up from 20 percent in 2015 and 15 percent in 2014. Unsurprisingly, the growth is coming from mobile ads, which is growing fast enough to more than offset a decline in desktop ads.

Meeker said that the amount of money spent on digital ads will surpass spending on TV ads sometime in the next six months.

Ecommerce growth is also accelerating.

That online-retail sales is growing year after year is a given. But the pace of growth has been accelerating for the past three years, rising steadily from 14 percent in 2013 to 15 percent last year.

Credit Amazon, of course, but Walmart is also seeing new online growth in the wake of its purchase of deep-discount site Meanwhile, physical retailers are expected to close nearly 1,700 shops in the U.S., the largest number in 20 years, the report says. Those closings have more to do with unwise overexpansion in recent years than Amazon or ecommerce in general.

Gaming continues to lead and shape the online experience.

Another unsurprising insight concerns the growth and popularity in gaming, but it’s interesting to see the figures Meeker has collected to show that growth.

Meeker estimates that there are 2.6 billion gamers around the world, up from 100 million in 1995. The gaming industry generated $100 billion in global revenue last year, with nearly half of that, $47 billion, coming from Asia. Games are central to defining the overall online experience. In her presentation, Meeker speculated that they may be preparing society for the rise of human-computer interaction.

Revenue in the music industry is rising again.

The Internet has not been kind to the music recording industry. For the past 16 years, revenue has declined by an average of 4 percent a year. The rate of decline had slowed in the past several years as downloaded and streaming music began to offset the vanishing sales of CDs.

Last year, overall music revenue grew by 11% to more than $12 billion, its highest figure since 2009. Subscription and streaming revenue made up more than half of the total figure for the first time.

Digital health care is approaching an inflection point.

Health care is at once a data-driven industry and one that is perhaps the worst at managing data. Meeker says health care “is at a digital inflection point,” one of those terms that act as red meat for investors because it signals strong growth ahead.

The rise of fitness trackers and health apps are collecting more user data than ever, while hospitals are sharing more health care information with patients. The average hospital holds 50 petabytes of health care data, and the total amount of that data is growing by 48 percent a year, Meeker says.

The bottleneck to analyzing that data is patient privacy. Health care data can be used to the benefit and the detriment of patients. A survey of consumers asking which tech companies they’d share their health data with shows 60 percent trust Google and 56 percent trust Microsoft. Less trusted are Amazon and Facebook — only 39 percent of consumers would share health data with them.

China is growing as a tech rival to the U.S.

The biggest market caps in tech belong to none other than the Big Four: Apple, Alphabet, Amazon, and Facebook. Together, they are worth a collective $2.4 trillion. But seven of the next 16 on the list are Chinese companies like Tencent and Alibaba. Those seven are worth $929 billion in aggregate.

U.S. companies may still dominate the money invested in tech, but China’s rivals are quickly catching up.

Immigrants are core to the Valley’s DNA.

The story of Silicon Valley is in good part the story of immigrants who have played a part in building and shaping its technology. Meeker looked at the 25 most highly valued tech companies and found that 15 of them had founders who were first- or second-generation Americans.

The shift in the Trump Administration’s “America first” stance on work visas may put that in jeopardy. To underscore the importance of foreign workers and founders in tech, Meeker showed that half of the most highly valued private tech companies were founded by first-generation immigrants. Those companies — including Uber, SpaceX, and Slack — have created 48,000 jobs.

The full report can be found here

Is Your Data Ready?

Through meeting with amazing companies doing amazing things, we see where the industry is headed and the level of innovation going on in our sector. Data meets analytics. Analytics meets decision-making. It’s now all embedded in software and magic happens.

I had the pleasure of sitting down recently with Harry Blount, CEO of DISCERN. They have built what Google does for images but for data. What I love about what they’ve done and are doing is they’ve created the holy grail for business decision-makers — the integration of internal data with free external data (think open web) combined with external licensed data. Further, they have options for users to license additional third-party data at point of use, creating doorways for vendors who want to control their data, but enabling access for users in keeping with their and third-party licensor needs. Wow.

At Outsell, we have been tracking and analyzing this industry for 20 years. In that time, I have seen enterprise customers want the holy grail. The refrain goes like this: “Please let me integrate my data with your data and with other people’s data, so we can make the decisions we must make with the right data, at the right place and at the right time.” Sound familiar?

This is the first platform I have seen that actually makes this work. It is agnostic and has a visual front end rather than boring row after row of text strings … so yesterday. This platform … so now. I can show you diagrams from reports Outsell wrote in the 90s for heaven’s sake, which points to the need for this type of integration. It only took 20 years and near obliteration by Google, Facebook, Twitter, and the other big platforms for the technology to catch up and our industry’s innovators to rise to the occasion to meet this need. DISCERN has a really interesting solution and opportunity in front of them.

Now, what I love most about them is the learning they’ve applied to vendors to ensure their data is ready for integration. I’ve written about “your data being ugly but no one wanting to tell you.” We spoke about this at Outsell’s DataMoney. We hear every day that data isn’t fit for purpose, and we hear how “backwater” legacy information and data company architectures are. So, with the DISCERN platform comes Harry’s rules for Data Readiness. They are learned from experience, and they are relevant for anyone because he wrote them for one of the hardest customer segments to serve — financial services, which always leads the way in our industry. So, if your data is ready for a company on Wall Street, it’s ready for just about anyone.

Checklist for Marketing Data to Wall Street Quantamentalists:

1. Has the data been sold for financial securities decision-making before?

2. What securities (bonds, stocks, commodities), geographies (US, World, Europe, etc.), sectors (11 GICS or 134 sub-GICS) can the data be used to enhance investment or lending decisions?

3. How is the data delivered? CSV, FTP, API, and is there a front-end?

4 How much history is available?

5. How frequently is the data updated — constantly, daily, weekly, etc.?

6. Has there been any white papers (or other documentation) published establishing the predictive value of the data?

7. Has the data been restated?

8. What is your business process for informing customers about changes to your data dictionary? How much advance warning, etc.?

9. Do you have a data dictionary ready to send?

10. Do you have a relevant data sample ready to test? (Wall Street firms often want to test 30–180 days before deciding)

11. Why (and how) is your data different than other vendors?

Follow Harry’s checklist. While you are at it, if you have an application that needs a great data front and the ability to integrate data, give Harry a call at (650) 336–0222 or email him at

Anthea C. Stratigos is Co-founder & CEO of Outsell, Inc., the leading research and advisory firm focused exclusively on data, media, information, and technology. Get professional and personal lessons from a career spent mentoring successful leaders. Tell your story or ask a question — confidentially. Ask Anthea!

Is Your Data Ready? was originally published in Outsell Inc. on Medium, where people are continuing the conversation by highlighting and responding to this story.

Here’s How Technology Will Shape Marketing Over The Next Decade

Ten years ago, social media was in its infancy. Nobody even heard of mobile marketing, content marketing or big data. The iPhone hadn’t even been launched yet. If you took a reasonably competent marketer from 2007 and transported her to today, much of what she knew about her job would be irrelevant.

We’re at a similar point now. Many of the most powerful technologies that will shape marketing over the next ten years are just emerging and many marketers will be left behind. Clearly, anybody who thinks that they can get by doing more of what they’re doing today is kidding themselves.

Unfortunately, there’s no way to perfectly predict the future, but we can look at today’s technology and make some basic judgments. Big data and artificial intelligence will become much more powerful and interact more completely with the physical world. That, in turn, will transform how we identify and serve customers to something very different from today.

The Rise Of Voice And Visual Interfaces

At the most recent Consumer Electronic Show, Amazon’s Alexa wowed the crowds like no other product. The device, which like Apple’s Siri is wholly voice activated, goes a step further by adding skills, which work like apps on a smartphone. Google’s competing product, Google Home, has a similar function it calls “actions.”

For example, the History Channel developed a skill that allows users to see what happened “on this day in history.” Starbucks has one that lets customers reorder their favorite beverage. Some TV manufacturers are even building new sets with Alexa integration included, so that their customers can flip channels without having to look for the remote.

Scott Brinker, Co-Founder of Ion Interactive and creator of the Chief Marketing Technologist blog, told me, “I think what’s going on with voice is incredibly interesting. In the next 12–24 months a lot of marketing innovation will be in things in like Amazon Alexa skills and Google Home actions. There’s already been something like 1000 brands already active in this area.”

We’ll soon see a similar trend with visual interfaces. It’ll start by using facial recognition instead of password log-ins and will move quickly to allow us to point and gesture to interact with augmented and virtual reality spaces. Computers, as we know them today, will disappear into the ether.

Analyzing Personality And Mood

We’ve all become accustomed to hearing the phrase, “Your call may be monitored for quality and training purposes” before we are connected to a customer service representative. But monitored by who? Nobody ever actually says that an actual person will be listening in.

Mattersight is a ten year-old company that uses artificial intelligence algorithms, combined with a methodology developed by NASA to evaluate the compatibility of astronauts sent into space together, to profile customers. As it gathers data, the system is better able to pair those customers with representatives that have a compatible personality that will serve them best.

These technologies are accelerating at a blinding pace. While Mattersight’s technology is highly advanced, more basic personality and mood analysis tools are now available on platforms like Amazon Web Services and Watson Developer Cloud and Microsoft Azure. IBM even expects that within five years we will be able to diagnose psychiatric disorders through voice interfaces.

David Gustafson, Chief Operating Officer at Mattersight, told me “What we’ve seen over the past ten years or so is that you advance the technology on two planes. The first is the accuracy of the algorithms themselves and that is a fairly natural process. The second, which is perhaps even more important, is finding the right application to apply those insights to.”

Creating Custom Experiences

When we go into a store, we take it for granted that a salesperson will approach us, ask a few questions and within seconds design a sales experience that caters to our needs. If it’s a store we frequent, we expect the salesperson to already know our preferences and further customize the experience for our needs on that particular day.

BloomReach is a platform that performs a similar function for e-commerce. If you are looking for a cocktail dress, it will immediately show you items based on your past behavior as well as recent shopping trends. The firm recently acquired content management company Hippo to expand the range of experiences it can deliver.

Increasingly, this type of personalization is moving into the physical world. Buy a ticket for the next sporting event and you may receive an RFID bracelet in the mail and be given a short questionnaire that asks you about things like your favorite team, your shirt size and other preferences.

When you arrive at the event, you will find that when you go to the bar, the area around your seat lights up with your team colors. Sponsor booths will give you a complimentary t-shirt in your size. If they see from your history that you already received a t-shirt from another sponsor, they might offer you a hat instead. Everywhere you go, you’ll feel like a VIP.

This may seem futuristic, but The Solomon Group creates these features for events like the NBA All-Star game on the Mendix platform, which is so simple even marketers with no coding ability can build on it. Also, because Mendix allows you to suck in resources from AWS, Azure and Watson, you can still access the most sophisticated technology on the planet.

The Future Of Technology Is Always More Human

When you watch old reruns on late night TV, you’ll immediately notice the difference in technology compared to modern shows in the form of production values. But soon it will become clear that there is also a stark contrast in emotional content. Because today’s programing caters to niche audiences, they are able to make a creative statement that connects more powerfully.

Gustafson of Mattersight has noticed a similar effect at call centers.“What we’ve seen is that people buy from people they like,” he says. “We track engagement during sales calls and what we’ve seen is that there is almost a perfect linear relationship between engagement and sales. When customers become more engaged with the experience, sales go through the roof.”

Samuel Moore, Head of Global Public Relations at BloomReach, told me, “Our platform is based on data and context. As the sources of data expand and improve to include personality and mood profiles, there is the potential for marketers to provide customers with the best possible experience, in real time and at scale.”

The truth is the future of technology is always more human. In the years to come, marketers will need to go beyond seeing consumers as bland combinations of demographics and psychographics and begin to know them on a more visceral, personal level. Brands that continue to try to work the averages will find it difficult to compete.

What should be clear by now is that we need to shift from crafting messages to creating experiences. This process will be machine mediated, but ultimately it will put people at the center. Algorithms can analyze and target, but only humans can truly inspire other humans.

– Greg

An earlier version of this article first appeared in

The Conversation: Research and Scholarly Publishing in the Age of Big Data #alpsp16

Ziyad Marar, Global Publishing Director at SAGE Publishing chaired the opening plenary at the 2016 ALPSP Conference. He was joined by his colleague, Ian Mulvany, who is SAGE’s Head of Product Innovation and Francine Bennett, CEO and co founder of big data consultancy Mastodon C. They discussed how data is to the 21st century what oil was to the 20th Century and this has major implications for researchers and publishers alike. Information of all kinds is now being produced, collected, and analyzed at unprecedented speed, breadth, depth and scale. The big data revolution promises to ask and answer fundamental questions about individuals and collectives, but large datasets alone will not solve major social or scientific problems.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑