lunes, 3 de junio de 2019

Deep learning and shallow understanding


Deep learning and shallow understanding 

Martyn Richard Jones

Bruxelles 21st May 2019

 

Hi, I'm Martyn Jones over at GoodStrat.Com (the Good Strategy Company) And today I want to talk about something I like to call deep learning and shallow understanding.

To begin at the beginning

In spite of or perhaps because of the many years I worked in artificial intelligence I believe that the current long-distance love affair with AI and what is also euphemistically termed ‘deep learning’ to be somewhat irrational in its exuberance.
In the eighties I was experimenting with automatic feature extraction, pattern recognition and parallel distributed processing (see Rumelhart and McClelland), which led me to adaptive neural networks.
As a result, my research and development in this field showed me that automatically mining nuggets of gold or automatically gaining real insight from data is not easy, far from stable and frequently not achievable.  Interesting experiments, but frequently impractical in a business setting.
        
Back in the day I was interested in examples of possibilities in this ‘technology space’ and in trying to design, train and productise neural networks. At Unisys we started out trying to do things such as refining the predictability that a person would be a member of the Sharks or the Jets (New York gangs, if I remember rightly) and applying that in divining the credit risks of individuals and businesses, or elsewhere we were creating simple applications for handwritten number recognition and (a bit more sophisticated) basic document recognition.
To that end, I addressed the European IEEE conference on Neural Networks in Nice regarding our work in the USA and at the European Centre for AI and Advanced Information Technologies. There in Madrid I was working at the intersection of AI and advanced database technologies, with peers globally across the industry. At the same time, my company, Sperry Univac was the go-to IT Company for engineers, designers and computer scientists.
Sperry and Unisys gave me license to research, design and build proof-of-concept prototypes in the areas of complex very-large database management, a tightly integrated 4GL and expert system development platform (extending a 4th generation language product called Mapper), heterogenic cross-platform database query management (a project for the EU), and AI and data mining; the deep learning of today.
But, central to this blog piece was my work in the area of AI and data mining. The goal I set myself was to find ways of extracting business-oriented rules from the data-mining process and to be able to apply and explain rule-based lines of reasoning in a way that a subject matter expert could understand and find credible.
Keep in mind that the ability for AI technology to produce explanations of lines-of-reasoning was an absolute necessity back in the day. Which was reasonable but extraordinary, because not so long ago we didn’t even expect any sort of rational line-of-reasoning explanations from many ‘experts’.

So, what?

If we remove statistics from the scenario, the automatic recognition of patterns in databases and naïve learning whilst having interesting pasts, a fuzzy and nuanced present and an uncertain future, are terrains plagued with shallow and imprecise understanding, exaggeration and make-believe; which makes any real and meaningful understanding quite problematic.
The fact of the matter is that there is plenty of visionary noise surrounding deep learning and AI, but very little in terms of concrete strategies to address truly significant challenges using this technology. There is also lot of speculative fantasy about the promise of AI, but very little in terms of coherent and tangible answers to the absolutely fundamental question of “to what ends?” On top of that, there hasn’t been an entirely satisfactory explanation given as to why we would place much trust in technology that can’t even explain itself? Is “the computer says no” really good enough?
Now I want to briefly look at what I call the lessons of the past, the realities of the present and the promises of the future.

Some lessons of the past

Over almost six decades, AI has seen a few peaks of hype followed by prolonged troughs of incredulity, disappointment and recrimination. History has shown us that much of the promise of AI comes unstuck when immature, unsound or unproven technology is taken to market too soon and without a reasonable understanding of its applicability; when companies embrace and spend significant time and effort trying to exploit things they don’t comprehend in order to achieve ends they can’t define in tangible, reasonable and realistic terms.
So, interesting experiments in AI are brought out of academia far too soon, hyped to the heavens and eagerly acquired (if not ultimately used) by commercial IT. The fact is that IT companies (and more recently IT service companies) have unwittingly killed AI time and time again, through their own irrational exuberance; bringing into question more than just the technology.
Then we have the issue of hyperbole and AI. There is no end to the charlatans waiting in the wings to big-up the latest tech trend. But, when things go pear-shaped it’s the business cultures who will seek to destroy something that promised to deliver so much, yet failed to deliver anything other than liabilities, costs and wasted opportunities.
In the eighties one of my major projects in AI was to design and tightly integrate an easy to use Expert System development and delivery capabilities with my company’s flagship 4GL product (Mapper, now known as Unisys BIS), it was a rules-based system. It worked well. But one of the major limitations of all Expert System shells at that time was the ability to manage and maintain large rule sets. It is the first time I realised that without adequate tools to manage complexity then business and technology risk exposure would become quite an issue. As a result, I tried to identify ways to simplify the tools without losing intrinsic value and to provide the required tools to help reduce the perception of complexity. It was a partial success.
Later I had a project to address a classification challenge and for this I decided to look at data mining again. Using small training-datasets we initially had some interesting results, but then what we noticed was that when we increased the size of the datasets that learning became skewed, and not in a good way. So, we pondered the problem and decided that we weren’t getting the right answers because we weren’t using enough data, so our inability to feed it appropriately was producing the wrong outcomes. That’s when we hit a brick wall of an idea. We threw tons of data at the neural network. So much so the neural network was incapable of discriminating or discerning anything at all. We had created a narrow and shallow idiot-savant and trained it so well that it eventually knew nothing.
I mentioned some of these anecdotes to my colleagues at IBM’s global data mining centre in Dublin. To my surprise, they had stumbled upon exactly the same issues and had tried to resolve them in exactly the same way, to no avail.
Another project I was involved in was at a major investment bank. The purpose was to build an artificial-trader. To achieve this it was decided to capture as rules the expertise of expert-traders and to combine the evaluation and execution of these rules with networks trained by the data mining of price curves and historic trading data in order to build a functioning artificial-trader. When the application was eventually trialled three observations were made: The AI trader performed marginally better than the real expert-trader; the real trader came a close second; and, when the trader worked with the artificial-trader, the outcomes were far worse than when the trader worked alone. Also, there was the belief that the trader performed less well because of the active benchmarking against a machine.

Some realities from the present

The more things change the more they stay the same as each new generation of techno-babbler tries to reinvent the wheel, whether it is needed or not, whilst astutely ignoring the sage advice of “if it ain’t broke, don’t fix it”.
What we see again is an ingrained inability to learn from the past or even the present, and too judicially apply those rules in evaluating the present. This seems to be a constant in the evolution of humankind, made more obvious by way of the use and abuse of technology, hyperbole and ignorance.
What are the realities of the present? Here are a few:
People are either afraid of complexity and prefer not to see it or embrace it without really understanding the implications. The complexity of significant challenges is being ignored and so is the complexity and risks of technological and process options.
People don’t know how to disambiguate or deconstruct complexity. In short, people don’t know how to consume the elephant, and people who do know aren’t that much appreciated either.
No alignment of imagination and the practical. People don’t know how to successfully simplify the complex and those who do know are treated like gods, fools or pariah.
People are willingly and uncritically embracing techno-fad dogma. The visibility and audibility of tech-fad slogans have gone beyond the sloganizing used in tractor factories in the communist era.

That best engineering principles are just so much academic theory.
That more data means better data. More data doesn’t mean better data. It’s a dopey generalisation without any reasonable theoretical or practical underpinning.
That data is the new oil. Try telling that to your car engine. No data is not like oil. It’s quite a dopey analogy.
That reasonable explanation is not a thing. If a machine is being used to produce recommendations for anything that requires duty-of-care then they must also be able to produce a reasonable, accurate and verifiable line of reasoning, at the same time. Also, legal and compliance issues are also important considerations.
A data scientist can do the job of a qualified statistician. “Without a grounding in statistics, a Data Scientist is a Data Lab Assistant.”
It isn’t important to know what it is, just use it. Just because you can pick up a tool it doesn’t mean you know how to use it. Just because it’s called a tool doesn’t necessarily mean that it is. If all you want is one hammer, being given one thousand hammers at the same time doesn’t help as much. If what you want to do is boil an egg then a hammer isn’t really the thing.
In short, the present reality is the presence of a surfeit of actors in the big data, deep learning, data science and AI technology solutions spaces, who are running around aimlessly, throwing faeces and feasting upon the hype and hyperbole of it all, like acid-dropping chimps at a chimpanzee’s tea-party, run amok.

Some promises for the future

What does the future offer? Probably more of the same, but more so.
However, if I take an optimistic view, here are some areas where I think we might make some advances:
The development of tools for managing and deploying rule-based expertise into apps.
The development of tools for reducing the complexity of managing and maintaining rule-bases that can be used, for example, in the automatic generation of rule-centric APIs.
Using data governors with elements of AI to actually reduce the volume of streaming data-as-noise.
The evolution of tools for treating data as an asset, whether as a value-adding asset, an asset of no apparent value or a liability.
Using explainable AI to challenge and to potentially negate the predictions of data mining apps. Thereby limiting the damage that such an app could inflict.
Socialising Expert System development and delivery. Would Microsoft like to embed an expert system shell in Excel, for example? I know someone who has done that.
We must insist that AI must either be explainable or fully controllable.
Just as we have for data we must also have AI governance and General AI Protection Regulation.
Finally, reducing the buzzword bingo bullshit and term abuse in IT, AI and data. That people take the time to learn what things actually mean and where things are really applicable or not. That people understand why using terms that they don’t understand is a very bad idea.

Summary

Now, wouldn’t that be nice? Of course, there is a lot more than that.  But, blogs being blogs…
With such a subject it is easy to become conflicted.
At the back of my mind, I have the idea that deep learning is just God’s way of telling companies that they have too much money and not enough sense.  I also believe that a lot more work needs to be done in investigating feature extraction from data before we can even begin to consider it as a maturing technology.
That said. The world of data and use of that data can be an exciting place. Maybe one day not too far away someone will come up with a useful AI or data technology that doesn’t actually require hyping.
Many thanks for reading.
Have a beautiful June 2019.
Martyn Jones
One of the sales execs looked at a TI Explorer workstation I was using and asked… can it do accounts? Then followed it up with “it’s no better for business than a bloody expensive anchor”.

jueves, 17 de marzo de 2016

Professional networking? Yo! BlankedOut Sucked

Martyn Richard Jones

Hello, readers.

Before my Aunt Dolly went to a better life she received a handwritten letter from her dear friend and long-time admirer Sir Arthur Streeb-Greebling, which was to be passed on to the CEO of, what he called, an interweb professional dating site. Now, she didn't actually give me a precise name, so I now find myself at a loss. So, if there's anyone out there that recognises who this might have been written for, then please let me know.

What follows is Sir Arthur's text, as relayed to my Aunt Dolly. 

Dear Mister Def Archibald Quengler,

We've never met, and we probably never will, and I don't much like the cut of your jib, but, I would like to take this opportunity to draw your attention to the demise of a once burgeoning professional dating site where decent chaps and chapesses and an assortment of pathetic likeminded individuals could share likeminded individual things. Such as pictures of cats, sexist crap, professional resumes, tips and tricks, insightful comments, 'me too' inanities, hype, boloney, mendacity, political detritus and even worse religious detritus.

Personally, I blame lack of national service and the parents… oh, and the teachers.

Anyway, I am writing to you in the tepid hope that your amazing and absolutely fabulous online concern does not fall victim to the same malaise.

You may have heard of the once significant, successful and utterly sensational BlankedOut web setup, it was a so called professional link-up site for pros, or some such dreck.

I am reliably informed that people loved BlankedOut, just a bit, and that they also hated it even more. Many people said to me "BlankedOut is the Facebook of the crass and dim-witted wannabe class, and apart from some minority exceptions, it is a gushing channel of crap and a conduit of intense mediocrity." But, not being aware of the game, I was in no position to make a judgement. I will leave that for others.

BlankedOut has been variously described as "a place where capital interests took us for a ride", and where members were generally treated as hapless schmucks, captive clowns and useful idiots, and that according to the observers on the hustings, they did it in such a way that people lapped up the ride.

My distinguished colleague Mister Bernice Hill, PhN observed that, as a role model for a Big Data and Big Data analytics company, that BlankedOut "sucked, big time".

He went on to state "It sucks from top to bottom, from left to right, and around the whole global enchilada." Bernice was tough, a hard-hearted man, and he had a way with words.

The former judge Sir James Beauchamp, also didn't hold back when he stated "From its obtuse, obnoxious and incessant promotions of sponsored rancid 'content' to its insipid, trite and fatuous love-affair with its god-damn-awful Effluence®s and fawning sycophants, BlankedOut stood as a shining internet beacon of manipulation, exploitation and hypocrisy." I seem to recall a certain Barnie Puddle as being one of the mendacious and manipulative of the Effluence®s. But, whatever, as the young people are want to say these days

So, you see. I knew bugger all about the matter of these sorts of high-class professional career-oriented pimping sites, past or present. But now, and you may call me an incurable romantic, when I look upon the history of the deceased BlankedOut community, as dead as a Norwegian blue, what I see is something that leads my thoughts to visions of a massive work of misuse, abuse and deception. 

Which is not a good omen.

Of course, alternatives to BlankedOut existed, but they were ascetically professional and did not venture much into the wild-side of vulgarity, populism and cant. They stuck to their core competences, like troopers, and trusted their clientele to be just as serious, decent and professional as they were. More fool them, what?

But not so, at poor, dead and despised BlankedOut, lying in a state of disgrace, like some sort of dead pisshead society on a pyre of burning nothing.

So, Mister Def Whiner, heed my word, don't let your business turn into yet another BlankedOut. If ever there was one, an abject lesson in snatching failure from the jaws of success.

Carpe diem, man, carpe diem!

So, I just have this to tell you, and I will say it only once. Good will to all women and men and all of that.  If you are still an admirer of what was that dreadful BlankedOut business model,,, then, bugger off and take your bloody dogs with you!  

Yours sincerely,

Sir Arthur Greeb-Streebling
Admiral of the Grand Fleet, retired

Well, nothing much to add from me. Sir Arthur seems to have said it all. Although, I would still like to know who this letter is supposed to have been written to, because try as I might, I can't track down anyone who goes by the name of Mister Def Archibald Quengler. That stated, the next time I am in Palma de Mallorca I will ask my Aunt Dolly, now that she is in a far better place and has more free-time on her hands.

Next week I will be looking at financial scams that concern greenhorns, their parentages and protectors, culture establishments, incongruous financial arrangements, the government and more importantly, the police and the judicial system.

Stay tuned.

Many thanks for reading.


domingo, 6 de marzo de 2016

Free Business Analytics Content –Thanks to Wikipedia – Part 1

Why buy when you can get it for free?
Here is the first fantastic delivery of an amazing and fabulous selection of free and widely available business analytics learning content, which has been prepared… just for you.
  1. A/B testing is a way to compare two versions of a single variable typically by testing a subject’s response to variable A against variable B, and determining which of the two variables is more effective.https://en.wikipedia.org/wiki/A/B_testing
  2. Choice modelling attempts to model the decision process of an individual or segment via Revealed preferences or stated preferences made in a particular context or contexts. Typically, it attempts to use discrete choices (A over B; B over A, B & C) in order to infer positions of the items (A, B and C) on some relevant latent scale (typically “utility” in economics and various related fields). https://en.wikipedia.org/wiki/Choice_modelling
  3. Adaptive control is the control method used by a controller which must adapt to a controlled system with parameters which vary, or are initially uncertain. For example, as an aircraft flies, its mass will slowly decrease as a result of fuel consumption; a control law is needed that adapts itself to such changing conditions. https://en.wikipedia.org/wiki/Adaptive_control
  4. Multivariate Testing. In marketingmultivariate testing or multi-variable testing techniques apply statistical hypothesis testing on multi-variable systems, typically consumers on websites. Techniques of multivariate statistics are used.https://en.wikipedia.org/wiki/Multivariate_testing_in_marketing
  5. In probability theory, the multi-armed bandit problem (sometimes called the K[1] or N-armed bandit problem[2]) is a problem in which a gambler at a row of slot machines (sometimes known as “one-armed bandits”) has to decide which machines to play, how many times to play each machine and in which order to play them.https://en.wikipedia.org/wiki/Multi-armed_bandit
  6. t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution if the null hypothesis is supported.https://en.wikipedia.org/wiki/Student%27s_t-test
  7. Visual analytics is an outgrowth of the fields of information visualization and scientific visualization that focuses on analytical reasoning facilitated by interactive visual interfaces.https://en.wikipedia.org/wiki/Visual_analytics
  8. In statisticsdependence is any statistical relationship between two random variables or two sets of dataCorrelation refers to any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. https://en.wikipedia.org/wiki/Correlation_and_dependence
  9. Scenario analysis is a process of analyzing possible future events by considering alternative possible outcomes (sometimes called “alternative worlds”). Thus, the scenario analysis, which is a main method of projections, does not try to show one exact picture of the future. Instead, it presents consciously several alternative future developments. https://en.wikipedia.org/wiki/Scenario_analysis
  10. Forecasting is the process of making predictions of the future based on past and present data and analysis of trends.https://en.wikipedia.org/wiki/Forecasting
  11. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. https://en.wikipedia.org/wiki/Time_series
  12. Data mining is an interdisciplinary subfield of computer science.[1][2][3] It is the computational process of discovering patterns in largedata sets (“big data“) involving methods at the intersection of artificial intelligencemachine learningstatistics, and database systemshttps://en.wikipedia.org/wiki/Data_mining
  13. In statistical modelingregression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’). https://en.wikipedia.org/wiki/Regression_analysis
  14. Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning https://en.wikipedia.org/wiki/Text_mining
  15. Sentiment analysis (also known as opinion mining) refers to the use of natural language processingtext analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.https://en.wikipedia.org/wiki/Sentiment_analysis
  16. Image analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing [1] Image analysis tasks can be as simple as reading bar coded tags or as sophisticated as identifying a person from their face.https://en.wikipedia.org/wiki/Image_analysis
  17. Video content analysis (also Video content analyticsVCA) is the capability of automatically analyzing video to detect and determine temporal and spatial events.https://en.wikipedia.org/wiki/Video_content_analysis
  18. Speech analytics is the process of analyzing recorded calls to gather information, brings structure to customer interactions and exposes information buried in customer contact center interactions with an enterprise. https://en.wikipedia.org/wiki/Speech_analytics
  19. Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated randomsampling to obtain numerical results. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other mathematical methods. Monte Carlo methods are mainly used in three distinct problem classes:[1]optimizationnumerical integration, and generating draws from a probability distribution.https://en.wikipedia.org/wiki/Monte_Carlo_method
  20. Linear programming (LP; also called linear optimization) is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements are represented by linear relationships. Linear programming is a special case of mathematical programming (mathematical optimization).https://en.wikipedia.org/wiki/Linear_programming
  21. Cohort analysis is a subset of behavioral analytics that takes the data from a given eCommerce platform, web application, or online game and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span. https://en.wikipedia.org/wiki/Cohort_analysis
  22. Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables calledfactors. For example, it is possible that variations in say six observed variables mainly reflect the variations in two unobserved (underlying) variables.https://en.wikipedia.org/wiki/Factor_analysis
  23. Adaptive (or Artificial) Neural Networks. Like other machine learning methods – systems that learn from data – neural networks have been used to solve a wide variety of tasks that are hard to solve using ordinaryrule-based programming, including computer vision and speech recognition.https://en.wikipedia.org/wiki/Artificial_neural_network
  24. Meta Analysis. The basic tenet of a meta-analysis is that there is a common truth behind all conceptually similar scientific studies, but which has been measured with a certain error within individual studies. The aim in meta-analysis then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived. In essence, all existing methods yield a weighted average from the results of the individual studies and what differs is the manner in which these weights are allocated and also the manner in which the uncertainty is computed around the point estimate thus generated. https://en.wikipedia.org/wiki/Meta-analysis
I hope you find the content useful. Of course all thanks should really go to Wikipedia and their unpaid expert contributors.
I will try and get the next part of ‘ Free Business Analytics Content’ onto Linked Pulse over the next weekend.
Many thanks for reading.
Just a few points before closing.
Firstly, please consider joining The Big Data Contrarians, here on LinkedIn:https://www.linkedin.com/groups/8338976
Secondly, keep in touch. My strategy blog is here http://www.goodstrat.comand I can be followed on Twitter at @GoodStratTweet. Please also connect on LinkedIn if you wish. If you have any follow-up questions then leave a comment or send me an email on martyn.jones@cambriano.es
Thirdly, you may be interested in other articles I have written, such as:
You may also be interested in some other articles I have written on the subject of Data Warehousing.

martes, 1 de marzo de 2016

A data superhero is something to be


A data superhero is something to be



A data warehousing superhero is something to be


Not all that glitters is Big Data, and Big Data has a long way to go before it can deliver anything like the same satisfying results, tangible benefits and organisational agility that a properly implemented Inmon Enterprise Data Warehouse can provide.

Therefore, I have a question for you.

Do you want to win friend and influence people in the world of data architecture and management? Do you want to do something in IT that atypically will bring kudos and credibility? Do you want to enjoy what you are doing because you are actually doing the right thing right for an appreciative audience?

Okay, this a recipe that I will now reveal, has the power to turn you into, not only a data hero, but a 4th generation enterprise data warehousing superhero – with Big Data bells and whistles attached, and even more amazingly, it is offered for nothing, gratis, and for keeps.

Yes, you read it right. I am feeling generous, and although a rare animal, there is such a thing as a free lunch. In this instance, the free lunch takes the form of a cookbook for successful data sourcing, warehousing and provisioning, one that will turn you into a truly modern day digital superhero.

Follow the suggestions to the letter and it will be hard to fail. However, drop any magic ingredient from the mix and expect, eventually, to run out of luck – that is rhyming slang for Donald Duck, down my way. Almost as important, please apply your own criteria of good sense at every step of the way.

The craft of data  


The craft of data includes temporary-permanence in exploitation, revolution and institution.

When Sun Tzu was talking about the Art of War, he was also talking about the craft of data.

In the 21st century the highest expression of the craft of data in an organisation, whether public, private or military, is the enterprise data warehouse.

These are some of the key rules and guidelines for ensuring that you prevail and not your adversaries. The items are necessarily terse, but should provide a sound basis for further research, thought and strategic practice.

So without further ado, let us get to the crux of the matter.

1.       This is the first piece of advice, and it's a little bit of a 'downer', but you may just thank me for it later. The business sponsor of any significant Data Warehouse initiative or iteration cannot be the CIO, CTO or any member of the IT organisation. When this unfortunately happens, and it happens far too often, you should know that this particular data warehouse project is dead before it even gets off the ground - guaranteed. If you can afford to walk away from such a project, then do so. Now for the more positive aspects.

2.       All data in the data warehouse must be subject-oriented.

3.       We must integrate all data before it enters into the data warehouse.

4.       All data in the data warehouse must be time-variant or specifically indeterminate.

5.       Data in the data warehouse must be non-volatile – within periods of explicit and implicit snapshot coverage.

6.       Data in the data warehouse is primarily used to feed into management decision making (by order of importance: strategic, tactical and then operational).

7.       We build the data warehouse iteratively and over time. We never build the data warehouse using a 'big bang' approach.

8.       We base each build iteration of the data warehouse on a specific set of well-bound departmental-oriented requirements, deliverable in a short and specific timeframe. We never try to build the data warehouse using a 'boil the ocean' approach.

9.       We never run more concurrent iterative developments in a data warehouse programme than we would in any other agile environment. This means that for a mature data warehousing setup, we run a maximum of five concurrent developments. The more immature the organisation, the less the number of concurrent iterations.

10.   We use a contemporary two-tier approach to the data-warehousing super-component. A well architected, designed and engineered third-normal form database that supports true historicity and time-variance-modelling forms the basis of the decision support database of record.

11.   We build departmental and process-centric data marts on top of the data warehouse layer, as the end-user-centric semantic-layer of the data warehouse.

12.   We use 3NF to model the data warehouse data-model. We typically use dimensional modelling to model the data mart models, although other modelling options are also valid. Target use cases will inform the decisions we make regarding the choice of data mart model.

13.   Never trust anyone who claims that we can service the strategic data needs of a complex and volatile enterprise by implementing a faux data warehouse built using a collection of conformed dimensions and facts. This approach may initially appear to work, however, this is a massive strategic, tactical and operational mistake, which will eventually involve costly reengineering, loss of valuable data, organisational disruption and dissatisfied clients.

14.   We store transaction in the data warehouse at the lowest possible level of granularity. We store transaction and fact data in the data marts at the aggregation levels appropriate to the target audience.

15.   Based on use cases and performance needs, we will accordingly aggregate data in the data marts. If, in the future, lower level data granularity is required in the data mart then we can easily provide that by reconstructing the data mart from atomic level data stored in the data warehouse.

16.   We should never second-guess business requirements. No business imperatives means no requirement. You're aiming to be a successful data superhero, keep that goal in mind. Don't be beguiled into doing the wrong things even when accosted by 'right-sounding reasons'.

17.   Data warehousing is about the permanent incremental development and redefinition of minimum viable products and a minimum viable service. Iteratively grow the data warehouse and ignore those who claim that Inmon is about 'big bang', 'bottom up' and 'boil the ocean'.

18.   Avoid pork barrel political games in data warehouse programmes. You should not use a data warehouse programme as a means to leverage a raft of other related data, operational and DevOps projects in the organisation. For example, Corporate Data Governance, Data Quality and Disaster Recovery/Business Continuity should not packed into the data warehousing programmes, at any level. Again, this is a massive strategic, tactical and operational mistake.

19.   We ensure that as a minimum that data in the data warehouse is as reliable as the data at source. Simply stated, we do not allow unnecessary entropy to effect the data in the journey from source systems to the target data warehouse or data marts.

20.   No data is 'corrected' or 'cleaned' in the data warehouse without the explicit, verifiable and express consent of the fiduciary duty holder with respect to that data. If the data warehouse is to act as a system of record then it must also hold metadata relative to any 'cleaning' that has been applied to that data, and should also hold 'before' and 'after' states of corrected data – for auditing purposes.

21.   We secure all data in the data warehouse in accordance with prevailing legislation and corporate rules and guidelines. In any conflict between corporate rule and legal jurisdiction, the current laws prevail.

22.   Ensure that competent and independent design authorities, with the support of the Data Warehouse architect, are ultimately responsible for all data-warehouse architectural, process and design decisions.

23.   Architectural and process choices govern the selection of methodology, product and partner. Always remember mens sana in corpore sano. Prejudice, speculation and opinion generally lead to very bad data-warehouse acquisition decisions, and can potentially lead to strategic, tactical and operational mistakes.

24.   Data warehousing iterations have clear top-level phases: start-up; DW management phase; analysis phase; design phase; build phase; testing phase; and, implementation phase. We complement these phases with data warehousing tracks: project management track; user track and requirements; data track; technical track; and, metadata track. This approach is used by a number of data warehousing methodologies, including the Cambriano methodology for data warehousing, information management and data integration.

25.   To conclude, I would like to iterate some of the reasons why we should follow an Inmon based approach to the building of a Data Warehouse. The Inmon approach is very much based on:

                    i.            Iteratively solving specific business challenges, iteration by iteration. This is not just a flippant excuse for spending other peoples' money. The Inmon DW is not about 'boiling the ocean', 'bottom up' or 'big bang'. Neither is it an insistence that one can build a whale by carefully configuring a collection of minnows. There's a 'little bit more' to it than that.

                   ii.            Delivering perceived and visible value within a reasonable timeframe.

                 iii.            Achieving high returns on investment.

                 iv.            Meeting or exceeding expectations.

                  v.            Meeting user requirements, first time and every time.

                 vi.            Delivering a quality data-warehouse solution on schedule, within budget, whilst effectively utilizing the resources available.

               vii.            The rational and economic need to minimize the impact that any strategic data initiative will have on operational systems and the organisation.

              viii.            The goal of maximizing information availability and analytical capabilities throughout the organisation and even to stakeholders and clients, if we so wish.

                 ix.            Designing towards maximum flexibility to ensure that we can accommodate much of the future decision support needs immediately and that we swiftly and coherently address new requirements.

Now what?


Now I've given out a wealth of valuable information and indications you may be asking 'and now what?'

This is the next step, dear budding data superhero:

1.       Take each of the items mentioned above and study them to the best of your ability. Do lots of research, and start to fit together the pieces of the jigsaw.

2.       Invent scenarios, or better still, ask other people for scenarios and hypothetical challenges, and then work through how you would go about responding to those scenarios and challenges.

3.       If you have any questions that you cannot research and answer yourself, then I will be glad to help. That is, if the request is regarding a particular aspect of data warehousing or management. Please email me your questions at martyn.jones@cambriano.es Please use one email shot per question please (e.g. if you have three questions, send three emails), so that I can prioritise the questions and manage the time I can set aside to respond to them.

The subtle evolution of Inmon's definitive Data Warehousing


What I have described are elements and requisites of a solid, coherent and cohesive approach to fourth generation Enterprise Data Warehousing, a proven approach to the provision of quality data for management decision support. The approach is the evolution of the classic Inmon approach, which has evolved over the intervening decades, thanks to Bill Inmon himself, and those who adopted and developed his approach to cohesive, coherent and comprehensive data warehousing.

Many thanks for reading


So, that's it. Many thanks for reading this piece and I sincerely hope you found it of interest.

Do keep in touch. You can connect with me via LinkedIn and you can also keep up to date with my activities on Twitter (User handle @GoodStratTweet) and on my personal blog http://www.goodstrat.com (GoodStrat.com)

I am the manager of The Big Data Contrarians group on LinkedIn. Consider joining that group, if only for the critical thinking that it could potentially provoke.

You may also be interested in some other articles I have written on the subject of Data Warehousing.








Martyn Richard Jones

Palma de Mallorca

23rd September 2015