Search for:
five tips to convert big data into a big success
Five Tips To Convert Big Data into a Big Success

Can data be considered as the new gold? Considering the pace at which data is evolving all across the globe, there is little question. Consider the following: 

  • Netflix saves $1 billion per year on customer retention only by utilizing big data.
  • Being the highest shareholder of the search engines market, Google faces 1.2 trillion searches every year, with more than 40,000 search queries every second!
  • Additionally, among all the google searches. 15% of those are new and are never typed before, leading to the fact that a new set of data is generated by Google continuously regularly. The main agenda is to convert data into information and then convert that information into insights. 

Organizations were storing tons of their data into their databases without knowing what to do with that data until big data analytics became a completely developed idea. Poor data quality can cost businesses from $9.7 billion to 14.2 million every year. Moreover, poor data quality can surely lead to wrong business strategies or poor decision-making. This also results in low productivity and sabotages the relationship between customers and the organization, causing the organization to lose its reputation in the market.  

To deter this problem, here is a list of five things an enterprise must acquire in order to turn their big data into a big success:

Strong Leadership Driving Big Data Initiatives  

The most important factor for nurturing data-driven decision-making culture is proper leadership. Organizations must have well-defined leadership roles for big data analytics to boost the successful implementation of big data initiatives. Necessary stewardship is crucial for organizations for making big data analytics an integral part of regular business operations. 

Leadership-driven big data initiatives assist organizations in making their big data commercially viable. Unfortunately, only 34% of the organizations have appointed a chief data officer to handle the implementation of big data initiatives. A pioneer in the utilization of big data in the United States’s banking industry, Bank of America, specified a Chief Data Officer (CDO) who is responsible for all the data management standards and policies, simplification of It tools and infrastructures that are required for the implementation, and setting up the big data platform of the bank. 

Invest in Appropriate Skills Before Technology

Having the right skills are crucial even before the technology has been implemented: 

  • utilize disparate open-source software for the integration and analysis of both structured and unstructured data. 
  • framing and asking appropriate business questions with a crystal-clean line of sight such as how the insights will be utilized, and 
  • bringing the appropriate statistical tools to bear on data for performing predictive analytics and generating forward-looking insights. 

All of the above-mentioned skills can be proactively developed for both hiring and training. It is essential to search for those senior leaders within the organization who not only believe in the power of big data but are also willing to take risks and perform experimentation. Such leaders play a vital role in driving swift acquisitions and the success of data applications. 

Perform Experimentation With Big Data Pilots

Start with the identification of the most critical problems of the business and how big data serves as the solution to that problem. After the identification of the problem, bring numerous aspects of big data into the laboratory where these pilots can be run before making any major investment in the technology.  Such pilot programs provide an enormous collection of big data tools and expertise that prove value effectively for the organization without making any hefty investments in IT costs or talent. By working with such pilots, implementation of these efforts at a grassroots level can be done with minimal investments in the technology. 

Search For a Needle in an Unstructured Hay 

The thing that always remains on the top of the mind of businesses is unstructured and semistructured data – information contained in documents, spreadsheets, and similar non-traditional data sources. According to Gartner, data of organizations will evolve by 800% in the upcoming five years and 80% of that data will be unstructured. There are three crucial principles associated with unstructured data. 

  • Having the appropriate technology is essential for storing and analyzing unstructured data. 
  • Prioritiing such unstructured data that is rich in information value and sentiments. 
  • Extracting relevant signals must be done from the insights and must be combined with structured data for boosting business predictions and insights.

Incorporate Operational Analytics Engines

 One potential advantage that can be attained by using big data is the capability of tailoring experiences to customers based on their most up-to-the-minute behavior. Businesses can no longer extract the data of last month, analyze that data offline for two months, and act upon the analysis three months later for making big data a competitive benefit.

Take, as an example, loyal customers who enter promotional codes at the time of checkout but discover that their discount is not applied resulting in a poor customer experience.

Businesses need to shift their mindset of traditional offline analytics to tech-powered analytic engines that empower businesses with real-time and near-time decision-making, acquiring a measured test and learn approach. This can be achieved by making 20% of the organization’s decisions with tech-powered analytical engines and then gradually increasing the percentage of decisions processed in this way over time as comfort grows about the process.. 

Final Thoughts 

In this tech-oriented world and digitally powered economy, big data analytics plays a vital role in the proper navigation of the market and to come up with appropriate predictions as well as decisions. Organizations must never ignore understanding patterns and deterring flows. especially as enterprises deal with different types of data each day, in different sizes, shapes, and forms. The market of big data analytics is growing dramatically, and will reach up to $62.10 billion by the year 2025. Considering that progression, 97.2% of the organizations are already investing in artificial intelligence as well as big data. Hence organizations must acquire appropriate measures and keep in mind all the crucial above-mentioned tips for turning their big data into big success to stay competitive in this ever-changing world.

Source Prolead brokers usa

5 promising tips to convert big data into a big success
5 Promising Tips To Convert Big Data into a Big Success

Can data be considered as the new gold? Considering the pace at which data is evolving all across the globe, definitely yes!

Let me show you some eye-opening facts and statistics. 

Do you know that Netflix saves $1 billion per year on customer retention only by utilizing big data? That’s not all. Being the highest shareholder of the search engines market, Google faces 1.2 trillion searches every year, with more than 40,000 search queries every second! There’s more, among all the google searches. 15% of those are new and are never typed before, leading to a new set of data generated by Google regularly. The main agenda is to convert data into information and then convert that information into insights. 

Organizations were storing tons of their data into their databases without knowing what to do with that data until big data analytics became a completely developed idea. Poor data quality can cost businesses from $9.7 million to 14.2 million every year. Moreover, poor data quality can surely lead to wrong business strategies or poor decision-making. This also results in low productivity and sabotages the relationship between customers and the organization, causing the organization to lose its reputation in the market.  

To deter this problem, here is the list of 5 promising tips enterprises must acquire to turn their big data into a big success. 

1. A Strong Leader for Driving Big Data Initiatives  

The most important factor for nurturing data-driven decision-making culture is proper leadership. Organizations must have well-defined leadership roles for big data analytics to boost the successful implementation of big data initiatives. Necessary stewardship is crucial for organizations for making big data analytics an integral part of regular business operations.  Leadership-driven big data initiatives assist organizations in converting their big data into a big success. Unfortunately, only 34% of the organizations have appointed a Chief Data Officer for the victorious implementation of big data initiatives. A pioneer in the utilization of big data in the United States’ banking industry, Bank of America, have specified a Chief Data Officer who is responsible for all the data management standards and policies, simplification of It tools and infrastructures that are required for the implementation, and setting up the big data platform  of the bank. 

2. Invest in Appropriate Skills Before Technology

Having the right technological skills is crucial, of which the following three are required: 

  • The capability of utilizing disparate open-source software for the integration and analysis of both structured and unstructured data. 
  • The capability of properly framing and asking appropriate business questions with a crystal-clean line of sight such as how the insights will be utilized. 
  • The capability of bringing the appropriate statistical tools to bear on data for performing predictive analytics and generating forward-looking insights. 

All of the above-mentioned skills can be proactively developed for both hiring and training. It is essential to search for those senior leaders within the organization who not only believe in the power of big data but are also willing to take risks and perform experimentation. Such leaders play a vital role in driving swift acquisitions and the success of data applications. 

3. Perform Experimentation With Big Data Pilots

In the present age, numerous big data conversions emerge from technology vendors in case they have anything to do with the business case and return of investment (ROI) of big data. Start with the identification of the most critical problems of the business and how big data serves as the solution to that problem. After the identification of the problem, bring numerous aspects of big data into the big data laboratory where these pilots can be run before making any major investment in the technology.  Big data labs provide an enormous collection of big data tools and expertise that permits organizations to run a pilot and prove value effectively without making any hefty investments in IT and talent. Implementation of these efforts at a grassroots level can be done with minimal investments in the technology. 

4. Search For a Needle in an Unstructured Hay 

The thing that always remains on the top of the mind of businesses is unstructured and semi-structured data. According to Gartner, data of organizations will grow by 800% in the upcoming five years and 80% of that data will be unstructured. Let us see the three most crucial principles associated with unstructured data. 

  • Assurance of having the appropriate technology is essential for storing and analyzing unstructured data. 
  • Prioritization and attention to such unstructured data are important that can be linked back to the individual. Also, it is imperative to prioritize such unstructured data that is rich in information value and sentiment. 
  • Only analyzing the unstructured data is not enough. Extraction of relevant signals must be done from the insights and must be combined with structured data for boosting business predictions and insights.

 

5. Incorporate Operational Analytics Engines

One of the potential advantages that can be attained by using big data is the capability of tailoring experiences to customers based on their most up-to-the-minute behavior. Businesses can no longer extract the data of last month, analyze that data offline for two months, and act upon the analysis three months later for making big data a competitive benefit. Take the high-value case under consideration, loyal customers who enter the promotional code at the time of checkout but discount is not applied, resulting in poor customer experience. It’s high time for businesses to shift their mindset of traditional offline analytics to tech-powered analytic engines that empower businesses with real-time and near-time decision-making. Companies must acquire a measured test and learn approach. Making 20% of the organization’s decisions with tech-powered analytical engines and then gradually increasing the percentage of decisions help organizations in developing a greater level of comfort. 

Final Thoughts 

In this tech-oriented world and digitally powered economy, big data analytics plays a vital role in the proper navigation of the market and to come up with appropriate predictions as well as decisions. Organizations must never ignore at any cost the natural instincts of understanding patterns and deterring flows. Enterprises deal with different types of data each day. That data exists in different sizes, shapes, and forms. The market of big data analytics is tremendously progressing and will reach up to $62.10 billion by the year 2025. Considering that progression, 97.2% of the organizations are already investing in artificial intelligence as well as big data. Hence organizations must acquire appropriate measures and keep in mind all the crucial above-mentioned tips for turning their big data into big success to stay competitive in this ever-changing world.

Source Prolead brokers usa

difference between algorithm and artificial intelligence
Difference Between Algorithm and Artificial Intelligence

By 2035 AI could boost average profitability rates by 38 percent and lead to an economic increase of $14 Trillion.

The words Artificial Intelligence (AI), and algorithms are most often misused and misunderstood. There are often used interchangeably when they shouldn’t be. This leads to unnecessary confusion.

In this article, let’s understand what AI and algorithms are, and what the difference between them is.

An algorithm is a form of automated instruction. An algorithm can either be a sequence of simple single if-then statements like if this button is pressed, execute that action, or sometimes it can be more complex mathematical equations.

· Examples where algorithms are used

  • YouTube’s algorithm knows what kind of ads should be displayed to a particular user
  • The e-commerce giant Amazon’s algorithm knows what kind of products a specific user like and based on it shows similar product details.

· Types of algorithms

The complexity of an algorithm will depend on the complexity of every single step, which is required to execute, as well as on the sheer number of steps the algorithm is required to execute. Mostly the algorithms are quite simpler. 

  • Basic algorithm

If a defined input leads to a defined output, then the system’s journey can be called an algorithm. This program journey between the start and the end emulates the basic calculative ability behind formulaic decision-making.

  • Complex algorithm

If a system is able to come to a defined output based on a set of complex rules, calculations, or problem-solving operations, then that system’s journey can be called a complex algorithm. Same as the basic algorithm, this program journey emulates the calculative ability behind formulaic, but more complex decision-making.

Artificial intelligence is a set of algorithms, which is able to cope with unforeseen circumstances. It differs from Machine Learning (ML) in that it can be fed unstructured data and still function. One of the reasons why AI is often used interchangeably with ML is because it’s not always straightforward to know whether the underlying data is structured or unstructured. This is not so much about supervised and unsupervised learning, but about the way, it’s formatted and presented to the AI algorithm.

The term AI algorithms are usually used to mention the details of the algorithms. But the accurate word to use for this is “Machine Learning Algorithms”. AI is a culmination of technologies that embrace Machine Learning (ML). ML is a set of algorithms that enables computers to learn from previous outcomes and get an update with the information without human intervention. It is simply fed with a huge amount of structured data in order to complete a task.

Based on the data acquired, AI algorithms will develop assumptions and come up with possible new outcomes by considering several factors into account that help them to make better decisions than humans.

In AI algorithms, outputs are not defined but designated depending on the complex mapping of user data that is then multiplied with each output. This program’s journey emulates the human ability to come to a decision, based on collected data. The more an intelligent system can enhance its output based on additional inputs, the more advanced the application of AI becomes. 

· Examples where AI algorithms are used

  • Self-driving cars are one of the best examples of AI algorithms.
  • Recognition-based applications such as facial, speech, and object recognition mapping

· Learning algorithms

Artificial intelligence algorithms are also called learning algorithms. There are three major kinds of algorithms in ML.

  • Supervised learning

The supervised learning algorithms are based on outcome and target variable mostly dependent variable. This gets predicted from a specific set of predictors which are independent variables. By making use of this set of variables, one can generate a function that maps inputs to get adequate results. The core algorithms, which are available in supervised learning, are Support Vector Machines (SVM), Decision Tree, and naïve Bayes classifiers, Ordinary Least Squares (OLS), Random Forest, Regression, Logistic Regression, and KNN.

  • Unsupervised learning

These are similar to the supervised learning algorithms, but there is no specific target or result, which can be estimated or predicted. As they keep on adjusting their models entirely based on input data. The algorithm operates a self-training process without any type of external intervention. The core algorithms, which are available in unsupervised learning algorithms, are Independent Component Analysis (ICA), apriori algorithm, K-means, Singular Value Decomposition (SVD), and Principal Component Analysis (PCA).

  • Reinforcement Learning (RL)

The RL has the constant iteration that depends on trial and error, in which the machines can generate the outputs depending on the specific kind of conditions, the machines are well-trained to take relevant decisions. The machine learns well based on past experiences and then captures the most suitable and relevant information to develop business decisions accurately. The best examples for RL are Q-Learning, Markov Decision Process, SARSA (State – action – reward – state – action), and Deep Mind’s Alpha Zero chess AI.

An algorithm takes automated instructions, which can be simple or complex, takes some input and some logic in the form of code, and offers an output based on the predefined set of guidelines described in the algorithm.

Whereas, an AI algorithm varies based on the data it receives whether structured or unstructured learns from the data and comes up with unique solutions. It also possesses the capability to alter its algorithms and develop new algorithms in response to learned inputs.

Humans and machines must work together to build humanized technology grounded by diverse socio-economic backgrounds, cultures, and various other perspectives. Knowledge of algorithms and AI will help to develop better solutions and to be successful in today’s volatile and complex world.

Source Prolead brokers usa

dsc weekly digest 06 july 2021
DSC Weekly Digest 06 July 2021

Before launching into my editorial this week, I wanted to make the announcement that starting with this issue, the DSC Newsletter will be sent out on Tuesdays rather than Mondays. To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free! 

There has long been a pattern with computer technology. At a certain point in the evolution of technology, there is a realization that things that had been done repeatedly as one-offs occur often enough to start building libraries, or even extensions to languages. For a while, mastery of these libraries defines a certain subset of programmers. or analysts, and typically most of the innovations tend to take place as improvements to these libraries, articles on technical sites or journals, and so forth.

Eventually, however, the capabilities are abstracted into more sophisticated stand-alone applications, frequently with user interfaces that provide ways to handle the most frequent use cases while relegating the edge cases to specialized screens. The results of these, in turn, are wrapped within some kind of container (such as Kubernetes) that can then be incorporated into a pipeline with other similar containers.

This is the direction that machine learning is going. MLOps now complements DevOps, ensuring that changes to machine learning models – from the data engineering necessary to ensure that the source data is ready for production, to feature engineering that can be altered on the fly to try out different scenarios, through to presentation and productization that not only makes sure that the results are understandable to a business audience, but that also can then feed into other operational channels, is here now, and will likely become commonplace within the next couple of years within the industry.

This transformation is critical for several reasons. First, it makes it far easier to create ensemble models, models that are developed and work in parallel, and that can handle different starting scenarios. This is key because the more generalized a model has to be, the more expensive, time-consuming, and complex it turns out, and the less likely that it can handle edge cases accurately. This is especially important when dealing with sparse datasets, where the danger is that single, comprehensive models can badly overfit the input, making such models very brittle to initial conditions.

In addition to this, however, however, by reducing the overall costs of implementing models from months to weeks, or even days, organizations are able to better productize their data analytics in ways that would have been unheard of even a couple of years before. As not all problems can (or should) be solved with machine learning in the first place, the ability to take advantage of more generalized DevOps pipelines within your organization put machine learning right where it belongs – as a powerful tool among many, rather than a single, potentially shaky foundation on its own.

For machine learning and data science specialists, this has other implications as well. Domain proficiency in a given sector will mean more, the ability to write Python or R will mean less, save for those who focus more specifically on tool-building within integrated frameworks. However, having a good understanding of data operations in general and machine learning operations in particular, all engineering tasks, will likely increase in demand dramatically over the next few years. Additionally, those that are better at productizing data, integrating ML streams in with other streams towards the creation of digital assets that can then be published as physical assets, will do quite well.

Machine learning is maturing. There’s nothing wrong with that.

In media res,

Kurt Cagle
Community Editor,
Data Science Central


Source Prolead brokers usa

whats in a name
What’s In a Name?

I wrote on this topic way back in 2016, but when a recent reader indicated that the original article didn’t have images anymore (it happens), it seemed like a good opportunity to write about it again.

I am Kurt Cagle, or, according to my birth certificate, Kurt Alan Cagle. My name is Kurt Cagle.

Now, think about that for a bit. The to be verb is remarkably slippery, and it is slippery in almost every language on the planet that has such a construct. For instance, consider the following statements:

I am Kurt Cagle.

I am a writer.

These are two of the most fundamental assertions in language. The first statement can be broken down as:

There exists a label associated with the referenced entity that (at least locally) identifies that entity to differentiate it from other entities.

The second statement can also be restated as:

There exists a set associated with the referenced entity that indicates membership of that entity within that set, which in turn has a label.

Makes sense, right? Welcome to the world of ontology!

The shape of names evolved over time. The concept goes way back- the proto-indo-european word for name (which had its origins in the Crescent Valley) was nomen (nuh-min) , and outside of that family tree, the Chinese root for name is ming, which many linguists would recognize as a cognate form of nomen meaning that people have been using names for at least six thousand years, and possibly for far longer.

The first names were likely given names, and were in essence “gifted” names: the name bestowed by others (typically the parent) to signify a given aspiration – such as Grace, Hope, Luke (shining) – or a beseechment or dedication to a deity, such as Mark (Mars-like or martial) or Michael (gift of God) or Gabriel (strength of God), with the suffix “el” in the latter two cases meaning Lord or Ruler (from the Sumerian and Phoenician Ba’el, reflected in the name Al-lah in Arabic and Muslim cultures).

Women’s names were typically diminutives of men’s names, where a diminutive was a shortened or “softened” form of a man’s name that often stemmed from the roots for small, such as Gabriella or Marcia, softened forms of Gabriel and Mark respectively. They were also given names that reflected beauty, such as plant names (e.g., Holly, Ivy, or Lily), or gem names (Ruby, Pearl). Occasionally male names in different languages than the naming language would become feminized variants, such as the French Jean (John, in English) becoming the feminine form of John in England. In general, there are many more variants of female names than male one.

Within family groups, this differentiation was sufficient to ensure uniqueness most of the time, though in small groups you might have adjectives that qualify these names – Big John, Tall John, Red John, and so forth. In some cases, especially among rulers, these qualifiers became parts of their name – Charlemagne was, as an example, Charles the Great. The word nickname, by the way, has nothing to do with the devil (Old Nick) but instead started out as ekename in Old English, where eke meant “also” or “alternative”. As eke fell out of usage in comparison to also in OE, eke became nekename, with the middle syllable eventually lost to become nickname. Alternative names, synonyms, or aliases, tend to be weaker because they generally have weaker authority (a lesson that ontologists should pay especially close attention to).

Once cultures reached a certain size, given names were no longer adequate to fully differentiate members of that population. One solution to this, seen especially in northern cultures, was to use familial relationships: John, Son of James (John Jameson), was different from John, Son of John (John Johnson). Admittedly, this made more sense in villages where people knew one another’s families reasonably well. but it also accounts for the reason that Johnson is one of the most common surnames in regions with strong Nordic roots. In other places (especially in England and Germany) profession names were used to differentiate family lines – Smith, Sawyer (a person who used saws to cut down trees, or a lumberjack), Miller, Tinker (a tin smith), Carpenter, and so forth often uniquely identified a person in that profession, and as family trades were frequently handed down, so too were the differentiating surnames.

Finally, family names also tended to echo prominent place features – Lake, Brook, Craig (a mountain), Fields, etc. – associated with the family (this was especially true of nobles). This was especially true of nobles and other officals, who often took the name of a given property or city that they had dominion over, though the use of originating cities or regions as qualifiers also goes way back.

The use of both a given name and a family or surname almost invariably was tied into tax collection. For instance, after the invasion of England by Willelm of Normandy (a.k.a,. William the Conqueror) in 1066, one of the first orders of business was to identify the wealthy people and assets in the country, in a survey called the Domesday Book. These tax records served to freeze what had been until that time colloquial names (such as the use of professional names such as Smith or Miller as differentiators), while also formalizing “House Names” such as the Houses of York or Lancaster (lampshaded in George R.R. Martin’s Game of Thrones series as House Stark and House Lanister respectively).

It’s worth noting that taxonomists and ontologies refer to the given + family or sur-names as qualified names; the surname qualifies the given (or local) name. In a more formal code standpoint, the qualified name acts as a namespace for the terms (names) within that space, and the qualifier typically denotes set or class membership. Such a system dramatically reduces the likelihood that a name may refer to more than one person. As such, it is a mechanism for determining uniqueness in a broader set.

Note that beyond the emergence of given and surnames, there are other qualifiers that can differentiate a name, such as patronymics (senior, junior, the third, elder, younger, etc.), and honorifics that ironically also qualify a person by profession or distinction (sir, which is a contraction of Senior, doctor, reverend, etc.) as well as gender identifiers, up to and including the latest fashion of specifying pronouns for address purposes.

Western European styles also reflect a cultural preference for putting the given name first in narrative prose, though in legal contracts and other communications, the reverse order of surname and given name, separated by a comma is frequently used to facilitate sorting by family name. Asian countries,, on the other hand (with notable exceptions including Thailand and the Philippines), always use the qualifying (sur) name first. As such, it is typical to store a common usage name in the Western-style while also storing given names and surnames separately in order to facilitate sorting using either convention.

Cardinality and Reification

It is dangerous to assume that there is always a one-to-one correspondence between an individual and a name. Indeed, for fifty percent of the population, it is likely that their name will change at least once in their lifetime. That segment, of course, is women. Until comparatively recently (the 1960s in the United States) if a woman married, she was expected to take the surname of her husband. The feminist movement started changing that, in part as a reflection of shifting expectations about property ownership, taxation, and a weakening of the ecclesiastical view of marriage and divorce. While still a fairly low percentage, women in more marriages than ever are choosing to keep their “maiden names” if they marry, or both partners (especially in same-sex relationships) are choosing to create hyphenated surnames that differ from their pre-marriage surnames.

Nonetheless, in modeling individuals, the assumption should be that surnames especially will change over time, and given names may very well change too. Once again, gender plays a role. A person may very well either physically change their sex through surgery or may at least publicly present themselves as the opposite gender, with names reflecting this event.

It’s worth noting that there are always political dimensions when it comes to data modeling, and nowhere is that as intense as with identity modeling. Any modeling involves making certain assumptions, assumptions that are often informed by cultural norms and expectations. We are now entering an era where identity is fluid: it changes over time based upon gender intent, relational status, professional appelation (The Artist Formerly Known as Prince) and even social context. For instance, you are increasingly seeing gender pronoun preferences (he,him,his;she,her,her;ze,zir,zis) in social media.

Yet at the same time this adds to the complexity of the model. From a semantics perspective, this recreates a structure that occurs whenever you have temporal evolution, what I’d call the now-then pattern.

The now part of the pattern is an assertion that, at the time the assertion is made, is true:

Her name is Jane Doe

The then part of the pattern, on the other hand, is a set of assertions that specify a range (possibly open-ended) identifying an event or state:

This is an event.
This event refers to a property called name.
The value of this property is Jane Doe.
This event began on March 16, 1993.
This event ended on June 17, 2021.
This event was reported by Kurt Cagle.

This second structure is known in semantic circles as an example of reification, meaning that the second set of assumptions describes a single relationship. The this in this case is in fact the statement Her name is Jane Doe. For those familiar with SQL, reification typically describes 3rd Normal Forms (or 3NF).

In more abstract terms, the initial statement can be broken down as:

r = {s->[p]->o}

where q is a reference to a subject entity, p is a reference to a relationship or property, and o is a reference to an object or value relative to that relationship. The reification is then a set of other relationships that refer to the given assertion or statement q:

r is a reification.
r has property p.
r has subject s.
r has object o.
r starts at time t1
r optionally ends at time t2.
r was reported by m.

The reification is significant because it specifies the time to live of a given relationship between two things. Reifications can also hold other metadata (for instance, specifying a pronoun type indicating preferred gender designation). However, it’s also worth noting that you can have a great deal of information within a reification, but that also adds significantly to the number of assertions (triples) bound to that reification.

In terms of a graph, a reification is in fact the metadata associated with the information about an edge, when given two objects. For instance, if s is an airport, o is also an airport, and p is an indication that a route exists between s and o, then r:{s->[p] ->o} is in fact the route between ?s and ?o:

airport:_SEA airport:hasRoute airport:_DEN (Seattle has a route to Denver).

The route is in effect a reification (especially as routes, which are largely ephemeral and abstract entities, change far more quickly than airports do).

The route can assign a mean travel time as a property on the reification. This is, effectively, contextual information, information that belongs not to either airport but rather to the relationship that exists between the two.

With regard to names, this introduces some interesting modeling issues. A personal name goes from being a simple label to being something with a structure, a presence, and a role or type. More on that in a bit, but before digging into the weeds, its time to emphasize an important point here:

Reifications are almost invariably trade-offs between the need to deal with transients and the complexity of combinatorics. In the case of names, for instance, a given individual may have multiple names, though some may be birth names, some nicknames, some professional names, and some due to change in marital status or presentation status. A person may even have multiple names simultaneously. Names are, of course, not necessarily unique, but they still serve as one of the most commonly used identifiers for people, and for this reason as much as any other, this kind of reification makes sense.

Modeling Names (and a Sneak Peak of Templeton)

Given all of this, what would the best model for names look like? The now-then pattern suggests a two pronged approach: first, model what a Personal Name should look like, then, from the set of all such names for the individual, choose the primary name for that person from the set, the name that is currently used to best represent that individual.

The following example is in what I’m calling Templeton (short for RDF Template Notation).

?PersonalName a Class:_PersonalName;

      PersonalName:hasType ?PersonalNameType;

      PersonalName:hasFullName ?fullName;

      PersonalName:hasGivenName ?givenName; #+

      PersonalName:hasSecondaryName; #*

      PersonalName:hasSignatoryName; #? ## Name used on a legal document

      PersonalName:hasFamilyName ?familyName; #*

      PersonalName:hasFamilySortName ?sortName; #? ## For convenient sorting

      PersonalName:hasHonorific ?honorofic; #* ## Mr., Ms., Dr., etc.

      PersonalName:hasPatronymic ?patronyic; #* ## Sr, Jr, III

      PersonalName:hasDistinction ?distinction; #* ## PhD, JD

      PersonalName:hasNominativePronoun ?nominativePronoun; #? ## he, she, ze

      PersonalName:hasPosessivePronoun ?possessivePronoun; #? ## his,hers,zes

      PersonalName:hasObjectivePronoun ?objectivePronoun; #? ## him,her,zem

      PersonalName:hasStartDate ?startDate; #? xsd:date

      PersonalName:hasEndDate ?endDate; #? xsd:date

      PersonalName:hasLanguage; #? ## indicates the language code of the name (en,de,es,cn, etc.)

      .

PersonalName:hasFullName a Class:_Property;

      rdfs:subPropertyOf rdfs:label

      .

?Person a Class:_Person;

      Person:hasPrimaryPersonalName ?PersonalName;

      Person:hasPersonalName ?PersonalName; #+

      Person:hasPrimaryNameString ?fullName;

      .

?PersonalNameType a Class:_PersonalNameType.

%[

PersonalNameType:_BirthName,

PersonalNameType:_AdoptedName,

PersonalNameType:_LegalChangedName,

PersonalNameType:_ProfessionalName,

PersonalNameType:_MarriedName,

PersonalNameType:_LegalAlias,

PersonalNameType:_IllegalAlias,

PersonalNameType:_NickName,

]% a Class:_PersonalNameType.

First a few words about the notation. The core of it (just as with SPARQL) is Turtle as a way of describing assertions (triples here). Variable names (beginning with a question mark) provide a label, and in some cases (such as ?fullName) a value used in multiple assertion templates. If a line is indented (and the preceding line ends with a semicolon) then the un-indented first term remains in force. For instance,

?PersonalName a Class:_PersonalName;

      PersonalName:hasType ?PersonalNameType;

      .

is short for

?PersonalName a Class:_PersonalName;

?PersonalName PersonalName:hasType ?PersonalNameType;

The hash mark (#) is a comment, but in the template it’s used to signal cardinality. Thus #* indicates that the previous assertion may be repeated zero or more times, #+ indicates a one-or-more repetition, and #? indicates an optional assertion. If a variable starts with an uppercase letter, it indicates an IRI (or reference pointer), if it indicates a lowercase letter, though, then the value is an atomic value, defaulting to a string. Thus,

?PersonalName personalName:hasStartDate ?startDate; #? xsd:date

indicates that ?startDate in this particular case is a date.

The notation

%[a,b,c,…]% a class:PersonalNameType.

indicates that the list of items are each subjects to the associated predicate and object, and is very useful for specifying type enumerations. Finally the single a is a shorthand for rdf:type.

Note: Templeton is a shorthand templating notation I’ve been developing as a way of creating schemas that can be expanded to OWL, SHACL, XML Schema, or JSON-Schema. I’m working on a parser for it now.

Of Compositions, Associations, and the Now/Then Pattern.

The modeling of PersonalName should seem straightforward, with a few caveats. First, it has been my observation working with dozens of ontologies over the years that almost every time you define a class, there is usually some kind of intent indicator needed. Such indicators do not materially change the definition of the class, but they do provide a level of context about what a particular instance is intended to do. For instance, PersonalNameType identifies whether something is a birth name, a married name, an alias, or a professional name (among others) These are differentiated from being subclasses because they do not change any other properties.

The second caveat has to do with modeling. UML differentiates between a composition and an association. An association typically describes a relationship between two disparate entities, and in the semantic parlance could be considered the same as a reification (or third normal form construction in SQL) . A composition, on the other hand, occurs when there is an existential dependency between the subject and object. For instance, even if you have two people who have the same personal name, these two instances are distinctive (having different start and end dates, for instance). Should a person be deleted from the database, all of the names associated with that person would also need to be deleted (which is not true for associations).

In my own modeling, compositions should always belong to the reference subject, or, put another way, the relationship points from the subject to the object semantically. Associations, on the other hand, generally are reifications – there is a reifying object such as the route in our airport example, that binds two entities together. If you delete the reification (the route, here), you don’t in this case delete the associated entities (the airports),

There are some objects that seem to skirt the boundaries. An address is a good example. If a person has an associated address, a naive modeling would make an address a composition. However, it’s not. Multiple people can live at the same address. If one person moves away, that does not cause the address itself to “disappear”. This also means that the association of a person with an address should be seen as being a reification. I use the term Habitation as the class for that reification, one that points to both a person and an address:

?Habitation a Class:_Habitation;

     Habitation:hasType ?HabitationType;

     Habitation:hasTenant ?Person;

     Habitation:hasAddress ?Address;

     Habitation:hasStartDate ?startDate;

     Habitation:hasEndDate ?endDate; #?

     .

Regardless of whether something is a composition or an association, there are times where you just want to know what a person’s current primary name is, without having to build complex queries to find it. This is where inferred triples come into play. An inferred triple is typically generated, either through a SPARQL Update query or as part of a CONSTRUCT (these are more or less the same, depending upon how inferred triples are persisted).

For instance, the following SPARQL Query will change the primary name for a person to the specified value:

# Update Primary Name

delete {

    ?Person Person:hasPrimaryName ?oldPrimaryName;

            Person:hasPrimaryNameString ?oldFullName.

    }

insert {

    ?Person Person:hasPrimaryName ?newPrimaryName;

            Person:hasPrimaryNameString ?newFullName.

    }

where {

    values (?Person ?newPrimaryName) {(Person:_JaneDoe PersonName:_JaneDoeBirthName)}

    ?Person Person:hasPrimaryName ?oldPrimaryName

    ?Person Person:hasPrimaryNameString ?oldFullName.

    ?newPrimaryName PersonName:hasFullName ?newFullName.

    }

   

   

Inferred triples are frequently transitory assertions – they reflect the default value from a set of objects, but that can change, and frequently they provide a way of shortcircuiting complex queries. For instance Person:hasPrimaryNameString is the string representation of the default personal name, This can be made even more powerful by making that particular property the subproperty of something like skos:prefLabel (assuming a basic inference engine), so that a naive query, such as:

select ?s ?name where {

    ?s skos:prefLabel ?name.

    filter (contains(?name,’Jane Doe’))

}

will return a list of all entities which have a primary label of “Jane Doe” in them. Note that this isn’t a terribly efficient query, but it can be handy, nonetheless.

So when you’re thinking about the design of your models, identify those properties that you’d intuitively want to see for the classes in question that can be inferred or derived, and in effect pre-generate or update these properties as the state of the object changes so that your users don’t have to build complex queries. Remember, a triple store is an index, and such actions can be thought of as optimizing that index.

Summary

Modeling, when it comes right down to it, is the process of questioning your assumptions and optimizations. A big issue that arises with most traditional SQL systems is that many database modelers optimize for complexity by reducing the number of database tables and joins, but this also reduces the contextual metadata that is increasingly a requirement in today’s data rich world.

Source Prolead brokers usa

an overview of logistic regression analysis
An Overview of Logistic Regression Analysis
An Intuitive study of Logistic Regression Analysis

Logistic regression is a statistical technique to find the association between the categorical dependent (response) variable and one or more categorical or continuous independent (explanatory) variable.

We can define the regression model as,

G(probability of event)=β01x12x2+…+βkxk

We determine G using link function as following,

Y={1 ; β01x1+ϵ>0

{0 ; else

There are three types of link fuction. They are,

  • Logit
  • Normit (probit)
  • Gombit

          

An Intuitive study of Logistic Regression Analysis

Why we use logistic regression?

We use it when there exists,

  • One Categorical response variable
  • One or more explanatory variable.
  • No linear relationship between dependent and independent variables.

Assumptions of Logistic Regression

  • The dependent variable should be categorical (binary, ordinal, nominal or count occurrences).
  • The predictor or independent variable should be continuous or categorical.
  • The correlation among the predictors or independent variable (multi-collinearity) should not be severe but there exists linearity of independent variables and log odds.
  • The data should be the representative part of population and record the data in the order its collected.
  • The model should provide a good fit of the data.

Logistic regression vs Linear regression

  • In the case of Linear Regression, the outcome is continuous while in the case of logistic regression outcome is discrete (not continuous)
  • To perform linear regression, we require a linear relationship between the dependent and independent variables. But to perform Logit we do not require a linear relationship between the dependent and independent variables.
  • Linear Regression is all about fitting a straight line in the data while Logit  is about fitting a curve to the data.
  • Linear Regression is a regression algorithm for Machine Learning while Logit  is a classification Algorithm for machine learning.
  • Linear regression assumes Gaussian (or normal) distribution of the dependent variable. Logit assumes the binomial distribution of the dependent variable.

*Logit=logistic regression

Types

There are four types of logistic regression. They are,

  • Binary logistic: When the dependent variable has two categories and the characteristics are at two levels such as yes or no, pass or fail, high or low etc. then the regression is called binary logistic regression.
  • Ordinal logistic: When the dependent variable has three categories and the characteristics are at natural ordering of the levels such as survey results (disagree, neutral, agree) then the regression is called ordinal logistic regression.
  • Nominal logistic: When the dependent variable has three or more categories but the characteristics are not at natural ordering of the levels such as colors (red, blue, green) then the regression is called nominal logistic.
  • Poisson logistic: When the dependent variable has three or more categories but the characteristics are the number of time of an event occurs such as 0, 1, 2, 3, …, etc.  then the regression is called Poisson logistic regression.

Source…

Source Prolead brokers usa

defining value the key to ai success
Defining “Value” – the Key to AI Success

I recently conducted a 3-day, remote “Data Monetization: Thinking Like a Data Scientist” workshop for a transportation agency in the Middle East.  Doing this training remotely is a personal challenge as I miss the face-to-face interaction in ideating, validating, and prioritizing the business areas that can benefit from data and analytics.  However, conducting the workshop remotely did provide some valuable learnings for me.

One learning was my “Thinking Like a Data Scientist” visual was outdated (Figure 1).

Figure 1: Original “Thinking Like a Data Scientist” (TLADS) Visual

Figure 1 portrayed the “Thinking Like a Data Scientist” (TLADS) process as a linear process, where you would complete one step and then cleanly move onto the next step.  But in reality, the process is highly iterative where it is common for learnings from one step to impact an earlier step such as refining the KPIs against which the targeted business initiative’s progress and success will be measured.  So, I created an updated “Thinking Like a Data Scientist” visual in Figure 2 to reflect the highly iterative nature of the TLADS process.

Figure 2: Updated “Thinking Like a Data Scientist” Visual

One other learning from the workshop was the need to spend more time thoroughly understanding and defining the “value” that the business initiative sought to create.  And that’s where things get tricky.  Too many organizations limit how they define and measure “value”.  And defining a robust and diverse set of KPIs and metrics against which to measure business initiative progress and success becomes critically important as we apply AI to continuously-optimize the business initiative.

To understand the AI “value” quandary, one must first understand how an AI model works:

  1. The AI model interacts with its environment to gain feedback in order to continuously learn and adapt its performance (using backpropagation and stochastic gradient descent).
  2. AI model’s continuously learn and adapt process is guided by the AI Utility Function, which are the metrics and KPIs that define AI model progress and success.
  3. The AI model seeks to continuously make the “right” decisions, as framed by the AI Utility Function, as the AI model continuously interacts with its environment.
  4. In order to create an AI model that makes the “right” decisions, the AI Utility Function must be comprised of a holistic and robust definition of “value” including financial, economic, operational, customer, society, environmental, and maybe even spiritual.

Bottom-line: the AI model determines “right and wrong” actions based upon the definition of “value” as articulated in the AI Utility Function (Figure 3).

Figure 3: AI Rational Agent Makes the “Right” Decisions based on AI Utility Function

Consequently, we must invest the effort across a diverse set of stakeholders to thoroughly explore and validate a diverse set of metrics and KPIs against which the AI model is seeking to optimize.  And that’s exactly one of the key objectives of the TLADS process.

If we want AI to work for us humans (versus us humans working for AI), then we must thoroughly define “value” before we start building our AI ML models. Consequently, I expanded the TLADS process to drive a more thorough exploration and definition of “value” (Figure 4).

Figure 4: How Does Your Organization Define “Value”

A more holistic suite of “value” dimensions that today’s organizations need to consider are represented in red in Figure 4.  To support the ideation around the exploration of these expanded value dimensions, I updated TLADS Step #1 and Template #1 (Figure 5) to include:

  • What is the targeted Business Initiative? A clear statement about what the business initiative is trying to accomplishment.
  • What are the KPIs or metrics against which business initiative progress and success will be measured? There should be at least 6 – 8 KPIs and metrics against which the organization is measuring the progress and success of the targeted business initiative.
  • What are the Ideal Outcomes from this business initiative? This is a “future visioning” exercise to envision what successful execution of the business initiative looks like.
  • What are the Benefits from the business initiative from the value perspectives of financial, customer, product, operational, environmental, societal? There should be at least 6 to 8 benefits across the broader dimensions that define value.
  • What are the Potential Impediments to successful execution of the business initiative? There should be at least 6 to 8 potential impediments across technology, data, skills, personnel, competitive, market, and organizational factors.
  • What are the Ramifications of the Failure of this business initiative? This one is the most fun because it gives everyone a chance to envision and explore all the different ways where things can go wrong.

Note: capturing a robust set of KPIs, benefits and impediments should not be difficult if 1) you have a diverse group of stakeholders participating in the brainstorming process and 2) you ensure that everyone has an equal voice in the ideation process.

See Figure 5 for an updated Template #1 using my traditional Chipotle example, where the items in red are related to the expanded definition of “value” for Chipotle.

Figure 5:  Updated Template 1 of the “Thinking Like a Data Scientist” methodology

By the way, how your organization defines “value” probably says more about your organization than whatever your charter and mission statement states.  Or said another way:

You are what you measure, and you measure what you reward

Yea, you may say that your organization’s charter is such and such, but your organization’s charter is actually defined by the metrics and KPIs against which you measure (and reward) business success.  Period.

Value definition is critical from an AI execution perspective.  Unfortunately, in a rush to get to the fun part of the AI job and start playing with the AI algorithms, organizations sometimes shortchange the upfront work in thoroughly and holistically defining the value (metrics and KPIs) against which business initiative progress and success will be measured.

Organizations must be thoughtful and thorough in how it defines the values against which the operations of the business will be measured.  Getting those “values” wrong can lead to unintended, biased, and disastrous consequences in your AI models (check out Terminators, VIKI, and ARIIA…your homework assignment for my next blog).

Source Prolead brokers usa

what skills does an it business analyst need
What Skills Does an IT Business Analyst Need?

The success of an IT project largely depends on a Business Analyst – the intermediary between IT processes and a business. Thanks to the Business Analyst, products of the required quality prosper on the market. We’ll tell you what skills this specialist should have so that the above is true. 

The Business Analyst’s mission on a project 

The Business Analyst analyzes future products to figure out what needs to be improved so that the development is as useful to consumers and as profitable to the customer as possible. The Business Analyst performs the following tasks at different stages of the software development life cycle (SDLC):

  • studies the market and competitors to improve the product’s functionality if possible, 
  • communicates with customers to collect and document product requirements, 
  • approves the requirements with stakeholders,
  • advises the teams on the product, and more.

To summarize, the Business Analyst provides the teams with high-quality requirements, strives to avoid the development of useless features, and maximizes business value.

What skills the Business Analyst needs 

The following six skills help the Business Analyst ensure that the project is completed to good quality, on time, and within budget.

  • Technical skills.

IT Business Analysts should possess the following knowledge:

  • understand complex technologies and terms and know how to use different tools,
  • have a good understanding of the Big Data concept and ways to obtain information to analyze a business,
  • have an idea of software architecture, understand the basics of testing and programming, and know fundamental SQL queries.

This knowledge allows Business Analysts to elaborate development plans and strategies for improving the product at all stages of SDLC.

  • Research skills.

Every project begins with a request from a customer. The Business Analyst is to conduct research of the customer’s business, identify problems or opportunities, and recommend a solution. In the course of their work, Business Analysts study the market and competitors,  estimate possible benefits for the business, and suggest the best way to reach the customer’s goals.

  • Analytical skills.

The Business Analyst has to study lots of information: statistics, requirements, documentation, market conditions, and so on. The wealth of information that Business Analysts obtain after the research is completed needs to be analyzed. This allows them to estimate risks, forecast success, and choose the best solution for the business. 

  • Communication skills.

Since Business Analysts are intermediaries between customers and development teams, they have extensive constant communication with both these parties. 

The Business Analyst receives requirements forming the basis of a project through communication with the customer. The documentation that Business Analysts create must be clear, consistent, and without any ambiguity, as the product development is based on it.

As they know all the nuances of the project, Business Analysts also advise other employees. They receive feedback in the course of development and modify the product creation plan.

  • Leadership skills.

Business Analysts’ work is tied with management skills because they eliminate problems such as project delays and are responsible for the project results. All SDLC participants go to the Business Analyst with development-related issues as they are an authoritative source of knowledge. After all, it is hard to negotiate with a customer if you don’t have leadership skills. 

  • Negotiation skills. 

Negotiation and persuasion skills differ from the ability to simply communicate with teams. The Business Analyst interacts with managers of different levels and convinces them that some or other decision is correct. If the customer wants to have certain features in the app but the Business Analysts see a better option, they must prove their point and strike a balance between customer desires and business needs.

To sum up, we can say that competent Business Analysts balance expertise with interpersonal skills. These specialists combine technical and non-technical competence to ensure the competitive edge of products, which is needed in a world of rapid business development.

Source Prolead brokers usa

fine tuning transformer model for invoice recognition
Fine-Tuning Transformer Model for Invoice Recognition
A step-by-step guide from annotation to training

                                                    Photo by Andrey Popov from Dreamstime

Introduction

Building on my recent tutorial on how to annotate PDFs and scanned images for NLP applications, we will attempt to fine-tune the recently released Microsoft’s Layout LM model on an annotated custom dataset that includes French and English invoices. While the previous tutorials focused on using the publicly available FUNSD dataset to fine-tune the model, here we will show the entire process starting from annotation and pre-processing to training and inference.

LayoutLM Model

The LayoutLM model is based on BERT architecture but with two additional types of input embeddings. The first is a 2-D position embedding that denotes the relative position of a token within a document, and the second is an image embedding for scanned token images within a document. This model achieved new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24), and document image classification (from 93.07 to 94.42). For more information, refer to the original article.

Thankfully, the model was open sourced and made available in huggingface library. Thanks, Microsoft!

For this tutorial, we will clone the model directly from the huggingface library and fine-tune it on our own dataset. But first, we need to create the training data.

Invoice Annotation

Using the UBIAI text annotation tool, I have annotated around 50 personal invoices. I am interested to extract both the keys and values of the entities; for example in the following text “Date: 06/12/2021” we would annotate “Date” as DATE_ID and “06/12/2021” as DATE. Extracting both the keys and values will help us correlate the numerical values to their attributes. Here are all the entities that have been annotated:

DATE_ID, DATE, INVOICE_ID, INVOICE_NUMBER,SELLER_ID, SELLER, MONTANT_HT_ID, MONTANT_HT, TVA_ID, TVA, TTC_ID, TTC

Here are a few entity definitions:

MONTANT_HT: Total price pre-tax
TTC: Total price with tax
TVA: Tax amount

Below is an example of an annotated invoice using UBIAI:


                                                             Image by author: Annotated invoice

After annotation, we export the train and test files from UBIAI directly in the correct format without any pre-processing step. The export will include three files for each training and test datasets and one text file containing all the labels named labels.txt:

Train/Test.txt

2018 O
Sous-total O
en O
EUR O
3,20 O
€ O
TVA S-TVA_ID
(0%) O
0,00 € S-TVA
Total B-TTC_ID
en I-TTC_ID
EUR E-TTC_ID
3,20 S-TTC
€ O
Services O
soumis O
au O
mécanisme O
d'autoliquidation O
- O

Train/Test_box.txt (contain bounding box for each token):

€ 912 457 920 466
Services 80 486 133 495
soumis 136 487 182 495
au 185 488 200 495
mécanisme 204 486 276 495
d'autoliquidation 279 486 381 497
- 383 490 388 492

Train/Test_image.txt (contain bounding box, document size, and name):

€ 912 425 920 434 1653 2339 image1.jpg
TVA 500 441 526 449 1653 2339 image1.jpg
(0%) 529 441 557 451 1653 2339 image1.jpg
0,00 € 882 441 920 451 1653 2339 image1.jpg
Total 500 457 531 466 1653 2339 image1.jpg
en 534 459 549 466 1653 2339 image1.jpg
EUR 553 457 578 466 1653 2339 image1.jpg
3,20 882 457 911 467 1653 2339 image1.jpg
€ 912 457 920 466 1653 2339 image1.jpg
Services 80 486 133 495 1653 2339 image1.jpg
soumis 136 487 182 495 1653 2339 image1.jpg
au 185 488 200 495 1653 2339 image1.jpg
mécanisme 204 486 276 495 1653 2339 image1.jpg
d'autoliquidation 279 486 381 497 1653 2339 image1.jpg
- 383 490 388 492 1653 2339 image1.jpg

labels.txt:

B-DATE_ID
B-INVOICE_ID
B-INVOICE_NUMBER
B-MONTANT_HT
B-MONTANT_HT_ID
B-SELLER
B-TTC
B-DATE
B-TTC_ID
B-TVA
B-TVA_ID
E-DATE_ID
E-DATE
E-INVOICE_ID
E-INVOICE_NUMBER
E-MONTANT_HT
E-MONTANT_HT_ID
E-SELLER
E-TTC
E-TTC_ID
E-TVA
E-TVA_ID
I-DATE_ID
I-DATE
I-SELLER
I-INVOICE_ID
I-MONTANT_HT_ID
I-TTC
I-TTC_ID
I-TVA_ID
O
S-DATE_ID
S-DATE
S-INVOICE_ID
S-INVOICE_NUMBER
S-MONTANT_HT_ID
S-MONTANT_HT
S-SELLER
S-TTC
S-TTC_ID
S-TVA
S-TVA_ID

Fine-Tuning LayoutLM Model:

Here, we use google colab with GPU to fine-tune the model. The code below is based on the original layoutLM paper and this tutorial .

First, install the layoutLM package…

! rm -r unilm
! git clone -b remove_torch_save https://github.com/NielsRogge/unilm.git
! cd unilm/layoutlm
! pip install unilm/layoutlm

…as well as the transformer package from where the model will be downloaded:

! rm -r transformers
! git clone https://github.com/huggingface/transformers.git
! cd transformers
! pip install ./transformers

Next, create a list containing the unique labels from labels.txt:

from torch.nn import CrossEntropyLoss
def get_labels(path):
with open(path, "r") as f:
labels = f.read().splitlines()
if "O" not in labels:
labels = ["O"] + labels
return labels
labels = get_labels("./labels.txt")
num_labels = len(labels)
label_map = {i: label for i, label in enumerate(labels)}
pad_token_label_id = CrossEntropyLoss().ignore_index

Then, create a pytorch dataset and dataloader:

from transformers import LayoutLMTokenizer
from layoutlm.data.funsd import FunsdDataset, InputFeatures
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
args = {'local_rank': -1,
'overwrite_cache': True,
'data_dir': '/content/data',
'model_name_or_path':'microsoft/layoutlm-base-uncased',
'max_seq_length': 512,
'model_type': 'layoutlm',}
# class to turn the keys of a dict into attributes
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = self
args = AttrDict(args)
tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")
# the LayoutLM authors already defined a specific FunsdDataset, so we are going to use this here
train_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="train")
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
sampler=train_sampler,
batch_size=2)
eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="test")
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset,
sampler=eval_sampler,
batch_size=2)
batch = next(iter(train_dataloader))
input_ids = batch[0][0]
tokenizer.decode(input_ids)

Load the model from huggingface. This will be fine-tuned on the dataset.

from transformers import LayoutLMForTokenClassification
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=num_labels)
model.to(device)

Finally, start the training:

from transformers import AdamW
from tqdm import tqdm
optimizer = AdamW(model.parameters(), lr=5e-5)
global_step = 0
num_train_epochs = 50
t_total = len(train_dataloader) * num_train_epochs # total number of training steps
#put the model in training mode
model.train()
for epoch in range(num_train_epochs):
for batch in tqdm(train_dataloader, desc="Training"):
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids,
labels=labels)
loss = outputs.loss
# print loss every 100 steps
if global_step % 100 == 0:
print(f"Loss after {global_step} steps: {loss.item()}")
# backward pass to get the gradients 
loss.backward()
#print("Gradients on classification head:")
#print(model.classifier.weight.grad[6,:].sum())
# update
optimizer.step()
optimizer.zero_grad()
global_step += 1

You should be able to see the training progress and the loss getting updated.


                                                 Image by author: Layout LM training in progress

After training, evaluate the model performance with the following function:


import numpy as np
from seqeval.metrics import (
classification_report,
f1_score,
precision_score,
recall_score,
)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
# put model in evaluation mode
model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating"):
with torch.no_grad():
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)
# forward pass
outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids,
labels=labels)
# get the loss and logits
tmp_eval_loss = outputs.loss
logits = outputs.logits
eval_loss += tmp_eval_loss.item()
nb_eval_steps += 1
# compute the predictions
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = labels.detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(
out_label_ids, labels.detach().cpu().numpy(), axis=0
)
# compute average evaluation loss
eval_loss = eval_loss / nb_eval_steps
preds = np.argmax(preds, axis=2)
out_label_list = [[] for _ in range(out_label_ids.shape[0])]
preds_list = [[] for _ in range(out_label_ids.shape[0])]
for i in range(out_label_ids.shape[0]):
for j in range(out_label_ids.shape[1]):
if out_label_ids[i, j] != pad_token_label_id:
out_label_list[i].append(label_map[out_label_ids[i][j]])
preds_list[i].append(label_map[preds[i][j]])
results = {
"loss": eval_loss,
"precision": precision_score(out_label_list, preds_list),
"recall": recall_score(out_label_list, preds_list),
"f1": f1_score(out_label_list, preds_list),
}

With only 50 documents, we get the following scores:


                                         Image by author: Evaluation score after training

With more annotations, we should certainly get higher scores.

Finally, save the model for future prediction:

PATH='./drive/MyDrive/trained_layoutlm/layoutlm_UBIAI.pt'
torch.save(model.state_dict(), PATH)

Inference:

Now comes the fun part, let’s upload an invoice, OCR it, and extract relevant entities. For this test, we are using an invoice that was not in the training or test dataset. To parse the text from the invoice, we use the open source Tesseract package. Let’s install the package:

!sudo apt install tesseract-ocr
!pip install pytesseract

Before running predictions, we need to parse the text from the image and pre-process the tokens and bounding boxes into features. To do so, I have created a preprocess python file layoutLM_preprocess.py that will make it easier to preprocess the image:

import sys
sys.path.insert(1, './drive/MyDrive/UBIAI_layoutlm')
from layoutlm_preprocess import *
image_path='./content/invoice_test.jpg'
image, words, boxes, actual_boxes = preprocess(image_path)

Next, load the model and get word predictions with their bounding boxes:

model_path='./drive/MyDrive/trained_layoutlm/layoutlm_UBIAI.pt'
model=model_load(model_path,num_labels)
word_level_predictions, final_boxes=convert_to_features(image, words, boxes, actual_boxes, model)

Finally, display the image with the predicted entities and bounding boxes:

draw = ImageDraw.Draw(image)
font = ImageFont.load_default()
def iob_to_label(label):
if label != 'O':
return label[2:]
else:
return ""
label2color = {'data_id':'green','date':'green','invoice_id':'blue','invoice_number':'blue','montant_ht_id':'black','montant_ht':'black','seller_id':'red','seller':'red', 'ttc_id':'grey','ttc':'grey','':'violet', 'tva_id':'orange','tva':'orange'}
for prediction, box in zip(word_level_predictions, final_boxes):
predicted_label = iob_to_label(label_map[prediction]).lower()
draw.rectangle(box, outline=label2color[predicted_label])
 draw.text((box[0] + 10, box[1] - 10), text=predicted_label, fill=label2color[predicted_label], font=font)
image

Et voila:


                                               Image by author: Predictions on a test invoice

While the model made few mistakes such as assigning the TTC label to a purchased item or not identifying some IDs, it was able to extract the seller, invoice number, date, and TTC correctly. The results are impressive and very promising given the low number of annotated documents (only 50)! With more annotated invoices, we will be able to reach higher F scores and more accurate predictions.

Conclusion:

Overall, the results from the LayoutLM model are very promising and demonstrate the usefulness of transformers in analyzing semi-structured text. The model can be fine-tuned on any other semi-structured documents such as driver licences, contracts, government documents, financial documents, etc.

If you have any question, don’t hesitate to ask below or send us an email at [email protected]

If you liked this article, please like and share!

Source Prolead brokers usa

a must have tool to analyse latest ai research papers fast
A must have tool to analyse latest AI research papers fast

Links

Title

Description

Abstract

https://papers.labml.ai/paper/2105.04026

The Modern Mathematics of Deep Learning

We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. We present an overview of modern approaches that yield partial answers to these questions.

We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.

https://papers.labml.ai/paper/2106.04554

A Survey of Transformers

Transformers have achieved great success in many artificial intelligence fields. Up to the present, a great variety of Transformer variants (a.k.a. X-formers) have been proposed. A systematic literature review on these Transformer variants is still missing.

Transformers have achieved great success in many artificial intelligence fields, such as natural language processing, computer vision, and audio processing. Therefore, it is natural to attract lots of interest from academic and industry researchers. Up to the present, a great variety of Transformer variants (a.k.a. X-formers) have been proposed, however, a systematic and comprehensive literature review on these Transformer variants is still missing. In this survey, we provide a comprehensive review of various X-formers. We first briefly introduce the vanilla Transformer and then propose a new taxonomy of X-formers. Next, we introduce the various X-formers from three perspectives: architectural modification, pre-training, and applications. Finally, we outline some potential directions for future research.

https://papers.labml.ai/paper/2106.06561

GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)

A map that takes a content code, derived from a face, and a randomly chosen style code to an anime image. The map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face.

We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse — a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos.

https://papers.labml.ai/paper/2106.03253

Tabular Data: Deep Learning is Not All You Need

Several deep learning models for tabular data have been proposed. They claim to outperform XGBoost for some use-cases. We show that an ensemble of the deep models and X GBoost performs better on these datasets.

A key element of AutoML systems is setting the types of models that will be used for each type of task. For classification and regression problems with tabular data, the use of tree ensemble models (like XGBoost) is usually recommended. However, several deep learning models for tabular data have recently been proposed, claiming to outperform XGBoost for some use-cases. In this paper, we explore whether these deep models should be a recommended option for tabular data, by rigorously comparing the new deep models to XGBoost on a variety of datasets. In addition to systematically comparing their accuracy, we consider the tuning and computation they require. Our study shows that XGBoost outperforms these deep models across the datasets, including datasets used in the papers that proposed the deep models. We also demonstrate that XGBoost requires much less tuning. On the positive side, we show that an ensemble of the deep models and XGBoost performs better on these datasets than XGBoost alone.

https://papers.labml.ai/paper/2009.05673

Applications of Deep Neural Networks

Deep learning is a group of exciting new technologies for neural networks. It is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Readers will use the Python programming language to implement deep learning using Google TensorFlow and Keras

Deep learning is a group of exciting new technologies for neural networks. Through a combination of advanced training techniques and neural network architectural components, it is now possible to create neural networks that can handle tabular data, images, text, and audio as both input and output. Deep learning allows a neural network to learn hierarchies of information in a way that is like the function of the human brain. This course will introduce the student to classic neural network structures, Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Neural Networks (GRU), General Adversarial Networks (GAN), and reinforcement learning. Application of these architectures to computer vision, time series, security, natural language processing (NLP), and data generation will be covered. High-Performance Computing (HPC) aspects will demonstrate how deep learning can be leveraged both on graphical processing units (GPUs), as well as grids. Focus is primarily upon the application of deep learning to problems, with some introduction to mathematical foundations. Readers will use the Python programming language to implement deep learning using Google TensorFlow and Keras. It is not necessary to know Python prior to this book; however, familiarity with at least one programming language is assumed.

https://papers.labml.ai/paper/2104.13478

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Deep learning can be used to solve complex problems. It can be applied to complex problems such as learning to fold a ball into a square. It is also a way to study the structure of the human brain.

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to be beyond reach — such as computer vision, playing Go, or protein folding — are in fact feasible with appropriate computational scale. Remarkably, the essence of deep learning is built from two simple algorithmic principles: first, the notion of representation or feature learning, whereby adapted, often hierarchical, features capture the appropriate notion of regularity for each task, and second, learning by local gradient-descent type methods, typically implemented as backpropagation. While learning generic functions in high dimensions is a cursed estimation problem, most tasks of interest are not generic, and come with essential pre-defined regularities arising from the underlying low-dimensionality and structure of the physical world. This text is concerned with exposing these regularities through unified geometric principles that can be applied throughout a wide spectrum of applications. Such a ‘geometric unification’ endeavour, in the spirit of Felix Klein’s Erlangen Program, serves a dual purpose: on one hand, it provides a common mathematical framework to study the most successful neural network architectures, such as CNNs, RNNs, GNNs, and Transformers. On the other hand, it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented.

https://papers.labml.ai/paper/2106.08962

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. We believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support.

Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. We present and motivate the problem of efficiency in deep learning, followed by a thorough survey of the five core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware) and the seminal work there. We also present an experiment-based guide along with code, for practitioners to optimize their model training and deployment. We believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support. Our hope is that this survey would provide the reader with the mental model and the necessary understanding of the field to apply generic efficiency techniques to immediately get significant improvements, and also equip them with ideas for further research and experimentation to achieve additional gains.

https://papers.labml.ai/paper/2007.01547

Descending through a Crowded Valley — Benchmarking Deep Learning Optimizers

Analyze over 50,000 runs with different optimizers. Optimizer performance varies across tasks. Adam remains strong.

Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of fifteen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing more than $50,000$ individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. (ii) We observe that evaluating multiple optimizers with default parameters works approximately as well as tuning the hyperparameters of a single, fixed optimizer. (iii) While we cannot discern an optimization method clearly dominating across all tested tasks, we identify a significantly reduced subset of specific optimizers and parameter choices that generally lead to competitive results in our experiments: Adam remains a strong contender, with newer methods failing to significantly and consistently outperform it. Our open-sourced results are available as challenging and well-tuned baselines for more meaningful evaluations of novel optimization methods without requiring any further computational efforts.

https://papers.labml.ai/paper/2106.14843

CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders

CLIPDraw synthesizes novel drawings based on natural language input. The algorithm does not require any training. It operates over vector strokes rather than pixel images.

This work presents CLIPDraw, an algorithm that synthesizes novel drawings based on natural language input. CLIPDraw does not require any training; rather a pre-trained CLIP language-image encoder is used as a metric for maximizing similarity between the given description and a generated drawing. Crucially, CLIPDraw operates over vector strokes rather than pixel images, a constraint that biases drawings towards simpler human-recognizable shapes. Results compare between CLIPDraw and other synthesis-through-optimization methods, as well as highlight various interesting behaviors of CLIPDraw, such as satisfying ambiguous text in multiple ways, reliably producing drawings in diverse artistic styles, and scaling from simple to complex visual representations as stroke count is increased. Code for experimenting with the method is available at: https://colab.research.google.com/github/kvfrans/clipdraw/blob/main…

https://papers.labml.ai/paper/2106.06981

Thinking Like Transformers

Transformers have no such familiar parallel. We propose a computational model for the transformer-encoder in the form of a programming language. We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder — attention and feed-forward computation — into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.

https://papers.labml.ai/paper/2106.02584

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learning. Our approach uses self-attention to reason about relationships between datapoints explicitly. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models.

We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms. However, unlike conventional non-parametric models, we let the model learn end-to-end from the data how to make use of other datapoints for prediction. Empirically, our models solve cross-datapoint lookup and complex reasoning tasks unsolvable by traditional deep learning models. We show highly competitive results on tabular data, early results on CIFAR-10, and give insight into how the model makes use of the interactions between points.

https://papers.labml.ai/paper/2106.10207

Distributed Deep Learning in Open Collaborations

Large corporations and institutions use dedicated High-Performance Computing clusters. Grid- or volunteer computing has seen successful applications in scientific areas. Using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and other challenges.

Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.

https://papers.labml.ai/paper/2106.12627

Provably efficient machine learning for quantum many-body problems

Classical machine learning (ML) provides a potentially powerful approach to solving challenging quantum many-body problems. We prove that classical ML algorithms can efficiently predict ground state properties of gapped Hamiltonians in finite spatial dimensions.

Classical machine learning (ML) provides a potentially powerful approach to solving challenging quantum many-body problems in physics and chemistry. However, the advantages of ML over more traditional methods have not been firmly established. In this work, we prove that classical ML algorithms can efficiently predict ground state properties of gapped Hamiltonians in finite spatial dimensions, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter. In contrast, under widely accepted complexity theory assumptions, classical algorithms that do not learn from data cannot achieve the same guarantee. We also prove that classical ML algorithms can efficiently classify a wide range of quantum phases of matter. Our arguments are based on the concept of a classical shadow, a succinct classical description of a many-body quantum state that can be constructed in feasible quantum experiments and be used to predict many properties of the state. Extensive numerical experiments corroborate our theoretical results in a variety of scenarios, including Rydberg atom systems, 2D random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases.

https://papers.labml.ai/paper/2106.10745

Calliar: An Online Handwritten Dataset for Arabic Calligraphy

Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. In the past few years, there has been a considerable effort to digitize such type of art.

Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.

https://papers.labml.ai/paper/2106.11189

Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data

Tabular datasets are the last “unconquered castle” for deep learning. Traditional ML methods like Gradient-Boosted Decision Trees still perform strongly against specialized neural architectures. We propose regularizing plain MLPs by searching for the optimalcombination of 13 regularization techniques for each dataset.

Tabular datasets are the last “unconquered castle” for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures. In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters. We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.

https://papers.labml.ai/paper/2106.11959

Revisiting Deep Learning Models for Tabular Data

The choice between GBDT and DL models highly depends on data and there is still no universally superior solution. We demonstrate that a simple ResNet-like architecture is a surprisingly effective baseline, which outperforms most of the sophisticated models.

The necessity of deep learning for tabular data is still an unanswered question addressed by a large number of research efforts. The recent literature on tabular DL proposes several deep architectures reported to be superior to traditional “shallow” models like Gradient Boosted Decision Trees. However, since existing works often use different benchmarks and tuning protocols, it is unclear if the proposed models universally outperform GBDT. Moreover, the models are often not compared to each other, therefore, it is challenging to identify the best deep model for practitioners. In this work, we start from a thorough review of the main families of DL models recently developed for tabular data. We carefully tune and evaluate them on a wide range of datasets and reveal two significant findings. First, we show that the choice between GBDT and DL models highly depends on data and there is still no universally superior solution. Second, we demonstrate that a simple ResNet-like architecture is a surprisingly effective baseline, which outperforms most of the sophisticated models from the DL literature. Finally, we design a simple adaptation of the Transformer architecture for tabular data that becomes a new strong DL baseline and reduces the gap between GBDT and DL models on datasets where GBDT dominates.

Source Prolead brokers usa

Pro Lead Brokers USA | Targeted Sales Leads | Pro Lead Brokers USA Skip to content