The second method is to take the median (or other quantiles, such as the 20% quantile). This method directly divides the sorted users according to the number, the median can divide the population into two equal parts, and other quantiles can also have reasonable business explanations: 20% of users contribute 80% of the effect, and many more.

I personally think this method is relatively simple and easy to implement, and I recommend it.

Regarding the skewed __country email list__ distribution (take the right skew as an example), the mode, median, and mean have the following relationships:

In addition to using statistics directly as thresholds, the third method, which is also seen more frequently on the market, is the scoring method.

The so-called scoring method is to first divide the original R, F, and M values into scores from 1 to 5, and then calculate the average of the scores as the division threshold. For example the following picture:

This method is very busy, and it is scoring and averaging. But I personally don't recommend it. On the one hand, only one threshold needs to be calculated originally, but now it needs to be divided into 5 segments. How should these 5 segments be divided reasonably? Secondly, what is the significance of this scoring, and also increases the computational complexity.

If it is to solve the problem of outliers or uneven distribution, it is good to use the quantile method. I don't really want to understand the meaning of the popular scoring method on the market. I thought of a possibility, that is, the scoring is so that the three dimensions can be measured on the same dimension, so that the comprehensive RFM score of a user can be calculated and the comprehensive score can be ranked. As shown below:

If so, then I think, whether to use the scoring method mainly depends on the model goal. If it is divided into 8 discrete user layers, it is not necessary to score; if it is to obtain the comprehensive RFM score of the user, it is necessary to score. Other than that, I really can't think of the point of scoring. I hope God will guide you.

4. User Tiered Computing

After going through the disputes of different threshold division methods above, the following is relatively smooth, that is, the calculation of user stratification.

This step is relatively easy to understand. According to the three predetermined thresholds, determine which interval each user belongs to, and then mark it. I won't go into details.

5. Model optimization

The so-called model optimization mainly lies in the adjustment of the threshold.

It is necessary to adjust the setting of the threshold according to the final group of people and related operational effects and activity rules, and finally achieve the most reasonable division.

Third, the advantages and disadvantages of RFM model

It was also mentioned at the beginning of this article that the RFM model is widely used and has great advantages, but there are also many disadvantages. Now let's discuss it with you.

1. Advantages of the model

The biggest advantage should be the availability of data.

At present, in the Internet, the collection of data is still relatively complete, collecting various behavior data of users, etc., which can better carry out user labeling and hierarchical operations. However, in traditional industries, there is not much behavioral data, and the data that can be used is relatively limited.

However, no matter how incomplete the company's data is, there must be transaction data (unless the company has no income...). As long as there is transaction data, RFM analysis can be carried out, which is the biggest advantage. Moreover, the RFM model based on transaction data is quite effective.

Second, the hierarchical interpretability of the model is strong.

Models often use clustering to stratify users, which is not very well explained for business. But the RFM model is divided into 8 user categories, which is very easy to understand.

2. Disadvantages of the model

The RFM model is actually a hysteretic analysis model, and RFM analysis can only be performed after the user has made a purchase. And the premise of the model is that there is no difference between the user's before and after behavior.

In addition, it should be noted that the application of this model is different in different industries.

The most typical is the difference between fast-moving consumer goods and durable consumer goods. RFM analysis is not a very efficient model for consumer goods. For example, in the purchase of refrigerators, users may not buy a refrigerator for more than ten years after purchasing it, which cannot be analyzed by RFM. It doesn't make any sense if you force it.